Last model - H2O deep learning

So far we used the Spark MLlib for building different models; however, we can use also H2O algorithms as well. So let's try them!

At first, we are going to transfer our training and testing datasets over to H2O and create a DNN for our binary classification problem. To reiterate, this is made possible because Spark and H2O are sharing the same JVM which facilitates passing Spark RDDs over to H2O hex frames and vice versa.

All the models that we have run up to now have been in MLlib but now we are going to use H2O to build a DNN using the same training and testing sets that we used, which means we need to send this data over to our H2O cloud as follows:

val trainingHF = h2oContext.asH2OFrame(trainingData.toDF, "trainingHF") 
val testHF = h2oContext.asH2OFrame(testData.toDF, "testHF") 

To verify that we have successfully transferred our training and testing RDDs (which we converted to DataFrames), we can execute this command in our Flow notebook (all commands are executed with Shift+Enter). Notice that we have two H2O frames now called trainingRDD and testRDD which you can see in our H2O notebook by running the command getFrames.

Figure 12 - List of available H2O frames is available by typing "getFrames" into Flow UI.

We can easily explore frames to see their structure by typing getFrameSummary "trainingHF" into the Flow cell or just by clicking on the frame name (see Figure 13).

Figure 13 - Structure of training frame.

The preceding figure shows structure of the training frame - it has 80,491 rows and 29 columns; there are numeric columns named features0, features1, ... with real values and the first column label containing integer values.

Since we would like to perform a binary classification, we need to transform the "label" column from the integer to categorical type. You can do that easily by clicking on the action Convert to enum in the Flow UI or in the Spark console by executing the following commands:

trainingHF.replace(0, trainingHF.vecs()(0).toCategoricalVec).remove() 
trainingHF.update() 
 
testHF.replace(0, testHF.vecs()(0).toCategoricalVec).remove() 
testHF.update() 

The code replaces the first vector by a transformed vector and removes the original vector from memory. Furthermore, the call update propagates changes into the shared distributed store, so they become visible by all the nodes in the cluster.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset