An H2O example

For this example, we will again use the adult census dataset to predict income. As with our Keras example, this will be kept extremely minimal and we will cover just enough to illustrate the syntax for working with H2O, as well as the design nuances that differ from other packages:

The first major difference when working with H2O is that we must explicitly initialize our H2O session, which will generate a Java Virtual Machine instance and connect it with R. This is accomplished with the following lines of code:

# load H2O package
library(h2o)

# start H2O
h2o::h2o.init()

Loading data to use with H2O requires converting the data to H2OFrame. H2OFrame is very similar to data frames, with the major distinction having to do with where the object is stored. While data frames are held in memory, H2OFrame is stored on the H2O cluster. This feature can be an advantage with very large datasets. In the following example, we will convert the data into the proper format using a two-step process. First, we load the data by reading csv in the usual way. Second, we will convert the data frames to H2OFrame. We convert our data into the proper format using the following code:

## load data 
train <- read.csv("adult_processed_train.csv")
test <- read.csv("adult_processed_test.csv")

# load data on H2o
train <- as.h2o(train)
test <- as.h2o(test)

For this example, we will perform some imputation as the sole pre-processing step. In this step, we will replace all missing values and we will use mean for numeric data and mode for factor data. In H2O, setting column = 0 will apply the function to the entire frame. Of note is that the function is called on the data; however, it is not necessary to assign the results to a new object as the imputations will be directly reflected in the data passed through as an argument to the function. It is also worth highlighting that in H2O, we can pass a vector to the method argument and it will be used for every variable in this case by first checking whether the first method can be used and, if not, moving on to the second method. Pre-processing this data is accomplished by running the following lines of code:

## pre-process
h2o.impute(train, column = 0, method = c("mean", "mode"))
h2o.impute(test, column = 0, method = c("mean", "mode"))

In addition, in this step, we will define the dependent and independent variables. The dependent variable is held in the target column, while all the remaining columns contain the independent variables, which will be used for predicting the target variable during this task:

#set dependent and independent variables
target <- "target"
predictors <- colnames(train)[1:14]

With all of the preparation steps complete, we can now create a minimal model. The H2O deeplearning function will create a feedforward artificial neural network. In this example, just the minimum required to run the model will be included. However, this function can accept 80 to 90 arguments and we will cover many of these in the later chapters. In the following code, we provide a name for our model, identify the training data, set a seed for reproducibility through replicating pseudo-random numbers involved in the model, define the dependent and independent variables, and note the number of times the model should be run and how the data should be cut for each round:

#train the model - without hidden layer
model <- h2o.deeplearning(model_id = "h2o_dl_example"
                          ,training_frame = train
                          ,seed = 321
                          ,y = target
                          ,x = predictors
                          ,epochs = 10
                          ,nfolds = 5)

After running the model, the performance can be evaluated on the out-of-fold samples using the following line of code:

h2o.performance(model, xval = TRUE)

Finally, when our model is complete, the cluster must be explicitly shut down just as it was initialized. The following function will close the current h2o instance:

h2o::h2o.shutdown()

We can observe the following in this example:

The syntax for H2O varies quite a bit from other machine learning libraries.
First, we need to initiate the Java Virtual Machine and we need to store our data in special data containers with this package.
In addition, we can see that imputation happens by running the function on a data object without assigning the changes back to an object.
We can see that we also need to include all the independent variable column names, which is slightly different from other models.
All of this is to say that H2O may feel a little unfamiliar as you use it. It is also limited in terms of the algorithms available. However, the ability to work with larger datasets is a definite advantage to this package.

Now that we have looked at the comprehensive deep learning packages, we will focus on packages written with R that perform a specific modeling task or a limited set of tasks.

Table of Contents for An H2O example

Create new playlist

Sign In

Sign Up

Table of Contents for
An H2O example