Shifting gears away from the Space Shuttle, let's work through a practical example of deep learning, using the h2o
package. We will do this on the data that we used for some of the chapters: the Pima Indian diabetes data. In Chapter 5, More Classification Techniques — K-Nearest Neighbors and Support Vector Machines, the best classifier was the sigmoid kernel, Support Vector Machine. We've already gone through the business and data understanding work in that chapter, so in this section, we will focus on how to load the data in the H20 platform and run the deep learning code.
H2O is an open source predictive analytics platform with prebuilt algorithms, such as k-nearest neighbor, gradient boosted machines, and deep learning. You can upload data to the platform via Hadoop, AWS, Spark, SQL, noSQL, or your hard drive. The great thing about it is that you can utilize the machine learning algorithms in R and, at a much greater scale, on your local machine. If you are interested in learning more, you can visit the site, http://h2o.ai/product/.
What we will do here is prepare the data, save it to the drive, and load it in H2O. The data is in two different datasets and we will first combine them. We will also need to scale the inputs. For labeled outcomes, the deep learning algorithm in H2O does not require numbered responses but factors, which means that we will not need to transform it. This code gets us to where we need to be. The rbind()
function concatenates the datasets, as follows:
> data(Pima.tr) > data(Pima.te) > pima = rbind(Pima.tr, Pima.te) > pima.scale = as.data.frame(scale(pima[,-8])) > pima.scale$type = pima$type > str(pima.scale) 'data.frame':532 obs. of 8 variables: $ npreg: num 0.448 1.052 0.448 -1.062 -1.062 ... $ glu : num -1.13 2.386 -1.42 1.418 -0.453 ... $ bp : num -0.285 -0.122 0.852 0.365 -0.935 ... $ skin : num -0.112 0.363 1.123 1.313 -0.397 ... $ bmi : num -0.391 -1.132 0.423 2.181 -0.943 ... $ ped : num -0.403 -0.987 -1.007 -0.708 -1.074 ... $ age : num -0.708 2.173 0.315 -0.522 -0.801 ... $ type : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 1 1 2 ...
I will save this to the hard drive, but you can save it anywhere that is accepted by H2O. Let's make a note of the working directory as well:
> getwd() [1] "C:/Users/clesmeister/chap7 NN" > write.csv(pima.scale, file="pimaScale.csv", row.names=FALSE)
We can now connect to H2O and start an instance. Please note that your output will differ than what is displayed below:
> library(h2o) > localH2O = h2o.init() Successfully connected to http://127.0.0.1:54321/ > h2o.getConnection() IP Address: 127.0.0.1 Port : 54321 Session ID: _sid_8102ec2ab4585cc63b8186735b594e00 Key Count : 39
The H2O function, h2o.uploadFile()
, allows you to upload/import your file to the H2O cloud. The following functions are also available for uploads:
h2o.importFolder
h2o.importURL
h2o.importHDFS
There are a number of arguments that you can use to upload data, but the two that we need are the path of the file and a key specification. I like to specify the path outside of the function, as follows:
> path = "C:/Users/clesmeister/chap7 NN/pimaScale.csv"
We will specify the key with destination_frame=" "
in the function. It is quite simple to upload the file and a percent indicator tracks the status:
> pima.hex = h2o.uploadFile(path=path, destination_frame="pima.hex") |=========================================================| 100%
The data is now in H2OFrame
, which you can verify with class()
, as follows:
> class(pima.hex) [1] "H2OFrame" attr(,"package") [1] "h2o"
Many of the R commands in H2O may produce a different output than what you are used to seeing. For instance, look at the structure of our data:
> str(pima.hex) Formal class 'H2OFrame' [package "h2o"] with 4 slots ..@ conn :Formal class 'H2OConnection' [package "h2o"] with 3 slots .. .. ..@ ip : chr "127.0.0.1" .. .. ..@ port : num 54321 .. .. ..@ mutable:Reference class 'H2OConnectionMutableState' [package "h2o"] with 2 fields .. .. .. ..$ session_id: chr "_sid_8102ec2ab4585cc63b8186735b594e00" .. .. .. ..$ key_count : int 43 .. .. .. ..and 13 methods, of which 1 is possibly relevant: .. .. .. .. initialize ..@ frame_id : chr "pima.hex" ..@ finalizers: list() ..@ mutable :Reference class 'H2OFrameMutableState' [package "h2o"] with 4 fields .. ..$ ast : NULL .. ..$ nrows : num 532 .. ..$ ncols : int 8 .. ..$ col_names: chr "npreg" "glu" "bp" "skin" ... .. ..and 13 methods, of which 1 is possibly relevant: .. .. initialize
The head()
and summary()
functions work exactly the same. Here are the first six rows of our data in H2O along with a summary:
> head(pima.hex) npreg glu bp skin bmi 1 0.4477858 -1.1300306 -0.2847739 -0.1123474 -0.3909581 2 1.0516440 2.3861862 -0.1223077 0.3627626 -1.1321178 3 0.4477858 -1.4203605 0.8524894 1.1229387 0.4228642 4 -1.0618597 1.4184201 0.3650908 1.3129827 2.1813017 5 -1.0618597 -0.4525944 -0.9346387 -0.3974135 -0.9431947 6 0.4477858 -0.7751831 0.3650908 -0.2073695 0.3937991 ped age type 1 -0.4033309 -0.7075782 No 2 -0.9867069 2.1730387 Yes 3 -1.0070235 0.3145762 No 4 -0.7080796 -0.5217319 No 5 -1.0737779 -0.8005013 No 6 -0.3626978 1.8942693 Yes > summary(pima.hex) npreg glu bp Min. :-1.062e+00 Min. :-2.098e+00 Min. :-3.859e+00 1st Qu.:-7.642e-01 1st Qu.:-7.220e-01 1st Qu.:-6.105e-01 Median :-4.613e-01 Median :-1.972e-01 Median : 3.918e-02 Mean :-1.252e-15 Mean :-1.557e-16 Mean :-1.004e-17 3rd Qu.: 4.472e-01 3rd Qu.: 6.504e-01 3rd Qu.: 6.889e-01 Max. : 4.071e+00 Max. : 2.515e+00 Max. : 3.127e+00 skin bmi ped Min. :-2.108e+00 Min. :-2.135e+00 Min. :-1.213e+00 1st Qu.:-6.829e-01 1st Qu.:-7.313e-01 1st Qu.:-7.116e-01 Median :-1.847e-02 Median :-1.715e-02 Median :-2.541e-01 Mean : 5.828e-17 Mean : 3.085e-16 Mean : 6.115e-17 3rd Qu.: 6.459e-01 3rd Qu.: 5.798e-01 3rd Qu.: 4.490e-01 Max. : 6.634e+00 Max. : 4.972e+00 Max. : 5.564e+00 age type Min. :-9.863e-01 No :355 1st Qu.:-8.024e-01 Yes:177 Median :-3.396e-01 Mean :-1.876e-16 3rd Qu.: 5.915e-01 Max. : 4.589e+00
You can upload your own train
and test
partitioned datasets to H2O or you can use the built-in functionality. I will demonstrate the latter with a 70/30 split. The first thing to do is create a vector of random and uniform numbers for our data:
> rand = h2o.runif(pima.hex, seed = 123)
You can then build your partitioned data and assign it with your desired key
name, as follows:
> train = pima.hex[rand <= 0.7, ] > train = h2o.assign(train, key = "train") > test = pima.hex[rand > 0.7, ] > test <- h2o.assign(test, key = "test")
With these created, it is probably a good idea that we have a balanced response variable between the train
and test
sets. To do this, you can use the h2o.table()
function and, in our case, it would be column 8
:
> h2o.table(train[,8]) H2OFrame with 2 rows and 2 columns type Count 1 No 253 2 Yes 124 > h2o.table(test[,8]) H2OFrame with 2 rows and 2 columns type Count 1 No 102 2 Yes 53
This appears well and good and with that, we can begin the modeling process:
As we will see, the deep learning function has quite a few arguments and parameters that you can tune. The thing that I like about the package is the ability to keep it as simple as possible and let the defaults do their thing. If you want to see all the possibilities along with the defaults, see help or run the following command:
> args(h2o.deeplearning)
Documentation on all the arguments and tuning parameters is available online at: http://h2o.ai/docs/master/model/deep-learning/.
On a side note, you can run a demo for the various machine learning methods by just running demo("method")
. For instance, you can go through the deep learning demo with demo(h2o.deeplearning)
.
The critical items to specify in this example will be as follows:
training_frame = train
validation_frame = test
variable_importances = TRUE
When combined, this code will create the deep learning object, as follows:
> dlmodel <- h2o.deeplearning(x=1:7, y=8, training_frame = train, validation_frame = test, seed = 123, variable_importances = TRUE) |=========================================================| 100%
An indicator bar tracks the progress and with this relatively small dataset, it only takes a couple of seconds.
Calling the dlmodel
object produces quite a bit of information. The two things that I will show here are the confusion matrices for the train
and test
sets:
> dlmodel Model Details: ============== No Yes Error Rate No 204 49 0.193676 =49/253 Yes 29 95 0.233871 =29/124 Totals 233 144 0.206897 =78/377 No Yes Error Rate No 86 16 0.156863 =16/102 Yes 18 35 0.339623 =18/53 Totals 104 51 0.219355 =34/155
The first matrix shows the performance on the test
set with the columns as the predictions and rows as the actuals and the model achieved 79.3 percent accuracy. The false positive and false negative rates are roughly equivalent. The accuracy on the test
data was 78 percent, but with a higher false negative rate.
A further exploration of the model parameters can also be called, which produces a lengthy output:
> dlmodel@allparameters
Let's have a look at the variable importance:
> dlmodel@model$variable_importances Variable Importances: variable relative_importance scaled_importance percentage 1 glu 1.000000 1.000000 0.156574 2 bp 0.942461 0.942461 0.147565 3 npreg 0.910888 0.910888 0.142622 4 age 0.902482 0.902482 0.141305 5 skin 0.894196 0.894196 0.140008 6 ped 0.882988 0.882988 0.138253 7 bmi 0.853734 0.853734 0.133673
The variable importance is calculated based on the so-called Gedeon Method. Keep in mind that these results can be misleading. In the table, we can see the order of the variable importance, but they are all relatively similar in their contribution and probably the table is not of much value, which is a common criticism of the neural network techniques. The importance is also subject to the sampling variation, and if you change the seed value, the order of the variable importance can change quite a bit.
You are also able to see the predicted values and put them in a data frame, if you want. We will first create the predicted values:
> dlPredict = h2o.predict(dlmodel,newdata=test) > dlPredict H2OFrame with 155 rows and 3 columns First 10 rows: predict No Yes 1 No 0.9973673 0.002632645 2 Yes 0.2167641 0.783235848 3 Yes 0.1707465 0.829253495 4 Yes 0.1609832 0.839016795 5 No 0.9740857 0.025914235 6 No 0.9957688 0.004231210 7 No 0.9989172 0.001082811 8 No 0.9342141 0.065785915 9 No 0.7045852 0.295414835 10 No 0.9003637 0.099636205
As it defaults to the first ten observations, putting it in a data frame will give us all of the predicted values, as follows:
> dlPred = as.data.frame(dlPredict) > head(dlPred) predict No Yes 1 No 0.9973673 0.002632645 2 Yes 0.2167641 0.783235848 3 Yes 0.1707465 0.829253495 4 Yes 0.1609832 0.839016795 5 No 0.9740857 0.025914235 6 No 0.9957688 0.004231210
With this, we have completed the introduction to deep learning in R using the capabilities of the H2O
package. It is simple to use while offering plenty of flexibility to tune the hyperparameters in order to optimize the model fit. Enjoy!