Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

An example of deep learning

Shifting gears away from the Space Shuttle, let's work through a practical example of deep learning, using the h2o package. We will do this on the data that we used for some of the chapters: the Pima Indian diabetes data. In Chapter 5, More Classification Techniques — K-Nearest Neighbors and Support Vector Machines, the best classifier was the sigmoid kernel, Support Vector Machine. We've already gone through the business and data understanding work in that chapter, so in this section, we will focus on how to load the data in the H20 platform and run the deep learning code.

H2O background

H2O is an open source predictive analytics platform with prebuilt algorithms, such as k-nearest neighbor, gradient boosted machines, and deep learning. You can upload data to the platform via Hadoop, AWS, Spark, SQL, noSQL, or your hard drive. The great thing about it is that you can utilize the machine learning algorithms in R and, at a much greater scale, on your local machine. If you are interested in learning more, you can visit the site, http://h2o.ai/product/.

Data preparation and uploading it to H2O

What we will do here is prepare the data, save it to the drive, and load it in H2O. The data is in two different datasets and we will first combine them. We will also need to scale the inputs. For labeled outcomes, the deep learning algorithm in H2O does not require numbered responses but factors, which means that we will not need to transform it. This code gets us to where we need to be. The rbind() function concatenates the datasets, as follows:

> data(Pima.tr)

> data(Pima.te)

> pima = rbind(Pima.tr, Pima.te)

> pima.scale = as.data.frame(scale(pima[,-8]))

> pima.scale$type = pima$type

> str(pima.scale)
'data.frame':532 obs. of  8 variables:
 $ npreg: num  0.448 1.052 0.448 -1.062 -1.062 ...
 $ glu  : num  -1.13 2.386 -1.42 1.418 -0.453 ...
 $ bp   : num  -0.285 -0.122 0.852 0.365 -0.935 ...
 $ skin : num  -0.112 0.363 1.123 1.313 -0.397 ...
 $ bmi  : num  -0.391 -1.132 0.423 2.181 -0.943 ...
 $ ped  : num  -0.403 -0.987 -1.007 -0.708 -1.074 ...
 $ age  : num  -0.708 2.173 0.315 -0.522 -0.801 ...
 $ type : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 1 1 2 ...

I will save this to the hard drive, but you can save it anywhere that is accepted by H2O. Let's make a note of the working directory as well:

> getwd()
[1] "C:/Users/clesmeister/chap7 NN"

> write.csv(pima.scale, file="pimaScale.csv", row.names=FALSE)

We can now connect to H2O and start an instance. Please note that your output will differ than what is displayed below:

> library(h2o)

> localH2O = h2o.init()
Successfully connected to http://127.0.0.1:54321/

> h2o.getConnection()
IP Address: 127.0.0.1 
Port      : 54321 
Session ID: _sid_8102ec2ab4585cc63b8186735b594e00 
Key Count : 39

The H2O function, h2o.uploadFile(), allows you to upload/import your file to the H2O cloud. The following functions are also available for uploads:

h2o.importFolder
h2o.importURL
h2o.importHDFS

There are a number of arguments that you can use to upload data, but the two that we need are the path of the file and a key specification. I like to specify the path outside of the function, as follows:

> path = "C:/Users/clesmeister/chap7 NN/pimaScale.csv"

We will specify the key with destination_frame=" " in the function. It is quite simple to upload the file and a percent indicator tracks the status:

> pima.hex = h2o.uploadFile(path=path, destination_frame="pima.hex")
  |=========================================================| 100%

The data is now in H2OFrame, which you can verify with class(), as follows:

> class(pima.hex)
[1] "H2OFrame"
attr(,"package")
[1] "h2o"

Many of the R commands in H2O may produce a different output than what you are used to seeing. For instance, look at the structure of our data:

> str(pima.hex)
Formal class 'H2OFrame' [package "h2o"] with 4 slots
  ..@ conn      :Formal class 'H2OConnection' [package "h2o"] with 3 slots
  .. .. ..@ ip     : chr "127.0.0.1"
  .. .. ..@ port   : num 54321
  .. .. ..@ mutable:Reference class 'H2OConnectionMutableState' [package "h2o"] with 2 fields
  .. .. .. ..$ session_id: chr "_sid_8102ec2ab4585cc63b8186735b594e00"
  .. .. .. ..$ key_count : int 43
  .. .. .. ..and 13 methods, of which 1 is  possibly relevant:
  .. .. .. ..  initialize
  ..@ frame_id  : chr "pima.hex"
  ..@ finalizers: list()
  ..@ mutable   :Reference class 'H2OFrameMutableState' [package "h2o"] with 4 fields
  .. ..$ ast      : NULL
  .. ..$ nrows    : num 532
  .. ..$ ncols    : int 8
  .. ..$ col_names: chr  "npreg" "glu" "bp" "skin" ...
  .. ..and 13 methods, of which 1 is  possibly relevant:
  .. ..  initialize

The head() and summary() functions work exactly the same. Here are the first six rows of our data in H2O along with a summary:

> head(pima.hex)
       npreg        glu         bp       skin        bmi
1  0.4477858 -1.1300306 -0.2847739 -0.1123474 -0.3909581
2  1.0516440  2.3861862 -0.1223077  0.3627626 -1.1321178
3  0.4477858 -1.4203605  0.8524894  1.1229387  0.4228642
4 -1.0618597  1.4184201  0.3650908  1.3129827  2.1813017
5 -1.0618597 -0.4525944 -0.9346387 -0.3974135 -0.9431947
6  0.4477858 -0.7751831  0.3650908 -0.2073695  0.3937991
         ped        age type
1 -0.4033309 -0.7075782   No
2 -0.9867069  2.1730387  Yes
3 -1.0070235  0.3145762   No
4 -0.7080796 -0.5217319   No
5 -1.0737779 -0.8005013   No
6 -0.3626978  1.8942693  Yes

> summary(pima.hex)
 npreg                glu                  bp                  
 Min.   :-1.062e+00   Min.   :-2.098e+00   Min.   :-3.859e+00  
 1st Qu.:-7.642e-01   1st Qu.:-7.220e-01   1st Qu.:-6.105e-01  
 Median :-4.613e-01   Median :-1.972e-01   Median : 3.918e-02  
 Mean   :-1.252e-15   Mean   :-1.557e-16   Mean   :-1.004e-17  
 3rd Qu.: 4.472e-01   3rd Qu.: 6.504e-01   3rd Qu.: 6.889e-01  
 Max.   : 4.071e+00   Max.   : 2.515e+00   Max.   : 3.127e+00  
 skin                 bmi                  ped                 
 Min.   :-2.108e+00   Min.   :-2.135e+00   Min.   :-1.213e+00  
 1st Qu.:-6.829e-01   1st Qu.:-7.313e-01   1st Qu.:-7.116e-01  
 Median :-1.847e-02   Median :-1.715e-02   Median :-2.541e-01  
 Mean   : 5.828e-17   Mean   : 3.085e-16   Mean   : 6.115e-17  
 3rd Qu.: 6.459e-01   3rd Qu.: 5.798e-01   3rd Qu.: 4.490e-01  
 Max.   : 6.634e+00   Max.   : 4.972e+00   Max.   : 5.564e+00  
 age                  type    
 Min.   :-9.863e-01   No :355 
 1st Qu.:-8.024e-01   Yes:177 
 Median :-3.396e-01           
 Mean   :-1.876e-16           
 3rd Qu.: 5.915e-01           
 Max.   : 4.589e+00

Create train and test datasets

You can upload your own train and test partitioned datasets to H2O or you can use the built-in functionality. I will demonstrate the latter with a 70/30 split. The first thing to do is create a vector of random and uniform numbers for our data:

> rand = h2o.runif(pima.hex, seed = 123)

You can then build your partitioned data and assign it with your desired key name, as follows:

> train = pima.hex[rand <= 0.7, ]

> train = h2o.assign(train, key = "train")

> test = pima.hex[rand  > 0.7, ]

> test <- h2o.assign(test, key = "test")

With these created, it is probably a good idea that we have a balanced response variable between the train and test sets. To do this, you can use the h2o.table() function and, in our case, it would be column 8:

> h2o.table(train[,8])
H2OFrame with 2 rows and 2 columns
  type Count
1   No   253
2  Yes   124

> h2o.table(test[,8])
H2OFrame with 2 rows and 2 columns
  type Count
1   No   102
2  Yes    53

This appears well and good and with that, we can begin the modeling process:

Modeling

As we will see, the deep learning function has quite a few arguments and parameters that you can tune. The thing that I like about the package is the ability to keep it as simple as possible and let the defaults do their thing. If you want to see all the possibilities along with the defaults, see help or run the following command:

> args(h2o.deeplearning)

Documentation on all the arguments and tuning parameters is available online at: http://h2o.ai/docs/master/model/deep-learning/.

On a side note, you can run a demo for the various machine learning methods by just running demo("method"). For instance, you can go through the deep learning demo with demo(h2o.deeplearning).

The critical items to specify in this example will be as follows:

The input variable
The response variable
The training data with training_frame = train
The testing data with validation_frame = test
An initial seed for the sampling
The variable importance with variable_importances = TRUE

When combined, this code will create the deep learning object, as follows:

> dlmodel <- h2o.deeplearning(x=1:7, y=8, training_frame = train, validation_frame = test, seed = 123, variable_importances = TRUE)
  |=========================================================| 100%

An indicator bar tracks the progress and with this relatively small dataset, it only takes a couple of seconds.

Calling the dlmodel object produces quite a bit of information. The two things that I will show here are the confusion matrices for the train and test sets:

> dlmodel
Model Details:
==============
        No Yes    Error     Rate
No     204  49 0.193676  =49/253
Yes     29  95 0.233871  =29/124
Totals 233 144 0.206897  =78/377

        No Yes    Error     Rate
No      86  16 0.156863  =16/102
Yes     18  35 0.339623   =18/53
Totals 104  51 0.219355  =34/155

The first matrix shows the performance on the test set with the columns as the predictions and rows as the actuals and the model achieved 79.3 percent accuracy. The false positive and false negative rates are roughly equivalent. The accuracy on the test data was 78 percent, but with a higher false negative rate.

A further exploration of the model parameters can also be called, which produces a lengthy output:

> dlmodel@allparameters

Let's have a look at the variable importance:

> dlmodel@model$variable_importances
Variable Importances:
  variable relative_importance scaled_importance percentage
1      glu            1.000000          1.000000   0.156574
2       bp            0.942461          0.942461   0.147565
3    npreg            0.910888          0.910888   0.142622
4      age            0.902482          0.902482   0.141305
5     skin            0.894196          0.894196   0.140008
6      ped            0.882988          0.882988   0.138253
7      bmi            0.853734          0.853734   0.133673

The variable importance is calculated based on the so-called Gedeon Method. Keep in mind that these results can be misleading. In the table, we can see the order of the variable importance, but they are all relatively similar in their contribution and probably the table is not of much value, which is a common criticism of the neural network techniques. The importance is also subject to the sampling variation, and if you change the seed value, the order of the variable importance can change quite a bit.

You are also able to see the predicted values and put them in a data frame, if you want. We will first create the predicted values:

> dlPredict = h2o.predict(dlmodel,newdata=test)

> dlPredict
H2OFrame with 155 rows and 3 columns

First 10 rows:
   predict        No         Yes
1       No 0.9973673 0.002632645
2      Yes 0.2167641 0.783235848
3      Yes 0.1707465 0.829253495
4      Yes 0.1609832 0.839016795
5       No 0.9740857 0.025914235
6       No 0.9957688 0.004231210
7       No 0.9989172 0.001082811
8       No 0.9342141 0.065785915
9       No 0.7045852 0.295414835
10      No 0.9003637 0.099636205

As it defaults to the first ten observations, putting it in a data frame will give us all of the predicted values, as follows:

> dlPred = as.data.frame(dlPredict)
> head(dlPred)
  predict        No         Yes
1      No 0.9973673 0.002632645
2     Yes 0.2167641 0.783235848
3     Yes 0.1707465 0.829253495
4     Yes 0.1609832 0.839016795
5      No 0.9740857 0.025914235
6      No 0.9957688 0.004231210

With this, we have completed the introduction to deep learning in R using the capabilities of the H2O package. It is simple to use while offering plenty of flexibility to tune the hyperparameters in order to optimize the model fit. Enjoy!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for An example of deep learning

Create new playlist

Sign In

Sign Up

An example of deep learning

H2O background

Data preparation and uploading it to H2O

Create train and test datasets

Modeling

Table of Contents for
An example of deep learning