Building a classification model using H2O RandomForest

H2O provides multiple algorithms for building classification models. In this chapter, we will focus on tree ensembles again, but we are going to demonstrate their usage in the context of our sensor data problem.

We have already prepared data which we can use directly to build the H2O RandomForest model. To transfer it them into H2O format we need to create H2OContext and then call the corresponding transformation:

import org.apache.spark.h2o._ 
val h2oContext = H2OContext.getOrCreate(sc) 
 
val trainHF = h2oContext.asH2OFrame(trainingData, "trainHF") 
trainHF.setNames(columnNames) 
trainHF.update() 
val testHF = h2oContext.asH2OFrame(testData, "testHF") 
testHF.setNames(columnNames) 
testHF.update() 

We created two tables referenced by the names trainHF and testHF. The code also updated names of columns by calling the method setNames since input RDD does not carry information about columns. The important step is call of the update method to save changes into H2O's distributed memory store. This is an important pattern exposed by the H2O API - all changes made on an object are done locally; to make them visible to other computation nodes, it is necessary to save them into the memory store (so-called distributed key-value store (DKV) )

Having data stored as H2O tables, we can open the H2O Flow user interface by calling h2oContext.openFlow and graphically explore the data. For example, the distribution of the activityId column as a numeric feature is shown in Figure 4:

Figure 4: The view of numeric column activityId which needs transformation to categorical type.

We can directly compare the results and verify that we observe right distribution by a piece of Spark code:

println(s"""^Distribution of activityId:
^${table(Seq("activityId", "Count"),
testData.map(row => (row.label, 1)).reduceByKey(_ + _).collect.sortBy(_._1),
Map.empty[Int, String])}
""".stripMargin('^'))

The output is as follows:

The next step is to prepare the input data to run H2O algorithms. First we need to verify that column types are in the form expected by the algorithm. The H2O Flow UI provides a list of columns with basic attributes (Figure 5):

Figure 5: Columns of imported training dataset shown in Flow UI.

We can see that the activityId column is numeric; however, to perform classification, H2O requires columns to be categorical. So we need to transform the column by clicking on Convert to enum in UI or programmatically:

trainHF.replace(0, trainHF.vec(0).toCategoricalVec).remove 
trainHF.update 
testHF.replace(0, testHF.vec(0).toCategoricalVec).remove 
testHF.update 

Again, we need to update the modified frame in the memory store by calling the update method. Furthermore, we are transforming a vector to another vector type and we do not need the original vector anymore, hence we can call the remove method on the result of the replace call.

After transformation, the activityId column is categorical; however, the vector domain contains values "0", "1", ..."6" - they are stored in the field trainHF.vec("activityId").domain. Nevertheless, we can update the vector with actual category names. We have already prepared index to name transformation called idx2Activity - hence we prepare a new domain and update the activityId vector domain for training and test tables:

val domain = trainHF.vec(0).domain.map(i => idx2Activity(i.toDouble)) 
trainHF.vec(0).setDomain(domain) 
water.DKV.put(trainHF.vec(0)) 
testHF.vec(0).setDomain(domain) 
water.DKV.put(testHF.vec(0)) 

In this case, we need to update the modified vector in the memory store as well - instead of calling the update method, the code makes explicit call of the method water.DKV.put which directly saves the object into the memory store.

In the UI, we can again explore the activityId column of test dataset and compare it with the results computed - Figure 6:

Figure 6: The column activityId values distribution in test dataset.

At this point, we have prepared the data to perform model building. The configuration of H2O RandomForest for a classification problem follows the same pattern we introduced in the previous chapter:

import _root_.hex.tree.drf.DRF 
import _root_.hex.tree.drf.DRFModel 
import _root_.hex.tree.drf.DRFModel.DRFParameters 
import _root_.hex.ScoreKeeper._ 
import _root_.hex.ConfusionMatrix 
import water.Key.make 
 
val drfParams = new DRFParameters 
drfParams._train = trainHF._key 
drfParams._valid = testHF._key 
drfParams._response_column = "activityId" 
drfParams._max_depth = 20 
drfParams._ntrees = 50 
drfParams._score_each_iteration = true 
drfParams._stopping_rounds = 2 
drfParams._stopping_metric = StoppingMetric.misclassification 
drfParams._stopping_tolerance = 1e-3 
drfParams._seed = 42 
drfParams._nbins = 20 
drfParams._nbins_cats = 1024 
 
 
val drfModel = new DRF(drfParams, make[DRFModel]("drfModel")).trainModel.get 

There are several important differences which distinguish the H2O algorithm from Spark. The first important difference is that we can directly specify a validation dataset as an input parameter (the _valid field). This is not necessary since we can perform validation after the model is built; however, when the validation dataset is specified, we can track the quality of the model in real-time during building and stop model building if we consider the model is good enough (see Figure 7 - the "Cancel Job" action stops training but the model is still available for further actions). Furthermore, later we can continue model building and append more trees if it is demanded. The parameter _score_each_iteration controls how often scoring should be performed:

Figure 7: Model training can be tracked in Flow UI and also stopped by pressing "Cancel Job" button.

Another difference is represented by the parameters _nbins, _nbins_top_level, and _nbins_cats. The Spark RandomForest implementation accepts the parameter maxBins which controls discretization of continuous features. In the H2O case, it corresponds to the parameter _nbins. However, the H2O machine learning platform allows finer-grained tuning of discretization. Since top-level splits are the most important and can suffer from loss of information due to discretization, H2O permits temporary increase in the number of discrete categories for top-level splits via the parameter _nbins_top_level. Furthermore, high-value categorical features (> 1,024 levels) often degrades performance of computation by forcing an algorithm to consider all possible splits into two distinct subsets. Since there are 2N subsets for N categorical levels, finding split points for these features can be expensive. For such cases, H2O brings the parameter _nbins_cats, which controls the number of categorical levels - if a feature contains more categorical levels than the value stored in the parameter, then the values are re-binned to fit into _nbins_cats bins.

The last important difference is that we specified an additional stopping criterion together with traditional depth and number of trees in ensemble. The criterion limits improvement of computed misclassification on validation data - in this case, we specified that model building should stop if two consecutive scoring measurements on validation data (the field _stopping_rounds) do not improve by 0.001 (the value of the field _stopping_tolerance). This is a perfect criterion if we know the expected quality of the model and would like to limit model training time. In our case, we can explore the number of trees in the resulting ensemble:

println(s"Number of trees: ${drfModel._output._ntrees}") 

The output is as follows:

Even we demanded 50 trees, the resulting model has only 14 trees since model training was stopped since the misclassification rate did not improve with respect to the given threshold.

H2O API exposes multiple stopping criteria which can be used by any of the algorithms - a user can use AUC value for binomial problems or MSE for regression problems. This is one of the most powerful feature which allow you to decrease computation time if huge space of hyper-parameters is explored

The quality of model can be explored in two ways: (1) directly by using the Scala API and accessing the model field _output which carries all output metrics, or (2) using the graphical interface to explore metrics in a more user-friendly way. For example, Confusion matrix on a specified validation set can be displayed as part of the model view directly in the Flow UI. Refer to the following figure:

Figure 8: Confusion matrix for initial RandomForest model composed of 14 trees.

It directly gives us error rate (0.22%) and misclassification per class and we can compare results directly with computed accuracy using Spark model. Furthermore, the Confusion matrix can be used to compute additional metrics which we explored.

For example, compute recall, precision, and F-1 metrics per class. We can simply transform H2O's Confusion matrix to Spark Confusion matrix and reuse all defined methods. But we have to be careful not to confuse actual and predicted values in the resulting Confusion matrix (the Spark matrix has predicted values in columns while the H2O matrix has them in rows):

val drfCM = drfModel._output._validation_metrics.cm 
def h2oCM2SparkCM(h2oCM: ConfusionMatrix): Matrix = { 
  Matrices.dense(h2oCM.size, h2oCM.size, h2oCM._cm.flatMap(x => x)) 
} 
val drfSparkCM = h2oCM2SparkCM(drfCM) 

You can see that computed metrics for the specified validation dataset are stored in model output field _output._validation_metrics. It contains Confusion matrix but also additional information about model performance tracked during training. Then we simply transformed the H2O representation into Spark matrix. Then we can easily compute macro-performance per class:

val drfPerClassSummary = drfCM._domain.indices.map { i =>
(drfCM._domain(i), rfRecall(drfSparkCM, i), rfPrecision(drfSparkCM, i), rfF1(drfSparkCM, i))
}

println(s"""^Per class summary
^${table(Seq("Label", "Recall", "Precision", "F-1"),
drfPerClassSummary,
Map(1 -> "%.4f", 2 -> "%.4f", 3 -> "%.4f"))}
""".stripMargin('^'))

The output is as follows:

You can see that results are slightly better than the Spark results computed before, even though H2O used less trees. The explanation needs to explore H2O implementation of the RandomForest algorithm - H2O is using an algorithm based on generating a regression decision tree per output class - an approach which is often referenced as a "one-versus-all" scheme. This algorithm allows more fine-grained optimization with respect to individual classes. Hence in this case 14 RandomForest trees are internally represented by 14*7 = 98 internal decision trees.

The reader can find more explanation about the benefits of a "one-versus-all" scheme for multiclass classification problems in the paper In Defense of One-Vs-All Classification from Ryan Rifkin and Aldebaro Klautau. The authors show that the schema is as accurate as any other approaches; on the other hand, the algorithm forces the generation of more decision trees which can negatively influence computation time and memory consumption.

We can explore more properties about the trained model. One of the important RandomForest metrics is variable importance. It is stored under the model's field _output._varimp. The object contains raw values which can be scaled by calling the scaled_values method or obtain relative importance by calling the summary method. Nevertheless, they can be explored visually in the Flow UI as shown in Figure 9. The graph shows that the most significant features are measured temperatures from all three sensors followed by various movement data. And surprisingly to our expectation, the heart rate is not included among the top-level features:

Figure 9: Variable importance for model "drfModel". The most important features include measured temperature.

If we are not satisfied with the quality of the model, it can be extended by more trees. We can reuse defined parameters and modify them in the following way:

  • Set up the desired numbers of trees in the resulting ensemble (for example, 20).
  • Disable early stopping criterion to avoid stopping model training before achieving the demanded number of trees.
  • Configure a so called model checkpoint to point to the previously trained model. The model checkpoint is unique feature of the H2O machine learning platform available for all published models. It is useful in situations when you need to improve a given model by performing more training iterations.

After that, we can simply launch model building again. In this case, the H2O platform simply continues model training, reconstructs model state, and builds and appends new trees into a new model:

drfParams._ntrees = 20 
drfParams._stopping_rounds = 0 
drfParams._checkpoint = drfModel._key 
 
val drfModel20 = new DRF(drfParams, make[DRFModel]("drfMode20")).trainModel.get 
println(s"Number of trees: ${drfModel20._output._ntrees}") 

The output is as follows:

In this case, only 6 trees were built - to see that, the user can explore the model training output in the console and find a line which ends model training output and reporting:

The 6th tree was generated in 2 seconds and it was the last tree appended into the existing ensemble creating a new model. We can again explore Confusion matrix of newly built model and see improvement in overall error rate from 0.23 to 0.2% (see Figure 9):

Figure 10: Confusion matrix for RandomForest model with 20 trees.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset