Grid search

As with most algorithms in MLlib and H2O, there are many hyper-parameters to choose from which can have a significant effect on the performance of the model. Given the endless amount of combinations that are possible, is there an intelligent way we can begin looking at what combinations look more promising than others? Thankfully, the answer is an emphatic "YES!" and the solution is known as a grid search, which is ML-speak for running many models that use different combinations of hyper-parameters.

Let's try running a simple grid search using the RF algorithm. In this case, the RF model builder is invoked for each combination of parameters from a defined hyper-space of parameters:

val rfGrid =  
    for ( 
    gridNumTrees <- Array(15, 20); 
    gridImpurity <- Array("entropy", "gini"); 
    gridDepth <- Array(20, 30); 
    gridBins <- Array(20, 50)) 
        yield { 
    val gridModel = RandomForest.trainClassifier(trainingData, 2, Map[Int, Int](), gridNumTrees, "auto", gridImpurity, gridDepth, gridBins) 
    val gridAUC = computeMetrics(gridModel, testData).areaUnderROC 
    val gridErr = computeError(gridModel, testData) 
    ((gridNumTrees, gridImpurity, gridDepth, gridBins), gridAUC, gridErr) 
  } 

What we have just written is a for-loop that is going to try a number of different combinations with respect to the number of trees, impurity type, depth of the trees, and the bins (that is, values to try); And then, for each model created based on these hyper-parameter permutations, we are going to score the trained model against our hold-out set while computing the AUC metric and the overall error rate. In total we get 2*2*2*2=16 models. Again, your models will be slightly different than the ones we show here but your output should resemble something like this:

Look at the first entry of our output:

|(15,entropy,20,20)|0.697|0.302|

We can interpret this as follows: for the combination of 15 decision trees, using Entropy as our impurity measure, along with a tree depth of 20 (for each tree) and a bin value of 20, our AUC is 0.695. Note that the results are shown in the order you wrote them initially. For our grid search using the RF algorithm, we can easily get a combination of hyper-parameters producing the highest AUC:

val rfParamsMaxAUC = rfGrid.maxBy(g => g._2)
println(f"RF Model: Parameters ${rfParamsMaxAUC._1}%s producing max AUC = ${rfParamsMaxAUC._2}%.3f (error = ${rfParamsMaxAUC._3}%.3f)")

The output is as follows:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset