Gradient boosting machine

So far, the best AUC we are able to muster is a 15-decision tree RF that has an AUC value of 0.698. Now, let's go through the same process of running a single gradient boosted machine with hardcoded hyper-parameters and then doing a grid search over these parameters to see if we can get a higher AUC using this algorithm.

Recall that a GBM is slightly different than an RF due to its iterative nature of trying to reduce an overall loss function that we declare beforehand. Within MLlib there are three different loss functions to choose from as of 1.6.0:

Log-loss: Use this loss function for classification tasks (note that GBM only supports binary classification for Spark. If you wish to use a GBM for multi-class classification, please use H2O's implementation, which we will show in the next chapter.
Squared-error: Use this loss function for regression tasks it is is the current default loss function for this type of problem.
Absolute-error: Another loss function that is available to use for regression tasks. Given that this function takes the absolute difference between the predicted and actual value, it controls for outliers much better than the squared error.

Given our task of binary classification, we will employ the log-loss function and begin building a 10 tree GBM model:

import org.apache.spark.mllib.tree.GradientBoostedTrees
 import org.apache.spark.mllib.tree.configuration.BoostingStrategy
 import org.apache.spark.mllib.tree.configuration.Algo

 val gbmStrategy = BoostingStrategy.defaultParams(Algo.Classification)
 gbmStrategy.setNumIterations(10)
 gbmStrategy.setLearningRate(0.1)
 gbmStrategy.treeStrategy.setNumClasses(2)
 gbmStrategy.treeStrategy.setMaxDepth(10)
 gbmStrategy.treeStrategy.setCategoricalFeaturesInfo(java.util.Collections.emptyMap[Integer, Integer])

 val gbmModel = GradientBoostedTrees.train(trainingData, gbmStrategy)

Notice that we must declare a boosting strategy before we can build our model. The reason is that MLlib does not know what type of problem we are tackling beforehand: classification or regression? So this strategy is letting Spark know that this is a binary classification problem and to use the declared hyper-parameters to build our model.

Following are some hyper-parameters to keep in mind when training GBMs:

numIterations: By definition, a GBM builds trees one at a time in order to minimize a loss function we declare. This hyper-parameter controls the number of trees to build; be careful to not build too many trees as performance at test-time may not be ideal.
loss: Where you declare which loss function to use depends on the question being asked and the dataset.
learningRate: Optimizes speed of learning. Lower values (< 0.1) means slower learning, and improved generalization. However, it also needs a higher number of iterations and hence a longer computation time.

Let's score this model against the hold-out set and compute our AUC:

val gbmTestErr = computeError(gbmModel, testData) 
println(f"GBM Model: Test Error = ${gbmTestErr}%.3f") 
val gbmMetrics = computeMetrics(dtreeModel, testData) 
println(f"GBM Model: AUC on Test Data = ${gbmMetrics.areaUnderROC()}%.3f")

The output is as follows:

As a final step, we will perform a grid-search over a few hyper-parameters and, similar to our previous RF grid-search example, output the combinations and their respective errors and AUC calculations:

val gbmGrid =  
for ( 
  gridNumIterations <- Array(5, 10, 50); 
  gridDepth <- Array(2, 3, 5, 7); 
  gridLearningRate <- Array(0.1, 0.01))  
yield { 
  gbmStrategy.numIterations = gridNumIterations 
  gbmStrategy.treeStrategy.maxDepth = gridDepth 
  gbmStrategy.learningRate = gridLearningRate 
 
  val gridModel = GradientBoostedTrees.train(trainingData, gbmStrategy) 
  val gridAUC = computeMetrics(gridModel, testData).areaUnderROC 
  val gridErr = computeError(gridModel, testData) 
  ((gridNumIterations, gridDepth, gridLearningRate), gridAUC, gridErr) 
}

We can print the first 10 lines of the result sorted by AUC:

println(
s"""GBM Model: Grid results:
      |${table(Seq("iterations, depth, learningRate", "AUC", "error"), gbmGrid.sortBy(-_._2).take(10), format = Map(1 -> "%.3f", 2 -> "%.3f"))}
""".stripMargin)

The output is as follows:

And we can easily get the model producing maximal AUC:

val gbmParamsMaxAUC = gbmGrid.maxBy(g => g._2) 
println(f"GBM Model: Parameters ${gbmParamsMaxAUC._1}%s producing max AUC = ${gbmParamsMaxAUC._2}%.3f (error = ${gbmParamsMaxAUC._3}%.3f)")

The output is as follows:

Table of Contents for Gradient boosting machine

Create new playlist

Sign In

Sign Up

Table of Contents for
Gradient boosting machine