Spark GBM model

Finally, we will move on to our gradient boosting machine (GBM), which will be the final model in our ensemble of models. Note that in the previous chapters, we used H2O's version of GBM, but now, we will stick with Spark and use Spark's implementation of GBM as follows:

import org.apache.spark.ml.classification.{GBTClassifier, GBTClassificationModel}
val gbmModelPath= s"$MODELS_DIR/gbmModel"
val gbmModel= {
  val model = new GBTClassifier()
      .setFeaturesCol(idf.getOutputCol)
      .setLabelCol("label")
      .setMaxIter(20)
      .setMaxDepth(6)
      .setCacheNodeIds(true)
      .fit(trainData)
  val gbmPrediction = model.transform(testData)
  gbmPrediction.show()
  val gbmAUC = new BinaryClassificationEvaluator()
      .setLabelCol("label")
      .setRawPredictionCol(model.getPredictionCol)
      .evaluate(gbmPrediction)
  println(s" GBM AUC on test data: $gbmAUC")
  model.write.overwrite.save(gbmModelPath)
  model
}

So now, we have trained up four different learning algorithms: a (single) decision tree, a random forest, Naive Bayes, and a gradient boosted machine. Each provides a different AUC as summarized in the table here. We can see that the best performing model is RandomForest followed by GBM. However, it is fair to say that we did not perform any exhausted search for the GBM model nor did we use a high number of iterations as is usually recommended:

Decision tree	0.659
Naive Bayes	0.484
Random forest	0.769
GBM	0.755

Table of Contents for Spark GBM model

Create new playlist

Sign In

Sign Up

Table of Contents for
Spark GBM model