Spark random forest model

Next, we will move on to our random forest algorithm, which, as you will recall from the previous chapters, is an ensemble of various decision trees whereby we perform a grid search again alternating between various depths and other hyper-parameters, which will be familiar:

import org.apache.spark.ml.classification.{RandomForestClassifier, RandomForestClassificationModel}
val rfModelPath= s"$MODELS_DIR/rfModel"
val rfModel= {
val rfGridSearch = for (
rfNumTrees<- Array(10, 15);
rfImpurity<- Array("entropy", "gini");
rfDepth<- Array(3, 5))
yield {
println( s"Training random forest: numTrees: $rfNumTrees,
impurity $rfImpurity, depth: $rfDepth")
val rfModel = new RandomForestClassifier()
.setFeaturesCol(idf.getOutputCol)
.setLabelCol("label")
.setNumTrees(rfNumTrees)
.setImpurity(rfImpurity)
.setMaxDepth(rfDepth)
.setMaxBins(10)
.setSubsamplingRate(0.67)
.setSeed(42)
.setCacheNodeIds(true)
.fit(trainData)
val rfPrediction = rfModel.transform(testData)
val rfAUC = new BinaryClassificationEvaluator()
.setLabelCol("label")
.evaluate(rfPrediction)
println(s" RF AUC on test data: $rfAUC")
((rfNumTrees, rfImpurity, rfDepth), rfModel, rfAUC)
}
println(rfGridSearch.sortBy(-_._3).take(5).mkString(" "))
val bestModel = rfGridSearch.sortBy(-_._3).head._2
// Stress that the model is minimal because of defined gird space^
bestModel.write.overwrite.save(rfModelPath)
bestModel
}

From our grid search, the highest AUC we are seeing is 0.769.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset