Spark decision tree model

First, let's start with a simple decision tree and perform a grid search over a few of the hyper-parameters. We will follow the code from Chapter 2Detecting Dark Matter: The Higgs-Boson Particle to build our models that are trained to maximize the AUC statistic. However, instead of using models from the MLlib library, we will adopt models from the Spark ML package. The motivation of using the ML package will be clearer later when we will need to compose the models into a form of pipeline. Nevertheless, in the following code, we will use DecisionTreeClassifier, which we fit to trainData, generate prediction for testData, and evaluate the model's AUC performance with the help of BinaryClassificationEvaluato:

import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import java.io.File
val dtModelPath = s" $ MODELS_DIR /dtModel"
val dtModel= {
val dtGridSearch = for (
dtImpurity<- Array("entropy", "gini");
dtDepth<- Array(3, 5))
yield {
println(s"Training decision tree: impurity $dtImpurity,
depth: $dtDepth")
val dtModel = new DecisionTreeClassifier()
.setFeaturesCol(idf.getOutputCol)
.setLabelCol("label")
.setImpurity(dtImpurity)
.setMaxDepth(dtDepth)
.setMaxBins(10)
.setSeed(42)
.setCacheNodeIds(true)
.fit(trainData)
val dtPrediction = dtModel.transform(testData)
val dtAUC = new BinaryClassificationEvaluator().setLabelCol("label")
.evaluate(dtPrediction)
println(s" DT AUC on test data: $dtAUC")
((dtImpurity, dtDepth), dtModel, dtAUC)
}
println(dtGridSearch.sortBy(-_._3).take(5).mkString(" "))
val bestModel = dtGridSearch.sortBy(-_._3).head._2
bestModel.write.overwrite.save(dtModelPath)
bestModel
}

After selecting the best model, we will write it into a file. This is a useful trick since model training can be time and resource expensive, and the next time, we can load the model directly from the file instead of retraining it again:

val dtModel= if (new File(dtModelPath).exists()) {
DecisionTreeClassificationModel.load(dtModelPath)
} else { /* do training */ }
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset