Our first model – decision tree

Our first attempt at trying to classify the Higgs-Boson from background noise will use a decision tree algorithm. We purposely eschew from explaining the intuition behind this algorithm as this has already been well documented with plenty of supporting literature for the reader to consume (http://www.saedsayad.com/decision_tree.htm, http://spark.apache.org/docs/latest/mllib-decision-tree.html). Instead, we will focus on the hyper-parameters and how to interpret the model's efficacy with respect to certain criteria / error measures. Let's start with the basic parameters:

val numClasses = 2 
val categoricalFeaturesInfo = Map[Int, Int]() 
val impurity = "gini" 
val maxDepth = 5 
val maxBins = 10 

Now we are explicitly telling Spark that we wish to build a decision tree classifier that looks to distinguish between two classes. Let's take a closer look at some of the hyper-parameters for our decision tree and see what they mean:

numClasses: How many classes are we trying to classify? In this example, we wish to distinguish between the Higgs-Boson particle and background noise and thus there are four classes:

  • categoricalFeaturesInfo: A specification whereby we declare what features are categorical features and should not be treated as numbers (for example, ZIP code is a popular example). There are no categorical features in this dataset that we need to worry about.
  • impurity: A measure of the homogeneity of the labels at the node. Currently in Spark, there are two measures of impurity with respect to classification: Gini and Entropy and one impurity for regression: variance.
  • maxDepth: A stopping criterion which limits the depth of constructed trees. Generally, deeper trees lead to more accurate results but run the risk of overfitting.
  • maxBins: Number of bins (think "values") for the tree to consider when making splits. Generally, increasing the number of bins allows the tree to consider more values but also increases computation time.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset