Summary

In this chapter on the decision trees, we first tried to understand the structure and the meaning of a decision tree. This was followed by a discussion on the mathematics behind creating a decision tree. Apart from implementing a decision tree in Python, the chapter also discussed the mathematics of related algorithms such as regression trees and random forests. Here is a brief summary of the chapter:

  • A decision tree is a classification algorithm used when the predictor variables are either categorical or continuous numerical variables.
  • Splitting a node into subnodes so that one gets a more homogeneous distribution (similar observations together), is the primary goal while making a tree.
  • There are various methods to decide which variable should be used to split the node. These methods include information gain, Gini, and maximum reduction in variance methods.
  • The method of building a regression tree is very similar to a decision tree. However, the target variable in the case of a regression tree is a continuous numerical variable, unlike the decision tree where it is a categorical variable.
  • Random forest is an algorithm coming under the ambit of ensemble methods. In ensemble methods, a lot of models are fitted over the same dataset. The final prediction is a combination (average or maximum votes) of the outputs from these models.
  • In the case of random forests, the models are all decision or regression trees. A random forest is more accurate than a single decision or a regression tree because the averaging of outputs maximizes the variance reduction.

In the next and final chapter, we will go through some best practices in predictive modeling to get optimal results.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset