Chapter 8. Trees and Random Forests with Python

Clustering, discussed in the last chapter, is an unsupervised algorithm. It is now time to switch back to a supervised algorithm. Classification is a class of problems that surfaces quite frequently in predictive modelling and in various forms. Accordingly, to deal with all of them, a family of classification algorithms is used.

A decision tree is a supervised classification algorithm that is used when the target variable is a discrete or categorical variable (having two or more than two classes) and the predictor variables are either categorical or numerical variables. A decision tree can be thought of as a set of if-then rules for a classification problem where the target variables are discrete or categorical variables. The if-then rules are represented as a tree.

A decision tree is used when the decision is based on multiple-staged criteria and variables. A decision tree is very effective as a decision making tool as it has a pictorial output that is easier to understand and implement compared to the output of the other predictive models.

Decision trees and related classification algorithms will be the theme running throughout this chapter. At the end of this chapter, the reader will be able to get a basic understanding of the concept of a decision tree, the mathematics behind it, the implementation of decision trees and Python, and the efficiency of the classification performed through the model.

We will cover the following topics in depth:

  • Introducing decision trees
  • Understanding the mathematics behind decision trees
  • Implementing a decision tree
  • Understanding and implementing regression trees
  • Understanding and implementing random forests

Introducing decision trees

A tree is a data structure that might be used to state certain decision rules because it can be represented in such a way as to pictorially illustrate these rules. A tree has three basic elements: nodes, branches, and leaves. Nodes are the points from where one or more branches come out. A node from where no branch originates is a leaf. A typical tree looks as follows:

Introducing decision trees

Fig. 8.1: A representation of a decision tree with its basic elements—node, branches, and leaves

A tree, specifically a decision tree, starts with a root node, proceeds to the decision nodes, and ultimately to the terminal nodes where the decision rules are made. All nodes, except the terminal node, represent one variable and the branches represent the different categories (values) of that variable. The terminal node represents the final decision or value for that route.

A decision tree

To understand what decision trees look like and how to make sense of them, let us consider an example. Consider a situation where one wants to predict whether a particular Place will get a Bumper, Moderate, or a Meagre Harvest of a crop based on information about the Rainfall, Terrain, availability of groundwater, and usage of fertilizers. Consider the following dataset for that situation:

A decision tree

Fig. 8.2 A simulated dataset containing information about the Harvest type and conditions such as the Rainfall, Terrain, Usage of Fertilizers, and Availability of Groundwater

For a while, let us forget how these trees are constructed and focus on interpreting a tree and how it becomes handy in classification problems. For illustrative purposes, let us assume that the final decision tree looks as follows (we will learn the mathematics behind it later):

A decision tree

Fig. 8.3: A representational decision tree based on the above dataset

There are multiple decisions one can make using this decision tree:

  • If Terrain is Plain and Rainfall is High, then the harvest will be Bumper
  • If Terrain is Plain, Rainfall is Low, and Groundwater is Yes (present), then the harvest will be Moderate
  • If Terrain is Plain, Rainfall is Low, and Groundwater is No (present), then the harvest will be Meagre
  • If Terrain is Hills and Rainfall is High, then the harvest will be Moderate
  • If Terrain is Hills and Rainfall is Low, then the harvest will be Meagre
  • If Terrain is Plateau, Groundwater is Yes, and Fertilizer is Yes (present), then the harvest will be Bumper
  • If Terrain is Plateau, Groundwater is Yes, and Fertilizer is No (present), then the harvest will be Moderate
  • If Terrain is Plateau, Groundwater is No, and Fertilizer is Yes (present), then the harvest will be Meagre
  • If Terrain is Plateau, Groundwater is No, and Fertilizer is No (present), then the harvest will be Meagre

A decision tree, or the decisions implied by it, can also be represented as disjunction statements. For example, the conditions for the harvest being Bumper can be written as a combination of AND and OR operators (or disjunctions), as follows:

A decision tree

In the same way, the conditions for a Moderate harvest can be summarized using disjunctions as follows:

A decision tree

In the preceding example, we have used only categorical variables as the predictor variables. However, there is no such restriction. The numerical variables can very well be used as predictor variables. However, the numerical variables are not the most preferred variables for a decision tree, as they lose some information while they are categorized into groups to be used in a decision tree algorithm.

A decision tree is advantageous because it is easier to understand and doesn't require a lot of data cleaning in the sense that it is not influenced by missing values and outliers that much.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset