Clustering, discussed in the last chapter, is an unsupervised algorithm. It is now time to switch back to a supervised algorithm. Classification is a class of problems that surfaces quite frequently in predictive modelling and in various forms. Accordingly, to deal with all of them, a family of classification algorithms is used.
A decision tree is a supervised classification algorithm that is used when the target variable is a discrete or categorical variable (having two or more than two classes) and the predictor variables are either categorical or numerical variables. A decision tree can be thought of as a set of if-then rules for a classification problem where the target variables are discrete or categorical variables. The if-then rules are represented as a tree.
A decision tree is used when the decision is based on multiple-staged criteria and variables. A decision tree is very effective as a decision making tool as it has a pictorial output that is easier to understand and implement compared to the output of the other predictive models.
Decision trees and related classification algorithms will be the theme running throughout this chapter. At the end of this chapter, the reader will be able to get a basic understanding of the concept of a decision tree, the mathematics behind it, the implementation of decision trees and Python, and the efficiency of the classification performed through the model.
We will cover the following topics in depth:
A tree is a data structure that might be used to state certain decision rules because it can be represented in such a way as to pictorially illustrate these rules. A tree has three basic elements: nodes, branches, and leaves. Nodes are the points from where one or more branches come out. A node from where no branch originates is a leaf. A typical tree looks as follows:
A tree, specifically a decision tree, starts with a root node, proceeds to the decision nodes, and ultimately to the terminal nodes where the decision rules are made. All nodes, except the terminal node, represent one variable and the branches represent the different categories (values) of that variable. The terminal node represents the final decision or value for that route.
To understand what decision trees look like and how to make sense of them, let us consider an example. Consider a situation where one wants to predict whether a particular Place will get a Bumper, Moderate, or a Meagre Harvest of a crop based on information about the Rainfall, Terrain, availability of groundwater, and usage of fertilizers. Consider the following dataset for that situation:
For a while, let us forget how these trees are constructed and focus on interpreting a tree and how it becomes handy in classification problems. For illustrative purposes, let us assume that the final decision tree looks as follows (we will learn the mathematics behind it later):
There are multiple decisions one can make using this decision tree:
A decision tree, or the decisions implied by it, can also be represented as disjunction statements. For example, the conditions for the harvest being Bumper can be written as a combination of AND
and OR
operators (or disjunctions), as follows:
In the same way, the conditions for a Moderate harvest can be summarized using disjunctions as follows:
In the preceding example, we have used only categorical variables as the predictor variables. However, there is no such restriction. The numerical variables can very well be used as predictor variables. However, the numerical variables are not the most preferred variables for a decision tree, as they lose some information while they are categorized into groups to be used in a decision tree algorithm.
A decision tree is advantageous because it is easier to understand and doesn't require a lot of data cleaning in the sense that it is not influenced by missing values and outliers that much.