The main goal in a decision tree algorithm is to identify a variable and classification on which one can give a more homogeneous distribution with reference to the target variable. The homogeneous distribution means that similar values of the target variable are grouped together so that a concrete decision can be made.
In the preceding example, the first goal would be to find a parameter (out of four: Terrain, Rainfall, Groundwater, and Fertilizers) that results in a better homogeneous distribution of the target variable within those categories.
Without any parameter, the count of harvest type looks as follows:
Bumper |
Moderate |
Meagre |
---|---|---|
4 |
9 |
7 |
Let us calculate, for each parameter, how the split on that parameter affects the homogeneity of the target variable split:
If one observes carefully, the classification done on the Rainfall parameter is more homogeneous than that done on the Terrain parameter. Homogeneity means more similar things together or, in other words, fewer dissimilar things together. Homogeneity can be understood as follows: for Low rainfall, the harvest is distributed as 0%, 71%, and 29% across Bumper, Meagre, and Moderate, and it is a more homogeneous classification than 0% (Bumper), 50% (Meagre), and 50% (Moderate) for hilly terrains. This is because Low rainfall was able to group more of the Meagre rainfall (71%) together than the classes in the Terrain parameter. For the Fertilizers and Groundwater parameters, the highest homogeneity that can be achieved is 67%. Thus, the rainfall is the most suited to classify the target variable; that is, the Harvest type. This can be refined more as we add more variables (parameters) for the classification.
To understand it even better, let us look at the following pictures. Which one is more homogeneous? Which one can represent a node where a concrete decision (about the type of dot: bold or normal) can be made? Definitely C, because it has only one type of dot. B is the next best choice as, in this case, the decision of the dot type is skewed more towards the bold dots (compared to A) than the normal dots.
Making these classifications more and more homogeneous, so that they can be identified as concrete decisions, is the ultimate goal in decision tree algorithms. Identifying the variable that results in the best homogeneous classification can be done in many ways. There are multiple algorithms available to do this. Let us take a look at some of them.
The entropy technique takes cues from information theory. The premise is that more homogeneous or pure nodes require less information to be represented. One example of information can be an encrypted signal that has to be transmitted. If the constituents of the message are more homogeneous, it will consume fewer bits. The configuration that requires less information is always preferred.
The impurity or heterogeneity of a node can be measured using something called entropy. One interpretation of entropy (from information theory) is the minimum number of bits required to encode the classification of an arbitrary member of the set. The change (reduction) in non-desirable entropy can be measured with information gain. In the case of a decision tree, the nodes whose addition results in information gain should be added to the configuration.
Entropy is defined as follows:
Here, the summation is over different categories of the target variable. Pi is the proportion (over the total number of observations) of the i th particular category of the target variable.
Entropy ranges from 0 to log2c, c being the number of categories present in the target variable. The entropy will be 0 when the observations are perfectly homogeneous, while it will be log2c when the observations are perfectly heterogeneous. The entropy reaches the maximum value (log2c) when all the pi's are equal. The entropy reaches the minimum value (0) when one pi is equal to 1 and all the others are 0. If pi=1, the receiver of the information will know with certainty what classification an arbitrary member should belong to, so that there will be no information required and, thus, the entropy will be 0.
In the preceding example, we have three categories of target variables; hence, the entropy can range from 0 to log23 (1.58). We have three categories of the target variable, namely, Bumper, Meagre, and Moderate with proportions 4/20, 9/20, and 7/20. Thus, the entropy can be calculated as follows:
When the target variables are perfectly homogeneously distributed, the entropy will be 0. Suppose that all the observations had Bumper, Meagre, or Moderate as a target variable, then the entropy would be as follows:
On the other hand, if the classification is perfectly heterogeneous (that is, when each category of the target variable has equal number of observations) then the entropy reaches its maximum value.
A plot of entropy versus pi takes up the shape of an inverted U with a maxima in the middle. A plot of entropy versus the pi for a binary classification looks as follows:
As you can see in the preceding picture, the entropy increases with an increase in pi, reaches a maximum at the point where all the pi's are equal, and then starts decreasing as the pi increases after that.
The entropy (S=1.5) we calculated is the entropy of the system at the beginning. This reduces as we introduce variables as nodes in the system. This reduction reflects as an Information Gain. Let us see how information gains are calculated for different variables. The information gain for a particular variable V is defined as follows:
Sv is the total entropy for the node variable V, c stands for the categories in the node variable, Vc is the number of total observation with the category c of the node variable, V is the total number of observations, and Entropy(Vc) is the entropy of the system having observations with the category c of the node variable. Summation is over the categories of the variable.
Let us calculate the information gain based on using the Terrain variable as a node.
For this, let us calculate the entropy for different categories of the Terrain variable. Entropy is always calculated keeping the target variable in mind. For each category of the particular variable, we need to count how many target variables of each kind are present:
Now, we can calculate the information gain as follows:
The preceding calculation can be summarized in a tree-branch as follows:
Similarly, the information gain for the Rainfall variable can be calculated as 0.42, that for the Fertilizers variable can be calculated as 0.36, and that for Groundwater can be calculated as 0.16:
Parameter |
Information Gain |
---|---|
Terrain |
0.14 |
Rainfall |
0.42 |
Fertilizers |
0.36 |
Groundwater |
0.16 |
The variable that provides the maximum information gain is chosen to be the node. In this case, the Rainfall variable is chosen as it results in the maximum information gain of 0.42.
The tree keeps growing by selecting the next variable that results in the maximum information gain as a subnode branching out from this node (Rainfall in this case). A node should be chosen separately for each branch of the previous node. The variables already used for the split still qualify to be considered for being a node. It is a recursive process that stops when the node is totally homogeneous (pure), or it reaches the maximum possible depth (the number of observations in the dataset) of the tree.
This algorithm using entropy and information gain is called ID3 algorithm, and can be summarized as follows:
Apart from the ID3 algorithm involving entropy and information gain, there are other algorithms that can be used to decide which variable should be chosen as a subnode. Some of them are Gini index method, Chi-Square Automatic Interaction Detector (CHAID) method, and Reduction in Variance method. All these algorithms are useful and might give better results than the others in specific scenarios. A summary of all these methods is available in the following table:
Method |
Properties |
---|---|
Gini index |
This can be used only if the target variable is a binary variable. Classification and Regression Trees (CART), a famous classification technique, uses Gini index method. |
CHAID |
This uses the chi-square statistic to find the statistical significance of the difference between a parent node and a subnode. It can handle more than two categories of the target variable. |
Reduction in Variance |
This is used to handle a continuous numerical variable as the target variable in decision tree creation. It calculates the variance at each node and takes a weighted average of the variances to decide the subnode. |
To give an idea about the nitty-gritty of these methods, a couple of them such as Gini index and Reduction in Variance have been explained in detail.
Gini method works only when the target variable is a binary variable. This method is used by the famous CART algorithm.
Gini is defined as the sum of the square of proportions of categories of the target variable for the particular category of the node variable. Suppose the node variable, for which we are trying to look for a subnode, has two categories C1 and C2. C1 has 2 yes's and 8 no's (out of the total 10 observations with C1 as the value of the node variable), while C2 has 6 yes's and 4 no's (the total of 8 yes's and 12 no's). Gini for each category is calculated as follows:
Then, Gini index for a variable is calculated by taking a weighted average over the categories:
Gini index is calculated for all the candidate variables. A variable with a higher Gini index is selected to create the subnode.
In the Reduction in Variance method, the target variable is supposed to be a numerical variable. Let's consider the same example that we considered for understanding the Gini index method. To convert the target variable (yes and no) to a numerical variable, let's transform yes to 1 and no to 0 (resulting in 8 1's and 12 0's).
The weighed variance is calculated for each variable. The variable with the smallest weighted variance is chosen to be the node.
As we have seen earlier, a decision tree keeps growing by adding subnodes to the parent node until the data is perfectly classified or totally homogeneous. An extreme case is that of a decision tree that has as many nodes as there are observations in the dataset. This happens rarely, but a more common phenomenon is over-fitting, where the tree over-learns a given training dataset, but doesn't perform that well with other similar datasets. A small tree will lose out on accuracy as it might miss out on some important variables, decreasing its accuracy.
A common strategy to overcome this situation is to first allow the tree to grow until the nodes have the minimum number of instances under them and then prune the tree to remove the branches or nodes that don't provide a lot of classification power to the tree. This procedure is called pruning a decision tree and can be done in both a bottom-up and top-down fashion. The following are a few methods that are used to do this:
Earlier in the discussion, we mentioned that continuous numerical variables can also be used as a predictor variable while creating a decision tree. However, the algorithms we have discussed to choose a subnode and grow the tree, all require a categorical variable. How do we then use continuous numerical variables to create a decision tree?
The answer is by defining thresholds for the continuous numerical variable, based on which the variable will be categorized in several classes dynamically. How do we define such a threshold for a numerical variable? To answer these questions, let us suppose that the harvest dataset used earlier has the information about Temperature, as well. Let us look at the first 10 observations for such a dataset (showing only the Temperature and Harvest variables):
Temp(0C) |
35 |
27 |
12 |
51 |
46 |
38 |
4 |
22 |
29 |
17 |
Harvest |
Bumper |
Moderate |
Meagre |
Meagre |
Meagre |
Bumper |
Meagre |
Moderate |
Moderate |
Meagre |
To find the appropriate thresholds that can be used to divide Temperature into categories for the sake of creating a decision tree, the following steps can be performed:
Temp(0C) |
4 |
12 |
17 |
22 |
27 |
29 |
35 |
38 |
35 |
46 |
51 |
Harvest |
Meagre |
Meagre |
Meagre |
Moderate |
Moderate |
Moderate |
Bumper |
Bumper |
Bumper |
Meagre |
Meagre |
This method works because it has already been proved that such ranges where the target variable transitions from one category to an other provide the thresholds for continuous numerical variables that maximize the information gain.
One of the advantages of using a decision tree is that it can handle missing values of an attribute. In many situations in real life where decision trees are used, one faces a situation where the data points of an attribute are missing. For example, consider a situation where one wants to predict a patient condition based on a number of laboratory results. What if some of the laboratory results of some of the tests are not available for some of the patients? How do we handle such a situation?
Let us consider the first 10 observations of the dataset we used above, albeit with a few missing values:
Place |
Rainfall |
Terrain |
Fertilizers |
Groundwater |
Harvest |
---|---|---|---|---|---|
P1 |
High |
Plains |
Yes |
Yes |
Bumper |
P2 |
Low |
No |
Yes |
Meagre | |
P3 |
Low |
Plateau |
No |
Yes |
Moderate |
P4 |
High |
Plateau |
No |
Yes |
Moderate |
P5 |
High |
Plains |
Yes |
No |
Bumper |
P6 |
Low |
Hills |
No |
No |
Meagre |
P7 |
Low |
Plateau |
No |
No |
Meagre |
P8 |
Mild |
Plateau |
No |
No |
Meagre |
P9 |
High |
Hills |
Yes |
Yes |
Moderate |
P10 |
Mild |
Yes |
Yes |
Bumper |
Some of the approaches that can be used to handle the missing values while creating a decision tree are as follows:
Once the missing values have been replaced, the usual decision tree algorithms can be applied steadfastly.