Understanding the mathematics behind decision trees

The main goal in a decision tree algorithm is to identify a variable and classification on which one can give a more homogeneous distribution with reference to the target variable. The homogeneous distribution means that similar values of the target variable are grouped together so that a concrete decision can be made.

Homogeneity

In the preceding example, the first goal would be to find a parameter (out of four: Terrain, Rainfall, Groundwater, and Fertilizers) that results in a better homogeneous distribution of the target variable within those categories.

Without any parameter, the count of harvest type looks as follows:

Bumper

Moderate

Meagre

4

9

7

Let us calculate, for each parameter, how the split on that parameter affects the homogeneity of the target variable split:

Homogeneity

Fig. 8.4: Splitting the predictor and the target variables into categories to see their effect on the homogeneity of the dataset

If one observes carefully, the classification done on the Rainfall parameter is more homogeneous than that done on the Terrain parameter. Homogeneity means more similar things together or, in other words, fewer dissimilar things together. Homogeneity can be understood as follows: for Low rainfall, the harvest is distributed as 0%, 71%, and 29% across Bumper, Meagre, and Moderate, and it is a more homogeneous classification than 0% (Bumper), 50% (Meagre), and 50% (Moderate) for hilly terrains. This is because Low rainfall was able to group more of the Meagre rainfall (71%) together than the classes in the Terrain parameter. For the Fertilizers and Groundwater parameters, the highest homogeneity that can be achieved is 67%. Thus, the rainfall is the most suited to classify the target variable; that is, the Harvest type. This can be refined more as we add more variables (parameters) for the classification.

To understand it even better, let us look at the following pictures. Which one is more homogeneous? Which one can represent a node where a concrete decision (about the type of dot: bold or normal) can be made? Definitely C, because it has only one type of dot. B is the next best choice as, in this case, the decision of the dot type is skewed more towards the bold dots (compared to A) than the normal dots.

Homogeneity

Making these classifications more and more homogeneous, so that they can be identified as concrete decisions, is the ultimate goal in decision tree algorithms. Identifying the variable that results in the best homogeneous classification can be done in many ways. There are multiple algorithms available to do this. Let us take a look at some of them.

Entropy

The entropy technique takes cues from information theory. The premise is that more homogeneous or pure nodes require less information to be represented. One example of information can be an encrypted signal that has to be transmitted. If the constituents of the message are more homogeneous, it will consume fewer bits. The configuration that requires less information is always preferred.

The impurity or heterogeneity of a node can be measured using something called entropy. One interpretation of entropy (from information theory) is the minimum number of bits required to encode the classification of an arbitrary member of the set. The change (reduction) in non-desirable entropy can be measured with information gain. In the case of a decision tree, the nodes whose addition results in information gain should be added to the configuration.

Entropy is defined as follows:

Entropy

Here, the summation is over different categories of the target variable. Pi is the proportion (over the total number of observations) of the i th particular category of the target variable.

Entropy ranges from 0 to log2c, c being the number of categories present in the target variable. The entropy will be 0 when the observations are perfectly homogeneous, while it will be log2c when the observations are perfectly heterogeneous. The entropy reaches the maximum value (log2c) when all the pi's are equal. The entropy reaches the minimum value (0) when one pi is equal to 1 and all the others are 0. If pi=1, the receiver of the information will know with certainty what classification an arbitrary member should belong to, so that there will be no information required and, thus, the entropy will be 0.

In the preceding example, we have three categories of target variables; hence, the entropy can range from 0 to log23 (1.58). We have three categories of the target variable, namely, Bumper, Meagre, and Moderate with proportions 4/20, 9/20, and 7/20. Thus, the entropy can be calculated as follows:

Entropy

When the target variables are perfectly homogeneously distributed, the entropy will be 0. Suppose that all the observations had Bumper, Meagre, or Moderate as a target variable, then the entropy would be as follows:

Entropy

On the other hand, if the classification is perfectly heterogeneous (that is, when each category of the target variable has equal number of observations) then the entropy reaches its maximum value.

A plot of entropy versus pi takes up the shape of an inverted U with a maxima in the middle. A plot of entropy versus the pi for a binary classification looks as follows:

Entropy

Fig. 8.5: A plot of entropy versus the proportions

As you can see in the preceding picture, the entropy increases with an increase in pi, reaches a maximum at the point where all the pi's are equal, and then starts decreasing as the pi increases after that.

Information gain

The entropy (S=1.5) we calculated is the entropy of the system at the beginning. This reduces as we introduce variables as nodes in the system. This reduction reflects as an Information Gain. Let us see how information gains are calculated for different variables. The information gain for a particular variable V is defined as follows:

Information gain

Sv is the total entropy for the node variable V, c stands for the categories in the node variable, Vc is the number of total observation with the category c of the node variable, V is the total number of observations, and Entropy(Vc) is the entropy of the system having observations with the category c of the node variable. Summation is over the categories of the variable.

Let us calculate the information gain based on using the Terrain variable as a node.

For this, let us calculate the entropy for different categories of the Terrain variable. Entropy is always calculated keeping the target variable in mind. For each category of the particular variable, we need to count how many target variables of each kind are present:

Information gain

Now, we can calculate the information gain as follows:

Information gain

The preceding calculation can be summarized in a tree-branch as follows:

Information gain

Fig. 8.6: The calculation of information gain for the Terrain variable

Similarly, the information gain for the Rainfall variable can be calculated as 0.42, that for the Fertilizers variable can be calculated as 0.36, and that for Groundwater can be calculated as 0.16:

Parameter

Information Gain

Terrain

0.14

Rainfall

0.42

Fertilizers

0.36

Groundwater

0.16

The variable that provides the maximum information gain is chosen to be the node. In this case, the Rainfall variable is chosen as it results in the maximum information gain of 0.42.

ID3 algorithm to create a decision tree

The tree keeps growing by selecting the next variable that results in the maximum information gain as a subnode branching out from this node (Rainfall in this case). A node should be chosen separately for each branch of the previous node. The variables already used for the split still qualify to be considered for being a node. It is a recursive process that stops when the node is totally homogeneous (pure), or it reaches the maximum possible depth (the number of observations in the dataset) of the tree.

This algorithm using entropy and information gain is called ID3 algorithm, and can be summarized as follows:

  1. Calculate the initial entropy of the system based on the target variable.
  2. Calculate the information gains for each candidate variable for a node. Select the variable that provides the maximum information gain as a decision node.
  3. Repeat step 2 for each branch (value) of the node (variable) identified in step 2. The newly identified node is termed as leaf node.
  4. Check whether the leaf node classifies the entire data perfectly. If not, repeat the steps from step 2 onwards. If yes, stop.

Apart from the ID3 algorithm involving entropy and information gain, there are other algorithms that can be used to decide which variable should be chosen as a subnode. Some of them are Gini index method, Chi-Square Automatic Interaction Detector (CHAID) method, and Reduction in Variance method. All these algorithms are useful and might give better results than the others in specific scenarios. A summary of all these methods is available in the following table:

Method

Properties

Gini index

This can be used only if the target variable is a binary variable. Classification and Regression Trees (CART), a famous classification technique, uses Gini index method.

CHAID

This uses the chi-square statistic to find the statistical significance of the difference between a parent node and a subnode. It can handle more than two categories of the target variable.

Reduction in Variance

This is used to handle a continuous numerical variable as the target variable in decision tree creation. It calculates the variance at each node and takes a weighted average of the variances to decide the subnode.

To give an idea about the nitty-gritty of these methods, a couple of them such as Gini index and Reduction in Variance have been explained in detail.

Gini index

Gini method works only when the target variable is a binary variable. This method is used by the famous CART algorithm.

Gini is defined as the sum of the square of proportions of categories of the target variable for the particular category of the node variable. Suppose the node variable, for which we are trying to look for a subnode, has two categories C1 and C2. C1 has 2 yes's and 8 no's (out of the total 10 observations with C1 as the value of the node variable), while C2 has 6 yes's and 4 no's (the total of 8 yes's and 12 no's). Gini for each category is calculated as follows:

Gini index

Then, Gini index for a variable is calculated by taking a weighted average over the categories:

Gini index

Gini index is calculated for all the candidate variables. A variable with a higher Gini index is selected to create the subnode.

Reduction in Variance

In the Reduction in Variance method, the target variable is supposed to be a numerical variable. Let's consider the same example that we considered for understanding the Gini index method. To convert the target variable (yes and no) to a numerical variable, let's transform yes to 1 and no to 0 (resulting in 8 1's and 12 0's).

Reduction in Variance

The weighed variance is calculated for each variable. The variable with the smallest weighted variance is chosen to be the node.

Pruning a tree

As we have seen earlier, a decision tree keeps growing by adding subnodes to the parent node until the data is perfectly classified or totally homogeneous. An extreme case is that of a decision tree that has as many nodes as there are observations in the dataset. This happens rarely, but a more common phenomenon is over-fitting, where the tree over-learns a given training dataset, but doesn't perform that well with other similar datasets. A small tree will lose out on accuracy as it might miss out on some important variables, decreasing its accuracy.

A common strategy to overcome this situation is to first allow the tree to grow until the nodes have the minimum number of instances under them and then prune the tree to remove the branches or nodes that don't provide a lot of classification power to the tree. This procedure is called pruning a decision tree and can be done in both a bottom-up and top-down fashion. The following are a few methods that are used to do this:

  • Reduced error pruning: It is a naïve top-down approach for pruning a tree. Starting from the terminal nodes or the leaves, each node is replaced with its most popular category. If the accuracy of the tree is not affected, the replacements are put into effect.
  • Cost complexity pruning: This method generates a set of trees, T0, T1,…,Tn, where T0 is the unpruned tree and Tn is just the root node. The Ti tree is created by replacing a subtree of the Ti-1 tree with one of the leaf nodes (selecting the leaf node is done using either of the algorithms explained above). The subtree to be replaced is chosen as follows:
    1. Define an error rate for a decision tree T over a dataset D as err(T,D).
    2. The number of terminal nodes or leaves in a decision tree T is given by leaves (T). A decision tree T from which a subtree t has been pruned is denoted by prune(T,t).
    3. Define a function M, where M=[err(prune(T,t),D)-err(T,D)]/[|leaves(T)|-|leaves(prune(T,t))|].
    4. Calculate M for all the candidate subtrees.
    5. The subtree that minimizes the M is chosen for removal.

Handling a continuous numerical variable

Earlier in the discussion, we mentioned that continuous numerical variables can also be used as a predictor variable while creating a decision tree. However, the algorithms we have discussed to choose a subnode and grow the tree, all require a categorical variable. How do we then use continuous numerical variables to create a decision tree?

The answer is by defining thresholds for the continuous numerical variable, based on which the variable will be categorized in several classes dynamically. How do we define such a threshold for a numerical variable? To answer these questions, let us suppose that the harvest dataset used earlier has the information about Temperature, as well. Let us look at the first 10 observations for such a dataset (showing only the Temperature and Harvest variables):

Temp(0C)

35

27

12

51

46

38

4

22

29

17

Harvest

Bumper

Moderate

Meagre

Meagre

Meagre

Bumper

Meagre

Moderate

Moderate

Meagre

To find the appropriate thresholds that can be used to divide Temperature into categories for the sake of creating a decision tree, the following steps can be performed:

  1. Sort the dataset based on Temperature (the numerical variable) in ascending order. The sorted dataset will look as shown in the following table:

    Temp(0C)

    4

    12

    17

    22

    27

    29

    35

    38

    35

    46

    51

    Harvest

    Meagre

    Meagre

    Meagre

    Moderate

    Moderate

    Moderate

    Bumper

    Bumper

    Bumper

    Meagre

    Meagre

  2. Mark the Temperature ranges where the Harvest type transitions from one category to another. For example, in the 17-22 range, the Harvest type changes from Meagre to Moderate. Similarly, in the 29-35 range, the Harvest type transitions from Moderate to Bumper, and in the 35-46 range, Harvest transitions from Bumper to Meagre.
  3. Corresponding to each transition in Harvest, there will be one threshold. In this case, there are three transitions so there will be three thresholds. Let us call these c1, c2, and c3.
  4. The thresholds of Temperature for these transitions are nothing but the average of the Temperature range over which the transition occurs. Thus, c1=(17+22)/2=19.5, c2=(29+35)/2=32, and c3=(35+46)/2=40.5.
  5. Corresponding to each of the thresholds, there will be a separate category of Temperature. Thus, the 3 categories of the Temperature will be as follows: Temperature>19.5, Temperature>32, and Temperature>40.5.

This method works because it has already been proved that such ranges where the target variable transitions from one category to an other provide the thresholds for continuous numerical variables that maximize the information gain.

Handling a missing value of an attribute

One of the advantages of using a decision tree is that it can handle missing values of an attribute. In many situations in real life where decision trees are used, one faces a situation where the data points of an attribute are missing. For example, consider a situation where one wants to predict a patient condition based on a number of laboratory results. What if some of the laboratory results of some of the tests are not available for some of the patients? How do we handle such a situation?

Let us consider the first 10 observations of the dataset we used above, albeit with a few missing values:

Place

Rainfall

Terrain

Fertilizers

Groundwater

Harvest

P1

High

Plains

Yes

Yes

Bumper

P2

Low

 

No

Yes

Meagre

P3

Low

Plateau

No

Yes

Moderate

P4

High

Plateau

No

Yes

Moderate

P5

High

Plains

Yes

No

Bumper

P6

Low

Hills

No

No

Meagre

P7

Low

Plateau

No

No

Meagre

P8

Mild

Plateau

No

No

Meagre

P9

High

Hills

Yes

Yes

Moderate

P10

Mild

 

Yes

Yes

Bumper

Some of the approaches that can be used to handle the missing values while creating a decision tree are as follows:

  • Assign the most common (highest frequency) category of that variable to the missing value. In the preceding case, the Terrain variable has a missing value. The most common category of the Terrain variable is Plateau. So, the missing value will be replaced by Plateau.
  • Assign the most common (highest frequency) category of that variable among all the observations that have the same class of the target variable to the missing value. For the first blank Harvest class is Meagre. The most common class for the Harvest==Meagre is Plateau. So, we will replace the blank with Plateau. For the second blank, the Harvest type is Bumper. The most common class for the Harvest==Bumper is Plain. So, we will replace the blank with Plain.

Once the missing values have been replaced, the usual decision tree algorithms can be applied steadfastly.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset