Entropy and information gain

Before we explain how to create a Decision Tree, we need to introduce two important concepts—entropy and information gain.

Entropy measures the homogeneity of a dataset. Imagine a dataset with 10 observations with one attribute, as shown in the following diagram, the value of this attribute is A for the 10 observations. This dataset is completely homogenous and is easy to predict the value of the next observation, it'll probably be A:

The entropy in a dataset that is completely homogenous is zero. Now, imagine a similar dataset, but in this dataset each observation has a different value, as shown in the following diagram:

Now, the dataset is very heterogeneous and it's hard to predict the following observation. In this dataset, the entropy is higher. The formula to calculate the entropy is Entropy and information gain, where Entropy and information gain is the probability of x.

Try to calculate the entropy for the following datasets:

Now, we understand how entropy helps us to know the level of predictability of a dataset. A dataset with a low entropy level is very predictable; a dataset with a high level of entropy is very hard to predict. We're ready to understand information gain and how entropy and information gain can help us to create a Decision Tree.

The information gain is a measure of the decrease of entropy you achieve when you split a dataset. We use it in the process of building a Decision Tree. We're going to use an example to understand this concept. In this example, our objective will be to create a tree to classify loan applications depending on its probability of defaulting, into low risk applications and high risk applications. Our dataset has three input variables: Purpose, Sex, and Age, and one output variable, Default?.

The following image shows the dataset:

To create the Decision Tree, we will start by choosing an attribute for the root node. This attribute will split our dataset into two datasets. We will choose the attribute that adds more predictability or reduces the entropy. We will start calculating the entropy for the current dataset:

We will start with an entropy of 0.97; our objective is to try to reduce the entropy to increase the predictability. What happens if we choose the attribute Purpose for our root node? By choosing Purpose for our root node, we will divide the dataset in three datasets. Each dataset contains five observations. We can calculate the entropy of each dataset and aggregate it to have a global entropy value.

The original entropy was 0.97. If we use Purpose for our root node and divide the dataset into three sets, the entropy will be 0.89, so our new dataset will be more predictable. The difference between the original entropy and the new entropy is the information gain. In this example, the information gain is 0.08. However, what happens if we choose Sex or Age for our root node?

If we use Sex to split the dataset, we create two datasets. The male dataset contains seven observations and the female dataset contains eight observations; the new entropy is 0.91. In this case, the information gain is 0.06, so Purpose is a better option than Sex to split the dataset. Splitting the dataset by Purpose, the result becomes more predictable. This is illustrated in the following diagram:

Finally, if we use Age to split the dataset, we will obtain three subsets. The subset that contains young people (< 25) contains nine observations, the subset with middle-aged people contains four observations, and finally, the subset with people older than 65 years contains two observations. In this case, the entropy is 0.52 and the information gain is 0.45.

The attribute Age has the higher information gain; we will choose it for our root node, as illustrated in the following diagram:

We've divided our dataset into three subsets, divided by Age.

After the root node, we need to choose a second attribute to split our three datasets and create a deeper tree.

