Implementing a decision tree with scikit-learn

Now, when we are sufficiently aware of the mathematics behind decision trees, let us implement a simple decision tree using the methods in scikit-learn. The dataset we will be using for this is a commonly available dataset called the iris dataset that has information about flower species and their petal and sepal dimensions. The purpose of this exercise will be to create a classifier that can classify a flower as belonging to a certain species based on the flower petal and sepal dimensions.

To do this, let's first import the dataset and have a look at it:

import pandas as pd
data=pd.read_csv('E:/Personal/Learning/Predictive Modeling Book/My Work/Chapter 7/iris.csv')
data.head()

The datasheet looks as follows:

Implementing a decision tree with scikit-learn

Fig. 8.7: The first few observations of the iris dataset

Sepal-length, Sepal-width, Petal-length, and Petal-width are the dimensions of the flower while the Species denotes the class the flower belongs to. There are actually three classes of species here that can be looked at as follows:

data['Species'].unique()

The output will be three categories of the species as follows:

Implementing a decision tree with scikit-learn

Fig. 8.8: The categories of the species in the iris dataset

The purpose of this exercise will be to classify the flowers as belonging to one of the three species based on the dimensions. Let us see how we can do this.

Let us first get the predictors and the target variables separated:

colnames=data.columns.values.tolist()
predictors=colnames[:4]
target=colnames[4]

The first four columns of the dataset are termed predictors and the last one, that is, species is termed as the target variable.

Next, let's split the dataset into training and testing data:

Import numpy as np
data['is_train'] = np.random.uniform(0, 1, len(data) <= .75
train, test = data[data['is_train']==True], data[data['is_train']==False]

In the first line, we are basically creating as many uniformly distributed random numbers between 0 and 1 as there are observations in the dataset. If the random number is less than or equal to .75, that observation goes to the training dataset; otherwise the observation goes to the testing dataset.

We have everything ready to create a decision tree now. As we have seen earlier, there are several methods to create nodes and subnodes. This method can be specified while invoking the DecisionTreeClassifier method of the sklearn library:

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(criterion='entropy',min_samples_split=20, random_state=99)
dt.fit(train[predictors], train[target])

The min_samples_split specifies the minimum number of observations required to split a node into a subnode. By default, it is set to 2, which can be troublesome and can lead to over-fitting as a tree in such case can keep growing until it can find at least two observations. In this case, we have specified it to be 20. Our decision tree is now ready. Let us now test the result of our decision tree by using it for prediction over the testing dataset:

preds=dt.predict(test[predictors])
pd.crosstab(test['Species'],preds,rownames=['Actual'],colnames=['Predictions'])

In the first line of the preceding code snippet, the decision tree is used to predict the class (species) for the flowers in the test dataset using the flower dimensions. The second line creates a table comparing the Actual species and the Predicted species. The table looks as follows:

Implementing a decision tree with scikit-learn

Fig. 8.9: Comparing the Actual and Predicted categories

This table can be interpreted as follows: all the actual setosas were actually classified correctly as setosas. Out of the total 13 versicolors, 11 were classified correctly and 2 were classified wrongly as virginicas. Out of the total 12 virginicas, 11 were classified correctly while 1 was classified wrongly as versicolor. This accuracy rate is pretty good.

Visualizing the tree

In scikit-learn, there are the following four steps to visualize a tree:

  1. Creating a .dot file from the Decision Tree Classifier model that is fit for the data.
  2. In Python, this can be done using the export_graphviz module in the sklearn package. A .dot file contains information necessary to draw a tree. This information includes the entropy value (or Gini) at that node, the number of observations in that node, the condition referring to that node, and the node number pointing to another node number denoting which node is connected next to which one. For example, 2->3 and 3->4 means that node 2 is connected to 3, 3 is connected to 4, and so on. You can specify the directory name where you want to create the .dot file:
    from sklearn.tree import export_graphviz
    with open('E:/Personal/Learning/Predictive Modeling Book/My Work/Chapter 7/dtree2.dot', 'w') as dotfile:
        export_graphviz(dt, out_file = dotfile, feature_names = predictors)
    dotfile.close()
  3. Take a look at the .dot file after it is created to have a better idea. It looks as follows:
    Visualizing the tree

    Fig. 8.10: Information inside a .dot file

  4. Rendering a .dot file into a tree:

    This can be done using the system module of the os package that is used to run the cmd commands from within Python. This is done as follows:

    from os import system
    system("dot -Tpng /E:/Personal/Learning/Predictive Modeling Book/My Work/Chapter 7/dtree2.dot -o /E:/Personal/Learning/Predictive Modeling Book/My Work/Chapter 7/dtree2.png")
    Visualizing the tree

    Fig. 8.11: The decision tree

This is how the tree looks like. The left arrow from the node ascribes to True and the right arrow to False for the condition given in the node. Each node has several important pieces of information such as the entropy at that node (remember, the less, the better), the number of samples (observations) at that node, and the number of samples in each species flower (under the heading value).

The tree is read as follows:

  • If Petal Length<=2.45, then the flower species is setosa.
  • If Petal Length>2.45, then check Petal Width. If Petal Length<=4.95, then the species is versicolor. If Petal Length > 4.95, then there is 1 versicolor and 2 virginica, and further classification is not possible.
  • If Petal Length>2.45, then check Petal Width. If Petal Length<=4.85, then there is 1 versicolor and 2 virginica, and further classification is not possible. If Petal Length>4.85, then the species is virginica.

Some other observations from the tree are as follows:

  • The maximum depth (the number of levels) of the tree is 3. In 3 leaves, the tree has been able to identify categories, which will make the dataset homogeneous.
  • Sepal dimensions don't seem to be playing any role in the tree formation or, in other words, in the classification of these flowers into one of the species.
  • There is a terminal node at the 1st level itself. If Petal Length<=2.45, one gets only the setosa species of flowers.
  • The value in each node denotes the number of observations belonging to the three species (setosa, versicolor, and virginica in that order) at that node. Thus, the terminal node in the 1st level has 34 setosas, 0 versicolors, and 0 virginicas.

Cross-validating and pruning the decision tree

The tree might have grown very complex even after putting the min_samples_split of 20. There is a parameter of DecisionTreeClassifier that can be used to check the maximum depth to which the tree grows. This is called max_depth. Let us use this parameter and also the cross validation accuracy score to get an optimum depth of the tree. We are actually pruning the tree to get to an optimum depth where it neither overfits nor underfits the dataset.

We will do cross validation over the entire dataset. If you remember, cross validation splits the dataset into training and testing sets on its own and does this a number of times to generalize the results of the model.

Let us cross validate our decision tree:

X=data[predictors]
Y=data[target]
dt1.fit(X,Y)
dt1 = DecisionTreeClassifier(criterion='entropy',max_depth=5, min_samples_split=20, random_state=99)

In these lines, we just assigned predictor variables to X and the target variable to Y. We have created a new decision tree that is very similar to the tree we created previously, except that it has an additional parameter, namely, max_depth=5.

The next step is to import the cross validation methods in sklearn and perform the cross validation:

from sklearn.cross_validation import KFold
crossvalidation = KFold(n=X.shape[0], n_folds=10, shuffle=True, random_state=1)
from sklearn.cross_validation import cross_val_score
score = np.mean(cross_val_score(dt1, X, Y, scoring='accuracy', cv=crossvalidation, n_jobs=1))
score

We have chosen to do a 10-fold cross validation, and the score is the mean of the accuracy score obtained from each fold. The score in this comes out to be 0.933. This score signifies the accuracy of the classification.

If we vary the max_depth from 1 to 10, this is how the mean accuracy score varies:

Cross-validating and pruning the decision tree

As you can observe, for max_depth => 4, the score remains almost constant. The maximum score is obtained when max_depth = 3. Hence, we will choose to grow our tree to only three levels from the root node.

Let us now do a feature importance test to determine which of the variables in the preceding dataset are actually important for the model. This can be easily done as follows:

dt1.feature_importances_
Cross-validating and pruning the decision tree

Fig. 8.12: Feature importance scores of the variables in the iris dataset

The higher the values, the higher the feature importance. Hence, we conclude that the Petal width and Petal length are important features (in ascending order of importance) to predict the flower species using this dataset.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset