Understanding the shortcomings of decision trees

The effect of overfitting the dataset, which a decision tree often falls victim to, is best demonstrated through a simple example.

For this, we will return to the make_moons function from scikit-learn's datasets module, which we previously used in Chapter 8, Discovering Hidden Structures with Unsupervised Learning, to organize data into two interleaving half circles. Here, we choose to generate 100 data samples belonging to two half circles, in combination with some Gaussian noise with a standard deviation of 0.25:

In [1]: from sklearn.datasets import make_moons
... X, y = make_moons(n_samples=100, noise=0.25,
... random_state=100)

We can visualize this data using matplotlib and the scatter function:

In [2]: import matplotlib.pyplot as plt
... %matplotlib inline
... plt.style.use('ggplot')
In [3]: plt.scatter(X[:, 0], X[:, 1], s=100, c=y)
... plt.xlabel('feature 1')
... plt.ylabel('feature 2');

The resulting plot has the data colored according to the class it belong to:

Because of all the noise we added, the two half moons might not be apparent at first glance. That's a perfect scenario for our current intentions, which is to show that decision trees are tempted to overlook the general arrangement of data points (that is, the fact that they are organized in half circles) and instead focus on the noise in the data.

To illustrate this point, we first need to split the data into training and test sets. We choose a comfortable 75-25 split (by not specifying train_size), as we have done a number of times before:

In [4]: from sklearn.model_selection import train_test_split
... X_train, X_test, y_train, y_test = train_test_split(
... X, y, random_state=100
... )

Now, let's have some fun. What we want to do is to study how the decision boundary of a decision tree changes as we make it deeper.

For this, we will bring back the plot_decision_boundary function from Chapter 6, Detecting Pedestrians with Support Vector Machines, among others:

In [5]: import numpy as np
... def plot_decision_boundary(classifier, X_test, y_test):

The function itself involves a number of processing steps:

  1. Create a dense 2D grid of data points between (x_min, y_min) and (x_max, y_max):
...         h = 0.02 # step size in mesh
... x_min = X_test[:, 0].min() - 1,
... x_max = X_test[:, 0].max() + 1
... y_min = X_test[:, 1].min() - 1
... y_max = X_test[:, 1].max() + 1
... xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
... np.arange(y_min, y_max, h))
...
  1. Classify every point on the grid:
...         X_hypo = np.c_[xx.ravel().astype(np.float32),
... yy.ravel().astype(np.float32)]
... zz = classifier.predict(X_hypo)
... zz = zz.reshape(xx.shape)
  1. If the predict method returns a tuple, we are dealing with an OpenCV classifier, otherwise we are dealing with scikit-learn:
...         if isinstance(ret, tuple):
... zz = ret[1]
... else:
... zz = ret
... zz = zz.reshape(xx.shape)
  1. Use the predicted target labels (zz) to color the decision landscape:
...         plt.contourf(xx, yy, zz, cmap=plt.cm.coolwarm,
alpha=0.8)
  1. Draw all the data points from the test set on top of the colored landscape:

... plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test,
... s=200)

Then, we can code a for loop where, at each iteration, we fit a tree of a different depth:

In [6]: from sklearn.tree import DecisionTreeClassifier
... for depth in range(1, 9):

In order to plot the decision landscape for every created decision tree, we need to create a plot with a total of eight subplots. The depth iterator can then serve as an index for the current subplot:

...         plt.subplot(2, 4, depth)

After fitting the tree to the data, we pass the predicted test labels to the plot_decision_boundary function:

...         tree = DecisionTreeClassifier(max_depth=depth)
... tree.fit(X_train, y_train)
... y_test = tree.predict(X_test)
... plot_decision_boundary(tree, X_test, y_test)

For the sake of clarity, we also turn off the axes and decorate each plot with a title:

...         plt.axis('off')
... plt.title('depth = %d' % depth)

This produces the following plot:

Recall the procedure of a decision tree: at every node in the tree, the algorithm picks a feature to split, either along the axis or along the axis. For example, a tree of depth = 1 has only one node and, thus, makes only one decision. This particular tree (upper row, leftmost panel) chose to split the second feature right down the middle, leading to a horizontal decision boundary that separates the decision landscape into an upper and lower half.

Similarly, a decision tree of depth = 2 will go a step further. After making the first decision of splitting the second feature in layer 1, the algorithm ends up with two additional nodes in layer 2. This particular tree (upper row, second panel from the left) chose to split the first feature in layer 2, leading to two additional vertical lines.

As we continue to build deeper and deeper trees, we notice something strange: the deeper the tree, the more likely it is to get strangely shaped decision regions, such as the tall and skinny patches in the rightmost panel of the lower row. It's clear that these patches are more a result of the noise in the data rather than some characteristic of the underlying data distribution. This is an indication that most of the trees are overfitting the data. After all, we know for a fact that the data is organized into two half circles! As such, the trees with depth = 3 or depth = 5 are probably closest to the real data distribution.

There are at least two different ways to make a decision tree less powerful:

  • Train the tree only on a subset of the data
  • Train the tree only on a subset of the features

Random forests do just that. In addition, they repeat the experiment many times by building an ensemble of trees, each of which is trained on a randomly chosen subset of data samples and/or features.

Random forests are a powerful ensemble classifier, which comes with several advantages:

  • Both training and evaluation are fast. Not only are the underlying decision trees relatively simple data structures, but every tree in the forest is independent, making it easy to train them in parallel.
  • Multiple trees allow for probabilistic classification. Later in this chapter, we will talk about voting procedures, which allow an ensemble to predict the probability with which a data point belongs to a particular class.

That being said, one of the disadvantages of random forests is that they might be hard to interpret, and the same goes for individual trees that are relatively deep. Although cumbersome, it is still possible to investigate the decision path within a particular tree. You might want to check out Chapter 5, Using Decision Trees to Make a Medical Diagnosis, for a refresher.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset