Implementing extremely randomized trees

Random forests are already pretty arbitrary. But what if we wanted to take the randomness to its extreme?

In extremely randomized trees (see the ExtraTreesClassifier and ExtraTreesRegressor classes), the randomness is taken even further than in random forests. Remember how decision trees usually choose a threshold for every feature so that the purity of the node split is maximized? Extremely randomized trees, on the other hand, choose these thresholds at random. The best one of these randomly generated thresholds is then used as the splitting rule.

We can build an extremely randomized tree as follows:

In [16]: from sklearn.ensemble import ExtraTreesClassifier
... extra_tree = ExtraTreesClassifier(n_estimators=10, random_state=100)

To illustrate the difference between a single decision tree, a random forest, and extremely randomized trees, let's consider a simple dataset, such as the Iris dataset:

In [17]: from sklearn.datasets import load_iris
... iris = load_iris()
... X = iris.data[:, [0, 2]]
... y = iris.target
In [18]: X_train, X_test, y_train, y_test = train_test_split(
... X, y, random_state=100
... )

We can then fit and score the tree object the same way we did before:

In [19]: extra_tree.fit(X_train, y_train)
... extra_tree.score(X_test, y_test)
Out[19]: 0.92105263157894735

For comparison, using a random forest would have resulted in the same performance:

In [20]: forest = RandomForestClassifier(n_estimators=10,
random_state=100)
... forest.fit(X_train, y_train)
... forest.score(X_test, y_test)
Out[20]: 0.92105263157894735

In fact, the same is true for a single tree:

In [21]: tree = DecisionTreeClassifier()
... tree.fit(X_train, y_train)
... tree.score(X_test, y_test)
Out[21]: 0.92105263157894735

So, what's the difference between them? To answer this question, we have to look at the decision boundaries. Fortunately, we have already imported our plot_decision_boundary helper function in the preceding section, so all we need to do is pass the different classifier objects to it.

We will build a list of classifiers, where each entry in the list is a tuple that contains an index, a name for the classifier, and the classifier object:

In [22]: classifiers = [
... (1, 'decision tree', tree),
... (2, 'random forest', forest),
... (3, 'extremely randomized trees', extra_tree)
... ]

Then, it's easy to pass the list of classifiers to our helper function such that the decision landscape of every classifier is drawn in its own subplot:

In [23]: for sp, name, model in classifiers:
... plt.subplot(1, 3, sp)
... plot_decision_boundary(model, X_test, y_test)
... plt.title(name)
... plt.axis('off')

The result looks like this:

Now the differences between the three classifiers become clearer. We see the single tree drawing by far the simplest decision boundaries, splitting the landscape using horizontal decision boundaries. The random forest is able to more clearly separate the cloud of data points in the lower-left of the decision landscape. However, only extremely randomized trees were able to corner the cloud of data points toward the center of the landscape from all sides.

Now that we know about all the different variations of tree ensembles, let's move on to a real-world dataset.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset