Training and testing the random forest

We continue to follow our best practice to split the data into training and test sets:

In [8]: from sklearn.model_selection import train_test_split
... X_train, X_test, y_train, y_test = train_test_split(
... X, y, random_state=21
... )

Then, we are ready to apply a random forest to the data:

In [9]: import cv2
... rtree = cv2.ml.RTrees_create()

Here, we want to create an ensemble with 50 decision trees:

In [10]: n_trees = 50
... eps = 0.01
... criteria = (cv2.TERM_CRITERIA_MAX_ITER + cv2.TERM_CRITERIA_EPS,
... n_trees, eps)
... rtree.setTermCriteria(criteria)

Because we have a large number of categories (that is, 40), we want to make sure the random forest is set up to handle them accordingly:

In [10]: rtree.setMaxCategories(len(np.unique(y)))

We can play with other optional arguments, such as the number of data points required in a node before it can be split:

In [11]: rtree.setMinSampleCount(2)

However, we might not want to limit the depth of each tree. This is again a parameter we will have to experiment with in the end. But for now, let's set it to a large integer value, making the depth effectively unconstrained:

In [12]: rtree.setMaxDepth(1000)

Then, we can fit the classifier to the training data:

In [13]: rtree.train(X_train, cv2.ml.ROW_SAMPLE, y_train);

We can check the resulting depth of the tree using the following function:

In [13]: rtree.getMaxDepth()
Out[13]: 25

This means that although we allowed the tree to go up to depth 1,000, in the end, only 25 layers were needed.

The evaluation of the classifier is done once again by predicting the labels first (y_hat) and then passing them to the accuracy_score function:

In [14]: _, y_hat = tree.predict(X_test)
In [15]: from sklearn.metrics import accuracy_score
... accuracy_score(y_test, y_hat)
Out[15]: 0.87

We find 87% accuracy, which turns out to be much better than with a single decision tree:

In [16]: from sklearn.tree import DecisionTreeClassifier
... tree = DecisionTreeClassifier(random_state=21, max_depth=25)
... tree.fit(X_train, y_train)
... tree.score(X_test, y_test)
Out[16]: 0.46999999999999997

Not bad! We can play with the optional parameters to see whether we get better. The most important one seems to be the number of trees in the forest. We can repeat the experiment with a forest made from 1,000 trees instead of 50 trees:

In [18]: num_trees = 1000
... eps = 0.01
... criteria = (cv2.TERM_CRITERIA_MAX_ITER + cv2.TERM_CRITERIA_EPS,
... num_trees, eps)
... rtree.setTermCriteria(criteria)
... rtree.train(X_train, cv2.ml.ROW_SAMPLE, y_train);
... _, y_hat = rtree.predict(X_test)
... accuracy_score(y_test, y_hat)
Out[18]: 0.94

With this configuration, we get 94% accuracy!

Here, we tried to improve the performance of our model through creative trial and error: we varied some of the parameters we deemed important and observed the resulting change in performance until we found a configuration that satisfied our expectations. We will learn more sophisticated techniques for improving a model in Chapter 11, Selecting the Right Model with Hyperparameter Tuning.

Another interesting use case of decision tree ensembles is AdaBoost.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset