We continue to follow our best practice to split the data into training and test sets:
In [8]: from sklearn.model_selection import train_test_split
... X_train, X_test, y_train, y_test = train_test_split(
... X, y, random_state=21
... )
Then, we are ready to apply a random forest to the data:
In [9]: import cv2
... rtree = cv2.ml.RTrees_create()
Here, we want to create an ensemble with 50 decision trees:
In [10]: n_trees = 50
... eps = 0.01
... criteria = (cv2.TERM_CRITERIA_MAX_ITER + cv2.TERM_CRITERIA_EPS,
... n_trees, eps)
... rtree.setTermCriteria(criteria)
Because we have a large number of categories (that is, 40), we want to make sure the random forest is set up to handle them accordingly:
In [10]: rtree.setMaxCategories(len(np.unique(y)))
We can play with other optional arguments, such as the number of data points required in a node before it can be split:
In [11]: rtree.setMinSampleCount(2)
However, we might not want to limit the depth of each tree. This is again a parameter we will have to experiment with in the end. But for now, let's set it to a large integer value, making the depth effectively unconstrained:
In [12]: rtree.setMaxDepth(1000)
Then, we can fit the classifier to the training data:
In [13]: rtree.train(X_train, cv2.ml.ROW_SAMPLE, y_train);
We can check the resulting depth of the tree using the following function:
In [13]: rtree.getMaxDepth()
Out[13]: 25
This means that although we allowed the tree to go up to depth 1,000, in the end, only 25 layers were needed.
The evaluation of the classifier is done once again by predicting the labels first (y_hat) and then passing them to the accuracy_score function:
In [14]: _, y_hat = tree.predict(X_test)
In [15]: from sklearn.metrics import accuracy_score
... accuracy_score(y_test, y_hat)
Out[15]: 0.87
We find 87% accuracy, which turns out to be much better than with a single decision tree:
In [16]: from sklearn.tree import DecisionTreeClassifier
... tree = DecisionTreeClassifier(random_state=21, max_depth=25)
... tree.fit(X_train, y_train)
... tree.score(X_test, y_test)
Out[16]: 0.46999999999999997
Not bad! We can play with the optional parameters to see whether we get better. The most important one seems to be the number of trees in the forest. We can repeat the experiment with a forest made from 1,000 trees instead of 50 trees:
In [18]: num_trees = 1000
... eps = 0.01
... criteria = (cv2.TERM_CRITERIA_MAX_ITER + cv2.TERM_CRITERIA_EPS,
... num_trees, eps)
... rtree.setTermCriteria(criteria)
... rtree.train(X_train, cv2.ml.ROW_SAMPLE, y_train);
... _, y_hat = rtree.predict(X_test)
... accuracy_score(y_test, y_hat)
Out[18]: 0.94
With this configuration, we get 94% accuracy!
Another interesting use case of decision tree ensembles is AdaBoost.