Implementing our first random forest

In OpenCV, random forests can be built using the RTrees_create function from the ml module:

In [7]: import cv2
... rtree = cv2.ml.RTrees_create()

The tree object provides a number of options, the most important of which are the following:

  • setMaxDepth: This sets the maximum possible depth of each tree in the ensemble. The actual obtained depth may be smaller if other termination criteria are met first.
  • setMinSampleCount: This sets the minimum number of samples that a node can contain for it to get split.
  • setMaxCategories: This sets the maximum number of categories allowed. Setting the number of categories to a smaller value than the actual number of classes in the data leads to subset estimation.
  • setTermCriteria: This sets the termination criteria of the algorithm. This is also where you set the number of trees in the forest.
Although we might have hoped for a setNumTrees method to set the number of trees in the forest (kind of the most important parameter of them all, no?), we instead need to rely on the setTermCriteria method. Confusingly, the number of trees is conflated with cv2.TERM_CRITERA_MAX_ITER, which is usually reserved for the number of iterations that an algorithm is run for, not for the number of estimators in an ensemble.

We can specify the number of trees in the forest by passing an integer, n_trees, to the setTermCriteria method. Here, we also want to tell the algorithm to quit once the score does not increase by at least eps from one iteration to the next:

In [8]: n_trees = 10
... eps = 0.01
... criteria = (cv2.TERM_CRITERIA_MAX_ITER + cv2.TERM_CRITERIA_EPS,
... n_trees, eps)
... rtree.setTermCriteria(criteria)

Then, we are ready to train the classifier on the data from the preceding code:

In [9]: rtree.train(X_train.astype(np.float32), cv2.ml.ROW_SAMPLE,
y_train);

The test labels can be predicted with the predict method:

In [10]: _, y_hat = rtree.predict(X_test.astype(np.float32))

Using scikit-learn's accuracy_score, we can evaluate the model on the test set:

In [11]: from sklearn.metrics import accuracy_score
... accuracy_score(y_test, y_hat)
Out[11]: 0.83999999999999997

After training, we can pass the predicted labels to the plot_decision_boundary function:

In [12]: plot_decision_boundary(rtree, X_test, y_test)

This will produce the following plot:

The preceding image shows the decision landscape of a random forest classifier.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset