Alternatively, we can implement random forests using scikit-learn:
In [13]: from sklearn.ensemble import RandomForestClassifier
... forest = RandomForestClassifier(n_estimators=10, random_state=200)
Here, we have a number of options to customize the ensemble:
- n_estimators: This specifies the number of trees in the forest.
- criterion: This specifies the node-splitting criterion. Setting criterion='gini' implements the Gini impurity, whereas setting criterion='entropy' implements information gain.
- max_features: This specifies the number (or fraction) of features to consider at each node split.
- max_depth: This specifies the maximum depth of each tree.
- min_samples: This specifies the minimum number of samples required to split a node.
We can then fit the random forest to the data and score it like any other estimator:
In [14]: forest.fit(X_train, y_train)
... forest.score(X_test, y_test)
Out[14]: 0.83999999999999997
This gives roughly the same result as in OpenCV. We can use our helper function to plot the decision boundary:
In [15]: plot_decision_boundary(forest, X_test, y_test)
The output looks like this:
The preceding image shows the decision boundary of a random forest.