Random forests for classification

The Random Forests classification class is implemented in RandomForestClassifier, under the sklearn.ensemble package. It has a number of parameters, such as the ensemble's size, the maximum tree depth, the number of samples required to make or split a node, and many more.

In this example, we will try to classify the hand-written digits dataset, using the Random Forest classification ensemble. As usual, we load the required classes and data and set the seed for our random number generator:

# --- SECTION 1 ---
# Libraries and data loading
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
import numpy as np

digits = load_digits()

train_size = 1500
train_x, train_y = digits.data[:train_size], digits.target[:train_size]
test_x, test_y = digits.data[train_size:], digits.target[train_size:]

np.random.seed(123456)

Following this, we create the ensemble, by setting the n_estimators and n_jobs parameters. These parameters dictate the number of trees that will be generated and the number of parallel jobs that will be run. We train the ensemble using the fit function and evaluate it on the test set by measuring its achieved accuracy:

# --- SECTION 2 ---
# Create the ensemble
ensemble_size = 500
ensemble = RandomForestClassifier(n_estimators=ensemble_size, n_jobs=4)

# --- SECTION 3 ---
# Train the ensemble
ensemble.fit(train_x, train_y)

# --- SECTION 4 ---
# Evaluate the ensemble
ensemble_predictions = ensemble.predict(test_x)

ensemble_acc = metrics.accuracy_score(test_y, ensemble_predictions)

# --- SECTION 5 ---
# Print the accuracy
print('Random Forest: %.2f' % ensemble_acc)

The classifier is able to achieve an accuracy of 93%, which is even higher than the previously best-performing method, XGBoost (Chapter 6, Boosting). We can visualize the approximation of the error limit we mentioned earlier, by plotting validation curves (from Chapter 2, Getting Started with Ensemble Learning) for a number of ensemble sizes. We test for sizes of 10, 50, 100, 150, 200, 250, 300, 350, and 400 trees. The curves are depicted in the following graph. We can see that the ensemble approaches a 10-fold cross-validation error of 96%:

Validation curves for a number of ensemble sizes
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset