Using random forests

Finally, we will employ a random forest ensemble. Once again, using validation curves, we will determine the optimal ensemble size. From the following graph, we conclude that 50 trees provide the least possible variance in our model, thus we proceed with ensemble size 50:

Validation curves for random forest

We provide the training and validation code as follows, as well as the achieved performance for both datasets. The following code is responsible for loading the required libraries and data, and training and evaluating the ensemble on the original and filtered datasets. We first load the required libraries and data, while creating train and test splits:

# --- SECTION 1 ---
# Libraries and data loading
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn import metrics

np.random.seed(123456)
data = pd.read_csv('creditcard.csv')
data.Time = (data.Time-data.Time.min())/data.Time.std()
data.Amount = (data.Amount-data.Amount.mean())/data.Amount.std()
np.random.seed(123456)
data = pd.read_csv('creditcard.csv')
data.Time = (data.Time-data.Time.min())/data.Time.std()
data.Amount = (data.Amount-data.Amount.mean())/data.Amount.std()
# Train-Test slpit of 70%-30%
x_train, x_test, y_train, y_test = train_test_split(
 data.drop('Class', axis=1).values, data.Class.values, test_size=0.3)

We then train and evaluate the ensemble, both on the original dataset, as well as on the filtered dataset:

# --- SECTION 2 ---
# Ensemble evaluation
ensemble = RandomForestClassifier(n_jobs=4)
ensemble.fit(x_train, y_train)
print('RF f1', metrics.f1_score(y_test, ensemble.predict(x_test)))
print('RF recall', metrics.recall_score(y_test, ensemble.predict(x_test)))

# --- SECTION 3 ---
# Filter features according to their correlation to the target
np.random.seed(123456)
threshold = 0.1
correlations = data.corr()['Class'].drop('Class')
fs = list(correlations[(abs(correlations)>threshold)].index.values)
fs.append('Class')
data = data[fs]
x_train, x_test, y_train, y_test = train_test_split(
 data.drop('Class', axis=1).values, data.Class.values, test_size=0.3)
ensemble = RandomForestClassifier(n_jobs=4)
ensemble.fit(x_train, y_train)
print('RF f1', metrics.f1_score(y_test, ensemble.predict(x_test)))
print('RF recall', metrics.recall_score(y_test, ensemble.predict(x_test)))

Dataset	Metric	Value
Original	F1	0.845
	Recall	0.743
Filtered	F1	0.867
	Recall	0.794

Random forest performance

As our dataset is highly skewed, we can speculate that changing the criterion for a tree’s split to entropy would benefit our model. Indeed, by specifying criterion='entropy' in the constructor (ensemble = RandomForestClassifier(n_jobs=4)), we are able to increase the performance on the original dataset to an F1 score of 0.859 and a Recall score of 0.786, two of the highest scores for the original dataset:

Dataset	Metric	Value
Original	F1	0.859
	Recall	0.787
Filtered	F1	0.856
	Recall	0.787

Performance with entropy as the splitting criterion

Table of Contents for Using random forests

Create new playlist

Sign In

Sign Up

Table of Contents for
Using random forests