Boosting

As we move on, we will start to utilize generative methods. The first generative method we will experiment with is boosting. We will first try to classify the datasets using AdaBoost. As AdaBoost resamples the dataset based on misclassifications, we expect that it will be able to handle our imbalanced dataset relatively well.

First, we must decide on the ensemble's size. We generate validation curves for a number of ensemble sizes depicted as follows:

Validation curves of various ensemble sizes for AdaBoost

As we can observe, 70 base learners provide the best trade-off between bias and variance. As such, we will proceed with ensembles of size 70.

The following code implements the training and evaluation for AdaBoost:

# --- SECTION 1 ---
# Libraries and data loading
import numpy as np
import pandas as pd
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn import metrics

np.random.seed(123456)
data = pd.read_csv('creditcard.csv')
data.Time = (data.Time-data.Time.min())/data.Time.std()
data.Amount = (data.Amount-data.Amount.mean())/data.Amount.std()
# Train-Test slpit of 70%-30%
x_train, x_test, y_train, y_test = train_test_split(
 data.drop('Class', axis=1).values, data.Class.values, test_size=0.3)

We then train and evaluate our ensemble, using 70 estimators and a learning rate of 1.0:

# --- SECTION 2 ---
# Ensemble evaluation
ensemble = AdaBoostClassifier(n_estimators=70, learning_rate=1.0)
ensemble.fit(x_train, y_train)
print('AdaBoost f1', metrics.f1_score(y_test, ensemble.predict(x_test)))
print('AdaBoost recall', metrics.recall_score(y_test, ensemble.predict(x_test)))

We reduce the number of features, by selecting only features with high correlation with respect to the target. Finally, we repeat the procedure of training and evaluating the ensemble:

# --- SECTION 3 ---
# Filter features according to their correlation to the target
np.random.seed(123456)
threshold = 0.1
correlations = data.corr()['Class'].drop('Class')
fs = list(correlations[(abs(correlations)>threshold)].index.values)
fs.append('Class')
data = data[fs]
x_train, x_test, y_train, y_test = train_test_split(
 data.drop('Class', axis=1).values, data.Class.values, test_size=0.3)
ensemble = AdaBoostClassifier(n_estimators=70, learning_rate=1.0)
ensemble.fit(x_train, y_train)
print('AdaBoost f1', metrics.f1_score(y_test, ensemble.predict(x_test)))
print('AdaBoost recall', metrics.recall_score(y_test, ensemble.predict(x_test)))

The results are depicted in the following table. As it is evident, it does not perform as well as our previous models:

Dataset	Metric	Value
Original	F1	0.778
	Recall	0.721
Filtered	F1	0.794
	Recall	0.721

Performance of AdaBoost

We can try to increase the learning rate to 1.3, which seems to improve overall performance. If we further increase it to 1.4, we notice a drop in performance. If we increase the number of base learners to 80, we notice an increase in performance for the filtered dataset, while the original dataset seems to trade recall for F1 performance:

Dataset	Metric	Value
Original	F1	0.788
	Recall	0.765
Filtered	F1	0.815
	Recall	0.743

Performance of AdaBoost, learning_rate=1.3

Dataset	Metric	Value
Original	F1	0.800
	Recall	0.765
Filtered	F1	0.800
	Recall	0.735

Performance of AdaBoost, learning_rate=1.4

Dataset	Metric	Value
Original	F1	0.805
	Recall	0.757
Filtered	F1	0.805
	Recall	0.743

Performance of AdaBoost, learning_rate=1.4, ensemble_size=80

We can, in fact, observe a Pareto front of F1 and Recall, which is directly linked to the learning rate and number of base learners present in the ensemble. The front is depicted in the following graph:

Pareto front of F1 and Recall for AdaBoost

Table of Contents for Boosting

Create new playlist

Sign In

Sign Up

Table of Contents for
Boosting