Testing the base learners

To test the base learners, we will benchmark the base learners by themselves, which will help us gauge how well they perform on their own. In order to do so, first, we load the libraries and dataset and then split the data with 70% in the train set and 30% in the test set. We use pandas in order to easily import the CSV. Our goal is to train and evaluate each individual base learner before we train and evaluate the ensemble as a whole:

# --- SECTION 1 ---
# Libraries and data loading
import numpy as np
import pandas as pd

from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn import metrics

np.random.seed(123456)
data = pd.read_csv('creditcard.csv')
data.Time = (data.Time-data.Time.min())/data.Time.std()
data.Amount = (data.Amount-data.Amount.mean())/data.Amount.std()

# Train-Test slpit of 70%-30%
x_train, x_test, y_train, y_test = train_test_split(
data.drop('Class', axis=1).values, data.Class.values, test_size=0.3)

After loading the libraries and data, we train each classifier and print the required metrics from the sklearn.metrics package. F1 score is implemented by the f1_score function and recall is implemented by the recall_score function. The decision tree is restricted to a maximum depth of three (max_depth=3), in order to avoid overfitting:

# --- SECTION 2 ---
# Base learners evaluation
base_classifiers = [('DT', DecisionTreeClassifier(max_depth=3)),
('NB', GaussianNB()),
('LR', LogisticRegression())]

for bc in base_classifiers:
lr = bc[1]
lr.fit(x_train, y_train)

predictions = lr.predict(x_test)
print(bc[0]+' f1', metrics.f1_score(y_test, predictions))
print(bc[0]+' recall', metrics.recall_score(y_test, predictions))

The results are depicted in the following table. As is evident, the decision tree outperforms the other three learners. Naive Bayes has a higher recall score, but its F1 score is considerably worse, compared to the decision tree:

Learner

Metric

Value

Decision Tree

F1

0.770

Recall

0.713

Naive Bayes

F1

0.107

Recall

0.824

Logistic Regression

F1

0.751

Recall

0.632

 

We can also experiment with the number of features present in the dataset. By plotting their correlation to the target, we can filter out features that present low correlation to the target. This table depicts each feature's correlation to the target: 

Correlation between each variable and the target

By filtering any feature with a lower absolute value than 0.1, we hope that the base learners will be able to better detect the fraudulent transactions, as the dataset's noise will be reduced.

In order to test our theory, we repeat the experiment, but remove any columns from the DataFrame where the absolute correlation is lower than 0.1, as indicated by fs = list(correlations[(abs(correlations)>threshold)].index.values).

Here, fs holds all column names with a correlation greater than the indicated threshold:

# --- SECTION 3 ---
# Filter features according to their correlation to the target
np.random.seed(123456)
threshold = 0.1

correlations = data.corr()['Class'].drop('Class')
fs = list(correlations[(abs(correlations)>threshold)].index.values)
fs.append('Class')
data = data[fs]

x_train, x_test, y_train, y_test = train_test_split(data.drop('Class', axis=1).values, data.Class.values, test_size=0.3)

for bc in base_classifiers:
lr = bc[1]
lr.fit(x_train, y_train)

predictions = lr.predict(x_test)
print(bc[0]+' f1', metrics.f1_score(y_test, predictions))
print(bc[0]+' recall', metrics.recall_score(y_test, predictions))

Again, we present the results in the following table. As we can see, the decision tree has increased its F1 score, while reducing its recall. Naive Bayes has improved on both metrics, while the logistic regression model has become considerably worse:

Learner

Metric

Value

Decision Tree

F1

0.785

Recall

0.699

Naive Bayes

F1

0.208

Recall

0.846

Logistic Regression

F1

0.735

Recall

0.610

Performance metrics for the three base learners for the filtered dataset
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset