Logistic regression classifier – skewed data

Having tested our previous approach, it is interesting to test the same process on the skewed data. Our intuition is that skewness will introduce issues that are difficult to capture and therefore provide a less effective algorithm.
To be fair, taking into account the fact that the train and test datasets are substantially bigger than the under-sampled ones, it is necessary to have a K-fold cross-validation. We can split the data: 60% for the training set, 20% for cross validation, and 20% for the test data. But let's take the same approach as before (there's no harm in this; it's just that K-fold is computationally more expensive):

best_c = print_gridsearch_scores(X_train,y_train)

Best parameters set found on development set:

{'C': 10}
Grid scores on development set:
0.591 (+/-0.121) for {'C': 0.01}
0.594 (+/-0.076) for {'C': 0.1}
0.612 (+/-0.106) for {'C': 1}
0.620 (+/-0.122) for {'C': 10}
0.620 (+/-0.122) for {'C': 100}

Use the preceding parameter to build the final model with the whole training dataset and predict the classes in the test, as follows:

# dataset
lr = LogisticRegression(C = best_c, penalty = 'l1')
lr.fit(X_train,y_train.values.ravel())
y_pred_undersample = lr.predict(X_test.values)
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test,y_pred_undersample)
np.set_printoptions(precision=2)
print("Recall metric in the testing dataset: ", cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))
# Plot non-normalized confusion matrix
class_names = [0,1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix')
plt.show()

Here is the output for the confusion matrix:

Before continuing, we need to change the classification threshold. We have seen that, by under-sampling the data, our algorithm does a much better job of detecting fraud. We can also tweak our final classification by changing the threshold. Initially, you build the classification model and then you predict unseen data using it. We previously used the predict() method to decide whether a record should belong to 1 or 0. There is another method, predict_proba(). This method returns the probabilities for each class. The idea is that by changing the threshold to assign a record to class 1, we can control precision and recall. Let's check this using the under-sampled data (C_param = 0.01):

lr = LogisticRegression(C = 0.01, penalty = 'l1')
lr.fit(X_train_undersample,y_train_undersample.values.ravel())
y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values)
thresholds = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
plt.figure(figsize=(10,10))
j = 1
for i in thresholds:
y_test_predictions_high_recall = y_pred_undersample_proba[:,1] > i

plt.subplot(3,3,j)
j += 1

# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test_undersample,y_test_predictions_high_recall)
np.set_printoptions(precision=2)
print "Recall metric in the testing dataset for threshold {}: {}".format(i, cnf_matrix[1,1]/(cnf_matrix[1,0]+cnf_matrix[1,1]))
# Plot non-normalized confusion matrix
class_names = [0,1]
plot_confusion_matrix(cnf_matrix, classes=class_names, title='Threshold >= %s'%i)
Recall metric in the testing dataset for threshold 0.1: 1.0
Recall metric in the testing dataset for threshold 0.2: 1.0
Recall metric in the testing dataset for threshold 0.3: 1.0
Recall metric in the testing dataset for threshold 0.4: 0.979591836735
Recall metric in the testing dataset for threshold 0.5: 0.925170068027
Recall metric in the testing dataset for threshold 0.6: 0.857142857143
Recall metric in the testing dataset for threshold 0.7: 0.829931972789
Recall metric in the testing dataset for threshold 0.8: 0.741496598639
Recall metric in the testing dataset for threshold 0.9: 0.585034013605
...

The pattern is very clear. The more you lower the required probability to put a certain in the class 1 category, the more records will be put in that bucket.

This implies an increase in recall (we want all the 1s), but at the same time, a decrease in precision (we misclassify many of the other classes).

Therefore, even though recall is our goal metric (do not miss a fraud transaction), we also want to keep the model being accurate as a whole:

  • There is an option which is quite interesting to tackle this. We could assign cost to misclassifications, but being interested in classifying 1s correctly, the cost for misclassifying 1s should be bigger than misclassifying 0s. After that, the algorithm would select the threshold that minimizes the total cost. A drawback here is that we have to manually select the weight of each cost.
  • Going back to changing the threshold, there is an option which is the precision-recall curve. By visually inspecting the performance of the model depending on the threshold we choose, we can investigate a sweet spot where recall is high enough while keeping a high precision value.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset