Model validation

Once the model has been built and evaluated, the next step is to validate the model. In the case of logistic regression models or classification models in general, we basically validate the model by comparing the actual class with the predicted class. There are various ways to do this, but the most famous and widely used is the Receiver Operating Characteristic (ROC) curve.

The ROC curve

An ROC curve is a graphical tool to understand the performance of a classification model. For a logistic regression model, a prediction can either be positive or negative. Also, this prediction can either be correct or incorrect.

There are four categories in which the predictions of a logistic regression model can fall:

Actual/predicted

Positive

Negative

Positive

True Positive (TP):

  • Correct positive prediction
  • Actually positive and prediction is also positive

True Negative (TN):

  • Correct negative prediction
  • Actually negative and prediction is also negative

Negative

False Positive (FP):

  • Incorrect positive prediction
  • Actually negative and prediction is positive

False Negative (FN):

  • Incorrect negative prediction
  • Actually positive and prediction is negative

So, True Positives are the ones that are actually positive, and the model has also predicted a positive outcome for them. False Positives are false successes. These are actually failures, but the model is predicting them as successes. False Negatives are actually successes, but the model predicts them as failures.

Let us state some totals in terms of these categories:

  • The total number of actual positive = TP+FN
  • The total number of actual negative = TN+FP
  • The total number of correct predictions = TP+TN
  • The total number of incorrect predictions = FP+FN

Aware of these terms, we can now understand the terms that are the constituents of a ROC curve. These terms are as follows:

Sensitivity (True Positive Rate): This is the proportion of the positive outcomes that are identified as such (as positives) by the model:

The ROC curve

Specificity (True Negative Rate): This is the proportion of the negative outcomes that are identified as such (as negatives) by the model:

The ROC curve

Sensitivity wards off against False Positive, while the Specificity does the same against False Negative. A perfect model will be 100% sensitive and also 100% specific.

An ROC curve is a plot of True Positive Rate vs False Positive Rate where False Positive Rate=FP/(TN+FP) =1-Specificity.

As we saw earlier, the number of positive and negative outcomes change as we change the threshold of probability values to classify a probability value as a positive or negative outcome. Thus, the Sensitivity and Specificity will change as well.

An ROC curve has the following important properties:

  • Any increase in Sensitivity will decrease the Specificity
  • The closer the curve is to the left and upper border of the quadrant, the better the model prediction
  • The closer the curve is to the diagonal line, the worse the model prediction is
  • The larger the area under the curve, the better the prediction

The following are the steps in plotting an ROC curve:

  1. Define several probability thresholds and calculate Sensitivity and 1-Specificity for each threshold.
  2. Plot Sensitivity and 1-Specificity points obtained in this way.

Let us plot the ROC curve for the model we built earlier in this chapter by following the steps described above. Later, we will see how to do it using the built-in methods in scikit-learn.

This is the model that we ran and calculated the probabilities for each observation:

from sklearn.cross_validation import train_test_split
from sklearn import linear_model
from sklearn import metrics
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
clf1 = linear_model.LogisticRegression()
clf1.fit(X_train, Y_train)
probs = clf1.predict_proba(X_test)

Each probability value is then compared to a threshold probability, and categorized as 1 (positive outcome) if it is greater than threshold probability and 0 if less than threshold probability. It can be done using the following code snippet (we have chosen a threshold probability of 0.05 in this case):

prob=probs[:,1]
prob_df=pd.DataFrame(prob)
prob_df['predict']=np.where(prob_df[0]>=0.05,1,0)
prob_df['actual']=Y_test
prob_df.head()

The resulting data frame looks as follows:

The ROC curve

Fig. 6.26: Predicted and actual outcomes for the bank dataset

Confusion matrix

The result of how many correct and incorrect predictions were made can be summarized using what is called a confusion matrix. A confusion matrix is just a tabular representation to state the number of TPs, TNs, FPs, and FNs. Once we have a data frame in such a format, we can calculate the confusion matrix using the crosstab statement as follows:

confusion_matrix=pd.crosstab(prob_df['actual'],prob_df['predict'])
confusion_matrix

The confusion matrix in this case is as follows:

At p=0.05:

Confusion matrix

Fig. 6.27.1: Confusion matrix at p=0.05

At p=0.10:

Confusion matrix

Fig. 6.27.2: Confusion matrix at p=0.10

The Sensitivity and Specificity are calculated at various other probability threshold levels and then the Sensitivity and (1-Specificity) are plotted against each other. The Sensitivity and (1-Specificity) or FPR at different threshold probability values are summarized as follows:

Threshold p

Sensitivity

(1-Specificity)

0.05

0.87

0.62

0.10

0.62

0.23

0.07

0.67

0.27

0.12

0.59

0.17

0.20

0.50

0.12

0.25

0.41

0.07

0.04

0.95

0.76

As one can observe, as the threshold of the probability increases, both the Sensitivity and the FPR (1-Specificity) decreases. Now, we have the Sensitivity and Specificity at different threshold probabilities. We can make use of this data to plot our ROC curve. A diagonal (y=x) line is a good benchmark for an ROC curve. If the ROC curve lies above the diagonal line, then the model is considered a better predictor than a random guess (represented by a diagonal line). An ROC curve lying below the diagonal line indicates that the model is a worse predictor compared to a random guess.

Let us plot our ROC curve and also the diagonal line and see whether the ROC curve lies above the diagonal line or below. This can be done using the following code snippet:

import matplotlib.pyplot as plt
%matplotlib inline
Sensitivity=[1,0.95,0.87,0.62,0.67,0.59,0.5,0.41,0]
FPR=[1,0.76,0.62,0.23,0.27,0.17,0.12,0.07,0]
plt.plot(FPR,Sensitivity,marker='o',linestyle='--',color='r')
x=[i*0.01 for i in range(100)]
y=[i*0.01 for i in range(100)]
plt.plot(x,y)
plt.xlabel('(1-Specificity)')
plt.ylabel('Sensitivity')
plt.title('ROC Curve')

The ROC curve looks as follows:

Confusion matrix

Fig. 6.28: The ROC curve drawn without using the scikit-learn methods

The curve in red is the ROC curve, while the blue line is the benchmark diagonal line. The ROC curve lies above the diagonal line and, hence, the model is a better predictor than a random guess. However, there can be many ROC curves lying above the diagonal line. How to determine one ROC curve is better than the other? This is determined by calculating the area enclosed under the ROC curves. The more the area enclosed by the ROC curve, the better it is. The area under the curve can lie between 0 and 1. The closer it is to 1, the better it is.

Let us now see how we can draw an ROC curve and calculate the area under the curve using the methods built in scikit-learn. This can be done using the following code snippet. Make sure you install the ggplot package before running this snippet:

from sklearn import metrics
from ggplot import *

prob = clf1.predict_proba(X_test)[:,1]
fpr, sensitivity, _ = metrics.roc_curve(Y_test, prob)

df = pd.DataFrame(dict(fpr=fpr, sensitivity=sensitivity))
ggplot(df, aes(x='fpr', y='sensitivity')) +
    geom_line() +
    geom_abline(linetype='dashed')
Confusion matrix

Fig. 6.29: The ROC curve drawn using the scikit-learn methods

The area under the curve can be found out as follows:

auc = metrics.auc(fpr,sensitivity)
auc

The area under the curve comes out to be 0.76, which is pretty good. The area under the curve can be plotted using the following code snippet:

ggplot(df, aes(x='fpr', ymin=0, ymax='sensitivity')) +
    geom_area(alpha=0.2) +
    geom_line(aes(y='sensitivity')) +
    ggtitle("ROC Curve w/ AUC=%s" % str(auc))

The plot looks as follows:

Confusion matrix

Fig. 6.30: The area under the ROC curve drawn using the scikit-learn methods

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset