Evaluating classification models

Now that we have fit a classification model, we can examine the accuracy on the test set. One common tool for performing this kind of analysis is the Receiver Operator Characteristic (ROC) curve. To draw an ROC curve, we select a particular cutoff for the classifier (here, a value between 0 and 1 above which we consider a data point to be classified as a positive, or 1) and ask what fraction of 1s are correctly classified by this cutoff (true positive rate) and, concurrently, what fraction of negatives are incorrectly predicted to be positive (false positive rate) based on this threshold. Mathematically, this is represented by choosing a threshold and computing four values:

TP = true positives = # of class 1 points above the threshold
FP = false positives = # of class 0 points above the threshold
TN = true negatives = # of class 0 points below the threshold
FN = false negatives = # of class 1 points below the threshold

The true positive rate (TPR) plotted by the ROC is then TP/(TP+FN), while the false positive rate (FPR) is FP/(FP+TN).

If both rates are equal, then this is no better than random. In other words, at whatever cutoff we choose, a prediction of class 1 by the model is equally likely regardless if the point is actually positive or negative. Thus, a diagonal line from the lower left to the upper right hand represent the performance of a classifier made through randomly choosing labels for data points, since the true positive and false positive rates are always equal. Conversely, if the classifier exhibits better than random performance, the true positive rate rises much faster as correctly classified points are enriched above the threshold. Integrating the area under the curve (AUC) of the ROC curve, which has a maximum of 1, is a common way to report the accuracy of classifier methods. To find the best threshold to use for classification, we find the point on this curve where the ratio between true positive and false positive rates is maximal.

In our example, this is important because 1 is less frequent than 0. As we mentioned in the beginning of this chapter when we were examining the data set, this can cause problems in training a classification model. While the naïve choice would be to consider events with predicted probability above 0.5 as 1, in practice we find that due to this dataset imbalance, a lower threshold is optimal as the solution is biased toward the zeros. This effect can become even more extreme in highly skewed data: consider an example where only 1 in 1,000 points have label 1. We could have an excellent classifier that predicts that every data point is 0: it is 99.9% percent accurate! However, it would not be very useful in identifying rare events. There are a few ways we could counteract this bias besides adjusting the threshold in the AUC.

One way would be to rebalance the model by constructing a training set that is 50 percent 1s and 50 percent 0s. We can then evaluate the performance on the unbalanced test dataset. If the imbalance is very large, our rebalanced training set might contain only a small number of the possible variation in the 0s: thus, to generate a model representative of the entire dataset, we may want to construct many such datasets and average the results of the models generated from them. This approach is not dissimilar to the Bagging method used in constructing Random Forest models, as we saw in Chapter 4, Connecting the Dots with Models – Regression Methods.

Secondly, we can use our knowledge of the imbalance to change the contribution of each data point as we optimize the parameters. For example, in the SGD equations, we can penalize errors on 1s 1,000 times as much as errors on 0s. This weight will then correct the bias in the model.

Our interpretation of the AUC is also changed in very imbalanced datasets. While an overall AUC of 0.9 might be considered good, if the ratio between the TPR and FPR at a false positive rate of 0.001 (the fraction of data containing the rare class) is not > 1, it indicates we may have to search through a large amount of the head of the ranking to enrich the rare events. Thus, while the overall accuracy appears good, the accuracy in the range of data we most These scenarios are not uncommon in practice. For example, ad clicks are usually much less frequent than non-clicks, as are responses to sales inquiries. Visually, a classifier that is not well-fit to imbalanced data would be indicated by an ROC curve where the difference between TPR and FPR is greatest near the middle of the curve (~0.5). Conversely, in an ROC curve of a classifier that is appropriately adapted to rare events, most of the area is contained to the left hand side (rising sharply from a cutoff of 0 and leveling off to the right), representing enrichment of positives at high thresholds.

Note that false positive rate and false negative rate are just two examples of accuracy metrics we could compute. We may also be interested in knowing, above a given cutoff for model score, 1) how many of our positive examples are classified (recall) and what percentage of points above this threshold are actually positive 2) precision. These are calculated as:

Precision = TP/(TP+FP)

Recall = TP/(TP+FN)

In fact, recall is identical to the true positive rate. While the ROC curve allows us to evaluate whether the model generates true positive predictions at a greater rate than false positives, comparing precision versus recall gives a sense of how reliable and complete the predictions above a given score threshold are. We could have very high precision, but only be able to detect a minority of the overall positive examples. Conversely, we could have high recall at the cost of low precision as we incur false positives by lowering the score threshold to call positives in our model. The tradeoff between these can be application specific. For example, if the model is largely exploratory, such as a classifier used to generate potential sales leads for marketing, then we accept a fairly low precision since the value of each positive is quite high even if the true predictions are interspersed with noise. On the other hand, in a model for spam identification, we may want to err on the side of high precision, since the cost of incorrectly moving a valid business email to the user's trash folder may be higher than the occasional piece of unwanted mail that gets through the filter. Finally, we could also consider performance metrics that are appropriate even for imbalanced data, because they represent a tradeoff between the precision and recall for majority and minority classes. These include the F-measure:

Evaluating classification models

And Matthew's correlation coefficient (Matthews, Brian W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure 405.2 (1975): 442-451.):

Evaluating classification models

Returning to our example, we have two choices in how we could compute the predictions from our model: either a class label (0 or 1) or a probability of a particular individual being class 1. For computing the ROC curve, we want the second choice, since this will allow us to evaluate the accuracy of the classifier over a range of probabilities used as a threshold for classification:

>>> train_prediction = log_model_newton.predict_proba(census_features_train)
>>> test_prediction = log_model_newton.predict_proba(census_features_test)

With this probability, we can we see visually that our model gives a subpar accuracy using the following code to plot the ROC curve for the training and test sets:

>>>  from sklearn import metrics
>>> fpr_train, tpr_train, thresholds_train = metrics.roc_curve(np.array(census_income_train),
   np.array(train_prediction[:,1]), pos_label=1)
>>> fpr_test, tpr_test, thresholds_test = metrics.roc_curve(np.array(census_income_test),
   np.array(test_prediction[:,1]), pos_label=1)
>>> plt.plot(fpr_train, tpr_train)
>>> plt.plot(fpr_test, tpr_test)
>>> plt.xlabel('False Positive Rate')
>>> plt.ylabel('True Positive Rate')
Evaluating classification models

Numerically, we find that the AUC of the test and training set is little better than random (0.5), as both the commands:

>>> metrics.auc(fpr_train,tpr_train)

and

>>> metrics.auc(fpr_test,tpr_test)

give results of ~ 0.6.

If possible, we would like to improve the performance of our classified—how can we diagnose the problems with our existing logistic regression model and work toward a better prediction?

Strategies for improving classification models

Confronted with this less than desirable performance, we typically have a few options:

  • Train with more data
  • Regularize the model to reduce over-fitting
  • Choose another algorithm

In our example with an under-performing logistic regression model, which option makes most sense?

Let us consider take the first option, that we simply need more data to improve performance. In some cases, we may not have enough data in our training set to represent the patterns we observe in the test set. If this were the case, we would expect to see our performance on the test set improve as we increase the size of the training set used to build the model. However, we do not always have the convenience of getting more data. In this example, we don't actually have more data to train with; even if is possible to collect more data in theory, in practice it may be too expensive to justify the cost, or we may need to make a decision before more data will be available.

What about over-fitting? In other words, perhaps our model is precisely tuned to the patterns in the training set, but does not generalize to the test set. Like the first option, we will observe better performance on the training set than the test set. However, the solution is not necessarily to add more data, but rather to prune features to make the model more general. In the preceding scenario, we see that the performance on both training and test is similar, so this does not seem like the most likely explanation.

Finally, we might try another algorithm. To do so, let us consider what the limitations of our current model are. For one, the logistic regression only incorporates single features: it has no way to represent interactions between them. For instance, it can only model the effect of marital status, not marital status conditional on education and age. It is perhaps not surprising that these factors probably in combination predict income, but not necessarily individually. It may help to look at the values of the coefficients, and to do so, we will need to map the original column headers to column names in our one-hot encoding, where each of the categorical features is now represented by several columns. In this format, the numerical columns are appended to the end of the data frame, so we need to add them last to the list of columns. The following code remaps the column headers using the mapping of category to one-hot position we calculated earlier:

>>> expanded_headers = []
>>> non_categorical_headers = []
>>> categorical_index = 0
>>> for e,h in enumerate(np.array(census.columns[:-1])):
 …   if e in set(categorical_features):
 …       unsorted_category = np.array([h+key for key in categorical_dicts[categorical_index].keys()]) # appends the category label h to each feature 'key' 
 …       category_indices = np.array(list(categorical_dicts[categorical_index].values())) # gets the mapping from category label h to the position in the one-hot array    
 …       expanded_headers+=list(unsorted_category[np.argsort(category_indices)]) # resort the category values in the same order as they appear in the one-hot encoding
…        categorical_index+=1 # increment to the next categorical feature
…    else:
…        non_categorical_headers+=[h]
… expanded_headers+=non_categorical_headers

We can check that the individual coefficient make sense: keep in mind that the sort function arranges items in ascending order, so to find the largest coefficients we sort by the negative value:

>>> expanded_headers[np.argsort(-1*log_model.coef_[0])]
array(['capital-gain', 'capital-loss', 'hours-per-week', 'age',        'education-num', 'marital-status Married-civ-spouse',        'relationship Husband', 'sex Male', 'occupation Exec-managerial',        'education Bachelors', 'occupation Prof-specialty',        'education Masters', 'relationship Wife', 'education Prof-school',        'workclass Self-emp-inc', 'education Doctorate',        'workclass Local-gov', 'workclass Federal-gov',        'workclass Self-emp-not-inc', 'race White',        'occupation Tech-support', 'occupation Protective-serv',        'workclass State-gov', 'occupation Sales', … 

Logically, the order appears to make sense, since we would expect age and education to be important predictors of income. However, we see that only ~1/3rd of the features have a large influence on the model through the following plot:

>>> plt.bar(np.arange(108),np.sort(log_model_newton.coef_[0]))
Strategies for improving classification models

Thus, it looks like the model is only able to learn information from a subet of features. We could potentially try to generate interaction features by making combinatorial labels (for example, a binary flag representing married and maximum education level as Master's Degree) by taking the product of all features with each other. Generating potential nonlinear features in this way is known as polynomial expansion, since we are taking single coefficient terms and converting them into products that have squared, cubic, or higher power relationships. However, for the purposes of this example, will try some alternative algorithms.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset