Getting classification straight with the confusion matrix

Accuracy is a metric that measures how well a model has performed in a given context. Accuracy is the default evaluation metric of scikit-learn classifiers. Unfortunately, accuracy is one-dimensional, and it doesn't help when the classes are unbalanced. The rain data we examined in Chapter 9, Ensemble Learning and Dimensionality Reduction, is pretty balanced. The number of rainy days is almost equal to the number of days on which it doesn't rain. In the case of e-mail spam classification, at least for me, the balance is shifted toward spam.

A confusion matrix is a table that is usually used to summarize the results of classification. The two dimensions of the table are the predicted class and the target class. In the context of binary classification, we talk about positive and negative classes. Naming a class negative is arbitrary—it doesn't necessarily mean that it is bad in some way. We can reduce any multi-class problem to one class versus the rest of the problem; so, when we evaluate binary classification, we can extend the framework to multi-class classification. A class can either be correctly predicted or not; we label those instances with the words true and false accordingly.

We have four combinations of true, false, positive, and negative, as described in the following table:

 

Predicted class

Target class

True positives: It rained and we correctly predicted it.

False positives: We incorrectly predicted that it would rain.

False negatives: It did rain, but we predicted that it wouldn't.

True negatives: It didn't rain, and we correctly predicted it.

How to do it...

  1. The imports are as follows:
    import numpy as np
    from sklearn.metrics import confusion_matrix
    import seaborn as sns
    import dautil as dl
    from IPython.display import HTML
    import ch10util
  2. Define the following function to plot the confusion matrix:
    def plot_cm(preds, y_test, title, cax):
        cm = confusion_matrix(preds.T, y_test)
        normalized_cm = cm/cm.sum().astype(float)
        sns.heatmap(normalized_cm, annot=True, fmt='.2f', vmin=0, vmax=1,
                    xticklabels=['Rain', 'No Rain'],
                    yticklabels=['Rain', 'No Rain'], ax=cax)
        cax.set_xlabel('Predicted class')
        cax.set_ylabel('Expected class')
        cax.set_title('Confusion Matrix for Rain Forecast | ' + title)
  3. Load the target values and plot the confusion matrix for the random forest classifier, bagging classifier, voting, and stacking classifier:
    y_test = np.load('rain_y_test.npy')
    sp = dl.plotting.Subplotter(2, 2, context)
    
    plot_cm(y_test, np.load('rfc.npy'), 'Random Forest', sp.ax)
    
    plot_cm(y_test, np.load('bagging.npy'), 'Bagging', sp.next_ax())
    
    plot_cm(y_test, np.load('votes.npy'),'Votes', sp.next_ax())
    
    plot_cm(y_test, np.load('stacking.npy'), 'Stacking', sp.next_ax())
    sp.fig.text(0, 1, ch10util.classifiers())
    HTML(sp.exit())

Refer to the following screenshot for the end result:

How to do it...

The source code is in the conf_matrix.ipynb file in this book's code bundle.

How it works

We displayed four confusion matrices for four classifiers, and the four numbers of each matrix seem to be repeating. Of course, the numbers are not exactly equal; however, you have to allow for some random variation.

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset