Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Getting classification straight with the confusion matrix

Accuracy is a metric that measures how well a model has performed in a given context. Accuracy is the default evaluation metric of scikit-learn classifiers. Unfortunately, accuracy is one-dimensional, and it doesn't help when the classes are unbalanced. The rain data we examined in Chapter 9, Ensemble Learning and Dimensionality Reduction, is pretty balanced. The number of rainy days is almost equal to the number of days on which it doesn't rain. In the case of e-mail spam classification, at least for me, the balance is shifted toward spam.

A confusion matrix is a table that is usually used to summarize the results of classification. The two dimensions of the table are the predicted class and the target class. In the context of binary classification, we talk about positive and negative classes. Naming a class negative is arbitrary—it doesn't necessarily mean that it is bad in some way. We can reduce any multi-class problem to one class versus the rest of the problem; so, when we evaluate binary classification, we can extend the framework to multi-class classification. A class can either be correctly predicted or not; we label those instances with the words true and false accordingly.

We have four combinations of true, false, positive, and negative, as described in the following table:

	Predicted class
Target class	True positives: It rained and we correctly predicted it.	False positives: We incorrectly predicted that it would rain.
Target class	False negatives: It did rain, but we predicted that it wouldn't.	True negatives: It didn't rain, and we correctly predicted it.

How to do it...

The imports are as follows:

import numpy as np
from sklearn.metrics import confusion_matrix
import seaborn as sns
import dautil as dl
from IPython.display import HTML
import ch10util

Define the following function to plot the confusion matrix:

def plot_cm(preds, y_test, title, cax):
    cm = confusion_matrix(preds.T, y_test)
    normalized_cm = cm/cm.sum().astype(float)
    sns.heatmap(normalized_cm, annot=True, fmt='.2f', vmin=0, vmax=1,
                xticklabels=['Rain', 'No Rain'],
                yticklabels=['Rain', 'No Rain'], ax=cax)
    cax.set_xlabel('Predicted class')
    cax.set_ylabel('Expected class')
    cax.set_title('Confusion Matrix for Rain Forecast | ' + title)

Load the target values and plot the confusion matrix for the random forest classifier, bagging classifier, voting, and stacking classifier:

y_test = np.load('rain_y_test.npy')
sp = dl.plotting.Subplotter(2, 2, context)

plot_cm(y_test, np.load('rfc.npy'), 'Random Forest', sp.ax)

plot_cm(y_test, np.load('bagging.npy'), 'Bagging', sp.next_ax())

plot_cm(y_test, np.load('votes.npy'),'Votes', sp.next_ax())

plot_cm(y_test, np.load('stacking.npy'), 'Stacking', sp.next_ax())
sp.fig.text(0, 1, ch10util.classifiers())
HTML(sp.exit())

Refer to the following screenshot for the end result:

The source code is in the conf_matrix.ipynb file in this book's code bundle.

How it works

We displayed four confusion matrices for four classifiers, and the four numbers of each matrix seem to be repeating. Of course, the numbers are not exactly equal; however, you have to allow for some random variation.

Table of Contents for
Getting classification straight with the confusion matrix

Getting classification straight with the confusion matrix

How to do it...

How it works

See also

Table of Contents for Getting classification straight with the confusion matrix

Create new playlist

Sign In

Sign Up

Getting classification straight with the confusion matrix

How to do it...

How it works

See also

Table of Contents for
Getting classification straight with the confusion matrix