D.7. Performance metrics

The most important piece of any machine learning pipeline is the performance metric. If you don’t know how well your machine learning model is working, you can’t make it better. The first thing we do when starting a machine learning pipeline is set up a performance metric, such as “.score()” on any sklearn machine learning model. We then build a completely random classification/regression pipeline with that performance score computed at the end. This lets us make incremental improvements to our pipeline that gradually improve the score, getting us closer to our goal. It’s also a great way to keep your bosses and coworkers convinced that you’re on the right track.

D.7.1. Measuring classifier performance

A classifier has two things you want it to do right: labeling things that truly belong in the class with that class label, and not labeling things that aren’t in that class with that label. The counts of these that it got right are called the true positives and the true negatives, respectively. If you have an array of all your model classifications or predictions in numpy arrays, you can count these correct predictions as shown in the following listing.

Listing D.3. Count what the model got right
>>> y_true = np.array([0, 0, 0, 1, 1, 1, 1, 1, 1, 1])              1
>>> y_pred = np.array([0, 0, 1, 1, 1, 1, 1, 0, 0, 0])              2
>>> true_positives = ((y_pred == y_true) & (y_pred == 1)).sum()
>>> true_positives                                                 3
4
 
>>> true_negatives = ((y_pred == y_true) & (y_pred == 0)).sum()
>>> true_negatives                                                 4
2

  • 1 y_true is a numpy array of the true (correct) class labels. Usually these are determined by a human.
  • 2 y_pred is a numpy array of your model’s predicted class labels (0 or 1.)
  • 3 true_positives are the positive class labels (1) that your model got right (correctly labeled 1.)
  • 4 true_negatives are the negative class labels (0) that your model got right (correctly labeled 0.)

Often it’s also important to count up the predictions that your model got wrong, as shown in the following listing.

Listing D.4. Count what the model got wrong
>>> false_positives = ((y_pred != y_true) & (y_pred == 1)).sum()
>>> false_positives                                               1
3
>>> false_negatives = ((y_pred != y_true) & (y_pred == 0)).sum()
>>> false_negatives                                               2
1

  • 1 false_positives are the negative class examples (0) that were falsely labeled positive by your model (labeled 1 when they should be 0.)
  • 2 false_negatives are the positive class examples (1) that were falsely labeled negative by your model (labeled 0 when they should be 1.)

Sometimes these four numbers are combined into a single 4 × 4 matrix called an error matrix or confusion matrix. The following listing shows what our made-up predictions and truth values would look like in a confusion matrix.

Listing D.5. Confusion matrix
>>> confusion = [[true_positives, false_positives],
...              [false_negatives, true_negatives]]
>>> confusion
[[4, 3], [1, 2]]
>>> import pandas as pd
>>> confusion = pd.DataFrame(confusion, columns=[1, 0], index=[1, 0])
>>> confusion.index.name = r'pred  truth'
>>> confusion
              1  0
pred  truth
1             4  1
0             3  2

In a confusion matrix, you want to have large numbers along the diagonal (upper left and lower right) and low numbers in the off diagonal (upper right and lower left). However, the order of positives and negatives are arbitrary, so sometimes you may see this table transposed. Always label your confusion matrix columns and indexes. And sometimes you might hear statisticians call this matrix a classifier contingency table, but you can avoid confusion if you stick with the name “confusion matrix.”

There are two useful ways to combine some of these four counts into a single performance metric for your machine learning classification problem: precision and recall. Information retrieval (search engines) and semantic search are examples of such classification problems, since your goal is to classify documents as a match or not. In chapter 2, you learned how stemming and lemmatization can improve recall but reduce precision.

Precision measures how good your model is at detecting all the members of the class you’re interested in, called the positive class. For this reason it’s also called the positive predictive value. Since your true positives are the positive labels that you got right and false positives are the negative examples that you mislabeled as positive, the precision calculation is as shown in the following listing.

Listing D.6. Precision
>>> precision = true_positives / (true_positives + false_positives)
>>> precision
0.571...

The example confusion matrix gives a precision of about 57% because it got 57% of the true labels correct.

The recall performance number is similar. It’s also called the sensitivity or the true positive rate or the probability of detection. Because the total number of examples in your dataset is the sum of the true positives and the false negatives, you can calculate recall, the percentage of positive labels that were detected, with the code shown in the following listing.

Listing D.7. Recall
>>> recall = true_positives / (true_positives + false_negatives)
>>> recall
0.8

So this says that our example model detected 80% of the positive examples in the dataset.

D.7.2. Measuring regressor performance

The two most common performance scores used for machine learning regression problems are root mean square error (RMSE) and Pierson correlation (R2). It turns out that classification problems are really regression problems under the hood. So you can use your regression metrics on your class labels if they’ve been converted to numbers, as we did in the previous section. So these code examples will reuse those example predictions and truth values here. RMSE is the most useful for most problems because it tells you how far away from the truth your predictions are likely to be. RMSE gives you the standard deviation of your error, as shown in the following listing.

Listing D.8. RMSE
>>> y_true = np.array([0, 0, 0, 1, 1, 1, 1, 1, 1, 1])
>>> y_pred = np.array([0, 0, 1, 1, 1, 1, 1, 0, 0, 0])
>>> rmse = np.sqrt((y_true - y_pred) ** 2) / len(y_true))
>>> rmse
0.632...

Another common performance metric for regressors in the Pierson correlation coefficient. The sklearn module attaches it to most models as the default .score() method. You should calculate these scores manually if you’re unclear on exactly what they measure. See the following listing.

Listing D.9. Correlation
>>> corr = pd.DataFrame([y_true, y_pred]).T.corr()
>>> corr[0][1]
0.218...
>>> np.mean((y_pred - np.mean(y_pred)) * (y_true - np.mean(y_true))) /
...     np.std(y_pred) / np.std(y_true)
0.218...

So our example predictions are correlated with the truth by only 28%.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset