Computing precision, recall, and F1-score

In the Getting classification straight with the confusion matrix recipe, you learned that we can label classified samples as true positives, false positives, true negatives, and false negatives. With the counts of these categories, we can calculate many evaluation metrics of which we will cover four in this recipe, as given by the following equations:

Computing precision, recall, and F1-score

These metrics range from zero to one, with zero being the worst theoretical score and one being the best. Actually, the worst score would be the one we get by random guessing. The best score in practice may be lower than one because in some cases we can only hope to emulate human performance, and there may be ambiguity about what correct classification should be, for instance, in the case of sentiment analysis (covered in the Python Data Analysis book).

  • The accuracy (10.1) is the ratio of correct predictions.
  • Precision (10.2) measures relevance as the likelihood of classifying a negative class sample as positive. Choosing which class is positive is somewhat arbitrary, but let's say that a rainy day is positive. High precision would mean that we labeled a relatively small number of non-rainy (negative) days as rainy. For a search (web, database, or other), it would mean a relatively high number of relevant results.
  • Recall (10.3) is the likelihood of finding all the positive samples. If again, rainy days are our positive class, the more rainy days are classified correctly, the higher the recall. For a search, we can get a perfect recall by returning all the documents because this will automatically return all the relevant documents. A human brain is a bit like a database, and in that context, recall will mean the likelihood of remembering, for instance, how a certain Python function works.
  • The F1 score (10.4) is the harmonic mean of precision and recall (actually, there are multiple variations of the F1 score). The G score uses the geometric mean; but, as far as I know, it is less popular. The idea behind the F1 score, related F scores and G scores, is to combine the precision and recall. That doesn't necessarily make it the best metric. There are other metrics you may prefer, such as the Matthews correlation coefficient (refer to the Taking a look at the Matthews correlation coefficient recipe) and Cohen's kappa (refer to the Examining kappa of classification recipe). When we facie the choice of so many classification metrics, we obviously want the best metric. However, you have to make the choice based on your situation, as there is no metric that fits all.

How to do it...

  1. The imports are as follows:
    import numpy as np
    from sklearn import metrics
    import ch10util
    import dautil as dl
    from IPython.display import HTML
  2. Load the target values and calculate the metrics:
    y_test = np.load('rain_y_test.npy')
    accuracies = [metrics.accuracy_score(y_test, preds)
                  for preds in ch10util.rain_preds()]
    precisions = [metrics.precision_score(y_test, preds)
                  for preds in ch10util.rain_preds()]
    recalls = [metrics.recall_score(y_test, preds)
               for preds in ch10util.rain_preds()]
    f1s = [metrics.f1_score(y_test, preds)
           for preds in ch10util.rain_preds()]
  3. Plot the metrics for the rain forecasts:
    sp = dl.plotting.Subplotter(2, 2, context)
    ch10util.plot_bars(sp.ax, accuracies)
    sp.label()
    
    ch10util.plot_bars(sp.next_ax(), precisions)
    sp.label()
    
    ch10util.plot_bars(sp.next_ax(), recalls)
    sp.label()
    
    ch10util.plot_bars(sp.next_ax(), f1s)
    sp.label()
    sp.fig.text(0, 1, ch10util.classifiers())
    HTML(sp.exit())

Refer to the following screenshot for the end result:

How to do it...

The code is in the precision_recall.ipynb file in this book's code bundle.

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset