An alternative way to measure classifier performance using receiver-operator characteristics

We already learned that measuring accuracy is not enough to truly evaluate a classifier. Instead, we relied on precision-recall (P/R) curves to get a deeper understanding of how our classifiers perform.

There is a sister of P/R curves, called receiver-operator-characteristics (ROC), which measures similar aspects of the classifier's performance, but provides another view of the classification performance. The key difference is that P/R curves are more suitable for tasks where the positive class is much more interesting than the negative one or where the number of positive examples is much less than the number of negative ones. Information retrieval and fraud detection are typical application areas. On the other hand, ROC curves provide a better picture on how well the classifier behaves in general.

To better understand the differences, let's consider the performance of the previously trained classifier in classifying country songs correctly, as shown in the following graph:

On the left, we see the P/R curve. For an ideal classifier, we would have the curve going from the top-left directly to the top-right and then to the bottom right, resulting in an area-under-curve (AUC) of 1.0.

The right graph depicts the corresponding ROC curve. It plots the True Positive Rate (TPR) over the False Positive Rate (FPR). Here, an ideal classifier would have a curve going from the lower-left to the top-left, and then to the top-right. A random classifier would be a straight line from the lower-left to the upper-right, as shown by the dashed line, having an AUC of 0.5. Therefore, we cannot compare an AUC of a P/R curve with that of an ROC curve.

Independent of the curve, when comparing two different classifiers on the same dataset, we it is always safe to assume that a higher AUC of a P/R curve for one classifier also means a higher AUC of the corresponding ROC curve and vice versa. Hence, we never bother to generate both. More on this can be found in the very insightful paper The Relationship Between Precision-Recall and ROC Curves, by Davis and Goadrich (ICML, 2006).

The following table summarizes the differences between P/R and ROC curves:

	x axis	y axis
P/R
ROC

Looking at the definitions of the ' x and y axes, we see that the TPR in the ROC curve's y axis is the same as the Recall of the P/R graph's x axis.

The FPR measures the fraction of true negative examples that were falsely classified as positive ones, ranging from 0 in a perfect case (no false positives) to 1 (all are false positives).

Going forward, let's use ROC curves to measure our classifiers' performance to get a better feeling for it. The only challenge for our multiclass problem is that both ROC and P/R curves assume a binary classification problem. For our purpose, let's, therefore, create one chart per genre that shows how the classifier performed a one-versus-rest classification:

from sklearn.metrics import roc_curve
y_pred = clf.predict(X_test)for label in labels:
    y_label_test = np.asarray(y_test==label, dtype=int)
    proba = clf.predict_proba(X_test)
    proba_label = proba[:, label] 

    # calculate false and true positive rates as well as the
    # ROC thresholds
    fpr, tpr, roc_thres = roc_curve(y_label_test, proba_label)

    # plot tpr over fpr ...

The outcomes are the following six ROC plots (again, for the full code, please follow the accompanying Jupyter notebook). As we have already found out, our first version of the classifier only performs well on classical songs. Looking at the individual ROC curves, however, tells us that we are really underperforming for most of the other genres. Only jazz and country provide some hope. The remaining genres are clearly not usable:

Table of Contents for An alternative way to measure classifier performance using receiver-operator characteristics

Create new playlist

Sign In

Sign Up

Table of Contents for
An alternative way to measure classifier performance using receiver-operator characteristics