Evaluation Metrics

We also need to be careful when selecting the evaluation metrics for our model. Suppose that we have two algorithms with accuracies of 98% and 96% respectively, for a dog/not dog classification problem. At first glance the algorithms look like they both have similar performance. Let us remember that classification accuracy is defined as the number of correct predictions made divided by the total number of predictions made. In other words the number of True Positive (TP) and True Negative (TN) prediction, divided by the total number of predictions. However, it might be the case that along with dog images we are also getting large number of background or similar looking objects falsely classified as dogs, commonly known as false positives (FP). Another undesirable behavior could be that many dog images are misclassified as negatives or False Negative (FN). Clearly, by definition the classification accuracy does not capture the notion of false positives or false negatives. Therefore, better evaluation metrics are required.

As a first step what we would do is build a confusion matrix that summarizes the last paragraph as it is shown:

Based on this table we can define four additional metrics that will give us a better insight into the achieved results. These are:

True Positive Rate (TPR) or sensitivity or recall: probability that a test result will be positive when the object is present (true positive rate, expressed as a percentage) = TP / (TP+FN)
False Positive rate (FPR): is the probability of falsely rejecting the actual negative for a particular test= FP / (FP+TN)
Positive predictive value (PPV) or precision: probability that the object is present when the test is positive (expressed as a percentage) = TP / (TP+FP)
Negative predictive value (NPV): probability that the object is not present when the test is negative (expressed as a percentage) = TN / (TN+FN)

To understand better how these metrics are useful, let’s take as an example the two following confusion matrices for two different algorithms and calculate the preceding metrics.

Example 1:

	Positive	Negative
Predicted Positive	10 (TP)	13 (FP)	23
Predicted Negative	75 (FN)	188 (TN)	263
	85	201	286

Accuracy: (TP+TN)/(TP+TN+FP+FN)=198/286=0.69
TPR: TP / (TP+FN)= 10/85=0.11
FPR: FP /(FP+TN)= 13/201=0.06
PPV: TP/ (TP+FP)=10/23=0.43
NPV: TN / (TN+FN)= 188/263=0.71

Example 2:

	Positive	Negative
Predicted Positive	0 (TP)	0 (FP)	0
Predicted Negative	85 (FN)	201 (TN)	286
	85	201	286

Accuracy: (TP+TN)/(TP+TN+FP+FN)=201/286=0.70
TPR: TP / (TP+FN)= 0/85=0
FPR: FP /(FP+TN)= 0/201=0
PPV: TP/ (TP+FP)=0/0=0
NPV: TN / (TN+FN)= 201/286=0.70

In the first example we get an okay accuracy of 69%, however in the second example, by just predicting negative for every example we actually increase our accuracy to 70%! Obviously a model that just predicts the negative class for everything isn't a great model and this is what we call the Accuracy Paradox. In simple terms, the Accuracy Paradox says that even though a model might have a higher accuracy it may not actually be a better model.

This phenomenon is more likely to happen when the class imbalance becomes large as in the preceding examples. The reader is encouraged to repeat the preceding test for a balanced dataset with 85 positive examples and 85 negative examples. If the proportion of false positives to true negatives remains the same as the preceding examples then this will lead to a classification accuracy of 52% for the 1st example and 50% for the 2nd example showing that the accuracy paradox does not hold for balanced datasets.

In order to be able to properly evaluate the algorithms we need to look at other evaluation metrics such as TPR and FPR. We can see that in the 2nd example they’re both zero which shows that the algorithm cannot detect the desired positive object at all.

Another case of unbalanced datasets where the precision metric is useful is cancer tests where the number of sick are considerably less than the number of healthy. Following is a worked out example for this.

	Sick	Healthy	total
Test result positive	99 (TP)	999 (FP)	1,098
Test result negative	1(FN)	98,901(TN)	98,902
total	100	99,900	100,000

Accuracy: (TP+TN)/(TP+TN+FP+FN)=0.99
TPR: TP / (TP+FN)=0.99
FPR: FP /(FP+TN)= 0.01
PPV: TP/ (TP+FP)=0.09
NPV: TN / (TN+FN)= 0.99

The test here seems to perform reasonably well, as the accuracy is 99%. However if you are diagnosed with cancer this does not mean that the probability that you have the disease is 99%. It should be noted that only 99 out of 1098 that tested positive have the disease. This means that if you are given a positive test the probability that you actually have the disease is only 9% for a test that is 99% accurate.

These examples are a good warning that the aim should be to have a balanced split in your test data, especially if you are using the accuracy metric to compare effectiveness of different models.

Other useful means to compare different algorithms are the precision-recall and receiver-operating characteristic curves. These can be plotted if we calculate the preceding metrics for different threshold values. If the output of our algorithm is not binary (0 for negative and 1 for positive) but instead a score that approaches 1 when the test is positive and zero when the test is negative then the number of TP, TN, FP, FN will depend on the threshold value that we have selected.

Let’s take the example of cat detection in images. For every region the classifier outputs a score which shows how confident it is about the detection. If we set a threshold of 0.5 then a score of 0.6 would show a positive detection while a score of 0.45 a negative one. If the threshold is lowered to 0.4, both detections become positive. The following table illustrates the preceding metrics varying with the threshold value.

Threshold	FPR	TPR	PPV	TP	TN	FN	FP
0.72	1	0.98	0.33	487	0	7	990
0.88	0.5	0.97	0.46	485	430	9	560
0.97	0.1	0.94	0.8	464	878	30	112
0.99	0.05	0.93	0.87	460	923	34	67
1.06	0.01	0.87	0.96	430	976	64	14
1.08	0.005	0.84	0.98	416	985	78	5
1.16	0.001	0.69	0.99	344	989	150	1

If we plot the FPR against the TPR, we will get what we call a ROC (Receiver Operator Characteristic) curve as shown:

To get a Precision-Recall (PR) curve, we need to plot recall/TPR against precision/PPV. An example of the curve is shown in the following graph. The reader is advised to investigate further on how to interpret the ROC and PR curves.

Table of Contents for Evaluation Metrics

Create new playlist

Sign In

Sign Up

Table of Contents for
Evaluation Metrics