Measuring the performance of classifiers

In this section, we'll see how to measure the performance of a classifier. In the example we saw in the previous chapter, a Decision Tree can predict that a new customer will not default, but actually he/she does default. We need a mechanism to evaluate the error rate of a decision tree; this mechanism is the confusion matrix or the error matrix.

Confusion matrix, accuracy, sensitivity, and specificity

Coming back to our loan example, imagine you have classified 1000 loans using a Decision Tree. For each loan, our classifier has added a label with the value yes or no, depending upon whether the algorithm predicts that the customer will default or not. In order to generalize, we will use the terms positive or negative classification. In our loans example, we have a positive classified observation when our classifier predicted that a customer will default, so the value of the Default? Attribute is Yes.

In this scenario, there are four types of predictions, listed as follows:

  • True Positive: The observation has been correctly classified as positive. In our example, the classifier predicts that the customer will default and the customer defaults.
  • False Positive: The observation has been incorrectly classified as positive. In our example, the classifier predicts that the customer will default but the customer doesn't default.
  • True Negative: The observation has been correctly classified as negative. In our example, the classifier predicts that the customer will not default and the customer doesn't default.
  • False Negative: The observation has been incorrectly classified as negative. In our example, the classifier predicts that the customer will not default but the customer defaults.

We will call this classification; the confusion matrix; or the error matrix because we represent it using a matrix, and this, gives us an idea of the prediction error. The following diagram illustrates the confusion matrix:

Confusion matrix, accuracy, sensitivity, and specificity

As you can see in the matrix, True Positives and True Negatives are correctly classified by the algorithm, whereas False Positives and False Negatives are incorrectly classified. In order to evaluate the performance of the classifier, the first measure is the accuracy. To calculate the accuracy, we will divide the total correctly classified observations by all the observations:

Confusion matrix, accuracy, sensitivity, and specificity

The accuracy is a number between 0 and 1. If the accuracy is 0, it means that the classifier has failed in all the predictions; if the accuracy is 1, it means that the classifier has classified correctly, all the observations. When the accuracy is close to one, the performance of the classifier is good, and when the accuracy is close to zero, the performance is bad. If you use a model that randomly guesses whether a credit is classified as bad or good risk (binary classification), the accuracy should be 0.5.

In the following diagram, are the results of two classifiers, Classifier A and Classifier B, both having 1000 observations, but the Accuracy value is very different. In the following example, the performance of Classifier A is better than the performance of Classifier B:

Confusion matrix, accuracy, sensitivity, and specificity

There is no rule for whether an accuracy rate is good or not, because it depends on the problem that you are solving. In order to know if your model has good performance, you have to compare with another model. The first step can be to compare the performance of your model against a random classifier or a simple classifier as baseline. Imagine that by exploring your data, you have discovered that young people have a greater probability of defaulting than older people. You can create a naive classifier that predicts Default?=Yes for young people and Default?=No for older people. If your assumption is true, the accuracy of your naive classifier will be greater than 0.5 (random classifier); maybe 0.6 or 0.7. You can use the performance of this naive model as the baseline to measure the performance of your model.

The opposite measure is the error rate—the percentage of observations misclassified.

Accuracy and error rate are good measures towards understanding the performance of a classifier, but keep in mind that they do not distinguish between the types of errors. Imagine you have developed a classifier that classifies tumor images between malignant and benignant. A false positive is a tumor classified as malignant, but that actually benign. If you have a false positive, then probably the doctor will ask for additional tests to confirm the diagnosis, and he will discover that the image was misclassified; but what happens if you have a false negative? If you have a false negative, a malignant tumor will be classified as benign. Probably, the doctor won't ask for additional tests and a malignant tumor would be treated as a benign tumor.

As you can see, for this domain, a false negative is more dangerous than a false positive. For this reason, we need a way to differentiate between the kinds of misclassifications.

We can use the Sensitivity or True Positive Rate to measure the ability of a classifier to correctly classify positive observations:

Confusion matrix, accuracy, sensitivity, and specificity

If a classifier has a high sensitivity, this means that when an observation is classified as positive, the probability of error is low.

In order to measure the ability of a classifier to correctly classify a negative case, we will use Specificity or True Negative Rate:

Confusion matrix, accuracy, sensitivity, and specificity

Now, we'll use the model that we built in the previous chapter to explore Rattle's confusion matrix options, to classify the risk of loan applications into two classes: 1 (good credit risk) and 2 (bad credit risk). Load the dataset we used in Chapter 6, Decision Trees and Other Supervised Learning Methods, and create two different models. I've created a decision tree (Minimum Split = 30, Minimum Bucket = 20, and Maximum Depth = 10) and a Random Forest model (default parameters).

Now, go to Rattle's Evaluate tab. You have three rows of options. In the top row (highlighted in the following screenshot), you can choose the type of evaluation you want to perform. Choose Error Matrix (confusion matrix). The middle row is for the model you want to evaluate. As explained earlier, I've created two models, a decision tree and a random forest, so I have the option of calculating the confusion matrix for these models. In the following screenshot, you can see that I've selected Tree and Forest because I want to compare the confusion matrix of both the models.

Finally, the bottom row allows you to choose the dataset that we'll use to perform the evaluation. By default, Rattle chooses the Validation dataset. Usually, we use this dataset to optimize the model, as shown in this screenshot:

Confusion matrix, accuracy, sensitivity, and specificity

In the previous screenshot, we can see the output of the Error Matrix option. For each model, we can see the confusion matrix with the total number of observations and the percentage of observations. After the matrix, Rattle provides us with the error rate or Overall error.

Risk Chart

For binary classification models, Risk Chart, or Cumulative Gain Chart, is a good way to measure model performance.

To obtain a Risk Chart, after creating a binary classification model, go to the Evaluate tab, choose the Risk type and the Validation dataset, and press Execute, as shown here:

Risk Chart

In the following screenshot, we can see a Risk Chart for our credit example. In this example, we've 1000 credit applications, 700 classified as a good risk and 300 classified as bad risk. We have a risk score of 33 percent. The risk of giving credit to an application classified as bad risk is 33 percent.

Imagine we want to use our model to choose some applications to inspect before granting a credit. We would like that our model helps us to choose the most risky applications. The Risk Chart will explain whether our model is appropriate for that task. The following screenshot demonstrates this:

Risk Chart

In order to understand a Risk Chart, we need to know two important concepts—Precision and Recall. In a classification problem, precision is the percentage of positive observations that are correctly classified by the model:

Risk Chart

A model with a high recall will be able to find the positive observations in our dataset; Recall and Sensitivity are the same. The formula is shown here:

Risk Chart

In the y axes, we can see the percentage of positive observations or Performance (%) applications classified as bad risk.

In the x axes, we can see the percentage of observations or Caseload (%): 1000 credit applications.

In this plot, the diagonal line is a baseline, if we review credit applications randomly. To review 50 percent of bad risk applications, we need to review 50 percent of applications.

The Recall line shows us how the model ranks the applications. If we review 500 applications selected by the model (50 percent of the whole dataset), we'll review approximately 84 percent of the bad risk applications.

ROC Curve

The ROC Curve is a chart that shows the performance of a binary classification model. This chart plots True Positive Rate (sensitivity) versus the False Positive Rate of our model.

Imagine that we want to develop a binary classification model to classify new loan applications into high risk applications and low risk applications. We have a model that returns the probability of fraud. We need to choose a threshold to split applications between low and high risk. For example, if the probability of fraud is higher or equal than 0.7 we predict high risk, and low risk if the probability is lower than 0.7. We can try with different thresholds and we'll discover that with higher thresholds, we obtain lower true positive rates and higher false positives rates, and with lower thresholds, we will obtain higher true positive rates and lower false positive rates. Obviously, there is a trade-off between true positive rate and false positive rate; the ROC Curve represents this trade-off for our model and helps us understand the model performance.

A good way to measure the accuracy of a classifier is the Area Under the ROC Curve or AUC. An area of 1 represents a perfect classifier; an area of 0.5 represents a random classifier. The ROC Curve of our model will be a number between 0 and 1. A rule of thumb to understand the AUC is:

  • 1 to 0.90: Excellent
  • 0.90 to 0.80: Good
  • 0.70 to 0.80: Correct
  • 0.70 to 0.60: Poor
  • 0.60 to 0.50: Bad

In the following screenshot, you can see the ROC Curve of our credit risk example. The diagonal line is the ROC Curve for a random classifier, and we can use it as a baseline to compare the performance of our model. As you can see in the screenshot, the area under the baseline curve is 0.5. In the screenshot, you can see that the Area Under the Curve or AUC is 0.8:

ROC Curve
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset