Chapter 6. Choosing and evaluating models

This chapter covers

  • Mapping business problems to machine learning tasks
  • Evaluating model quality
  • Explaining model predictions

In this chapter, we will discuss the modeling process (figure 6.1). We discuss this process before getting into the details of specific machine learning approaches, because the topics in this chapter apply generally to any kind of model. First, let’s discuss choosing an appropriate model approach.

Figure 6.1. Mental model

6.1. Mapping problems to machine learning tasks

As a data scientist, your task is to map a business problem to a good machine learning method. Let’s look at a real-world situation. Suppose that you’re a data scientist at an online retail company. There are a number of business problems that your team might be called on to address:

  • Predicting what customers might buy, based on past transactions
  • Identifying fraudulent transactions
  • Determining price elasticity (the rate at which a price increase will decrease sales, and vice versa) of various products or product classes
  • Determining the best way to present product listings when a customer searches for an item
  • Customer segmentation: grouping customers with similar purchasing behavior
  • AdWord valuation: how much the company should spend to buy certain AdWords on search engines
  • Evaluation of marketing campaigns
  • Organizing new products into a product catalog

Your intended uses of the model have a big influence on what methods you should use. If you want to know how small variations in input variables affect outcome, then you likely want to use a regression method. If you want to know what single variable drives most of a categorization, then decision trees might be a good choice. Also, each business problem suggests a statistical approach to try. For the purposes of this discussion, we will group the different kinds of problems that a data scientist typically solves into these categories:

  • Classification— Assigning labels to datums
  • Scoring— Assigning numerical values to datums
  • Grouping— Discovering patterns and commonalities in data

In this section, we’ll describe these problem classes and list some typical approaches to each.

6.1.1. Classification problems

Let’s try the following example.

Example

Suppose your task is to automate the assignment of new products to your company’s product categories, as shown in figure 6.2.

Figure 6.2. Assigning products to product categories

This can be more complicated than it sounds. Products that come from different sources may have their own product classification that doesn’t coincide with the one that you use on your retail site, or they may come without any classification at all. Many large online retailers use teams of human taggers to hand categorize their products. This is not only labor intensive, but inconsistent and error prone. Automation is an attractive option; it’s labor saving, and can improve the quality of the retail site.

Product categorization based on product attributes and/or text descriptions of the product is an example of classification: deciding how to assign (known) labels to an object. Classification itself is an example of what is called supervised learning: in order to learn how to classify objects, you need a dataset of objects that have already been classified (called the training set ). Building training data is the major expense for most classification tasks, especially text-related ones.

Multicategory vs. two-category classification

Product classification is an example of multicategory or multinomial classification. Most classification problems and most classification algorithms are specialized for two-category, or binomial, classification. There are tricks to using binary classifiers to solve multicategory problems (for example, building one classifier for each category, called a one-versus-rest classifier). But in most cases it’s worth the effort to find a suitable multiple-category implementation, as they tend to work better than multiple binary classifiers (for example, using the package mlogit instead of the base method glm() for logistic regression).

Common classification methods that we will cover in this book include logistic regression (with a threshold) and decision tree ensembles.

6.1.2. Scoring problems

Scoring can be explained as follows.

Example

Suppose that your task is to help evaluate how different marketing campaigns can increase valuable traffic to the website. The goal is not only to bring more people to the site, but to bring more people who buy.

In this situation, you may want to consider a number of different factors: the communication channel (ads on websites, YouTube videos, print media, email, and so on); the traffic source (Facebook, Google, radio stations, and so on); the demographic targeted; the time of year, and so on. You want to measure if these factors increase sales, and by how much.

Predicting the increase in sales from a particular marketing campaign based on factors such as these is an example of regression, or scoring. In this case, a regression model would map the different factors being measured into a numerical value: sales, or the increase in sales from some baseline.

Predicting the probability of an event (like belonging to a given class) can also be considered scoring. For example, you might think of fraud detection as classification: is this event fraud or not? However, if you are trying to estimate the probability that an event is fraud, this can be considered scoring. This is shown in figure 6.3. Scoring is also an instance of supervised learning.

Figure 6.3. Notional example of determining the probability that a transaction is fraudulent

6.1.3. Grouping: working without known targets

The preceding methods require that you have a training dataset of situations with known outcomes. In some situations, there’s not (yet) a specific outcome that you want to predict. Instead, you may be looking for patterns and relationships in the data that will help you understand your customers or your business better.

These situations correspond to a class of approaches called unsupervised learning: rather than predicting outputs based on inputs, the objective of unsupervised learning is to discover similarities and relationships in the data. Some common unsupervised tasks include these:

  • Clustering— Grouping similar objects together
  • Association rules— Discovering common behavior patterns, for example, items that are always bought together, or library books that are always checked out together

Let’s expand on these two types of unsupervised methods.

When to use basic clustering

A good clustering example is the following.

Example

Suppose you want to segment your customers into general categories of people with similar buying patterns. You might not know in advance what these groups should be.

This problem is a good candidate for k-means clustering. K-means clustering is one way to sort the data into groups such that members of a cluster are more similar to each other than they are to members of other clusters.

Suppose that you find (as in figure 6.4) that your customers cluster into those with young children, who make more family-oriented purchases, and those with no children or with adult children, who make more leisure- and social-activity-related purchases. Once you have assigned a customer into one of those clusters, you can make general statements about their behavior. For example, a customer in the with-young-children cluster is likely to respond more favorably to a promotion on attractive but durable glassware than to a promotion on fine crystal wine glasses.

Figure 6.4. Notional example of clustering your customers by purchase pattern and purchase amount

We will cover k-means and other clustering approaches in more detail in section 9.1.

When to use association rules

You might be interested in directly determining which products tend to be purchased together. For example, you might find that bathing suits and sunglasses are frequently purchased at the same time, or that people who purchase certain cult movies, like Repo Man, will often buy the movie soundtrack at the same time.

This is a good application for association rules (or even recommendation systems). You can mine useful product recommendations: whenever you observe that someone has put a bathing suit into their shopping cart, you can recommend suntan lotion, as well. This is shown in figure 6.5. We’ll cover the Apriori algorithm for discovering association rules in section 9.2.

Figure 6.5. Notional example of finding purchase patterns in your data

6.1.4. Problem-to-method mapping

To summarize the preceding, table 6.1 maps some typical business problems to their corresponding machine learning tasks.

Table 6.1. From problem to approach

Example tasks

Machine learning terminology

Identifying spam email Sorting products in a product catalog Identifying loans that are about to default Assigning customers to preexisting customer clusters Classification—Assigning known labels to objects. Classification is a supervised method, so you need preclassified data in order to train a model.
Predicting the value of AdWords Estimating the probability that a loan will default Predicting how much a marketing campaign will increase traffic or sales Predicting the final price of an auction item based on the final prices of similar products that have been auctioned in the past Regression—Predicting or forecasting numerical values. Regression is also a supervised method, so you need data where the output is known, in order to train a model.
Finding products that are purchased together Identifying web pages that are often visited in the same session Identifying successful (often-clicked) combinations of web pages and AdWords Association rules—Finding objects that tend to appear in the data together. Association rules are an unsupervised method; you do not need data where you already know the relationships, but are trying to discover the relationships within your data.
Identifying groups of customers with the same buying patterns Identifying groups of products that are popular in the same regions or with the same customer clusters Identifying news items that are all discussing similar events Clustering—Finding groups of objects that are more similar to each other than to objects in other groups. Clustering is also an unsupervised method; you do not need pregrouped data, but are trying to discover the groupings within your data.
Prediction vs. forecasting

In everyday language, we tend to use the terms prediction and forecasting interchangeably. Technically, to predict is to pick an outcome, such as “It will rain tomorrow,” and to forecast is to assign a probability: “There’s an 80% chance it will rain tomorrow.” For unbalanced class applications (such as predicting credit default), the difference is important. Consider the case of modeling loan defaults, and assume the overall default rate is 5%. Identifying a group that has a 30% default rate is an inaccurate prediction (you don’t know who in the group will default, and most people in the group won’t default), but potentially a very useful forecast (this group defaults at six times the overall rate).

6.2. Evaluating models

When building a model, you must be able to estimate model quality in order to ensure that your model will perform well in the real world. To attempt to estimate future model performance, we often split our data into training data and test data, as illustrated in figure 6.6. Test data is data not used during training, and is intended to give us some experience with how the model will perform on new data.

Figure 6.6. Schematic of model construction and evaluation

One of the things the test set can help you identify is overfitting: building a model that memorizes the training data, and does not generalize well to new data. A lot of modeling problems are related to overfitting, and looking for signs of overfit is a good first step in diagnosing models.

6.2.1. Overfitting

An overfit model looks great on the training data and then performs poorly on new data. A model’s prediction error on the data that it trained from is called training error. A model’s prediction error on new data is called generalization error. Usually, training error will be smaller than generalization error (no big surprise). Ideally, though, the two error rates should be close. If generalization error is large, and your model’s test performance is poor, then your model has probably overfit—it’s memorized the training data instead of discovering generalizable rules or patterns. You want to avoid overfitting by preferring (as long as possible) simpler models which do in fact tend to generalize better.[1] Figure 6.7 shows the typical appearance of a reasonable model and an overfit model.

1

Other techniques to prevent overfitting include regularization (preferring small effects from model variables) and bagging (averaging different models to reduce variance).

Figure 6.7. A notional illustration of overfitting

An overly complicated and overfit model is bad for at least two reasons. First, an overfit model may be much more complicated than anything useful. For example, the extra wiggles in the overfit part of figure 6.7 could make optimizing with respect to x needlessly difficult. Also, as we mentioned, overfit models tend to be less accurate in production than during training, which is embarrassing.

Testing on held-out data

In section 4.3.1 we introduced the idea of splitting your data into test-train or test-train-calibration sets, as shown in figure 6.8. Here we’ll go into more detail about why you want to split your data this way.

Figure 6.8. Splitting data into training and test (or training, calibration, and test) sets

Example

Suppose you are building models to predict used car prices, based on various features of the car. You fit both a linear regression model and a random forest model, and you wish to compare the two.[2]

2

Both these modeling techniques will be covered in later chapters of the book.

If you do not split your data, but instead use all available data to both train and evaluate each model, then you might think that you will pick the better model, because the model evaluation has seen more data. However, the data used to build a model is not the best data for evaluating the model’s performance. This is because there’s an optimistic measurement bias in this data, because this data was seen during model construction. Model construction is optimizing your performance measure (or at least something related to your performance measure), so you tend to get exaggerated estimates of performance on your training data.

In addition, data scientists naturally tend to tune their models to get the best possible performance out of them. This also leads to exaggerated measures of performance. This is often called multiple comparison bias. And since this tuning might sometimes take advantage of quirks in the training data, it can potentially lead to overfit.

A recommended precaution for this optimistic bias is to split your available data into test and training. Perform all of your clever work on the training data alone, and delay measuring your performance with respect to your test data until as late as possible in your project (as all choices you make after seeing your test or holdout performance introduce a modeling bias). The desire to keep the test data secret for as long as possible is why we often actually split data into training, calibration, and test sets (as we’ll demonstrate in section 8.2.1).

When partitioning your data, you want to balance the trade-off between keeping enough data to fit a good model, and holding out enough data to make good estimates of the model’s performance. Some common splits are 70% training to 30% test, or 80% training to 20% test. For large datasets, you may even sometimes see a 50–50 split.

K-fold cross-validation

Testing on holdout data, while useful, uses each example only once: either as part of the model construction or as part of the held-out model evaluation set. This is not statistically efficient,[3] because the test set is often much smaller than our whole dataset. This means we are losing some precision in our estimate of model performance by partitioning our data so simply. In our example scenario, suppose you were not able to collect a very large dataset of historical used car prices. Then you might feel that you do not have enough data to split into training and test sets that are large enough to both build good models and evaluate them properly. In this situation, you might choose to use a more thorough partitioning scheme called k-fold cross-validation.

3

An estimator is called statistically efficient when it has minimal variance for a given dataset size.

The idea behind k-fold cross-validation is to repeat the construction of a model on different subsets of the available training data and then evaluate that model only on data not seen during construction. This allows us to use each and every example in both training and evaluating models (just never the same example in both roles at the same time). The idea is shown in figure 6.9 for k = 3.

Figure 6.9. Partitioning data for 3-fold cross-validation

In the figure, the data is split into three non-overlapping partitions, and the three partitions are arranged to form three test-train splits. For each split, a model is trained on the training set and then applied to the corresponding test set. The entire set of predictions is then evaluated, using the appropriate evaluation scores that we will discuss later in the chapter. This simulates training a model and then evaluating it on a holdout set that is the same size as the entire dataset. Estimating the model’s performance on all the data gives us a more precise estimate of how a model of a given type would perform on new data. Assuming that this performance estimate is satisfactory, then you would go back and train a final model, using all the training data.

For big data, a test-train split tends to be good enough and is much quicker to implement. In data science applications, cross-validation is generally used for tuning modeling parameters, which is basically trying many models in succession. Cross-validation is also used when nesting models (using one model as input to another model). This is an issue that can arise when transforming data for analysis, and is discussed in chapter 7.

6.2.2. Measures of model performance

In this section, we’ll introduce some quantitative measures of model performance. From an evaluation point of view, we group model types this way:

  • Classification
  • Scoring
  • Probability estimation
  • Clustering

For most model evaluations, we just want to compute one or two summary scores that tell us if the model is effective. To decide if a given score is high or low, we generally compare our model’s performance to a few baseline models.

The null model

The null model is the best version of a very simple model you’re trying to outperform. The most typical null model is a model that returns the same answer for all situations (a constant model). We use null models as a lower bound on desired performance. For example, in a categorical problem, the null model would always return the most popular category, as this is the easy guess that is least often wrong. For a score model, the null model is often the average of all the outcomes, as this has the least square deviation from all the outcomes.

The idea is that if you’re not outperforming the null model, you’re not delivering value. Note that it can be hard to do as good as the best null model, because even though the null model is simple, it’s privileged to know the overall distribution of the items it will be quizzed on. We always assume the null model we’re comparing to is the best of all possible null models

Single-variable models

We also suggest comparing any complicated model against the best single-variable model you have available (please see chapter 8 for how to convert single variables into single-variable models ). A complicated model can’t be justified if it doesn’t outperform the best single-variable model available from your training data. Also, business analysts have many tools for building effective single-variable models (such as pivot tables), so if your client is an analyst, they’re likely looking for performance above this level.

We’ll present the standard measures of model quality, which are useful in model construction. In all cases, we suggest that in addition to the standard model quality assessments, you try to design your own custom business-oriented metrics with your project sponsor or client. Usually this is as simple as assigning a notional dollar value to each outcome and then seeing how your model performs under that criterion. Let’s start with how to evaluate classification models and then continue from there.

6.2.3. Evaluating classification models

A classification model places examples into one of two or more categories. For measuring classifier performance, we’ll first introduce the incredibly useful tool called the confusion matrix and show how it can be used to calculate many important evaluation scores. The first score we’ll discuss is accuracy.

Example

Suppose we want to classify email into spam (email we in no way want) and non-spam (email we want).

A ready-to-go example (with a good description) is the “Spambase Data Set” (http://mng.bz/e8Rh). Each row of this dataset is a set of features measured for a specific email and an additional column telling whether the mail was spam (unwanted) or non-spam (wanted). We’ll quickly build a spam classification model using logistic regression so we have results to evaluate. We will discuss logistic regression in section 7.2, but for right now you can just download the file Spambase/spamD.tsv from the book’s GitHub site (https://github.com/WinVector/PDSwR2/tree/master/Spambase) and then perform the steps shown in the following listing.

Listing 6.1. Building and applying a logistic regression spam model
spamD <- read.table('spamD.tsv',header=T,sep='	')              1

spamTrain <- subset(spamD,spamD$rgroup  >= 10)                  2
spamTest <- subset(spamD,spamD$rgroup < 10)

spamVars <- setdiff(colnames(spamD), list('rgroup','spam'))     3
spamFormula <- as.formula(paste('spam == "spam"',
paste(spamVars, collapse = ' + '),sep = ' ~ '))

spamModel <- glm(spamFormula,family = binomial(link = 'logit'), 4
                                  data = spamTrain)

spamTrain$pred <- predict(spamModel,newdata = spamTrain,        5
                              type = 'response')
spamTest$pred <- predict(spamModel,newdata = spamTest,
                            type = 'response')

  • 1 Reads in the data
  • 2 Splits the data into training and test sets
  • 3 Creates a formula that describes the model
  • 4 Fits the logistic regression model
  • 5 Makes predictions on the training and test sets

The spam model predicts the probability that a given email is spam. A sample of the results of our simple spam classifier is shown in the next listing.

Listing 6.2. Spam classifications
sample <- spamTest[c(7,35,224,327), c('spam','pred')]
print(sample)
##          spam         pred       1
## 115      spam 0.9903246227
## 361      spam 0.4800498077
## 2300 non-spam 0.0006846551
## 3428 non-spam 0.0001434345

  • 1 The first column gives the actual class label (spam or non-spam). The second column gives the predicted probability that an email is spam. If the probability > 0.5, the email is labeled “spam;” otherwise, it is “non-spam.”
The confusion matrix

The absolute most interesting summary of classifier performance is the confusion matrix. This matrix is just a table that summarizes the classifier’s predictions against the actual known data categories.

The confusion matrix is a table counting how often each combination of known outcomes (the truth) occurred in combination with each prediction type. For our email spam example, the confusion matrix is calculated by the R command in the following listing.

Listing 6.3. Spam confusion matrix
confmat_spam <- table(truth = spamTest$spam,
                         prediction = ifelse(spamTest$pred > 0.5,
                         "spam", "non-spam"))
print(confmat_spam)
##          prediction
## truth   non-spam spam
##   non-spam   264   14
##   spam        22  158

The rows of the table (labeled truth) correspond to the actual labels of the datums: whether they are really spam or not. The columns of the table (labeled prediction) correspond to the predictions that the model makes. So the first cell of the table (truth = non-spam and prediction = non-spam) corresponds to the 264 emails in the test set that are not spam, and that the model (correctly) predicts are not spam. These correct negative predictions are called true negatives.

Confusion matrix conventions

A number of tools, as well as Wikipedia, draw confusion matrices with the actual truth values controlling the x-axis in the figure. This is likely due to the math convention that the first coordinate in matrices and tables names the row (vertical offset), and not the column (horizontal offset). It is our feeling that direct labels, such as “pred” and “actual,” are much clearer than any convention. Also note that in residual graphs the prediction is always the x-axis, and being visually consistent with this important convention is a benefit. So in this book, we will plot predictions on the x-axis (regardless how that is named).

It is standard terminology to refer to datums that are in the class of interest as positive instances, and those not in the class of interest as negative instances. In our scenario, spam emails are positive instances, and non-spam emails are negative instances.

In a two-by-two confusion matrix, every cell has a special name, as illustrated in table 6.2.

Table 6.2. Two-by-two confusion matrix
 

Prediction=NEGATIVE (predicted as non-spam)

Prediction=POSITIVE (predicted as spam)

Truth mark=NEGATIVE (non-spam) True negatives (TN) confmat_spam[1,1]=264 False positives (FP) confmat_spam[1,2]=14
Truth mark=POSITIVE (spam ) False negatives (FN ) confmat_spam[2,1]=22 True positives (TP) confmat_spam[2,2]=158

Using this summary, we can now start to calculate various performance metrics of our spam filter.

Changing a score to a classification

Note that we converted the numerical prediction score into a decision by checking if the score was above or below 0.5. This means that if the model returned a probability higher than 50% that an email is spam, we classify it as spam. For some scoring models (like logistic regression) the 0.5 score is likely a threshold that gives a classifier with reasonably good accuracy. However, accuracy isn’t always the end goal, and for unbalanced training data, the 0.5 threshold won’t be good. Picking thresholds other than 0.5 can allow the data scientist to trade precision for recall (two terms that we’ll define later in this chapter). You can start at 0.5, but consider trying other thresholds and looking at the ROC curve (see section 6.2.5).

Accuracy

Accuracy answers the question, “When the spam filter says this email is or is not spam, what’s the probability that it’s correct?” For a classifier, accuracy is defined as the number of items categorized correctly divided by the total number of items. It’s simply what fraction of classifications the classifier makes is correct. This is shown in figure 6.10.

Figure 6.10. Accuracy

At the very least, you want a classifier to be accurate. Let’s calculate the accuracy of the spam filter:

(confmat_spam[1,1] + confmat_spam[2,2]) / sum(confmat_spam)
## [1] 0.9213974

The error of around 8% is unacceptably high for a spam filter, but is good for illustrating different sorts of model evaluation criteria.

Before we move on, we’d like to share the confusion matrix of a good spam filter. In the next listing, we create the confusion matrix for the Akismet comment spam filter from the Win-Vector blog.[4]

4

Listing 6.4. Entering the Akismet confusion matrix by hand
confmat_akismet <- as.table(matrix(data=c(288-1,17,1,13882-17),nrow=2,ncol=2))
rownames(confmat_akismet) <- rownames(confmat_spam)
colnames(confmat_akismet) <- colnames(confmat_spam)
print(confmat_akismet)
##       non-spam  spam
## non-spam   287     1
## spam        17 13865

Because the Akismet filter uses link destination clues and determination from other websites (in addition to text features), it achieves a more acceptable accuracy:

(confmat_akismet[1,1] + confmat_akismet[2,2]) / sum(confmat_akismet)
## [1] 0.9987297

More importantly, Akismet seems to have suppressed fewer good comments. Our next section on precision and recall will help quantify this distinction.

Accuracy is an inappropriate measure for unbalanced classes

Suppose we have a situation where we have a rare event (say, severe complications during childbirth). If the event we’re trying to predict is rare (say, around 1% of the population), the null model that says the rare event never happens is very (99%) accurate. The null model is in fact more accurate than a useful (but not perfect model) that identifies 5% of the population as being “at risk” and captures all of the bad events in the 5%. This is not any sort of paradox. It’s just that accuracy is not a good measure for events that have unbalanced distribution or unbalanced costs.

Precision and recall

Another evaluation measure used by machine learning researchers is a pair of numbers called precision and recall. These terms come from the field of information retrieval and are defined as follows.

Precision answers the question, “If the spam filter says this email is spam, what’s the probability that it’s really spam?” Precision is defined as the ratio of true positives to predicted positives. This is shown in figure 6.11.

Figure 6.11. Precision

We can calculate the precision of our spam filter as follows:

confmat_spam[2,2] / (confmat_spam[2,2]+ confmat_spam[1,2])
## [1] 0.9186047

It is only a coincidence that the precision is so close to the accuracy number we reported earlier. Again, precision is how often a positive indication turns out to be correct. It’s important to remember that precision is a function of the combination of the classifier and the dataset. It doesn’t make sense to ask how precise a classifier is in isolation; it’s only sensible to ask how precise a classifier is for a given dataset. The hope is that the classifier will be similarly precise on the overall population that the dataset is drawn from—a population with the same distribution of positives instances as the dataset.

In our email spam example, 92% precision means 8% of what was flagged as spam was in fact not spam. This is an unacceptable rate for losing possibly important messages. Akismet, on the other hand, had a precision of over 99.99%, so it throws out very little non-spam email.

confmat_akismet[2,2] / (confmat_akismet[2,2] + confmat_akismet[1,2])
## [1] 0.9999279

The companion score to precision is recall. Recall answers the question, “Of all the spam in the email set, what fraction did the spam filter detect?” Recall is the ratio of true positives over all actual positives, as shown in figure 6.12.

Figure 6.12. Recall

Let’s compare the recall of the two spam filters.

confmat_spam[2,2] / (confmat_spam[2,2] + confmat_spam[2,1])
## [1] 0.8777778

confmat_akismet[2,2] / (confmat_akismet[2,2] + confmat_akismet[2,1])
## [1] 0.9987754

For our email spam filter, this is 88%, which means about 12% of the spam email we receive will still make it into our inbox. Akismet has a recall of 99.88%. In both cases, most spam is in fact tagged (we have high recall) and precision is emphasized over recall. This is appropriate for a spam filter, because it’s more important to not lose non-spam email than it is to filter every single piece of spam out of our inbox.

It’s important to remember this: precision is a measure of confirmation (when the classifier indicates positive, how often it is in fact correct), and recall is a measure of utility (how much the classifier finds of what there actually is to find). Precision and recall tend to be relevant to business needs and are good measures to discuss with your project sponsor and client.

F1
Example

Suppose that you had multiple spam filters to choose from, each with different values of precision and recall. How do you pick which spam filter to use?

In situations like this, some people prefer to have just one number to compare all the different choices by. One such score is the F1 score. The F1 score measures a trade-off between precision and recall. It is defined as the harmonic mean of the precision and recall. This is most easily shown with an explicit calculation:

precision <- confmat_spam[2,2] / (confmat_spam[2,2]+ confmat_spam[1,2])
recall <- confmat_spam[2,2] / (confmat_spam[2,2] + confmat_spam[2,1])

(F1 <- 2 * precision * recall / (precision + recall) )
## [1] 0.8977273

Our spam filter with 0.93 precision and 0.88 recall has an F1 score of 0.90. F1 is 1.00 when a classifier has perfect precision and recall, and goes to 0.00 for classifiers that have either very low precision or recall (or both). Suppose you think that your spam filter is losing too much real email, and you want to make it “pickier” about marking email as spam; that is, you want to increase its precision. Quite often, increasing the precision of a classifier will also lower its recall: in this case, a pickier spam filter may also mark fewer real spam emails as spam, and allow it into your inbox. If the filter’s recall falls too low as its precision increases, this will result in a lower F1. This possibly means that you have traded too much recall for better precision.

Sensitivity and specificity
Example

Suppose that you have successfully trained a spam filter with acceptable precision and recall, using your work email as training data. Now you want to use that same spam filter on a personal email account that you use primarily for your photography hobby. Will the filter work as well?

It’s possible the filter will work just fine on your personal email as is, since the nature of spam (the length of the email, the words used, the number of links, and so on) probably doesn’t change much between the two email accounts. However, the proportion of spam you get on the personal email account may be different than it is on your work email. This can change the performance of the spam filter on your personal email.[5]

5

The spam filter performance can also change because the nature of the non-spam will be different, too: the words commonly used will be different; the number of links or images in a legitimate email may be different; the email domains of people you correspond with may be different. For this discussion, we will assume that the proportion of spam email is the main reason that a spam filter’s performance will be different.

Let’s see how changes in the proportion of spam can change the performance metrics of the spam filter. Here we simulate having email sets with both higher and lower proportions of email than the data that we trained the filter on.

Listing 6.5. Seeing filter performance change when spam proportions change
set.seed(234641)

N <- nrow(spamTest)
pull_out_ix <- sample.int(N, 100, replace=FALSE)
removed = spamTest[pull_out_ix,]                                  1

get_performance <- function(sTest) {                              2
  proportion <- mean(sTest$spam == "spam")
  confmat_spam <- table(truth = sTest$spam,
                        prediction = ifelse(sTest$pred>0.5,
                                            "spam",
                                            "non-spam"))
  precision <- confmat_spam[2,2]/sum(confmat_spam[,2])
  recall <- confmat_spam[2,2]/sum(confmat_spam[2,])
  list(spam_proportion = proportion,
       confmat_spam = confmat_spam,
       precision = precision, recall = recall)
}

sTest <- spamTest[-pull_out_ix,]                                  3
get_performance(sTest)

## $spam_proportion
## [1] 0.3994413
##
## $confmat_spam
##           prediction
## truth      non-spam spam
##   non-spam      204   11
##   spam           17  126
##
## $precision
## [1] 0.919708
##
## $recall
## [1] 0.8811189

get_performance(rbind(sTest, subset(removed, spam=="spam")))      4

## $spam_proportion
## [1] 0.4556962
##
## $confmat_spam
##           prediction
## truth      non-spam spam
##   non-spam      204   11
##   spam           22  158
##
## $precision
## [1] 0.9349112
##
## $recall
## [1] 0.8777778

get_performance(rbind(sTest, subset(removed, spam=="non-spam"))) 5

## $spam_proportion
## [1] 0.3396675
##
## $confmat_spam
##           prediction
## truth      non-spam spam
##   non-spam      264   14
##   spam           17  126
##
## $precision
## [1] 0.9
##
## $recall
## [1] 0.8811189

  • 1 Pulls 100 emails out of the test set at random
  • 2 A convenience function to print out the confusion matrix, precision, and recall of the filter on a test set.
  • 3 Looks at performance on a test set with the same proportion of spam as the training data
  • 4 Adds back only additional spam, so the test set has a higher proportion of spam than the training set
  • 5 Adds back only non-spam, so the test set has a lower proportion of spam than the training set

Note that the recall of the filter is the same in all three cases: about 88%. When the data has more spam than the filter was trained on, the filter has higher precision, which means it throws a lower proportion of non-spam email out. This is good! However, when the data has less spam than the filter was trained on, the precision is lower, meaning the filter will throw out a higher fraction of non-spam email. This is undesirable.

Because there are situations where a classifier or filter may be used on populations where the prevalence of the positive class (in this example, spam) varies, it’s useful to have performance metrics that are independent of the class prevalence. One such pair of metrics is sensitivity and specificity. This pair of metrics is common in medical research, because tests for diseases and other conditions will be used on different populations, with different prevalence of a given disease or condition.

Sensitivity is also called the true positive rate and is exactly equal to recall. Specificity is also called the true negative rate: it is the ratio of true negatives to all negatives. This is shown in figure 6.13.

Figure 6.13. Specificity

Sensitivity and recall answer the question, “What fraction of spam does the spam filter find?” Specificity answers the question, “What fraction of non-spam does the spam filter find?”

We can calculate specificity for our spam filter:

confmat_spam[1,1] / (confmat_spam[1,1] + confmat_spam[1,2])
## [1] 0.9496403

One minus the specificity is also called the false positive rate. False positive rate answers the question, “What fraction of non-spam will the model classify as spam?” You want the false positive rate to be low (or the specificity to be high), and the sensitivity to also be high. Our spam filter has a specificity of about 0.95, which means that it will mark about 5% of non-spam email as spam.

An important property of sensitivity and specificity is this: if you flip your labels (switch from spam being the class you’re trying to identify to non-spam being the class you’re trying to identify), you just switch sensitivity and specificity. Also, a trivial classifier that always says positive or always says negative will always return a zero score on either sensitivity or specificity. So useless classifiers always score poorly on at least one of these measures.

Why have both precision/recall and sensitivity/specificity? Historically, these measures come from different fields, but each has advantages. Sensitivity/specificity is good for fields, like medicine, where it’s important to have an idea how well a classifier, test, or filter separates positive from negative instances independently of the distribution of the different classes in the population. But precision/recall gives you an idea how well a classifier or filter will work on a specific population. If you want to know the probability that an email identified as spam is really spam, you have to know how common spam is in that person’s email box, and the appropriate measure is precision.

Summary: Using common classification performance measures

You should use these standard scores while working with your client and sponsor to see which measure most models their business needs. For each score, you should ask them if they need that score to be high, and then run a quick thought experiment with them to confirm you’ve gotten their business need. You should then be able to write a project goal in terms of a minimum bound on a pair of these measures. Table 6.3 shows a typical business need and an example follow-up question for each measure.

Table 6.3. Classifier performance measures business stories.

Measure

Typical business need

Follow-up question

Accuracy “We need most of our decisions to be correct.” “Can we tolerate being wrong 5% of the time? And do users see mistakes like spam marked as non-spam or non-spam marked as spam as being equivalent?”
Precision “Most of what we marked as spam had darn well better be spam.” “That would guarantee that most of what is in the spam folder is in fact spam, but it isn’t the best way to measure what fraction of the user’s legitimate email is lost. We could cheat on this goal by sending all our users a bunch of easy-to-identify spam that we correctly identify. Maybe we really want good specificity.”
Recall “We want to cut down on the amount of spam a user sees by a factor of 10 (eliminate 90% of the spam).” “If 10% of the spam gets through, will the user see mostly non-spam mail or mostly spam? Will this result in a good user experience?”
Sensitivity “We have to cut a lot of spam; otherwise, the user won’t see a benefit.” “If we cut spam down to 1% of what it is now, would that be a good user experience?”
Specificity “We must be at least three nines on legitimate email; the user must see at least 99.9% of their non-spam email.” “Will the user tolerate missing 0.1% of their legitimate email, and should we keep a spam folder the user can look at?”

One conclusion for this dialogue process on spam classification could be to recommend writing the business goals as maximizing sensitivity while maintaining a specificity of at least 0.999.

6.2.4. Evaluating scoring models

Let’s demonstrate evaluation on a simple example.

Example

Suppose you’ve read that the rate at which crickets chirp is proportional to the temperature, so you have gathered some data and fit a model that predicts temperature (in Fahrenheit) from the chirp rate (chirps/sec) of a striped ground cricket. Now you want to evaluate this model.

You can fit a linear regression model to this data, and then make predictions, using the following listing. We will discuss linear regression in detail in chapter 8. Make sure you have the dataset crickets.csv in your working directory.[6]

6

George W. Pierce, The Song of Insects, Harvard University Press, 1948. You can find the dataset here: https://github.com/WinVector/PDSwR2/tree/master/cricketchirps

Listing 6.6. Fitting the cricket model and making predictions
crickets <- read.csv("cricketchirps/crickets.csv")

cricket_model <- lm(temperatureF ~ chirp_rate, data=crickets)
crickets$temp_pred <- predict(cricket_model, newdata=crickets)

Figure 6.14 compares the actual data (points) to the model’s predictions (the line). The differences between the predictions of temperatureF and temp_pred are called the residuals or error of the model on the data. We will use the residuals to calculate some common performance metrics for scoring models.

Figure 6.14. Scoring residuals

Root mean square error

The most common goodness-of-fit measure is called root mean square error (RMSE). The RMSE is the square root of the average squared residuals (also called the mean squared error). RMSE answers the question, “How much is the predicted temperature typically off?” We calculate the RMSE as shown in the following listing.

Listing 6.7. Calculating RMSE
error_sq <- (crickets$temp_pred - crickets$temperatureF)^2
( RMSE <- sqrt(mean(error_sq)) )
## [1] 3.564149

The RMSE is in the same units as the outcome: since the outcome (temperature) is in degrees Fahrenheit, the RMSE is also in degrees Fahrenheit. Here the RMSE tells you that the model’s predictions will typically (that is, on average) be about 3.6 degrees off from the actual temperature. Suppose that you consider a model that typically predicts the temperature to within 5 degrees to be “good.” Then, congratulations! You have fit a model that meets your goals.

RMSE is a good measure, because it is often what the fitting algorithms you’re using are explicitly trying to minimize. In a business setting, a good RMSE-related goal would be “We want the RMSE on account valuation to be under $1,000 per account.”

The quantity mean(error_sq) is called the mean squared error. We will call the quantity sum(error_sq) the sum squared error, and also refer to it as the model’s variance.

R-squared

Another important measure of fit is called R-squared (or R2, or the coefficient of determination). We can motivate the definition of R-squared as follows.

For the data that you’ve collected, the simplest baseline prediction of the temperature is simply the average temperature in the dataset. This is the null model; it’s not a very good model, but you have to perform at least better than it does. The data’s total variance is the sum squared error of the null model. You want the sum squared error of your actual model to be much smaller than the data’s variance—that is, you want the ratio of your model’s sum squared error to the total variance to be near zero. R-squared is defined as one minus this ratio, so we want R-squared to be close to one. This leads to the following calculation for R-squared.

Listing 6.8. Calculating R-squared
error_sq <- (crickets$temp_pred - crickets$temperatureF)^2             1
numerator <- sum(error_sq)                                             2

delta_sq <- (mean(crickets$temperatureF) - crickets$temperatureF)^2    3
denominator = sum(delta_sq)                                            4

(R2 <- 1 - numerator/denominator)                                      5
## [1] 0.6974651

  • 1 Calculates the squared error terms
  • 2 Sums them to get the model’s sum squared error, or variance
  • 3 Calculates the squared error terms from the null model
  • 4 Calculates the data’s total variance
  • 5 Calculates R-squared

As R-squared is formed from a ratio comparing your model’s variance to the total variance, you can think of R-squared as a measure of how much variance your model “explains.” R-squared is also sometimes referred to as a measure of how well the model “fits” the data, or its “goodness of fit.”

The best possible R-squared is 1.0, with near-zero or negative R-squareds being horrible. Some other models (such as logistic regression) use deviance to report an analogous quantity called pseudo R-squared.

Under certain circumstances, R-squared is equal to the square of another measure called the correlation (see http://mng.bz/ndYf). A good statement of a R-squared business goal would be “We want the model to explain at least 70% of variation in account value.”

6.2.5. Evaluating probability models

Probability models are models that both decide if an item is in a given class and return an estimated probability (or confidence) of the item being in the class. The modeling techniques of logistic regression and decision trees are fairly famous for being able to return good probability estimates. Such models can be evaluated on their final decisions, as we’ve already shown in section 6.2.3, but they can also be evaluated in terms of their estimated probabilities.

In our opinion, most of the measures for probability models are very technical and very good at comparing the qualities of different models on the same dataset. It’s important to know them, because data scientists generally use these criteria among themselves. But these criteria aren’t easy to precisely translate into businesses needs. So we recommend tracking them, but not using them with your project sponsor or client.

To motivate the use of the different metrics for probability models, we’ll continue the spam filter example from section 6.2.3.

Example

Suppose that, while building your spam filter, you try several different algorithms and modeling approaches and come up with several models, all of which return the probability that a given email is spam. You want to compare these different models quickly and identify the one that will make the best spam filter.

In order to turn a probability model into a classifier, you need to select a threshold: items that score higher than that threshold will be classified as spam; otherwise, they are classified as non-spam. The easiest (and probably the most common) threshold for a probability model is 0.5, but the “best possible” classifier for a given probability model may require a different threshold. This optimal threshold can vary from model to model. The metrics in this section compare probability models directly, without having turned them into classifiers. If you make the reasonable assumption that the best probability model will make the best classifier, then you can use these metrics to quickly select the most appropriate probability model, and then spend some time tuning the threshold to build the best classifier for your needs.

The double density plot

When thinking about probability models, it’s useful to construct a double density plot (illustrated in figure 6.15).

Listing 6.9. Making a double density plot
library(WVPlots)
DoubleDensityPlot(spamTest,
                  xvar = "pred",
                  truthVar = "spam",
                  title = "Distribution of scores for spam filter")
Figure 6.15. Distribution of scores broken up by known classes

The x-axis in the figure corresponds to the prediction scores returned by the spam filter. Figure 6.15 illustrates what we’re going to try to check when evaluating estimated probability models: examples in the class should mostly have high scores, and examples not in the class should mostly have low scores.

Double density plots can be useful when picking classifier thresholds, or the threshold score where the classifier switches from labeling an email as non-spam to spam. As we mentioned earlier, the standard classifier threshold is 0.5, meaning that if the probability that an email is spam is greater than one-half, then we label the email as spam. This is the threshold that you used in section 6.2.3. However, in some circumstances you may choose to use a different threshold. For instance, using a threshold of 0.75 for the spam filter will produce a classifier with higher precision (but lower recall), because a higher fraction of emails that scored higher than 0.75 are actually spam.

The receiver operating characteristic curve and the AUC

The receiver operating characteristic curve (or ROC curve) is a popular alternative to the double density plot. For each different classifier we’d get by picking a different score threshold between spam and not-spam, we plot both the true positive (TP) rate and the false positive (FP) rate. The resulting curve represents every possible trade-off between true positive rate and false positive rate that is available for classifiers derived from this model. Figure 6.16 shows the ROC curve for our spam filter, as produced in the next listing. In the last line of the listing, we compute the AUC or area under the curve, which is another measure of the quality of the model.

Figure 6.16. ROC curve for the email spam example

Listing 6.10. Plotting the receiver operating characteristic curve
library(WVPlots)
ROCPlot(spamTest,                                 1
        xvar = 'pred',
        truthVar = 'spam',
        truthTarget = 'spam',
        title = 'Spam filter test performance')
library(sigr)
calcAUC(spamTest$pred, spamTest$spam=='spam')     2
 ## [1] 0.9660072

  • 1 Plots the receiver operating characteristic (ROC) curve
  • 2 Calculates the area under the ROC curve explicitly
The reasoning behind the AUC

At one end of the spectrum of models is the ideal perfect model that would return a score of 1 for spam emails and a score of 0 for non-spam. This ideal model would form an ROC with three points:

  • (0,0)—Corresponding to a classifier defined by the threshold p = 1: nothing gets classified as spam, so this classifier has a zero false positive rate and a zero true positive rate.
  • (1,1)—Corresponding to a classifier defined by the threshold p = 0: everything gets classified as spam, so this classifier has a false positive rate of 1 and a true positive rate of 1.
  • (0,1)—Corresponding to any classifier defined by a threshold between 0 and 1: everything is classified correctly, so this classifier has a false positive rate of 0 and a true positive rate of 1.

The shape of the ROC for the ideal model is shown in figure 6.17. The area under the curve for this model is 1. A model that returns random scores would have an ROC that is the diagonal line from the origin to the point (1,0): the true positive rate is proportional to the threshold. The area under the curve for the random model is 0.5. So you want a model whose AUC is close to 1, and greater than 0.5.

Figure 6.17. ROC curve for an ideal model that classifies perfectly

When comparing multiple probability models, you generally want to prefer models that have a higher AUC. However, you also want to examine the shape of the ROC to explore possible project goal trade-offs. Each point on the curve shows the trade-off between achievable true positive and false positive rates with this model. If you share the information from the ROC curve with your client, they may have an opinion about the acceptable trade-offs between the two.

Log likelihood

Log likelihood is a measure of how well the model’s predictions “match” the true class labels. It is a non-positive number, where a log likelihood of 0 means a perfect match: the model scores all the spam as being spam with a probability of 1, and all the non-spam as having a probability 0 of being spam. The larger the magnitude of the log likelihood, the worse the match.

The log likelihood of a model’s prediction on a specific instance is the logarithm of the probability that the model assigns to the instance’s actual class. As shown in figure 6.18, for a spam email with an estimated probability of p of being spam, the log likelihood is log(p); for a non-spam email, the same score of p gives a log likelihood of log(1 - p).

Figure 6.18. Log likelihood of a spam filter prediction

The log likelihood of a model’s predictions on an entire dataset is the sum of the individual log likelihoods:

log_likelihood = sum(y * log(py) + (1-y) * log(1 - py))

Here y is the true class label (0 for non-spam and 1 for spam) and py is the probability that an instance is of class 1 (spam). We are using multiplication to select the correct logarithm. We also use the convention that 0 * log(0) = 0 (though for simplicity, this isn’t shown in the code).

Figure 6.19 shows how log likelihood rewards matches and penalizes mismatches between the actual label of an email and the score assigned by the model. For positive instances (spam), the model should predict a value close to 1, and for negative instances (non-spam), the model should predict a value close to 0. When the prediction and the class label match, the contribution to the log likelihood is a small negative number. When they don’t match, the contribution to the log likelihood is a larger negative number. The closer to 0 the log likelihood is, the better the prediction.

Figure 6.19. Log likelihood penalizes mismatches between the prediction and the true class label.

The next listing shows one way to calculate the log likelihood of the spam filter’s predictions.

Listing 6.11. Calculating log likelihood
ylogpy <- function(y, py) {              1
   logpy = ifelse(py > 0, log(py), 0)
  y*logpy
}

y <- spamTest$spam == 'spam'             2

sum(ylogpy(y, spamTest$pred) +           3
       ylogpy(1-y, 1-spamTest$pred))
## [1] -134.9478

  • 1 A function to calculate y * log(py), with the convention that 0 * log(0) = 0
  • 2 Gets the class labels of the test set as TRUE/FALSE, which R treats as 1/0 in arithmetic operations
  • 3 Calculates the log likelihood of the model’s predictions on the test set

The log likelihood is useful for comparing multiple probability models on the same test dataset—because the log likelihood is an unnormalized sum, its magnitude implicitly depends on the size of the dataset, so you can’t directly compare log likelihoods that were computed on different datasets. When comparing multiple models, you generally want to prefer models with a larger (that is, smaller magnitude) log likelihood.

At the very least, you want to compare the model’s performance to the null model of predicting the same probability for every example. The best observable single estimate of the probability of being spam is the observed rate of spam on the training set.

Listing 6.12. Computing the null model’s log likelihood
(pNull <- mean(spamTrain$spam == 'spam'))
## [1] 0.3941588

sum(ylogpy(y, pNull) + ylogpy(1-y, 1-pNull))
## [1] -306.8964

The spam model assigns a log likelihood of -134.9478 to the test set, which is much better than the null model’s -306.8964.

Deviance

Another common measure when fitting probability models is the deviance. The deviance is defined as -2*(logLikelihood-S), where S is a technical constant called “the log likelihood of the saturated model.” In most cases, the saturated model is a perfect model that returns probability 1 for items in the class and probability 0 for items not in the class (so S=0). The lower the deviance, the better the model.

We’re most concerned with ratios of deviance, such as the ratio between the null deviance and the model deviance. These deviances can be used to calculate a pseudo R-squared (see http://mng.bz/j338). Think of the null deviance as how much variation there is to explain, and the model deviance as how much was left unexplained by the model. You want a pseudo R-squared that is close to 1.

In the next listing, we show a quick calculation of deviance and pseudo R-squared using the sigr package.

Listing 6.13. Computing the deviance and pseudo R-squared
library(sigr)

(deviance <- calcDeviance(spamTest$pred, spamTest$spam == 'spam'))
## [1] 253.8598
(nullDeviance <- calcDeviance(pNull, spamTest$spam == 'spam'))
## [1] 613.7929
(pseudoR2 <- 1 - deviance/nullDeviance)
## [1] 0.586408

Like the log likelihood, deviance is unnormalized, so you should only compare deviances that are computed over the same dataset. When comparing multiple models, you will generally prefer models with smaller deviance. The pseudo R-squared is normalized (it’s a function of a ratio of deviances), so in principle you can compare pseudo R-squareds even when they were computed over different test sets. When comparing multiple models, you will generally prefer models with larger pseudo R-squareds.

AIC

An important variant of deviance is the Akaike information criterion (AIC). This is equivalent to deviance + 2*numberOfParameters used in the model. The more parameters in the model, the more complex the model is; the more complex a model is, the more likely it is to overfit. Thus, AIC is deviance penalized for model complexity. When comparing models (on the same test set), you will generally prefer the model with the smaller AIC. The AIC is useful for comparing models with different measures of complexity and modeling variables with differing numbers of levels. However, adjusting for model complexity is often more reliably achieved using the holdout and cross-validation methods discussed in section 6.2.1.

So far, we have evaluated models on how well they perform in general: the overall rates at which a model returns correct or incorrect predictions on test data. In the next section, we look at one method for evaluating a model on specific examples, or explaining why a model returns a specific prediction on a given example.

6.3. Local interpretable model-agnostic explanations (LIME) for explai- ining model predictions

In many people’s opinion, the improved prediction performance of modern machine learning methods like deep learning or gradient boosted trees comes at the cost of decreased explanation. As you saw in chapter 1, a human domain expert can review the if-then structure of a decision tree and compare it to their own decision-making processes to decide if the decision tree will make reasonable decisions. Linear models also have an easily explainable structure, as you will see in chapter 8. However, other methods have far more complex structures that are difficult for a human to evaluate. Examples include the multiple individual trees of a random forest (as in figure 6.20), or the highly connected topology of a neural net.

Figure 6.20. Some kinds of models are easier to manually inspect than others.

If a model evaluates well on holdout data, that is an indication that the model will perform well in the wild—but it’s not foolproof. One potential issue is that the holdout set generally comes from the same source as the training data, and has all the same quirks and idiosyncrasies of the training data. How do you know whether your model is learning the actual concept of interest, or simply the quirks in the data? Or, putting it another way, will the model work on similar data from a different source?

Example

Suppose you want to train a classifier to distinguish documents about Christianity from documents about atheism.

One such model was trained using a corpus of postings from the 20 Newsgroups Dataset, a dataset frequently used for research in machine learning on text. The resulting random forest model was 92% accurate on holdout.[7] On the surface, this seems pretty good.

7

The experiment is described in Ribeiro, Singh, and Guestrin, “‘Why Should I Trust You?’ Explaining the Predictions of Any Classifier,” https://arxiv.org/pdf/1602.04938v1.pdf.

However, delving deeper into the model showed that it was exploiting idiosyncrasies in the data, using the distribution of words like “There” or “Posting” or “edu” to decide whether a post was about Christianity or about atheism. In other words, the model was looking at the wrong features in the data. An example of a classification by this model is shown in figure 6.21.[8]

8

Figure 6.21. Example of a document and the words that most strongly contributed to its classification as “atheist” by the model

In addition, since the documents in the corpus seem to have included the names of specific posters, this model could also potentially be learning whether a person who posts frequently in the training corpus is a Christian or an atheist, which is not the same as learning if a text is Christian or atheist, especially when trying to apply the model to a document from a different corpus, with different authors.

Another real-world example is Amazon’s recent attempt to automate resume reviews, using the resumes of people hired by Amazon over a 10-year period as training data.[9] As Reuters reported, the company discovered that their model was discriminating against women. It penalized resumes that included words like “women’s,” and downvoted applicants who had graduated from two particular all-women’s colleges. Researchers also discovered that the algorithm ignored common terms that referred to specific skills (such as the names of computer programming languages), and favored words like executed or captured that were disproportionately used by male applicants.

9

Jeffrey Dastin, “Amazon scraps secret AI recruiting tool that showed bias against women,” Reuters, October 9, 2018, https://www.reuters.com/article/us-amazon-com-jobs-automation-insight/amazon-scraps-secret-ai-recruiting-tool-that-showed-bias-against-women-idUSKCN1MK08G.

In this case, the flaw was not in the machine learning algorithm, but in the training data, which had apparently captured existing biases in Amazon’s hiring practices—which the model then codified. Prediction explanation techniques like LIME can potentially discover such issues.

6.3.1. LIME: Automated sanity checking

In order to detect whether a model is really learning the concept, and not just data quirks, it’s not uncommon for domain experts to manually sanity-check a model by running some example cases through and looking at the answers. Generally, you would want to try a few typical cases, and a few extreme cases, just to see what happens. You can think of LIME as one form of automated sanity checking.

LIME produces an “explanation” of a model’s prediction on a specific datum. That is, LIME tries to determine which features of that datum contributed the most to the model’s decision about it. This helps data scientists attempt to understand the behavior of black-box machine learning models.

To make this concrete, we will demonstrate LIME on two tasks: classifying iris species, and classifying movie reviews.

6.3.2. Walking through LIME: A small example

The first example is iris classification.

Example

Suppose you have a dataset of petal and sepal measurements for three varieties of iris. The object is to predict whether a given iris is a setosa based on its petal and sepal dimensions.

Let’s get the data and split it into test and training.

Listing 6.14. Loading the iris dataset
iris <- iris

iris$class <- as.numeric(iris$Species == "setosa")   1

set.seed(2345)
intrain <- runif(nrow(iris)) < 0.75                  2
train <- iris[intrain,]
test <- iris[!intrain,]

head(train)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species class
## 1          5.1         3.5          1.4         0.2  setosa     1
## 2          4.9         3.0          1.4         0.2  setosa     1
## 3          4.7         3.2          1.3         0.2  setosa     1
## 4          4.6         3.1          1.5         0.2  setosa     1
## 5          5.0         3.6          1.4         0.2  setosa     1
## 6          5.4         3.9          1.7         0.4  setosa     1

  • 1 Setosa is the positive class.
  • 2 Uses 75% of the data for training, the remainder as holdout (test data)

The variables are the length and width of the sepals and petals. The outcome you want to predict is class, which is 1 when the iris is setosa, and 0 otherwise. You will fit a gradient boosting model (from the package xgboost) to predict class.

You will learn about gradient boosting models in detail in chapter 10; for now, we have wrapped the fitting procedure into the function fit_iris_example() that takes as input a matrix of inputs and a vector of class labels, and returns a model that predicts class.[10] The source code for fit_iris_example() is in https://github.com/WinVector/PDSwR2/tree/master/LIME_iris/lime_iris_example.R; in chapter 10, we will unpack how the function works in detail.

10

The xgboost package requires that the input be a numeric matrix, and the class labels be a numeric vector.

To get started, convert the training data to a matrix and fit the model. Make sure that lime_iris_example.R is in your working directory.

Listing 6.15. Fitting a model to the iris training data
source("lime_iris_example.R")                     1

input <- as.matrix(train[, 1:4])                  2
model <- fit_iris_example(input, train$class)

  • 1 Loads the convenience function
  • 2 The input to the model is the first four columns of the training data, converted to a matrix.

After you fit the model, you can evaluate the model on the test data. The model’s predictions are the probability that a given iris is setosa.

Listing 6.16. Evaluating the iris model
predictions <- predict(model, newdata=as.matrix(test[,1:4]))   1

teframe <- data.frame(isSetosa = ifelse(test$class == 1,       2
                                        "setosa",
                                        "not setosa"),
                      pred = ifelse(predictions > 0.5,
                                "setosa",
                                "not setosa"))
with(teframe, table(truth=isSetosa, pred=pred))                3

##             pred
## truth        not setosa setosa
##   not setosa         25      0
##   setosa              0     11

  • 1 Makes predictions on the test data. The predictions are the probability that an iris is a setosa.
  • 2 A data frame of predictions and actual outcome
  • 3 Examines the confusion matrix

Note that all the datums in the test set fall into the diagonals of the confusion matrix: the model correctly labels all setosa examples as “setosa” and all the others as “not setosa.” This model predicts perfectly on the test set! However, you might still want to know which features of an iris are most important when classifying it with your model. Let’s take a specific example from the test dataset and explain it, using the lime package.[11]

11

The lime package does not support every type of model out of the box. See help(model_support) for the list of model classes that it does support (xgboost is one), and how to add support for other types of models. See also LIME’s README (https://cran.r-project.org/web/packages/lime/README.html) for other examples.

First, use the training set and the model to build an explainer: a function that you will use to explain the model’s predictions.

Listing 6.17. Building a LIME explainer from the model and training data
library(lime)
explainer <- lime(train[,1:4],                  1
                      model = model,
                      bin_continuous = TRUE,    2
                      n_bins = 10)              3

  • 1 Builds the explainer from the training data
  • 2 Bins the continuous variables when making explanations
  • 3 Uses 10 bins

Now pick a specific example from the test set.

Listing 6.18. An example iris datum
(example <- test[5, 1:4, drop=FALSE])                    1
##    Sepal.Length Sepal.Width Petal.Length Petal.Width
## 30          4.7         3.2          1.6         0.2

test$class[5]
## [1] 1                                                 2

round(predict(model, newdata = as.matrix(example)))
## [1] 1                                                 3

  • 1 A single row data frame
  • 2 This example is a setosa.
  • 3 And the model predicts that it is a setosa.

Now explain the model’s prediction on example. Note that the dplyr package also has a function called explain(), so if you have dplyr in your namespace, you may get a conflict trying to call lime’s explain() function. To prevent this ambiguity, specify the function using namespace notation: lime::explain(...).

Listing 6.19. Explaining the iris example
explanation <- lime::explain(example,
                                explainer,
                                n_labels = 1,     1
                                n_features = 4)   2

  • 1 The number of labels to explain; use 1 for binary classification.
  • 2 The number of features to use when fitting the explanation

You can visualize the explanation using plot_features(), as shown in figure 6.22.

plot_features(explanation)
Figure 6.22. Visualize the explanation of the model’s prediction.

The explainer expects the model will predict that this example is a setosa (Label = 1), and that the example’s value of Petal.Length is strong evidence supporting this prediction.

How LIME works

In order to better understand LIME’s explanations, and to diagnose when the explanations are trustworthy or not, it helps to understand how LIME works at a high level. Figure 6.23 sketches out the LIME procedure for a classifier at a high level. The figure shows these points:

  • The model’s decision surface. A classifier’s decision surface is the surface in variable space that separates where the model classifies datums as positive (in our example, as “setosa”) from where it classifies them as negative (in our example, as “not setosa”).
  • The datum we want to explain marked as the circled plus in the figure. In the figure, the datum is a positive example. In the explanation that follows, we’ll call this point “the original example,” or example.
  • Synthetic data points that the algorithm creates and gives to the model to evaluate. We’ll detail how the synthetic examples come about.
  • LIME’s estimate of the decision surface near the example we are trying to explain. We’ll detail how LIME comes up with this estimate.
Figure 6.23. Notional sketch of how LIME works

The procedure is as follows:

  1. “Jitter” the original example to generate synthetic examples that are similar to it. You can think of each jittered point as the original example with the value of each variable changed slightly. For example, if the original example is
    Sepal.Length Sepal.Width Petal.Length Petal.Width
             5.1         3.5          1.4         0.2
    then a jittered point might be
    Sepal.Length Sepal.Width Petal.Length Petal.Width
        5.505938    3.422535       1.3551   0.4259682
    To make sure that the synthetic examples are plausible, LIME uses the distributions of the data in the training set to generate the jittered data. For our discussion, we’ll call the set of synthetic examples {s_i}. Figure 6.23 shows the synthetic data as the additional pluses and minuses. Note that the jittering is randomized. This means that running explain() on the same example multiple times will produce different results each time. If LIME’s explanation is strong, the results should not be too different, so that the explanations remain quantitatively similar. In our case, it’s likely that Petal.Length will always show up as the variable with the most weight; it’s just the exact value of Petal.Length’s weight and its relationship to the other variables that will vary.
  2. Use the model to make predictions {y_i} on all the synthetic examples. In figure 6.23, the pluses indicate synthetic examples that the model classified as positive, and the minuses indicate synthetic examples that the model classified as negative. LIME will use the values of {y_i} to get an idea of what the decision surface of the model looks like near the original example. In figure 6.23, the decision surface is the large curvy structure that separates the regions where the model classifies datums as positive from the regions where it classifies datums as negative.
  3. Fit an m-dimensional linear model for {y_i} as a function of {s_i}. The linear model is LIME’s estimate of the original model’s decision surface near example, shown as a dashed line in figure 6.23. Using a linear model means that LIME assumes that the model’s decision surface is locally linear (flat) in a small neighborhood around example. You can think of LIME’s estimate as the flat surface (in the figure, it’s a line) that separates the positive synthetic examples from the negative synthetic examples most accurately. The R2 of the linear model (reported as the “explanation fit” in figure 6.22) indicates how well this assumption is met. If the explanation fit is close to 0, then there isn’t a flat surface that separates the positive examples from the negative examples well, and LIME’s explanation is probably not reliable. You specify the value of m with the n_features parameter in the function explain(). In our case, we are using four features (all of them) to fit the linear model. When there is a large number of features (as in text processing), LIME tries to pick the best m features to fit the model. The coefficients of the linear model give us the weights of the features in the explanation. For classification, a large positive weight means that the corresponding feature is strong evidence in favor of the model’s prediction, and a large negative weight means that the corresponding feature is strong evidence against it.
Taking the steps as a whole

This may seem like a lot of steps, but they are all supplied in a convenient wrapper by the lime package. Altogether, the steps are implementing a solution to a simple counter-factual question: how would a given example score differently if it had different attributes? The summaries emphasize what are the most important plausible variations.

Back to the iris example

Let’s pick a couple more examples and explain the model’s predictions on them.

Listing 6.20. More iris examples
(example <- test[c(13, 24), 1:4])

##     Sepal.Length Sepal.Width Petal.Length Petal.Width
## 58           4.9         2.4          3.3         1.0
## 110          7.2         3.6          6.1         2.5

test$class[c(13,24)]                                 1
## [1] 0 0

round(predict(model, newdata=as.matrix(example)))    2
## [1] 0 0

explanation <- explain(example,
                          explainer,
                          n_labels = 1,
                          n_features = 4,
                          kernel_width = 0.5)

plot_features(explanation)

  • 1 Both examples are negative (not setosa ).
  • 2 The model predicts that both examples are negative.

The explainer expects that the model will predict that both these examples are not setosa (Label = 0). For case 110 (the second row of example and the right side plot of figure 6.24), this is again because of Petal.Length. Case 58 (the left side plot of figure 6.24) seems strange: most of the evidence seems to contradict the expected classification! Note that the explanation fit for case 58 is quite small: it’s an order of magnitude less than the fit for case 110. This tells you that you may not want to trust this explanation.

Figure 6.24. Explanations of the two iris examples

Let’s look at how these three examples compare to the rest of the iris data. Figure 6.25 shows the distribution of petal and sepal dimensions in the data, with the three sample cases marked.

Figure 6.25. Distributions of petal and sepal dimensions by species

It’s clear from figure 6.25 that petal length strongly differentiates setosa from the other species of iris. With respect to petal length, case 30 is obviously setosa, and case 110 is obviously not. Case 58 appears to be not setosa due to petal length, but as noted earlier, the entire explanation of case 58 is quite poor, probably because case 58 sits at some sort of kink on the model’s decision surface.

Now let’s try LIME on a larger example.

6.3.3. LIME for text classification

Example

For this example, you will classify movie reviews from the Internet Movie Database (IMDB). The task is to identify positive reviews.

For convenience, we’ve converted the data from the original archive[12] into two RDS files, IMDBtrain.RDS and IMDBtest.RDS, found at https://github.com/WinVector/PDSwR2/tree/master/IMDB. Each RDS object is a list with two elements: a character vector representing 25,000 reviews, and a vector of numeric labels where 1 means a positive review and 0 a negative review.[13] You will again fit an xgboost model to classify the reviews.

12

The original data can be found at http://s3.amazonaws.com/text-datasets/aclImdb.zip.

13

The extraction/conversion script we used to create the RDS files can be found at https://github.com/WinVector/PDSwR2/tree/master/IMDB/getIMDB.R.

You might wonder how LIME jitters a text datum. It does so by randomly removing words from the document, and then converting the resulting new text into the appropriate representation for the model. If removing a word tends to change the classification of a document, then that word is probably important to the model.

First, load the training set. Make sure you have downloaded the RDS files into your working directory.

Listing 6.21. Loading the IMDB training data
library(zeallot)                                 1

c(texts, labels) %<-% readRDS("IMDBtrain.RDS")   2

  • 1 Loads the zeallot library. Calls install.packages(“zeallot”) if this fails.
  • 2 The command read(IMDBtrain.RDS) returns a list object. The zeallot assignment arrow %<-% unpacks the list into two elements: texts is a character vector of reviews, and labels is a 0/1 vector of class labels. The label 1 designates a positive review.

You can examine the reviews and their corresponding labels. Here’s a positive review:

list(text = texts[1], label = labels[1])
## $text
## train_21317
## train_21317
## "Forget depth of meaning, leave your logic at the door, and have a
## great time with this maniacally funny, totally absurdist, ultra-
## campy live-action "cartoon". MYSTERY MEN is a send-up of every
## superhero flick you've ever seen, but its unlikelysuper-wannabes
## are so interesting, varied, and well-cast that they are memorable
## characters in their own right. Dark humor, downright silliness,
## bona fide action, and even a touchingmoment or two, combine to
## make this comic fantasy about lovable losers a true winner. The
## comedic talents of the actors playing the Mystery Men --
## including one Mystery Woman -- are a perfect foil for Wes Studi
## as what can only be described as a bargain-basement Yoda, and
## Geoffrey Rush as one of the most off-the-wall (and bizarrely
## charming) villains ever to walk off the pages of a Dark Horse
## comic book and onto the big screen. Get ready to laugh, cheer,
## and say "huh?" more than once.... enjoy!"
##
## $label
## train_21317
##           1

Here’s a negative review:

list(text = texts[12], label = labels[12])
## $text
## train_385
## train_385
## "Jameson Parker And Marilyn Hassett are the screen's most unbelievable
## couple since John Travolta and Lily Tomlin. Larry Peerce's direction
## wavers uncontrollably between black farce and Roman tragedy. Robert
## Klein certainly think it's the former and his self-centered  performance
## in a minor role underscores the total lack of balance and chemistry
## between the players in the film. Normally, I don't like to let myself
## get so ascerbic, but The Bell Jar is one of my all-time favorite books,
## and to watch what they did with it makes me literally crazy."
##
## $label
## train_385
##         0
Representing documents for modeling

For our text model, the features are the individual words, and there are a lot of them. To use xgboost to fit a model on texts, we have to build a finite feature set, or the vocabulary. The words in the vocabulary are the only features that the model will consider.

We don’t want to use words that are too common, because common words that show up in both positive reviews and negative reviews won’t be informative. We also don’t want to use words that are too rare, because a word that rarely shows up in a review is not that useful. For this task, let’s define “too common” as words that show up in more than half the training documents, and “too rare” as words that show up in fewer than 0.1% of the documents.

We’ll build a vocabulary of 10,000 words that are not too common or too rare, using the package text2vec. For brevity, we’ve wrapped the procedure in the function create_pruned_vocabulary(), which takes a vector of documents as input and returns a vocabulary object. The source code for create_pruned_vocabulary() is in https://github.com/WinVector/PDSwR2/tree/master/IMDB/lime_imdb_example.R.

Once we have the vocabulary, we have to convert the texts (again using text2vec) into a numeric representation that xgboost can use. This representation is called a document-term matrix, where the rows represent each document in the corpus, and each column represents a word in the vocabulary. For a document-term matrix dtm, the entry dtm[i, j] is the number of times that the vocabulary word w[j] appeared in document texts[i]. See figure 6.26. Note that this representation loses the order of the words in the documents.

Figure 6.26. Creating a document-term matrix

The document-term matrix will be quite large: 25,000 rows by 10,000 columns. Luckily, most words in the vocabulary won’t show up in a given document, so each row will be mostly zeros. This means that we can use a special representation called a sparse matrix, of class dgCMatrix, that represents large, mostly zero matrices in a space-efficient way.

We’ve wrapped this conversion in the function make_matrix() that takes as input a vector of texts and a vocabulary, and returns a sparse matrix. As in the iris example, we’ve also wrapped the model fitting into a function fit_imdb_model() that takes as input a document term matrix and the numeric document labels, and returns an xgboost model. The source code for these functions is also in https://github.com/WinVector/PDSwR2/tree/master/IMDB/lime_imdb_example.R.

6.3.4. Training the text classifier

After you download lime_imdb_example.R into your working directory, you can create the vocabulary and a document-term matrix from the training data, and fit the model. This may take a while.

Listing 6.22. Converting the texts and fitting the model
source("lime_imdb_example.R")

vocab <- create_pruned_vocabulary(texts)      1
dtm_train <- make_matrix(texts, vocab)        2
model <- fit_imdb_model(dtm_train, labels)    3

  • 1 Creates the vocabulary from the training data
  • 2 Creates the document-term matrix of the training corpus
  • 3 Trains the model

Now load the test corpus and evaluate the model.

Listing 6.23. Evaluate the review classifier
c(test_txt, test_labels) %<-%  readRDS("IMDBtest.RDS")               1
dtm_test <- make_matrix(test_txt, vocab)                             2

predicted <- predict(model, newdata=dtm_test)                        3

teframe <- data.frame(true_label = test_labels,
                         pred = predicted)                           4

(cmat <- with(teframe, table(truth=true_label, pred=pred > 0.5)))    5

##      pred
## truth FALSE  TRUE
##     0 10836  1664
##     1  1485 11015

sum(diag(cmat))/sum(cmat)                                            6
## [1] 0.87404

library(WVPlots)
DoubleDensityPlot(teframe, "pred", "true_label",
                  "Distribution of test prediction scores")          7

  • 1 Reads in the test corpus
  • 2 Converts the corpus to a document-term matrix
  • 3 Makes predictions (probabilities) on the test corpus
  • 4 Creates a frame with true and predicted labels
  • 5 Computes the confusion matrix
  • 6 Computes the accuracy
  • 7 Plots the distribution of predictions

Based on its performance on the test set, the model does a good, but not perfect, job at classifying reviews. The distribution of test prediction scores (figure 6.27) shows that most negative (class 0) reviews have low scores, and most positive (class 1) reviews have high scores. However, there are some positive reviews that get scores near 0, and some negative reviews get scores near 1. And some reviews have scores near 0.5, meaning the model isn’t sure about them at all. You would like to improve the classifier to do a better job on these seemingly ambiguous reviews.

Figure 6.27. Distribution of test prediction scores

6.3.5. Explaining the classifier’s predictions

Try explaining the predictions for a few example reviews to get some insight into the model. First, build the explainer from the training data and the model. For text models, the lime() function takes a preprocessor function that converts the training texts and the synthetic examples to a document-term matrix for the model.

Listing 6.24. Building an explainer for a text classifier
explainer <- lime(texts, model = model,
                  preprocess = function(x) make_matrix(x, vocab))

Now take a short sample text from the test corpus. This review is positive, and the model predicts that it is positive.

Listing 6.25. Explaining the model’s prediction on a review
casename <- "test_19552";
sample_case <- test_txt[casename]
pred_prob <- predict(model, make_matrix(sample_case, vocab))
list(text = sample_case,
     label = test_labels[casename],
     prediction = round(pred_prob) )

## $text
## test_19552
## "Great story, great music. A heartwarming love story that's beautiful to
## watch and delightful to listen to. Too bad there is no soundtrack CD."
##
## $label
## test_19552
##          1
##
## $prediction
## [1] 1

Now explain the model’s classification in terms of the five most evidential words. The words that affect the prediction the most are shown in figure 6.28.

Figure 6.28. Explanation of the prediction on the sample review

Listing 6.26. Explaining the model’s prediction
explanation <- lime::explain(sample_case,
                       explainer,
                       n_labels = 1,
                       n_features = 5)

plot_features(explanation)

In listing 6.26, you used plot_features() to visualize the explanation, as you did in the iris example, but lime also has a special visualization for text, plot_text_ explanations().

As shown in figure 6.29, plot_text_explanations() highlights the key words within the text, green for supporting evidence, and red for contradictory. The stronger the evidence, the darker the color. Here, the explainer expects that the model will predict that this review is positive, based on the words delightful, great, and beautiful, and in spite of the word bad.

plot_text_explanations(explanation)

Figure 6.29. Text explanation of the prediction in listing 6.26

Let’s look at a couple more reviews, including one that the model misclassified.

Listing 6.27. Examining two more reviews
casenames <-  c("test_12034", "test_10294")
sample_cases <- test_txt[casenames]
pred_probs <- predict(model, newdata=make_matrix(sample_cases, vocab))
list(texts = sample_cases,
     labels = test_labels[casenames],
     predictions = round(pred_probs))

## $texts
## test_12034
## "I don't know why I even watched this film. I think it was because
## I liked the idea of the scenery and was hoping the film would be
## as good. Very boring and pointless."
##
## test_10294
## "To anyone who likes the TV series: forget the movie. The jokes
## are bad and some topics are much too sensitive to laugh about it.
## <br /><br />We have seen much better acting by R. Dueringer in
## "Hinterholz 8"".
##
## $labels
## test_12034 test_10294                          1
##          0          0
##
## $predictions                                   2
## [1] 0 1

explanation <- lime::explain(sample_cases,
                                    explainer,
                                    n_labels = 1,
                                    n_features = 5)

plot_features(explanation)
plot_text_explanations(explanation)

  • 1 Both these reviews are negative.
  • 2 The model misclassified the second review.

As shown in figure 6.30, the explainer expects that the model will classify the first review as negative, based mostly on the words pointless and boring. It expects that the model will classify the second review as positive, based on the words 8, sensitive, and seen, and in spite of the words bad and (somewhat surprisingly) better.

Figure 6.30. Explanation visualizations for the two sample reviews in listing 6.27

Note that according to figure 6.30, the probability of the classification of the second review appears to be 0.51—in other words, the explainer expects that the model won’t be sure of its prediction at all. Let’s compare this to what the model predicted in reality:

predict(model, newdata=make_matrix(sample_cases[2], vocab))
## [1] 0.6052929

The model actually predicts the label 1 with probability 0.6: not a confident prediction, but slightly more confident than the explainer estimated (though still wrong). The discrepancy is because the label and probability that the explainer returns are from the predictions of the linear approximation to the model, not from the model itself. You may occasionally even see cases where the explainer and the model return different labels for the same example. This will usually happen when the explanation fit is poor, so you don’t want to trust those explanations, anyway.

As the data scientist responsible for classifying reviews, you may wonder about the seemingly high importance of the number 8. On reflection, you might remember that some movie reviews include the ratings “8 out of 10,” or “8/10.” This may lead you to consider extracting apparent ratings out of the reviews before passing them to the text processor, and adding them to the model as an additional special feature. You may also not like using words like seen or idea as features.

As a simple experiment, you can try removing the numbers 1 through 10 from the vocabulary,[14] and then refitting the model. The new model correctly classifies test_10294 and returns a more reasonable explanation, as shown in figure 6.31.

14

This involves adding the numbers 1 through 10 as strings to the stopword list in the function create_pruned_vocabulary() in the file lime_imdb_example.R. We leave recreating the vocabulary and document-term matrices, and refitting the review classifier, as an exercise for the reader.

Figure 6.31. Explanation visualizations for test_10294

Looking at the explanations of other reviews that the model misclassifies can lead you to improved feature engineering or data preprocessing that can potentially improve your model. You may decide that sequences of words (good idea, rather than just idea) make better features. Or you may decide that you want a text representation and model that looks at the order of the words in the document rather than just word frequencies. In any case, looking at explanations of a model’s predictions on corner cases can give you insight into your model, and help you decide how to better achieve your modeling goals.

Summary

You now have some solid ideas on how to choose among modeling techniques. You also know how to evaluate the quality of data science work, be it your own or that of others. The remaining chapters of part 2 of the book will go into more detail on how to build, test, and deliver effective predictive models. In the next chapter, we’ll actually start building predictive models.

In this chapter you have learned

  • How to match the problem you want to solve to appropriate modeling approaches.
  • How to partition your data for effective model evaluation.
  • How to calculate various measures for evaluating classification models.
  • How to calculate various measures for evaluating scoring (regression) models.
  • How to calculate various measures for evaluating probability models.
  • How to use the lime package to explain individual predictions from a model.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset