Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

5 Local Model-Agnostic Interpretation Methods

Join our book community on Discord

In the previous two chapters, we dealt exclusively with global interpretation methods. This chapter will foray into local interpretation methods, which are there to explain why a single prediction or a group of predictions was made. It will cover how to leverage SHapley Additive exPlanations' (SHAP's) KernelExplainer and also, another method called Local Interpretable Model-agnostic Explanations (LIME) for local interpretations. We will also explore how to use these methods with both tabular and text data.

These are the main topics we are going to cover in this chapter:

Leveraging SHAP's KernelExplainer for local interpretations with SHAP values
Employing LIME
Using LIME for natural language processing (NLP)
Trying SHAP for NLP
Comparing SHAP with LIME

Technical requirements

This chapter's example uses the mldatasets, pandas, numpy, sklearn, nltk, lightgbm, rulefit, matplotlib, seaborn, shap, and lime libraries. Instructions on how to install all of these libraries are in the preface of the book. The code for this chapter is located here: https://github.com/PacktPublishing/Interpretable-Machine-Learning-with-Python/tree/master/Chapter06

The mission

Who doesn't love chocolate?! It's a global favorite, with around nine out of ten people loving it and about a billion people eating it every day. One popular form in which it is consumed is as a chocolate bar. However, even universally beloved ingredients can be used in ways that aren't universally appealing—so, chocolate bars can range from the sublime to the mediocre, to downright unpleasant. Often, this is solely determined by the quality of the cocoa or additional ingredients, and sometimes it becomes an acquired taste once it's combined with exotic flavors.

A French chocolate manufacturer who is obsessed with excellence has reached out to you. They have a problem. All of their bars have been highly rated by critics, yet critics have very particular taste buds. And some bars they love have inexplicably mediocre sales, but non-critics seem to like them in focus groups and tastings, so they are puzzled why sales don't coincide with their market research. They have found a dataset of chocolate bars rated by knowledgeable lovers of chocolate, and these ratings happen to coincide with their sales. To get an unbiased opinion, they have sought your expertise.

As for the dataset, members of the Manhattan Chocolate Society have been meeting since 2007 for the sole purpose of tasting and judging fine chocolate, to educate consumers and inspire chocolate makers to produce higher-quality chocolate. Since then, they have compiled a dataset of over 2,200 chocolate bars, rated by their members with the following scale:

4.0 - 5.00 = Outstanding
3.5 - 3.99 = Highly Recommended
3.0 - 3.49 = Recommended
2.0 - 2.99 = Disappointing
1.0 - 1.90 = Unpleasant

These ratings are derived from a rubric that factors in aroma, appearance, texture, flavor, aftertaste, and overall opinion, and the bars rated are mostly darker chocolate bars since the aim is to appreciate the flavors of cacao. In addition to the ratings, the Manhattan Chocolate Society dataset includes many characteristics, such as the country where the cocoa bean was farmed, how many ingredients the bar has, whether it includes salt, and the words used to describe it.

The goal is to understand why one of the chocolate manufacturers' bars is rated Outstanding yet sells poorly, while another one, whose sales are impressive, is rated as Disappointing.

The approach

You have decided to use local model interpretation to explain why each bar is rated as it is. To that end, you will prepare the dataset and then train classification models to predict if chocolate-bar ratings are above or equal to Highly Recommended, because the client would like all their bars to fall above this threshold. You will need to train two models: one for tabular data, and another NLP one for the words used to describe the chocolate bars. We will employ support vector machines (SVMs) and Light Gradient Boosting Machine (LightGBM), respectively, for these tasks. If you haven't used these black-box models, no worries—we will briefly explain them. Once you train the models, then comes the fun part: leverage two local model-agnostic interpretation methods to understand what makes a specific chocolate bar less than Highly Recommended or not. These methods are SHAP and LIME, which when combined will provide a richer explanation to convey back to your client. Then, we will compare both methods to understand their strengths and limitations.

The preparations

You will find the code for this example here:
https://github.com/PacktPublishing/Interpretable-Machine-Learning-with-Python/blob/master/Chapter06/ChocoRatings.ipynb

Loading the libraries

To run this example, you need to install the following libraries:

mldatasets to load the dataset
pandas, numpy, and nltk to manipulate it
sklearn (scikit-learn) and lightgbm to split the data and fit the models
matplotlib, seaborn, shap, and lime to visualize the interpretations

You should load all of them first, as follows:

import math
import mldatasets
import pandas as pd
import numpy as np
import re
import nltk
from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn import metrics, svm
from sklearn.feature_extraction.text import TfidfVectorizer
import lightgbm as lgb
import matplotlib.pyplot as plt
import seaborn as sns
import shap
import lime
import lime.lime_tabular
from lime.lime_text import LimeTextExplainer

Understanding and preparing the data

We load the data into a dataframe we call chocolateratings_df, like this:

chocolateratings_df = mldatasets.load("chocolate-bar-ratings_v2")

There should be over 2,200 records and 18 columns. We can verify this was the case simply by inspecting the contents of the dataframe, like this:

chocolateratings_df

The output shown here in Figure 5.1 corresponds to what we were expecting:

Figure 5.1 – Contents of chocolate-bar dataset

The data dictionary

The data dictionary comprises the following:

company: Categorical; the manufacturer of the chocolate bar (out of over 500 different ones)
company_location: Categorical; country of the manufacturer (66 different countries)
review_date: Continuous; year in which the bar was reviewed (from 2006 to 2020)
country_of_bean_origin: Categorical; country where the cocoa beans were harvested (62 different countries)
cocoa_percent: Categorical; what percentage of the bar is cocoa
rating: Continuous; rating given by the Manhattan Chocolate Society (possible values: 1-5)
counts_of_ingredients: Continuous; amount of ingredients in the bar
cocoa_butter: Binary; was it made with cocoa butter?
vanilla: Binary; was it made with vanilla?
lecithin: Binary; was it made with lecithin?
salt: Binary; was it made with salt?
sugar: Binary; was it made with sugar?
sweetener_without_sugar: Binary; was it made with sweetener without sugar?
first_taste: Text; word(s) used to describe the first taste
second_taste: Text; word(s) used to describe the second taste
third_taste: Text; word(s) used to describe the third taste
fourth_taste: Text; word(s) used to describe the fourth taste

Now that we have taken a peek at the data, we can quickly prepare this and then work on the modeling and interpretation!

Data preparation

The first thing we ought to do is set aside the text features so that we can process them separately. We can start by creating a dataframe called tastes_df with them and then drop them from chocolateratings_df. We can then take a look at tastes_df using head and tail, as illustrated in the following code snippet:

tastes_df = chocolateratings_df[['first_taste', 'second_taste', 
                                 'third_taste', 'fourth_taste']]
chocolateratings_df = chocolateratings_df.
drop(['first_taste', 'second_taste', 'third_taste',
      'fourth_taste'], axis=1)
tastes_df

The preceding code produces the dataframe shown here in Figure 5.2:

Figure 5.2 – Tastes columns have quite a few null values

Now, let's categorically encode the categorical features. There are too many countries in company_location and country_of_bean_origin, so let's establish a threshold. Say, if there are fewer than 3.333% (or 74 rows) for any country, let's bucket it into an Other category and then encode the categories. We can easily do this with the make_dummies_with_limits function and the process is shown again in the following code snippet:

chocolateratings_df =
   mldatasets.make_dummies_with_limits(chocolateratings_df,
                                       'company_location', 0.03333)
chocolateratings_df =        
   mldatasets.make_dummies_with_limits(chocolateratings_df,
                                      'country_of_bean_origin', 0.03333)

Now, to process the contents of tastes_df, the following code replaces all the null values with empty strings, then joins all the columns in tastes_df together, forming a single series. Then, it strips leading and trailing whitespace. The code is illustrated in the following snippet:

tastes_s = tastes_df.replace(np.nan, '', regex=True).
                            agg(' '.join, axis=1).str.strip()

And voilà! You can verify that the result is a pandas series (tastes_s) with (mostly) taste-related adjectives by printing it. As expected, this series is the same length as the chocolateratings_df dataframe, as illustrated in it’s output:

0          cocoa blackberry robust
1             cocoa vegetal savory
2                rich fatty bready
3              fruity melon roasty
4                    vegetal nutty
                   ...            
2221       muted roasty accessible
2222    fatty mild nuts mild fruit
2223            fatty earthy cocoa
Length: 2224, dtype: object

But let's find out how many of its phrases are unique, with print(np.unique(tastes_s).shape). Since the output is (2178,) it means fewer than 50 phrases are duplicated, so tokenizing by phrases would be a bad idea.

There are many approaches you could take here, such as tokenizing by bi-grams (sequences of two words) or even subwords (dividing words into logical parts). However, even though order matters slightly (because the first words had to do with the first taste, and so on), our dataset is too small and had too many nulls (especially in third taste and fourth taste) to derive meaning from the order. This is why it was a good choice to concatenate all the "tastes" together, thus removing their discernible division.

Another thing to note is that our words are (mostly) adjectives. We made a small effort to remove adverbs, but there are still some nouns present, such as "fruit" and "nuts", versus adjectives such as "fruity" and "nutty". We can't be sure if the chocolate connoisseurs who judged the bars meant something different by using "fruit" rather than "fruity". However, if we were sure of this, we could have performed stemming or lemmatization to turn all instances of "fruit", "fruity", and "fruitiness" to a consistent "fru" (stem) or "fruiti" (lemma). We won't concern ourselves with this because many of our adjectives' variations are not as common in the phrases anyway.

Let's find out the most common words by first tokenizing them with word_tokenize and using FreqDist to count their frequency. We can then place the resulting tastewords_fdist dictionary into a dataframe (tastewords_df). We can save only those words with more than 74 instances as a list (commontastes_l). The code is illustrated in the following snippet:

tastewords_fdist = FreqDist(word for word in
                word_tokenize(tastes_s.str.cat(sep=' ')))
tastewords_df = pd.DataFrame.from_dict(tastewords_fdist,
                orient='index').rename(columns={0:'freq'})
commontastes_l = tastewords_df[tastewords_df.freq > 74].
                  index.to_list()
print(commontastes_l)

As you can tell from the following output for commontastes_l, the most common words are mostly different (except for spice and spicy):

['cocoa', 'rich', 'fatty', 'roasty', 'nutty', 'sweet', 'sandy', 'sour', 'intense', 'mild', 'fruit', 'sticky', 'earthy', 'spice', 'molasses', 'floral', 'spicy', 'woody', 'coffee', 'berry', 'vanilla', 'creamy']

Something we can do with this list to enhance our tabular dataset is to turn these common words into binary features. In other words, there would be a column for each one of these "common tastes" (commontastes_l), and if the "tastes" for the chocolate bar include it, the column would have a 1, otherwise a 0. Fortunately, we can easily do this with two lines of code. First, we create a new column with our text-tastes series (tastes_s). Then, we use the make_dummies_from_dict function we used in the last chapter to generate the dummy features by looking for each "common taste" in the contents of our new column, as follows:

chocolateratings_df['tastes'] = tastes_s
chocolateratings_df =   
             mldatasets.make_dummies_from_dict(chocolateratings_df,
                                               'tastes', commontastes_l)

Now that we are done with our feature engineering, we can use info() to examine our dataframe. The output has all numeric non-null features except for company. There are over 500 companies, so categorical encoding of this feature would be complicated and, because it would be advisable to bucket most companies as Other, it would likely introduce bias toward the few companies that are most represented. Therefore, it's better to remove this column altogether. The output is shown here:

RangeIndex: 2224 entries, 0 to 2223
Data columns (total 46 columns):
#   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
0   company                     2224 non-null   object
1   review_date                 2224 non-null   int64  
2   cocoa_percent               2224 non-null   float64
:        :                         :     :        :
43  tastes_berry                2224 non-null   int64  
44  tastes_vanilla              2224 non-null   int64  
45  tastes_creamy               2224 non-null   int64  
dtypes: float64(2), int64(30), object(1), uint8(13)

Our last step to prepare the data for modeling starts with initializing rand, a constant to serve as our "random state" throughout this exercise. Then, we define y as the rating column converted to 1s if greater than or equal to 3.5, and 0 otherwise. X is everything else (excluding company). Then, we split X and y into train and test datasets with train_test_split, as illustrated in the following code snippet:

rand = 9
y = chocolateratings_df['rating'].
       apply(lambda x: 1 if x >= 3.5 else 0)
X = chocolateratings_df.drop(['rating','company'], axis=1).copy()
X_train, X_test, y_train, y_test = train_test_split(X, y,
                    test_size=0.33, random_state=rand)

In addition to the tabular test and train datasets, for our NLP models we will need text-only feature datasets that are consistent with our train_test_split so that we can use the same y labels. To this end, we can do this by subsetting our tastes series (tastes_s), using the index of our X_train and X_test sets to yield NLP specific versions of the series, as follows:

X_train_nlp = tastes_s[X_train.index]
X_test_nlp = tastes_s[X_test.index]

OK! We are all set now. Let's start modeling and interpreting our models!

Leveraging SHAP's KernelExplainer for local interpretations with SHAP values

For this section, and for subsequent use, we will train a Support Vector Classifier (SVC) model first.

Training a C-SVC model

SVM is a family of model classes that operate in high-dimensional space to find an optimal hyperplane, where they attempt to separate the classes with the maximum margin between them. Support vectors are the points closest to the decision boundary (the dividing hyperplane) that would change it if were removed. To find the best hyperplane, they use a cost function called hinge loss and a computationally cheap method to operate in high-dimensional space, called the kernel trick, and even though a hyperplane suggests linear separability, it's not always limited to a linear kernel.

The scikit-learn implementation we will use is called C-SVC. SVC uses an L2 regularization parameter called C and, by default, uses a kernel called the radial basis function (RBF), which is decidedly non-linear. For an RBF, a gamma hyperparameter defines the radius of influence of each training example in the kernel, but in an inversely proportional fashion. Hence, a low value increases the radius, while a high value decreases it.

The SVM family includes several variations for classification and even regression classes through support vector regression (SVR). The most significant advantage of SVM models is that they tend to work effectively and efficiently when there are many features compared to the observations, and even when the features exceed the observations! It also tends to find latent non-linear relationships in the data, without overfitting or becoming unstable. However, SVM is not as scalable to larger datasets, and it's hard to tune its hyperparameters.

Since we will use seaborn plot styling, which is activated with set(), for some of this chapter's plots, we will first save the original matplotlib settings (rcParams) so that we can restore them later. One thing to note about SVC is that it doesn't natively produce probabilities since it's linear algebra. However, if probability=True, the scikit-learn implementation uses cross-validation and then fits a logistic regression model to the SVC's scores to produce the probabilities. We are also using gamma=auto, which means it is set to 1/# features—so, 1/44. As always, it is recommended to set your random_state parameter for reproducibility. Once we fit the model to the training data, we can use evaluate_class_mdl to evaluate our model's predictive performance, as illustrated in the following code snippet:

svm_mdl = svm.SVC(probability=True, gamma='auto', random_state=rand)
fitted_svm_mdl = svm_mdl.fit(X_train, y_train)
y_train_svc_pred, y_test_svc_prob, y_test_svc_pred =
             mldatasets.evaluate_class_mdl(fitted_svm_mdl, X_train,
                                           X_test, y_train, y_test)

The preceding code produces the output shown here in Figure 5.3:

Figure 5.3 – Predictive performance of our SVC model

The performance achieved (see Figure 5.3) is not bad, considering this is a small imbalanced dataset in an already challenging domain for machine learning models' user ratings. In any case, the Area Under the Curve (AUC) curve is above the dotted coin toss line, and the Matthews correlation coefficient (MCC) is safely above 0. More importantly, precision is substantially higher than recall, and this is very good given the hypothetical cost of misclassifying a lousy chocolate bar as Highly Recommended. We favor precision over recall because we would prefer to have fewer false positives than false negatives.

Computing SHAP values using KernelExplainer

Given how computationally intensive calculating SHAP values by brute force can be, the SHAP library takes many statistically valid shortcuts. As we learned in Chapter 4, Global Model-Agnostic Interpretation Methods, these shortcuts range from leveraging a decision tree's structure (TreeExplainer) to the difference in a neural network's activations, and a baseline (DeepExplainer) to a neural network's gradient (GradientExplainer). These shortcuts make the explainers significantly less model-agnostic since they are limited to a family of model classes. However, there is a truly model-agnostic explainer in SHAP, called the KernelExplainer.

KernelExplainer has two shortcuts: it samples a subset of all feature permutations for coalitions and uses a weighting scheme according to the size of the coalition to compute SHAP values. The first shortcut is a recommended technique to reduce computation time. The second one is drawn from LIME's weighting scheme, which we will cover next in this chapter, and the authors of SHAP did this so that it remains compliant to Shapley. However, for "missing" features in the coalition, it randomly samples from the features' values in a background training dataset, which violates the dummy property of Shapley values. More importantly, as with permutation feature importance, if there's multicollinearity, it puts too much weight on unlikely instances. Despite this near-fatal flaw, KernelExplainer has all the other benefits of Shapley values and is one of LIME's main advantages.

Before we engage with the KernelExplainer, it's important to note that for classification models, it yields a list of multiple SHAP values. You access these for each class with an index. Confusion may arise if this index is not in the order you expect because it's in the order provided by the model. So, it is essential to make sure of the order of the classes in your model by running print(svm_mdl.classes_).

The output array([0, 1]) tells you that Not Highly Recommended has an index of 0, as you would expect, and Highly Recommended has an index of 1. We are interested in the SHAP values for the latter because this is what we are trying to predict.

KernelExplainer takes a predict function for a model (fitted_svm_mdl.predict_proba) and some background training data (X_train_summary). KernelExplainer strongly suggests other measures to minimize computation. One of these is using k-means to summarize the background training data instead of using it whole. Another method could be using a sample of the training data. In this case, we opted for k-means clustering into 10 centroids. Once we have initialized our explainer, we can use samples of our test dataset (nsamples=200) to come up with the SHAP values. It uses L1 regularization (l1_reg) during the fitting process. What we are telling it here is to regularize to a point where it only has 20 relevant features. Lastly, we can use a summary_plot to plot our SHAP values for class 1. The code is illustrated in the following snippet:

np.random.seed(rand)
X_train_summary = shap.kmeans(X_train, 10)
shap_svm_explainer =
          shap.KernelExplainer(fitted_svm_mdl.predict_proba,
                       X_train_summary)
shap_svm_values_test = shap_svm_explainer.shap_values(X_test,
                nsamples=200, l1_reg="num_features(20)")
shap.summary_plot(shap_svm_values_test[1], X_test, plot_type="dot")

The preceding code produces the output shown in Figure 5.4. Even though the point of this chapter is local model interpretation, it's important to start with the global form of this to make sure outcomes are intuitive. If they aren't, perhaps something is amiss.

Figure 5.4 – Global model interpretation with SHAP using a summary plot

In Figure 5.4, we can tell that the highest (red) cocoa percentages (cocoa_percent) tend to correlate with a decrease in the likelihood of Highly Recommended, but the middle values (purple) tend to increase it. This finding makes intuitive sense because the darkest chocolates are more of an acquired taste than less-dark chocolates. The low values (blue) are scattered throughout so they show no trend, but this could be because there aren't many. On the other hand, review date suggests that it was likely to be Highly Recommended in earlier years. There are significant shades of red and purple on both sides of 0, so it's hard to identify a trend here. A dependence plot, such as those used in Chapter 4, Global Model-Agnostic Interpretation Methods, would be better for this purpose. However, it's very easy for binary features to visualize how high and low values, ones and zeros, impact the model. For instance, we can tell that the presence of cocoa, creamy, rich, and berry tastes increases the likelihood of the chocolate being recommended, while sweet, earthy, sour, and fatty tastes do the opposite. Likewise, the odds for Highly Recommended decrease if the chocolate was manufactured in the US! Sorry, US.

Local interpretation for a group of predictions using decision plots

For local interpretation, you don't have to visualize one point at a time—you can instead interpret several at a time. The key is providing some context to compare the points adequately, and there can't be so many that you can't distinguish them. Usually, you would find outliers or only those that meet specific criteria. For this exercise, we will select only those bars that were produced by your client, as follows:

sample_test_idx = X_test.index.
               get_indexer_for([5,6,7,18,19,21,24,25,27])

One great thing about Shapley is its additivity property, which can be easily demonstrated. If you add all the SHAP values to the expected value used to compute them, you get a prediction. Of course, this is a classification problem, so the prediction is a probability; so, to get a Boolean array instead, we have to check if the probability is greater than 0.5. We can check if this Boolean array matches our model's test dataset predictions (y_test_svc_pred) by running the following code:

expected_value = shap_svm_explainer.expected_value[1]
y_test_shap_pred =
            (shap_svm_values_test[1].sum(1) + expected_value) > 0.5
print(np.array_equal(y_test_shap_pred, y_test_svc_pred))

It should, and it does

SHAP's decision plot comes with a highlight feature that we can use to make false negatives (FN) stand out. Now, let's figure out which of our sample observations are FN, as follows:

FN = (~y_test_shap_pred[sample_test_idx]) &
    (y_test.iloc[sample_test_idx] == 1).to_numpy()

We can now quickly reset our plotting style back to the default matplotlib style, and plot a decision_plot. It takes the expected_value, the SHAP values, and actual values of those items we wish to plot. Optionally, we can provide a Boolean array of the items we want to highlight, with dotted lines—in this case, the false negatives (FN), as illustrated in the following code snippet:

shap.decision_plot(expected_value,
            shap_svm_values_test[1][sample_test_idx],
              X_test.iloc[sample_test_idx], highlight=FN)

The plot produced in Figure 5.5 has a single color-coded line for each observation.

Figure 5.5 – Local model interpretation with SHAP for a sample of predictions, highlighting false negatives

The color of each line represents not the value of any feature, but the model output. Since we used predict_proba in KernelExplainer this is a probability, but otherwise it would have displayed SHAP values, and the value they have when they strike the top x axis is the predicted value. The features are sorted in terms of importance but only among the observations plotted, and you can tell that the lines increase and decrease horizontally depending on each feature. How much they vary and toward which direction depends on the feature's contribution to the outcome. The gray line represents the class's expected value, which is like the intercept in a linear model. In fact, similarly, all lines start at this value, making it best to read the plot from bottom to top.

You can tell that there are three false negatives plotted in Figure 5.5 because they have dotted lines. Using this plot, we can easily visualize which features made them veer toward the left the most because this is what made them negative predictions. For instance, we know that the leftmost false negative was to the right of the expected value line until lecithin and then continued decreasing till company_location_France, and review_date increased its likelihood of Highly Recommended, but it wasn't enough. You can tell that county_of_bean_origin_Other decreased the likelihood of two of the misclassifications. This decision could be unfair because the country could be one of over 50 countries that didn't get their own feature. Quite possibly, there's a lot of variation between the beans of these countries grouped together.

Decision plots can also isolate a single observation. When it does this, it prints the value of each feature next to the dotted line. Let's plot one for a decision plot of the same company (true-positive observation #696), as follows:

shap.decision_plot(expected_value, shap_svm_values_test[1][696],
    X_test.iloc[696], highlight=0)

Figure 5.6 here was outputted by the preceding code:

Figure 5.6 – Local model interpretation with SHAP for a single true positive in the sample of predictions

In Figure 5.6, you can see that lecithin and counts_of_ingredients decreased the Highly Recommended likelihood to a point where it could have jeopardized it. Fortunately, all features above those veered the line decidedly rightward because company_location_France=1, cocoa_percent=70, and tastes_berry=1 are all favorable.

Local interpretation for a single prediction at a time using a force plot

Your client, the chocolate manufacturer, has two bars they want you to compare. Bar #5 is Outstanding and #24 is Disappointing. They are both in your test dataset. One way of comparing them is to place their values side by side in a dataframe to understand how exactly they differ. We will concatenate the rating, the actual label y, and the y_pred predicted label to these observations' values, as follows:

eval_idxs = (X_test.index==5) | (X_test.index==24)
X_test_eval = X_test[eval_idxs]
eval_compare_df = pd.concat([
     chocolateratings_df.iloc[X_test[eval_idxs].index].rating,
     pd.DataFrame({'y':y_test[eval_idxs]}, index=[5,24]),
     pd.DataFrame({'y_pred':y_test_svc_pred[eval_idxs]},
     index=[24,5]), X_test_eval], axis=1).transpose()
eval_compare_df

The preceding code produces the dataframe shown in Figure 5.7.

Figure 5.7 – Observations #5 and #24 side by side, with feature differences highlighted in yellow

With this dataframe, you can confirm that they aren't misclassifications because y=y_pred. A misclassification could make model interpretations unreliable to understand why people tend to like one chocolate bar more than another. Then, you can examine the features to spot the differences—for instance, you can tell that the review_date is 2 years apart. Also, the beans for the Outstanding bar were from Venezuela, and the Disappointing beans came from another, lesser-represented country. The Outstanding one had a berry taste, and the Disappointing one was earthy.

The force plot can tell us a complete story of what weighed in the model's decisions (and, presumably, the reviewers'), and gives us clues as to what consumers might prefer. Plotting a force_plot requires the expected value for the class of your interest (expected_value), the SHAP values for the observation of your interest, and this observation's actual values. We will start with observation #5, as illustrated in the following code snippet:

shap.force_plot(expected_value,     
            shap_svm_values_test[1][X_test.index==5],
            X_test[X_test.index==5], matplotlib=True)

The preceding code produces the plot shown in Figure 5.8. This force plot depicts how much review_date, cocoa_percent, and tastes_berry weigh in the prediction, while the only feature that seems to be weighing in the opposite direction is counts_of_ingredients.

Figure 5.8 – Force plot for observation #5 (Outstanding)

Let's compare it with a force plot of observation #24, as follows:

shap.force_plot(expected_value,  
                shap_svm_values_test[1][X_test.index==24],
                X_test[X_test.index==24], matplotlib=True)

The preceding code produces the plot shown in Figure 5.9. We can easily tell that tastes_earthy and country_of_bean_origin_Other are considered highly negative attributes by our model. The outcome could be mostly explained by the difference in the chocolate tasting of "berry" versus "earthy". Despite our findings, the beans' origin country needs further investigation. After all, it is possible that the actual country of origin doesn't correlate with poor ratings.

Figure 5.9 – Force plot for observation #24 (Disappointing)

In this section, we covered the KernelExplainer, which uses some tricks it learned from LIME. But what is LIME? We will find that out next!

Employing LIME

Until now, the model-agnostic interpretation methods we've covered attempt to reconcile the totality of outputs of a model with its inputs. For these methods to get a good idea of how and why X becomes y_pred, they need some data first. Then, they perform simulations with this data, pushing variations of it in and evaluating what comes out of the model. Sometimes, they even leverage a global surrogate to connect the dots. By using what they learned in this process, they yield importances, scores, rules, or values that quantify a feature's impact, interactions, or decisions on a global level. For many methods such as SHAP, these can be observed locally too. However, even when it can be observed locally, what was quantified globally may not apply locally. For this reason, there should be another approach that quantifies the local effects of features solely for local interpretation—one such as LIME!

What is LIME?

LIME trains local surrogates to explain a single prediction. To this end, it starts by asking you which data point you want to interpret. You also provide it with your black-box model and a sample dataset. It then makes predictions on a perturbed version of the dataset with the model, creating a scheme whereby it samples and weighs points higher if they are closer to your chosen data point. This area around your point is called a neighborhood. Then, using the sampled points and black-box predictions in this neighborhood, it trains a weighted intrinsically interpretable surrogate model. Lastly, it interprets the surrogate model.

There are lots of keywords to unpack here so let's define them, as follows:

Chosen data point: LIME calls the data point, row, or observation you want to interpret an instance. It's just another word for this concept.
Perturbation: LIME simulates new samples by perturbing each feature drawing from its training-dataset distribution for categorical features and normal distribution for continuous features.
Weighting scheme: LIME uses an exponential smoothing kernel to both define the neighborhood radius and determine how to weigh the points farthest versus those closest.
Closer: LIME uses Euclidean distance for tabular and image data, and cosine similarity for text. This is hard to imagine in high-dimensional feature spaces, but you can calculate the distance between points for any number of dimensions and find which points are closest to the one of interest.
Intrinsically interpretable surrogate model: LIME uses a sparse linear model with weighted ridge regularization. However, it could use any intrinsically interpretable model as long as the data points can be weighted. The idea behind this is twofold. It needs a model that can yield reliable intrinsic parameters such as coefficients that tell it how much each feature impacts the prediction. It also needs to consider data points closest to the chosen point more because these are more relevant.

Much like with k-Nearest Neighbors (k-NN), the intuition behind LIME is that points in a neighborhood have commonality because you could expect points close to each other to have similar, if not the same, labels. There are decision boundaries for classifiers, so this could be a very naive assumption to make when close points are divided by one.

Similar to another model class in the Nearest Neighbors family, Radius Nearest Neighbors, LIME factors in distance along a radius and weighs points accordingly, although it does this exponentially. However, LIME is not a model class but an interpretation method, so the similarities stop there. Instead of "voting" for predictions among neighbors, it fits a weighted surrogate sparse linear model because it assumes that every complex model is linear locally, and because it's not a model class, the predictions the surrogate model makes don't matter. In fact, the surrogate model doesn't even have to fit the data like a glove because all you need from it is the coefficients. Of course, that being said, it is best if it fits well so that there is higher fidelity in the interpretation.

LIME works for tabular, image, and text data and generally has high local fidelity, meaning that it can approximate the model predictions quite well on a local level. However, this is contingent on the neighborhood being defined correctly, which stems from choosing the right kernel width and the assumption of local linearity holding true.

Local interpretation for a single prediction at a time using LimeTabularExplainer

To explain a single prediction, you first instantiate a LimeTabularExplainer by providing it with your sample dataset in a NumPy 2D array (X_test.values), a list with the names of the features (X_test.columns), a list with the indices of the categorical features (only the first three features aren't categorical), and the class names. Even though only the sample dataset is required, it is recommended that you provide names for your features and classes so that the interpretation makes sense. For tabular data, telling LIME which features are categorical (categorical_features) is important because it treats categorical features differently from continuous ones, and not specifying this could potentially make for a poor-fitting local surrogate. Another parameter that can greatly impact the local surrogate is kernel_width. This defines the diameter of the neighborhood, thus answering the question of what is considered local. It has a default value, which may or may not yield interpretations that make sense for your instance. You could tune this parameter on an instance-by-instance basis to optimize your explanations. The code can be seen in the following snippet:

lime_svm_explainer =  
  lime.lime_tabular.LimeTabularExplainer(X_test.values,
         feature_names=X_test.columns,        
         categorical_features=list(range(3,44)),
         class_names=['Not Highly Recomm.', 'Highly Recomm.'])

With the instantiated explainer, you can now use explain_instance to fit a local surrogate model to observation #5. We also will use our model's classifier function (predict_proba) and limit our number of features to eight (num_features=8). We can take the "explanation" returned and immediately visualize it with show_in_notebook. At the same time, the predict_proba parameter makes sure it also includes a plot to show which class is the most probable, according to the local surrogate model. The code is illustrated in the following snippet:

lime_svm_explainer.
  explain_instance(X_test[X_test.index==5].values[0],
             fitted_svm_mdl.predict_proba,
               num_features=8).show_in_notebook(predict_proba=True)

The preceding code provides the output shown in Figure 5.10. According to the local surrogate, a cocoa_percent value smaller or equal to 70 is a favorable attribute, as is the berry taste. A lack of sour, sweet, and molasses tastes also weighs in favorably in this model. However, a lack of rich, creamy, and cocoa tastes does the opposite, but not enough to push the scales toward Not Highly Recommended.

Figure 5.10 – LIME tabular explanation for observation #5 (Outstanding)

With a small adjustment to the code that produced Figure 5.10, we can produce the same plot but for observation #24, as follows:

lime_svm_explainer.
  explain_instance(X_test[X_test.index==24].values[0],
              fitted_svm_mdl.predict_proba,
              num_features=8).
  show_in_notebook(predict_proba=True)

Here, in Figure 5.11, we can clearly see why the local surrogate believes that observation #24 is Not Highly Recommended:

Figure 5.11 – LIME tabular explanation for observation #24 (Disappointing)

Once you compare the explanation of #24 (Figure 5.11) with that of #5 (Figure 5.10), the problems become evident. A single feature, tastes_berry, is what differentiates both explanations. Of course, we have limited it to the top eight features, so there's probably much more to it. However, you would expect the top eight features to include the ones that make the most difference.

According to SHAP, knowing that tastes_earthy=1 is what globally explains the disappointing nature of the #24 chocolate bar, but this appears to be counterintuitive. So, what happened? It turns out that observations #5 and #24 are relatively similar and, thus, in the same neighborhood. This neighborhood also includes many chocolate bars with berry tastes, and very few with earthy ones. However, there are not enough earthy ones to consider it a salient feature, so it attributes the difference between Highly Recommended and Not Highly Recommended to other features that seem to differentiate more often, at least locally. The reason for this is twofold: the local neighborhood could be too small, and linear models, given their simplicity, are on the bias end of a bias-variance trade-off. This bias is only exacerbated by the fact that some features such as tastes_berry can appear relatively more often than tastes_earthy. There's an approach we can use to fix this, and we'll cover this in the next section.

Using LIME for NLP

At the beginning of the chapter, we set aside training and test datasets with the cleaned-up contents of all the "tastes" columns for NLP. We can take a peek at the test dataset for NLP, as follows:

print(X_test_nlp)

This outputs the following:

1194                 roasty nutty rich
77      roasty oddly sweet marshmallow
121              balanced cherry choco
411                sweet floral yogurt
1259           creamy burnt nuts woody
                     ...              
327          sweet mild molasses bland
1832          intense fruity mild sour
464              roasty sour milk note
2013           nutty fruit sour floral
1190           rich roasty nutty smoke
Length: 734, dtype: object

No machine learning model can ingest the data as text, so we need to turn it into a numerical format—in other words, vectorize it. There are many techniques we can use to do this. In our case, we are not interested in the position of words in each phrase, nor the semantics. However, we are interested in their relative occurrence—after all, that was an issue for us in the last section.

For these reasons, Term Frequency-Inverse Document Frequency (TF-IDF) is the ideal method because it's meant to evaluate how often a term (each word) appears in a document (each phrase). However, it's weighted according to its frequency in the entire corpus (all phrases). We can easily vectorize our datasets using the TF-IDF method with TfidfVectorizer from scikit-learn. However, when you have to make TD-IDF scores, these are fitted to the training dataset only because that way, the transformed train and test datasets have consistent scoring for each term. Have a look at the following code snippet:

vectorizer = TfidfVectorizer(lowercase=False)
X_train_nlp_fit = vectorizer.fit_transform(X_train_nlp)
X_test_nlp_fit = vectorizer.transform(X_test_nlp)

To get an idea of what the TF-IDF score looks like, we can place all the feature names in one column of a dataframe, and their respective scores for a single observation in another. Note that since the vectorizer produces a scipy sparse matrix, we have to convert it into a NumPy matrix with todense() and then a NumPy array with asarray(). We can sort this dataframe in descending order by TD-IDF scores. The code is shown in the following snippet:

pd.DataFrame({'taste':vectorizer.get_feature_names(),
           'tf-idf': np. asarray(X_test_nlp_fit[X_test_nlp.index==5].
                                                      todense())[0]}).
       sort_values(by='tf-idf', ascending=False)

The preceding code produces the output shown here in Figure 5.12:

Figure 5.12 – The TF-IDF scores for words present in observation #5

And as you can tell from Figure 5.12, the TD-IDF scores are normalized values between 0 and 1, and those most common in the corpus have a lower value. Interestingly enough, we realize that observation #5 in our tabular dataset had berry=1 because of raspberry. The categorical encoding method we used searched occurrences of berry regardless of whether it matched an entire word or not. This isn't a problem because raspberry is a kind of berry, and raspberry wasn't one of our common tastes with its own binary column.

Now that we have vectorized our NLP datasets, we can proceed with the modeling.

Training a LightGBM model

LightGBM, like XGBoost, is another very popular and performant gradient-boosting framework that leverages boosted-tree ensembles and histogram-based split finding. The main differences lie in the split method's algorithms, which for LightGBM uses sampling with Gradient-based One-Side Sampling (GOSS) and bundling sparse features with Exclusive Feature Bundling (EFB) versus XGBoost's more rigorous Weighted Quantile Sketch and Sparsity-aware Split Finding. Another difference lies in how the trees are built, which is depth-first (top-down) for XGBoost and best-first (across a tree's leaves) for LightGBM. We won't get into the details of how these algorithms work because that would derail the topic at hand. However, it's important to note that thanks to GOSS, LightGBM is usually even faster than XGBoost, and though it can lose predictive performance due to GOSS split approximations, it gains some of it back with its best-first approach. On the other hand, Explainable Boosting Machine (EBM) makes LightGBM ideal for training on sparse features efficiently and effectively, such as those in our X_train_nlp_fit sparse matrix! That pretty much sums up why we are using LightGBM for this exercise.

To train the LightGBM model, we first initialize the model by setting the maximum tree depth (max_depth), the learning rate (learning_rate), the number of boosted trees to fit (n_estimators), the objective, which is binary classification, and—last but not least—the random_state for reproducibility. With fit, we train the model using our vectorized NLP training dataset (X_train_nlp_fit) and the same labels used for the SVM model (y_train). Once trained, we can evaluate using the evaluate_class_mdl we used with SVM. The code is illustrated in the following snippet:

lgb_mdl = lgb.LGBMClassifier(max_depth=13, learning_rate=0.05,
   n_estimators=100, objective='binary', random_state=rand)
fitted_lgb_mdl = lgb_mdl.fit(X_train_nlp_fit, y_train)
y_train_lgb_pred, y_test_lgb_prob, y_test_lgb_pred =
mldatasets.evaluate_class_mdl(fitted_lgb_mdl, X_train_nlp_fit, X_test_nlp_fit, y_train, y_test)

The preceding code produces Figure 5.13, shown here:

Figure 5.13 – Predictive performance of our LightGBM model

The performance achieved by LightGBM (see Figure 5.13) is slightly lower than for SVM (Figure 5.3) but it's still pretty good, safely above the coin-toss line. The comments for SVM about favoring precision over recall for this model also apply here.

Local interpretation for a single prediction at a time using LimeTextExplainer

To interpret any black-box model prediction with LIME, you need to specify a classifier function such as predict_proba for your model, and it will use this function to make predictions with perturbed data in the neighborhood of your instance and then train a linear model with it. The instance must be in its numerical form—in other words, vectorized. However, it would be easier if you could provide any arbitrary text, and it could then vectorize it on the fly. This is precisely what a pipeline can do for you. With the make_pipeline function from scikit-learn, you can define a sequence of estimators that transform the data, followed by one that can fit it. In this case, we just need vectorizer to transform our data, followed by our LightGBM model (lgb_mdl) that takes the transformed data, as illustrated in the following code snippet:

lgb_pipeline = make_pipeline(vectorizer, lgb_mdl)

Initializing a LimeTextExplainer is pretty simple. All parameters are optional, but it's recommended to specify names for your classes. Just as with LimeTabularExplainer, a kernel_width optional parameter can be critical because it defines the neighborhood's size, and there's a default that may not be optimal but can be tuned on an instance-by-instance basis. The code is illustrated here:

lime_lgb_explainer = LimeTextExplainer(
                      class_names=['Not Highly Recomm.', 'Highly Recomm.'])

Explaining an instance with LimeTextExplainer is similar to doing it for LimeTabularExplainer. The difference is that we are using a pipeline (lgb_pipeline), and the data we are providing (first parameter) is text since the pipeline can transform it for us. The code is illustrated in the following snippet:

lime_lgb_explainer.
    explain_instance(X_test_nlp[X_test_nlp.index==5].values[0],
              lgb_pipeline.predict_proba, num_features=4).
   show_in_notebook(text=True)

According to the LIME text explainer (see Figure 5.14), the LightGBM model predicts Highly Recommended for observation #5 because of the word caramel. At least according to the local neighborhood, raspberry is not a factor.

Figure 5.14 – LIME text explanation for observation #5 (Outstanding)

Now, let's contrast the interpretation for observation #5 with that of #24, as we've done before. We can use the same code but simply replace 5 with 24, as follows:

lime_lgb_explainer.
    explain_instance(X_test_nlp[X_test_nlp.index==24].values[0], 
               lgb_pipeline.predict_proba, num_features=4).
   show_in_notebook(text=True)

According to Figure 5.15, you can tell that observation #24, described as tasting like burnt wood earthy choco is Not Highly Recommended because of the words earthy and burnt.

Figure 5.15 – LIME tabular explanation for observation #24 (Disappointing)

Given that we are using a pipeline that can vectorize any arbitrary text, let's have some fun with that! We will first try a phrase made out of adjectives we suspect that our model favors, then try one with unfavorable adjectives, and lastly try using words that our model shouldn't be familiar with, as follows:

lime_lgb_explainer.explain_instance('creamy rich complex fruity', 
               lgb_pipeline.predict_proba, num_features=4).
   show_in_notebook(text=True)
lime_lgb_explainer.explain_instance('sour bitter roasty molasses',
               lgb_pipeline.predict_proba, num_features=4).
   show_in_notebook(text=True)
lime_lgb_explainer.explain_instance('nasty disgusting gross stuff', 
               lgb_pipeline.predict_proba, num_features=4).
   show_in_notebook(text=True)

In Figure 5.16, the explanations are spot-on for creamy rich complex fruity and sour bitter roasty molasses since the model knows these words to be either very favorable or unfavorable. These words are also common enough to be appreciated on a local level.

You can see the output here:

Figure 5.16 – Arbitrary phrases not in the training or test dataset can be effortlessly explained with LIME, as long as words are in the corpus

However, you'd be mistaken to think that the prediction of Not Highly Recommended for nasty disgusting gross stuff has anything to do with the words. The LightGBM model hasn't seen these words before, so the prediction has more to do with Not Highly Recommended being the majority class, which is a good guess, and the sparse matrix for this phrase is all zeros. Therefore, LIME likely found few distant points—if any at all—in its neighborhood, so the zero coefficients of LIME's local surrogate model reflect this.

Trying SHAP for NLP

Most of SHAP's explainers will work with tabular data. DeepExplainer can do text but is restricted to deep learning models, and, as we will cover in Chapter 7, Visualizing Convolutional Neural Networks, three of them do images, including KernelExplainer. In fact, SHAP's KernelExplainer was designed to be a general-purpose truly model-agnostic method, but it's not promoted as an option for NLP. It easy to understand why: it's slow, and NLP models tend to be very complex and with hundreds—if not thousands—of features to boot. In cases such as this one, where word order is not a factor and you have a few hundred features, but the top 100 are present in most of your observations, KernelExplainer could work.

In addition to overcoming slowness, there are a couple of technical hurdles you would need to overcome. One of them is that KernelExplainer is compatible with a pipeline, but it expects a single set of predictions back. But LightGBM returns two sets, one for each class: Not Highly Recommended and Highly Recommended. To overcome this problem, we can create a lambda function (predict_fn) that includes a predict_proba function, which returns only those predictions for Highly Recommended. This is illustrated in the following code snippet:

predict_fn = lambda X: lgb_mdl.predict_proba(X)[:,1]

The second technical hurdle has to with SHAP's incompatibility with SciPy's sparse matrices, and for our explainer we will need sample vectorized test data, which is in this format. To overcome this issue, can convert our data in SciPy sparse-matrix format to a NumPy matrix and then to a pandas dataframe (X_test_nlp_samp_df). To overcome any slowness, we can use the same kmeans trick we used last time. Other than the adjustments made to overcome obstacles, the following code is exactly the same as with SHAP performed with the SVM model:

X_test_nlp_samp_df = pd.DataFrame(shap.
                               sample(X_test_nlp_fit, 50).todense())
shap_lgb_explainer =  
    shap.KernelExplainer(predict_fn,
                         shap.kmeans(X_train_nlp_fit.todense(), 10))
shap_lgb_values_test =
   shap_lgb_explainer.shap_values(X_test_nlp_samp_df,
                                  l1_reg="num_features(20)")
shap.summary_plot(shap_lgb_values_test, X_test_nlp_samp_df,
      plot_type="dot", feature_names=vectorizer.get_feature_names())

By using SHAP's summary plot in Figure 5.17, you can tell that globally the words creamy, rich, cocoa, fruit, spicy, nutty, and berry have a positive impact on the model toward predicting Highly Recommended. On the other hand, sweet, sour, earthy, hammy, sandy, and fatty have the opposite effect. These results shouldn't be entirely unexpected given what we learned with our prior SVM model with the tabular data and local LIME interpretations. That being said, the SHAP values were derived from samples of a sparse matrix, and they could be missing details and perhaps even be partially incorrect, especially for underrepresented features. Therefore, we should take the conclusions with a grain of salt, especially toward the bottom half of the plot. To increase interpretation fidelity it's best to increase sample size, but given the slowness of KernelExplainer, there's a trade-off to consider.

You can view the output here:

Figure 5.17 – SHAP summary plot for the LightGBM NLP model

Now that we have validated our SHAP values globally, we can use them for local interpretation with a force plot. Unlike LIME, we cannot use arbitrary data for this. With SHAP, we are limited to those data points we have previously generated SHAP values for. For instance, let's take the 18_th observation from our test dataset sample, as follows:

print(shap.sample(X_test_nlp, 50).to_list()[18])

The preceding code outputs this phrase:

woody earthy medicinal

It's important to note which words are represented in the 18_th observation because the X_test_nlp_samp_df dataframe contains the vectorized representation. The 18_th observation's row in this dataframe is what you use to generate the force plot, along with the SHAP values for this observation and the expected value for the class, as illustrated in the following code snippet:

shap.force_plot(shap_lgb_explainer.expected_value,
                shap_lgb_values_test[18,:],
                X_test_nlp_samp_df.iloc[18,:],
                feature_names=vectorizer.get_feature_names())

Figure 5.18 is the force plot for woody earthy medicinal. As you can tell, earthy and woody weigh heavily in a prediction against Highly Recommended. The word medicinal is not featured in the force plot and instead you get a lack of creamy and cocoa as negative factors. As you can imagine, medicinal is not a word used often to describe chocolate bars, so there was only one observation in the sampled dataset that included it. Therefore, its average marginal contribution across possible coalitions would be greatly diminished.

Figure 5.18 – SHAP force plot for the 18 th observation of the sampled test dataset — Figure 5.18 – SHAP force plot for the 18 ^th observation of the sampled test dataset

Let's try another one, as follows:

print(shap.sample(X_test_nlp, 50).to_list()[9])

The 9_th observation is the following phrase:

intense spicy floral

Generating a force_plot for this observation is the same as before, except you replace 18 with 9. If you run this code, you produce the output shown here in Figure 5.19:

Figure 5.19 – SHAP force plot for the 9 th observation of the sampled test dataset — Figure 5.19 – SHAP force plot for the 9 ^th observation of the sampled test dataset

As you can appreciate in Figure 5.19, all words in the phrase are featured in the force plot: floral and spicy pushing toward Highly Recommended, and intense toward Not Highly Recommended. So, now you know how to perform both tabular and NLP interpretations with SHAP, how does it compare with LIME?

Comparing SHAP with LIME

As you will have noticed by now, both SHAP and LIME have limitations, but they also have strengths. SHAP is grounded in game theory and approximate Shapley values, so its SHAP values mean something. These have great properties such as additivity, efficiency, and substitutability that make it consistent but violate the dummy property. It always adds up and doesn't need parameter tuning to accomplish this. However, it's more suited for global interpretations, and one of its most model-agnostic explainers, KernelExplainer, is painfully slow. KernelExplainer also deals with missing values by using random ones, which can put too much weight on unlikely observations.

LIME is speedy, very model-agnostic, and adaptable to all kinds of data. However, it's not grounded on strict and consistent principles but has the intuition that neighbors are alike. Because of this, it can require tricky parameter tuning to define the neighborhood size optimally, and even then, it's only suitable for local interpretations.

Mission accomplished

The mission was to understand why one of your client's bars is Outstanding while another one is Disappointing. Your approach employed the interpretation of machine learning models to arrive at the following conclusions:

According to SHAP on the tabular model, the Outstanding bar owes that rating to its berry taste and its cocoa percentage of 70%. On the other hand, the unfavorable rating for the Disappointing bar is due mostly to its earthy flavor and bean country of origin (Other). Review date plays a smaller role, but it seems that chocolate bars reviewed in that period (2013-15) were at an advantage.
LIME confirms that cocoa_percent<=70 is a desirable property, and that, in addition to berry, creamy, cocoa, and rich are favorable tastes, while sweet, sour, and molasses are unfavorable.
The commonality between both methods using the tabular model is that despite the many non-taste-related attributes, taste features are among the most salient. Therefore, it's only fitting to interpret the words used to describe each chocolate bar via an NLP model.
The Outstanding bar was represented by the phrase oily nut caramel raspberry, of which, according to LIMETextExplainer, caramel is positive and oily is negative. The other two words are neutral. On the other hand, the Disappointing bar was represented by burnt wood earthy choco, of which burnt and earthy are unfavorable and the other two are favorable.
The inconsistencies between the tastes in tabular and NLP interpretations are due to the presence of lesser-represented tastes, including raspberry, which is not as common as berry.
According to SHAP's global explanation of the NLP model, creamy, rich, cocoa, fruit, spicy, nutty, and berry have a positive impact on the model toward predicting Highly Recommended. On the other hand, sweet, sour, earthy, hammy, sandy, and fatty have the opposite effect.

With these notions of which chocolate-bar characteristics and tastes are considered less attractive by Manhattan Chocolate Society members, a client can apply changes to their chocolate-bar formulas to appeal to a broader audience—that is, if the assumption is correct about that group being representative of their target audience.

It could be argued that it is pretty apparent that words such as earthy and burnt are not favorable words to associate with chocolate bars, while caramel is. Therefore, we could have reached this conclusion without machine learning! But first of all, a conclusion not informed by data would have been an opinion, and, secondly, context is everything. Furthermore, humans can't always be relied upon to place one point objectively in its context—especially considering it's among thousands of records!

Also, local model interpretation is not only about the explanation for one prediction because it's connected to how a model makes all predictions but, more importantly, to how it makes predictions for similar points—in other words, in the local neighborhood! In the next chapter, we will expand on what it means to be in the local neighborhood by looking at the commonalities (anchors) and inconsistencies (counterfactuals) we can find there.

Summary

After reading this chapter, you should know how to use SHAP's KernelExplainer, as well as its decision and force plot to conduct local interpretations. You also should know how to do the same with LIME's instance explainer for both tabular and text data. Lastly, you should understand the strengths and weaknesses of SHAP's KernelExplainer and LIME. In the next chapter, we will learn how to create even more human-interpretable explanations of a model's decisions, such as "if X conditions are met, then Y is the outcome".

Dataset sources

Brelinski, Brady (2020). Manhattan Chocolate Society. http://flavorsofcacao.com/mcs_index.html

Table of Contents for
5 Local Model-Agnostic Interpretation Methods

5 Local Model-Agnostic Interpretation Methods

Join our book community on Discord

Technical requirements

The mission

The approach

The preparations

Loading the libraries

Understanding and preparing the data

The data dictionary

Data preparation

Leveraging SHAP's KernelExplainer for local interpretations with SHAP values

Training a C-SVC model

Computing SHAP values using KernelExplainer

Local interpretation for a group of predictions using decision plots

Local interpretation for a single prediction at a time using a force plot

Employing LIME

What is LIME?

Local interpretation for a single prediction at a time using LimeTabularExplainer

Using LIME for NLP

Training a LightGBM model

Local interpretation for a single prediction at a time using LimeTextExplainer

Trying SHAP for NLP

Comparing SHAP with LIME

Mission accomplished

Summary

Dataset sources

Further reading

Table of Contents for 5 Local Model-Agnostic Interpretation Methods

Create new playlist

Sign In

Sign Up

5 Local Model-Agnostic Interpretation Methods

Join our book community on Discord

Technical requirements

The mission

The approach

The preparations

Loading the libraries

Understanding and preparing the data

The data dictionary

Data preparation

Leveraging SHAP's KernelExplainer for local interpretations with SHAP values

Training a C-SVC model

Computing SHAP values using KernelExplainer

Local interpretation for a group of predictions using decision plots

Local interpretation for a single prediction at a time using a force plot

Employing LIME

What is LIME?

Local interpretation for a single prediction at a time using LimeTabularExplainer

Using LIME for NLP

Training a LightGBM model

Local interpretation for a single prediction at a time using LimeTextExplainer

Trying SHAP for NLP

Comparing SHAP with LIME

Mission accomplished

Summary

Dataset sources

Further reading

Table of Contents for
5 Local Model-Agnostic Interpretation Methods