Chapter 9: Model Evaluation Metrics

Model evaluation is essential for determining the accuracy and effectiveness of your model. Evaluation metrics can help a data scientist develop a robust model on the training and validation data sets while also providing a framework to evaluate the accuracy of the final model by applying these metrics to the hold-out test and out-of-time data sets. A data scientist must select the appropriate evaluation metric before developing their model and must understand the pros and cons of each evaluation metric.

This chapter will focus on many of the most common model evaluation metrics that are required in both the academic and corporate data science settings.

General Information

Before you create any model, you should first have a strong understanding of all the data assets that you will be using to develop your analysis and subsequent model. Not only is this very important for your own use, but once you build your model, you will need to demonstrate its quality and effectiveness to a team of people who were not involved in the model development process. They will require a start-to-finish explanation of the data sources, modeling techniques, and evaluation metrics before they can approve your model for production.

Data Source Statement

The first item that is generally required is a data source statement. This statement is a summary of information about all the data sets that you have pooled together to create your modeling data set. Table 9.1 shows a general structure of the data source statement.

Table 9.1: Data Source Statement Structure

The data source statement is usually in the form of a spreadsheet or a Word document. It contains all the information that a reviewer would need to understand what type of data you will be using, where it can be found, and some indication of the volume of data.

Data Dictionary

The next critical piece of information that a reviewer would need is the data dictionary. The data dictionary is a summary of all the variables that are used in the model. Some model governance organizations require that you submit a data dictionary for all variables that were even considered by the model, while other model governance bodies require only a data dictionary for those variables that were retained in the final model.

Table 9.2 shows the general structure of the data dictionary.

Table 9.2: Data Dictionary Structure

The table shows the variable name as represented in the data set, the actual business name (which is the common language name of the data asset), and a definition of the variable.

Variable Summaries

The last pieces of general information that we will touch on are the variable summaries, statistics, and graphs. These pieces of information provide the reviewer with detailed information about each variable in your data set.

We will use the Airbnb data set that we developed in Chapter 2 to demonstrate the variable summaries. This data set contains both continuous numeric variables and categorical character variables. For the numeric variables, we can run a simple PROC MEANS to get a general overview of these variables. Program 9.1 develops the numeric variable summaries.

Program 9.1: Numeric Variable Summaries

PROC MEANS DATA=TRAIN N NMISS MIN MAX MEAN STDDEV;

       VAR price bedrooms bathrooms accommodates square_feet;

RUN;

This program pulls only a few selected numeric variables for demonstration purposes. The data that is used is the raw data that has not been treated for missing values, outliers, or limits set on specific values. Output 9.1 shows the data output.

Output 9.1: Variable Summary for a Few Selected Variables

The raw data shows missing values and wide standard deviations of the data. The model reviewers will need to know what the data looked like prior to any treatments of the data.

Once the data has been treated, you can provide another variable summary of the final modeling data set. This summary provides the model reviewers with a before and after look at the modeling data set. Output 9.2 shows the same PROC MEANS applied to the final modeling data set.

Output 9.2: Variable Summary for the Final Modeling Data Set

By evaluating the before and after look at the data, the model reviewers can easily see that there no more missing values, there are thresholds placed on the target variable price, the standard deviations have been significantly reduced, and the variable square_feet has been removed due to a large number of missing values.

For critical variables such as the target variable, additional analysis may be required. This analysis would include a statistical review of the variable. The most common approach is to create a PROC UNIVARIATE of the target variable. This analysis will show all of the relevant information pertaining to that variable. Program 9.2 develops a statistical analysis of the price variable.

Program 9.2: Univariate Analysis of the Target Variable

PROC UNIVARIATE DATA=train_final;

    VAR price;

    HISTOGRAM;

RUN;

The output of PROC UNIVARIATE has been reviewed in both Chapter 2 and Chapter 3, so please refer to those chapters for more detail on the interpretation of the output. The model reviewers will generally inspect the Moments and Basic Statistical Measures tables shown in Output 9.3.

Output 9.3: PROC UNIVARIATE Output

Output 9.13 – PROC UNIVARIATE Output

For character variables, PROC FREQ will provide the necessary information on the number of classes and the volume in each category for each character variable. Program 9.3 shows a sample PROC FREQ statement.

Program 9.3: Character Variable Analysis

PROC FREQ DATA=train_final;

  TABLES _CHARACTER_;

RUN;

The final piece of information that a model reviewer may require is a graphical depiction of the distribution of a set of variables. This chart is not generally required for all variables, but it is common to produce a graphical distribution for the target variable and a few of the critical predictive variables.

Output 9.4: Graphical Distribution for a Few Key Variables

PROC UNIVARIATE in Program 9.2 included a HISTOGRAM statement that produces a graphical distribution of the data shown in Output 9.4.

This view of the target variable allows the model reviewer to see that the variable is skewed right with most of the values occurring between a price of $50 and $150 per night.

Model Output

Academic review and organizational model governance will require a detailed description of how the model was developed. There are standard templates that you can use for academic purposes, and most organizations have detailed forms that you need to complete. These forms walk you through the process of detailing your development process. Some parts of these forms will require you to provide model overview information and model assessment metrics. This section will focus on walking you through the explanation of each of the most common model metrics and what is expected in the model overview section.

Model Overview

Once a model has been developed, we can provide detailed information on certain features of the model. If the model that was developed is a linear regression, logistic regression or a simple decision tree, then we can provide a model overview table similar to Table 9.3.

Table 9.3: Model Overview Table

This table is very useful for “white-box” models. These models allow us to know precisely what is going on inside the model at the variable level. Therefore, we can list the relevant variables and provide information on each variable such as the coefficient value, percent contribution based on the Wald Chi-Square value, the variable description, the direction of the relationship between the variable and the target variable and the business description that defines how this variable works in relation to the target variable.

“Black-box” models will require some alternative methods of explanation that we will explore throughout this chapter.

Accuracy Statistics

Model accuracy metrics can be divided into two classes, regression and classification. Regression models contain a target variable that is a continuous numeric value, while classification models contain a categorical target variable. Because of this difference, the way that the model accuracy is assessed is different depending on which of these model designs are being used.

Regression Models

There are different stages of model accuracy that we need to inspect during the construction and review of regression models. During the construction phase, we need to assess the training error rate that compares the actual observation values with the predicted value (for example, MAE and RMSE). Once the given error metric is minimized, we will need to determine which of a range of models will perform best on the testing data set (for example, AIC and BIC). Finally, once a model is selected, we need to determine the overall accuracy of the model (for example, R2and Adjusted R2).

RSS, MAE, MSE, RMSE, and MAPE

In Chapter 6, we reviewed linear regression models and we introduced the subject of accuracy metrics such as the Residual Sum of Squares (RSS). The RSS is simply the sum of the squared values of the actual minus the predicted value. Equation 9.1 shows the formula for the RSS. This metric is used to construct the coefficient weights of the regression model. These weights are selected where the RSS is minimized.

Equation 9.1: Residual Sum of Squares Equation

Although this is an important error metric that we need to calculate to understand the predictive accuracy of the model, it is by no means the only error metric that we can calculate by using the difference between the actual and the predicted values.

Figure 9.1 shows us an example of the actual versus the predicted values for a sample data set.

Figure 9.1: Actual vs. Predicted Values

The Mean Absolute Error (MAE) is calculated by taking the absolute value of the difference between the actual and predicted values.

Equation 9.2: Mean Absolute Error Equation

This metric mitigates the impact that outliers can have on your model performance metric because each residual contributes proportionally to the total amount of error. This means that outliers contribute linearly to the overall error.

The Mean Squared Error (MSE) can be thought of as a combination of the RSS and MAE approaches. This equation takes the RSS formula and divides it by the total number of observations.

Equation 9.3: Mean Squared Error Equation

Because we are squaring the difference of the terms, the MSE will almost always be greater than the MAE. Because the MAE and MSE are computationally different, we cannot justly compare the two results. For the MSE metric, the square term magnifies the impact of outliers. This means that the error term will grow quadratically in relation to outliers.

The Root Mean Squared Error (RMSE) is the square root of the MSE metric. This metric normalizes the units to be on the same scale as the target variable. If the MSE can be thought of as the variance of the model, then the RMSE can be thought of as the standard deviation of the model.

Equation 9.4: Root Mean Squared Error Equation

The Mean Absolute Percent Error (MAPE) is calculated by converting the MAE into a percentage. The main benefit of the MAPE is that people understand percentages, so the interpretation of this metric is much easier to understand. We still get the linear relation of outliers to the overall error, just like the MAE. Where the MAE is the average magnitude of error produced by the model, the MAPE is how far the model’s predictions are off from their corresponding outputs on average.

Equation 9.5: Mean Absolute Percent Error Equation

Mallow’s Cp , AIC, and BIC

When we built our regression model, we selected a method that specifies how the variables will be added to the model. The possible methods included full, forward, backward, and stepwise. Each of these approaches results in a set of models that contain subsets of a different number of predictors. We can evaluate the model errors on the training data set using the model error metrics discussed in the previous section (MAE, MSE, RMSE, MAPE).

Those metrics are beneficial in determining the best model for the training data. However, we select the overall best model based on the TEST data. To evaluate the error metrics on the TEST data set, we can either create separate hold-out test data sets and investigate the accuracy statistics on those data sets, or we can adjust the training error for the model size, which will simulate a test data set performance metric. This approach can help us select the best model among a set of models with a different number of variables.

Mallows’s Cp adds a penalty factor to the training RSS to adjust for the training error underestimating the test error. Equation 9.6 shows that Mallows’s Cp adds a penalty factor of 2dσ ̂2.

Equation 9.6: Mallows’s Cp Equation

This penalty factor has two main effects:

1. As more variables are added to the model (represented by d), the penalty increases.

2. TThe σ ̂2 term is an estimate of the variance of the error associated with each response measurement.

The Cp statistic can be interpreted as “lower is better.” This interpretation is because the value of C_p tends to be small for models with a low test error.

The Akaike Information Criterion (AIC) functions similarly to the Mallows’s Cp statistic. Equation 9.7 shows that the difference between the AIC and the Cp statistic is that the AIC includes the variance in the denominator.

Equation 9.7: Akaike Information Criterion Equation

An alternative formulation of AIC is shown in Equation 9.8.

Equation 9.8: Alternative Formulation of AIC

AIC = –2 ln (L) + 2k

This approach highlights that the AIC uses a model’s maximum likelihood estimation (log-likelihood) as a measure of fit (where L is the likelihood and k is the number of parameters).

AIC can be interpreted in the same manner as the Mallows’s Cp statistic. Lower is better.

The Bayesian Information Criterion (BIC) derives very similar results as the Mallows’s Cp and the AIC. The BIC replaces the 2dσ ̂2 used by Mallows’s Cp with a log (n)dσ ̂2. This replacement forces the BIC to place heavier penalties on models with more predictors and can result in more conservative models compared to the Mallows’s Cp and AIC approaches.

Equation 9.9: Bayesian Information Criterion Equation

R2 and Adjusted R2

The above metrics allow us to evaluate and choose the best model across a range of model designs. But how do we evaluate our overall model fit? For regression models, the model fit is generally measured by a variety of metrics including R2 and Adjusted R2.

The R2 metric represents the percentage of variation of the observations explained by the model. Equation 9.10 shows that R2 is calculated by the ratio of the Residual Sum of Squares (RSS) to the Total Sum of Squares (TSS)

Equation 9.10:R2 Equation

It is tempting to use this metric to evaluate your model. It is straightforward and easy to understand. However, the R2 metric has some inherent problems with it:

● The R2 metric will always increase when you add more predictors to the model. This is the case even when these predictors are random variables with no relation to the target variable.

● The R2 metric can misrepresent the strength of the model. It does not measure goodness-of-fit and can be artificially low when the model is entirely correct. If the standard deviation of the data is large, this will drive R2 towards zero, even when the model is correctly specified.

● The R2 metric can be artificially high even when the model is wrong. This often occurs when the data is clearly non-linear, as demonstrated by a scatter plot of the data. A straight line through the data might capture many observations and produce a high R2 value, but the most appropriate model would be a non-linear approac.

● If you change the range of your predictor variables, it can have a dramatic effect on the R2 value.

A key component of the regression model building process is that the model is trying to reduce the RSS value of the training data set. It does this by adjusting the coefficients of the model and selects the coefficients that result in the lowest training RSS. However, this does not mean that this is the lowest RSS of the testing data set. Therefore, the training RSS and the training R2 cannot be used to select from among a set of models with different numbers of variables.

Instead of using the R2, you can instead use the Adjusted R2. This error metric penalizes the R2 for each additional term added to the model. Therefore, you cannot artificially improve model performance by merely adding more features to your model. The Adjusted R2 requires that each additional feature must add a significant amount of predictive power to the model to overcome an inherent adjustment based on degrees of freedom.

Equation 9.11: R2 Adjusted Equation

The d term in the equation represents the number of variables in the model. The Adjusted R2 is maximized when the numerator of Equation 9.11 is minimized. This equation demonstrates that the Adjusted R2 metric is penalized for the inclusion of unwarranted variables in the model.

For most regression designs, SAS provides detailed output on each of the statistics reviewed above. These statistics include graphical representations such as the one shown in Figure 9.2.

Figure 9.2: Fit Criteria Graphics

The graphical representation provides an easy-to-understand overview of each of the model fit statistics. These graphics show us that for the R2 and Adjusted R2 higher values are better, and for the remaining fit metrics, lower values are better. SAS also includes an indicator (a star on the graph) that shows where the number of predictors generates the lowest RSS value.

Variance Inflation Factor

This test measures the increase in an independent variable’s variance due to other independent variables in the model. The variance inflation factor test is produced with the VIF option in the REG procedure. Although the issue of where to set the threshold for the VIF varies by project, a good rule of thumb is that the VIF for each of the variables in the model should be less than 5.

Program 9.4 shows the VIF statement included in the PROC REG model.

Program 9.4: Variance Inflation Factor Metric in a PROC REG Model

PROC REG DATA=train_final;

    MODEL price_log = r_entire log_accom n_manhattan

           p_group1 log_bath poly_accom / SELECTION=STEPWISE

    SLE=0.1 SLS=0.1 INCLUDE=0 COLLIN VIF;

RUN;

The standard output of the PROC REG model includes the Parameter Estimates table. Since we included the VIF statement in the program, the last column contains the VIF values.

Output 9.5: PROC REG Parameter Estimate Output

Classification Models

Data scientists who are new to the field often make a common mistake when evaluating the performance of a classification model. That is, they examine only the classification accuracy or misclassification rate of the training and test data sets. Although this metric is important, it can provide you with a false sense of accomplishment with misleading scores. For example, if you have an unbalanced data set where the event rate is only 1.3% of your total observations, it is very easy to get a low misclassification rate by simply predicting that all the observations are a non-event. In this case, your accuracy rate would be 98.7%. In many cases, models built on unbalanced data sets result in a bias towards the majority class.

A better approach is to create a balanced data set where there is an equal number of events and non-events for your training data set. If your data set is large enough, you can simply pull all the events and take a random sample of the non-events at a sample size equivalent to the event population. If your population is not large enough, you can create a bootstrapped data set by sampling with replacement to increase the size of your population.

Remember that the train/test split must occur before the creation of a balanced data set. This is because you will build your model on the balanced training data set, but its final accuracy assessment must be determined by the hold-out testing data set that contains the true event rate.

Classification Accuracy and Misclassification Rate

Classification accuracy is simply the ratio of the number of correct predictions to the total number of observations. Equation 9.12 shows this calculation.

Equation 9.12: Classification Accuracy

The misclassification rate represents the ratio of incorrect predictions to the total number of observations. Equation 9.13 shows this calculation.

Equation 9.13: Misclassification Rate

On a balanced data set, these metrics can provide a good overview of your model performance.

Classification Matrix and Error Rate

A classification matrix (also called a confusion matrix) is a table of information that shows a 2X2 matrix of the actual number of observations for a given group and the predicted number of observations that were classified for a given group. These values allow the researcher to determine the number of predictions that the model got right and the number of predictions that the model got wrong.

If the actual value for an observation is 1 and the predicted value for an observation is categorized as 1, then it is an accurate prediction. However, when the actual value is 0, and the model predicts that it is 1, then it is an incorrect prediction. This holds true in the opposite direction also.

Figure 9.3 displays the classification matrix.

Figure 9.3: Classification Matrix

	Predicted Positive	Predicted Negative
True Positive	TP	FN	Condition Positive
True Negative	FP	TN	Condition Negative
	Predicted Condition Positive	Predicted Condition Negative	Total Population

Classification Accuracy	=(TP+TN)/Total Pop
Misclassification Rate	=(FP+FN)/Total Pop

Recall (Sensitivity)	=TP / Condition Positive
Precision (Specificity)	=TP / Predicted Condition Positive
F1 Score	=2 / ((1/Recall) + (1/Precision))

Figure 9.3 shows that the classification matrix can be used to develop critical model evaluation metrics for classification models. These metrics tell us more than the overall error rate.

● The False Positive (FP) cell is also known as the Type I Error.

● The False Negative (FN) cell is also known as the Type II Error.

● The True Positive Rate is also referred to as Recall or Sensitivity. This value answers the question, “when the value is actually positive, how often does the model predict that it is positive?”

● The False Positive Rate is calculated as (1- Specificity). This value answers the question, “when the value is actually negative, how often does the model predict that it is positive?”

● The Positive Predicted Value is also referred to as Precision or Specificity. This value answers the question, “when the value is predicted to be positive, how often is the value actually positive?”

● The F1 Score is the weighted average of the True Positive Rate and the True Negative Rate.

Receiver Operating Characteristic (ROC) and Area Under the Curve (AUC)

It is a common approach in classification models to create a probability threshold of 0.5 and judge any observation with a value greater than or equal to the threshold as positive, and any observation with a value less than the threshold as negative. It is a reasonable place to start; however, there is no hard and fast rule that your threshold must be 0.5. We can choose to set the threshold at any value between 0 and 1. With all the possible choices, how do we determine which point is the optimal place to set our threshold?

Figure 9.4: ROC Curve

The ROC curve is a visual representation of all the possible choices between 0 and 1. This graphic displays the True Positive Rate (sensitivity) on the y-axis and the False Positive Rate (1-specificity) on the x-axis. Figure 9.4 shows a sample ROC graphic.

Both the x- and y-axis range from 0 to 1. The diagonal line that bisects the graph represents points where the True Positive Rate equals the False Positive Rate. That is, at each position across the diagonal line, the probability is 50% (a coin flip). Any point on this line means that the proportion of correctly classified samples is equal to the proportion of incorrectly classified samples.

The curved line represents the model’s probability scores at various thresholds. Therefore, the ROC is a probability curve. The optimal threshold point can be determined where the curve is at its highest point towards the upper left and point of the graph. This location is the balance point where the True Positive Rate is maximized while the False Positive Rate is minimized.

The AUC is the area under the ROC’s probability curve. This metric tells us the degree or measure of separability. The higher the AUC, the better the model is at separating the positives from the negatives.

Gini Score

The Gini score (also called the Somers’ D score) is calculated by 2 * AUC -1. This formula provides us a single metric that represents how well a binary classifier model can separate the classes.

Depending on the subject matter that you are studying, the Gini score can be interpreted in different ways. A score of 40 may be weak for a marketing model, but it may be strong for a clinical model. For my own purposes of building marketing and risk models, I developed Table 9.4 that provides a very broad rule of thumb to interpret your Gini score results.

Table 9.4: Model Strength Table

Gini Score	Model Strength
0 to 20	Not a model
20 to 30	Very weak model
31 to 40	Mildly predictive
41 to 60	Intermediately predictive
61 to 70	Strongly predictive
71 to 80	Very strong model
Greater than 80	Suspiciously strong - check for data bleed

Lift Table

Lift tables provide additional information that has been summarized in the ROC chart. Remember that every point along the ROC chart is a probability threshold. The lift table provides detailed information for every point along the ROC curve.

The model evaluation macro that I have referenced throughout Chapters 7 and 8 was developed by Wensui Liu. This easy-to-use macro is labeled “separation” and can be applied to any binary classification model output to evaluate the model results.

I have placed this macro on my C drive, and I call it with a %INCLUDE statement. Program 9.5 shows the application of this macro.

Program 9.5: Lift Table Evaluation Macro

%INCLUDE ‘C:/Users/James Gearheart/Desktop/SAS Book Stuff/Projects/separation.sas’;

%separation(data = TEST_SCORE, score = P_1, y = bad);

The first line simply states where I placed the macro and the name of the macro. The actual macro is called with the %separation statement followed by specifications of your data set that you want to evaluate.

● The DATA statement allows you to specify the name of the data set that you want to evaluate.

● The SCORE statement allows you to specify the name of the predicted probability variable.

● The Y statement allows you to specify the name of the target variable.

With these three simple statements, the macro will develop a detailed analysis of the separation power and accuracy of your scored data set. Output 9.6 shows the lift table generated by this macro.

Output 9.6: Lift Table Macro Output

GOOD BAD SEPARATION REPORT FOR P_1 IN DATA TEST_SCORE
(AUC STATISTICS = 0.7097, GINI COEFFICIENT = 0.4193, DIVERGENCE = 0.5339 )
	MIN	MAX	GOOD	BAD	TOTAL	ODDS	BAD	CUMULATIVE	BAD	CUMU. BAD
	SCORE	SCORE	#	#	#	ODDS	RATE	BAD RATE	PERCENT	PERCENT
BAD	0.342	0.7104	1,689	1,088	2,777	1.55	39.18%	39.18%	22.18%	22.18%
\|	0.2708	0.342	1,947	830	2,777	2.35	29.89%	34.53%	16.92%	39.10%
\|	0.2239	0.2708	2,112	665	2,777	3.18	23.95%	31.00%	13.56%	52.66%
\|	0.1874	0.2239	2,196	581	2,777	3.78	20.92%	28.48%	11.85%	64.51%
\|	0.1562	0.1874	2,288	490	2,778	4.67	17.64%	26.31%	9.99%	74.50%
\|	0.1284	0.1562	2,384	393	2,777	6.07	14.15%	24.29%	8.01%	82.51%
\|	0.1021	0.1284	2,438	339	2,777	7.19	12.21%	22.56%	6.91%	89.42%
\|	0.0766	0.1021	2,548	229	2,777	11.13	8.25%	20.77%	4.67%	94.09%
V	0.0533	0.0766	2,581	196	2,777	13.17	7.06%	19.25%	4.00%	98.08%
GOOD	0.0142	0.0533	2,683	94	2,777	28.54	3.38%	17.66%	1.92%	100.00%
	0.0142	0.7104	22,866	4,905	27,771

This lift chart was constructed to create ten bins of data that are sorted by the max score. It is also common to specify for the program to develop twenty bins for even more detailed analysis. Although it is a common practice to set your threshold at the midpoint of the range of scores, the lift table allows you to make more strategic decisions by selecting a max score value based on your business strategy.

For example, if this was a credit risk model and I wanted to minimize my risk while not excluding good customers, I might select the maximum score that would choose the top two bins (max score = 0.342). Since there are ten bins, I am choosing the top 20% of the population sorted by the probability of their account going delinquent. The column labeled “Cumu. Bad Percent” shows that we would capture 39.10% of delinquent accounts.

If we examine the Gini score in the header information, we can see that the model has a Gini score of only 41.93. This is not a very strong risk model, so the “goods” and “bads” are not separated very well. Some risk models are very strong and have good separation of the good and bad accounts. The lift table for these models can help you determine thresholds where you can capture over 70% of predicted delinquent accounts in the top two deciles.

Population Stability Index (PSI)

The Population Stability Index measures a variable’s distribution and compares it to the same variable’s distribution of a different time period. The reason that this metric is valuable is because it is important to know if the distribution of the variable is changing over time. If it is changing over time, then the model that you developed will not be very effective in a different time period. This evaluation metric tries to determine whether the underlying data is stable across time periods or varies significantly.

After you have developed separate data sets for modeling and Out-Of-Time (OOT) based on distinct time periods, the PSI is calculated according to these steps:

1. Use PROC RANK to divide the numeric variables into an equal number of bins for both the modeling and OOT data sets. This is generally set to ten bins to create decile categories.

2. Calculate the percentage of records for each bin in the modeling sample.

3. Obtain the cutoff points for the intervals in the modeling sample.

4. Apply the cutoff points to the OOT sample.

5. Create the distributions for the OOT sample.

6. Calculate PSI from the statistics gathered from the modeling and OOT data sets.

Table 9.5 shows an example PSI table.

Table 9.5: Population Stability Index Table

This table was created for the variable acc_open_past_24mths. I had created two separate data sets that were specified by time period. The modeling data set contains records from Jan to Jun 2015 and the OOT data set contains records from Jul to Dec 2015.

The PSI is calculated by Equation 9.14.

Equation 9.14: Population Stability Index Equation

For example, the top row PSI can be calculated as:

The individual bins of the calculated PSI are summarized to create a single PSI metric for the variable. This metric can be interpreted with the following rule of thumb:

<0.1: Very slight change

0.1-0.2: Some minor change

>0.2: Significant change

Figure 9.5 shows a graphical demonstration of the PSI output for the example above. The overall PSI for the variable acc_open_past_24mths is 0.001453. This is far below the range that would indicate significant change, so we can determine that this variable is stable across the modeling and OOT time frames.

Figure 9.5: PSI Macro Output

The table of information and the output graphic have been generated from an excellent macro created by Alec Zhixiao Lin. I have included this macro in the GitHub repository, and I have labeled it “PSI Macro.” For further information on the population stability index and the development of this macro, check out Alec’s excellent paper titled, “Examining Distributional Shifts by Using Population Stability Index (PSI) for Model Validation and Diagnosis.”

Black-Box Evaluation Tools

Black-box models are models where we cannot say precisely how a value was generated. This lack of insight could be because of a randomization within the model (such as a random forest) or it could because of a shifting target value and constant rebalancing of coefficients (such as gradient boosting), or it could be because of complex hidden layers of non-linear models with changing weights (such as neural networks) or it could be because of several other factors.

Both academic and organizational model review boards must pay particular attention to black-box models because of these mysterious processes that occur within the model development period. In order to understand the contributing factors and the importance of these factors, we must be able to develop evaluation approaches that allow us to understand how the final output value is determined.

Variable Importance

Although we may not be able to determine the exact effect that a variable has on the development of an output variable (such as coefficient weights in a regression model), we can still create an importance metric that tells us the relative importance of each variable in a black-box model.

Variable Importance tables are commonly created as output from tree-based models. This output is a great help when evaluating the variables in random forests and gradient boosting algorithms.

Table 9.6 shows us the Variable Importance table. This table provides us critical information on the power of each variable in the model and its relative importance to predicting the target variable. It is important to remember that the most critical variables might not be at the top of the tree. Variable importance is determined through four metrics:

● Count – The number of times a variable was used as a splitting rule.

● Surrogate Count – The number of times a variable was used in a surrogate splitting rule.

● Reduction in RSS – The change in the residual sum of squares when a split is found at a node.

● Relative Importance – The individual variable RSS-based importance divided by the maximum RSS importance.

Table 9.6: Variable Importance Table

The variable importance can be interpreted as each predictor’s ranking based on the contribution predictors make to the model. A relative importance of 0.000 can be interpreted to mean that since this variable was never used to split a column, it does not contribute to the predictive value of the model.

Variable importance can be used as a variable reduction technique. Since the table provides a ranked order of variables based on their usefulness, a data scientist can make selections on which variables to retain for a more parsimonious model based on their relative value.

Partial Dependence (PD) Plots

One of the most critical questions that we will need to answer to any model reviewer is, “what is the relationship between the model inputs and the prediction?” This question is easily answered with transparent models such as regression models and simple decision trees, but it becomes much more difficult to answer for black-box models. We know what the inputs are and because of the variable importance chart, we know which variables are the strongest drivers of the model prediction, but we still need to be able to express the actual relationship between the model inputs and the predicted value.

Partial Dependence Plots are graphs that depict the functional relationship between a model’s inputs and its output. This plot can show whether the relationship between a single variable and the prediction is a linear relationship or a step function or some other kind of relationship.

The PD plot attempts to explain how the model’s predictions vary depending on the value of the inputs. These plots are called “model agnostic” because they can be used across a wide variety of model designs. The PD plot is an ad hoc method of model interpretation because it shows how the model behaves when the inputs are changed, but it does not evaluate how the model creates the predictions.

The PD plot simply plots the average predicted value for each value of an input variable. This plot is a simple chart that visually demonstrates the relationship between an input variable and the prediction.

An excellent explanation and accompanying SAS code has been provided by Ray Wright, a Principle Machine Learning Developer at SAS, in his paper “Interpreting Black-Box Machine Learning Models Using Partial Dependence and Individual Conditional Expectation Plots.” This paper and the accompanying macros can be found in my GitHub repository.

Let’s develop an example using Ray’s PD plot macro titled “%PDFunction.” The example will be developed on the MYDATA.BANK_TRAIN data set that we used in Chapter 8.

The PD plot allows us to not only examine the relationship between a predictor and the output, but it can also allow us to compare the difference between model designs. For example, Program 9.6 develops a simple decision tree on the MYDATA.BANK_TRAIN data set with a small number of predictors for demonstration purposes.

Once you have downloaded the macro from either Ray’s paper or through my GitHub site and you have run the macro, you can apply the %PDFunction macro with the appropriate specifications.

Program 9.6: Partial Dependence Macro Example

proc hpsplit data=MYDATA.BANK_TRAIN leafsize = 10;

       target target / level = interval;

       input age pdays pout_succ contact_tele

pout_non contact_cell previous / level = int;

       code file=’C:/Users/James Gearheart/Desktop/SAS Book Stuff/Data/treecode.sas’;

run;

%PDFunction( dataset=MYDATA.BANK_TRAIN, target=target, PDVars=age,

otherIntervalInputs=pdays pout_succ contact_tele pout_non contact_cell previous, otherClassInputs=,

scorecodeFile=’C:/Users/James Gearheart/Desktop/SAS Book Stuff/Data/treecode.sas’, outPD=partialDependence );

proc sgplot data=partialDependence;

       series x = age y = AvgYHat;

run;

The %PDFunction uses the stored modeling code and creates a PD output data set of the variables that you defined in the macro. The final step is to apply a line plot to visualize the relationship developed in the PD output.

Output 9.7 shows the relationship between the variable “age” and the target variable. We can see that the relationship is a step function.

Output 9.7: Partial Dependence Plot

In order to compare model designs, we can create a new model and apply the PD plot macro to it so we can examine the relationship between the same two variables, but defined through a new model.

Program 9.7 develops a neural network model on the MYDATA.BANK_TRAIN data set with the same input variables that we used in the decision tree example. The model output code is input to the %PDFunction macro, and a line chart is developed to show the relationship between the same input variable and the target variable.

Program 9.7: Partial Dependence Macro Example

PROC HPNEURAL DATA=MYDATA.BANK_TRAIN;

   ARCHITECTURE MLP;

   INPUT age pdays pout_succ contact_tele

   pout_non contact_cell previous;

   TARGET target / LEVEL=nom;

   HIDDEN 10;

   HIDDEN 5;

   TRAIN;

   SCORE OUT=scored_NN;

   CODE FILE= ‘C:/Users/James Gearheart/Desktop/SAS Book Stuff/Data/NN_Model.sas’;

run;

%PDFunction( dataset=mydata.bank_train, target=target, PDVars=age,

otherIntervalInputs= pdays pout_succ contact_tele pout_non contact_cell previous, otherClassInputs=,

scorecodeFile=’C:/Users/James Gearheart/Desktop/SAS Book Stuff/Data/NN_Model.sas’, outPD=partialDependence );

proc sgplot data=partialDependence;

       series x = age y = AvgYHat;

run;

Output 9.8 shows that the neural network model develops a smooth relationship between the two variables. The overall direction of the relationship is similar for both models. The average target value is at a midpoint for lower age ranges, then it decreases significantly for the middle age ranges, and then it increases significantly for older age ranges. Both models and their associated PD plots tell the same story, but we can easily see that the functional relationship is different between the models.

Output 9.8: Partial Dependence Plot

Individual Conditional Expectation (ICE) Plots

A concept that is closely associated with the partial dependency plots is Individual Conditional Expectation plots. The ICE plots allow a data scientist to dig deeper into the relationship between variables and discover subgroups and interactions between model inputs.

The PD plots examined the average value of the target variable and compared it to a given variable’s values. The ICE plot drills down to the individual observation level. Individual observation plots can be overwhelming to look at. For a data set of even a hundred observations, examining a graph with one hundred different line plots would be challenging to comprehend. That is why the ICE plots either samples or clusters individual observations.

In the paper by Ray Wright referenced in the Partial Dependence section above, Ray also provides a macro for the development of ICE plots (thanks, Ray!). This macro can also be found in the GitHub repository for this book.

Program 9.8 shows that the %ICEPlot macro is a continuation of the %PDFunction macro. Once you have run the %PDFunction macro, the %ICEPlot macro will take the same data set, input features, and model prediction as inputs to the development of the ICE plot.

Program 9.8: Individual Conditional Expectation Macro Call

%ICEPlot( ICEVar=age, samples=10, YHatVar=p_target1 );

The %ICEPlot macro can be called with only a few statements. You need to specify only the plot variable, the number of observations to sample, and the prediction variable.

Output 9.9 shows the output of the ICE plot created from the neural network model.

Output 9.9: Individual Conditional Expectation Plot

We can see that although we specified ten individual observations, there are only four represented in the chart. This difference is because some observations overlap with one another. This chart shows us that the predictive value is influenced by age in some cases, but the top line indicates that there are observations where age does not affect the predicted value.

If we were to increase the number of samples to fifty, we would get a chart like the one in Output 9.10.

Output 9.10: Individual Conditional Expectation Plot

This chart clearly shows that there is a distinction between observations where the predicted value is a function of age and observations where the predicted value is not influenced by age.

Chapter Review

This chapter might be the most important in this book. This is because evaluation metrics are a way of holding ourselves accountable for the data science products that we produce. Properly developed evaluation metrics shine a light on all of our false assumptions, data mistakes, and biases. If the appropriate evaluations are made at each stage of the model development process, then there should be no surprises at the end of the project.

We can often tell how well a model will fit the data simply by visualizing the distribution of the data. We can also create histograms of the continuous target variables and frequency distributions of the binary target variables to gain some insight into how well a model will perform. We can look at scatter plots and bar charts of predictor variables and see how strongly they are associated with the target variable. These evaluations are performed before any modeling has occurred.

At this point, you should have a good idea of the strength of the relationship between the predictors and the target variable. Strong associations suggest that a predictive model will produce accurate predictions while weak associations suggest that a predictive model will produce poor predictions.

The next phase of the model development is the initial construction of the model. Through evaluation metrics such as AIC or the BIC, we can determine the influence of the predictors in the model outcome for regression models, and for classification models, we can gain the same insight through the Variable Importance metrics. Both of these techniques, among others, can provide us insight into how the target variable relates to the predictors within the confines of the modeling algorithm. The strength of the relationship between the target and the predictors can again be evaluated at this stage of model development.

Finally, the model that was developed on the TRAIN data set is applied to the hold-out TEST data set. The evaluation metrics that are applied to the hold-out TEST data set will determine your overall model fit. These metrics are often the RMSE for regression models and the Gini for classification models. At this point, you should have developed a good intuition of how well the model evaluation metrics will look. There is little chance that you would have seen weak relationships between the target and the predictors at each stage of the model development and somehow end up with a very strong model.

This chapter covered a variety of model evaluation techniques. It is important to align the proper technique with the associated model structure. You wouldn’t want to evaluate a continuous regression model with a Gini metric, and you wouldn’t want to evaluate a binary classifier with the RMSE metric. By understanding how each of the evaluation metrics is constructed, you can have a deeper understanding of how and when to apply each of these metrics.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9: Model Evaluation Metrics

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 9: Model Evaluation Metrics