Chapter 9. Reporting and Testing – Iterating on Analytic Systems

In previous chapters we have considered many components of an analytical application, from the input data set to the choice of algorithm and tuning parameters, and even illustrated a potential deployment strategy using a web server. In this process, we considered parameters such as scalability, interpretability, and flexibility in making our applications robust to both later refinements of an algorithm and changing requirements of scale. However, these sorts of details miss the most important element of this application: your business partners who hope to derive insight from the model and the continuing needs of the organization. What metrics should we gather on the performance of a model to make the case for its impact? How can we iterate on an initial model to optimize its use for a business application? How can these results be articulated to stakeholders? These sorts of questions are key in conveying the benefit of building analytical applications for your organization.

Just as we can use increasingly larger data sets to build predictive models, automated analysis packages and "big data" systems are making it easier to gather substantial amounts of information about the behavior of algorithms. Thus, the challenge becomes not so much if we can collect data on an algorithm or how to measure this performance, but to choose what statistics are most relevant to demonstrate value in the context of a business analysis. In order to equip you with the skills to better monitor the health of your predictive applications, improve them through iterative milestones and explain these techniques to others in this chapter, we will:

  • Review common model diagnostics and performance indicators.
  • Describe how A/B testing may be used to iteratively improve upon a model.
  • Summarize ways in which predictive insights from predictive models can be communicated in reports.

Checking the health of models with diagnostics

Throughout the previous chapters, we have primarily focused on the initial steps of predictive modeling, from data preparation and feature extraction to optimization of parameters. However, it is unlikely that our customers or business will remain unchanging, so predictive models must typically adapt as well. We can use a number of diagnostics to check the performance of models over time, which serve as a useful benchmark to evaluate the health of our algorithms.

Evaluating changes in model performance

Let us consider a scenario in which we train a predictive model on customer data and evaluate its performance on a set of new records each day for a month afterward. If this were a classification model, such as predicting whether a customer will cancel their subscription in the next pay period, we could use a metric such as the Area Under the Curve (AUC) of the Receiver-Operator-Characteristic (ROC) curve that we saw previously in Chapter 5, Putting Data in its Place – Classification Methods and Analysis. Alternatively, in the case of a regression model, such as predicting average customer spend, we can use the R2 value or average squared error:

to quantify performance over time. If we observe a drop in one of these statistics, how can we analyze further what the root cause may be?

In the graph above, we show such a scenario, where we measured the AUC for a hypothetical ad-targeting algorithm for 30 days after initial training by quantifying how many of the targeted users clicked on an ad sent in an e-mail and visited the website for our company. We see AUC begin to dip on day 18, but because the AUC is an overall measure of accuracy, it is not clear whether all observations are being poorly predicted or only a subpopulation is leading to this drop in performance. Thus, in addition to measuring overall AUC, we might think of calculating the AUC for subsets of data defined by the input features. In addition to providing a way of identifying problematic new data (and suggest when the model needs to be retrained), such reports provide a way of identifying the overall business impact of our model.

As an example, for our ad-targeting algorithm, we might look at overall performance and compare that to one of the labels in our data set: whether the user is a current subscriber or not. It may not be surprising that the performance on subscribers is usually higher, as these users were already likely to visit our site. The non-subscribers, who may never have visited our site before, represent the real opportunity in this scenario. In the preceding graph, we see that, indeed, the performance on non-subscribers dropped on day 18. However, it is also worth noting that this does not necessarily tell the whole story of why the performance dropped. We still do not know why the performance on new members is lower. We can subset the data again and look for a correlated variable. For example, if we looked along a number of ad IDs (which correspond to different images displayed to a customer in an e-mail), we might find that the performance dip is due to one particular ad (please refer to the following graph). Following up with our business stakeholders, we might find that this particular ad was for a seasonal product and is only shown every 12 months. Therefore, the product was familiar to subscribers, who may have seen this product before, but not to non-members, who thus were unfamiliar with the item and did not click on it. We might be able to confirm this hypothesis by looking at subscriber data and seeing whether performance of the model also dips for subscribers with tenure less than 12 months.

This sort of investigation can then begin to ask how to optimize this particular advertisement for new members, but can also indicate ways to improve our model training. In this scenario, it is likely that we trained the algorithm on a simple random sample of data that was biased for current subscribers, as we have more data on these customers if we took a simple random sample of event data: subscribers are more active, and thus both produce more impressions (as they may have registered for promotional e-mails) and are more likely to have clicked on ads. To improve our model, we might want to balance our training data between subscribers and non-subscribers to compensate for this bias.

In this simple example, we were able to diagnose the problem by examining performance of the model in only a small number of sub-segments of data. However, we cannot guarantee that this will always be the case, and manually searching through hundreds of variables will be inefficient. Thus, we might consider using a predictive model to help narrow down the search space. For example, consider using a gradient boosted machine (GBM) from Chapter 5, Putting Data in its Place – Classification Methods and Analysis, with the inputs being the same data we used to train our predictive model, and the outputs being the misclassification (either a label of 1, 0 for a categorical model, or a continuous error such as squared error or log loss in a regression model). We now have a model that predicts errors in the first model. Using a method such as a GBM allows us to examine systematically a large number of potential variables and use the resulting variable importance to pinpoint a smaller number of hypotheses.

Of course, the success of any these approaches hinges on the fact that the variable causing the drop in performance is a part of our training set, and that the issue has to do with the underlying algorithm or data. It is also certainly possible to imagine other cases where there is an additional variable we are not using to construct our data set for training which is causing the problem, such as poor connections on a given Internet service provider that are preventing users from clicking through an ad to our webpage, or a system problem such as failure in e-mail delivery.

Looking at performance by segments can also help us determine if the algorithm is functioning as intended when we make changes. For example, if we reweighted our training data to emphasize non-subscribers, we would hope that performance of AUC on these customers would improve. If we only examined overall performance, we might observe increases that are improvements on existing customers, but not the effect we actually wished to achieve.

Changes in feature importance

Besides examining the accuracy of models over time, we might also want to examine changes in the importance of different input data. In a regression model, we might examine the important coefficients as judged by magnitude and statistical significance, while in a decision-tree-based algorithm such as a Random Forest or GBM, we can look at measures of variable importance. Even if the model is performing at the same level as described by evaluation statistics discussed previously, shifts in the underlying variables may signal issues in data logging or real changes in the underlying data that are of business significance.

Let us consider a churn model, where we input a number of features for a user account (such as zip code, income level, gender, and engagement metrics such as hours spent per week on our website) and try to predict whether a given user will cancel his/her subscription at the end of each billing period. While it is useful to have a score predicting the likelihood of churn, as we would target these users with additional promotional campaigns or targeted messaging, the underlying features that contribute to this prediction may provide insight for more specific action.

In this example, we generate a report of the 10 most important features in the predictive model each week. Historically, this list has been consistent, with the customer's profession and income being the top variables. However, in one week, we find that income is no longer in this list, and that instead zip code has replaced it. When we check the data flowing into the model, we find that the income variable is no longer being logged correctly; thus, zip code, which is correlated with income, becomes a substitute for this feature in the model, and our regular analysis of variable importance helped us detect a significant data issue.

What if, instead the income variable was being logged correctly? In this case, it seems unlikely that zip code is more powerful a predictor than income if the underlying feature both are capturing is a customer's finances. Thus, we might examine whether there are particular zip codes for which churn has changed significantly over the past week. Upon doing so, we find that a competitor recently launched a site with a lower price in certain zip codes, letting us both understand the reason for the rise in zip code as a predictor (customers with the lower price option are more likely to abandon our site) and indicate market dynamics that are of larger interest.

This second scenario also suggests another variable we might monitor: the correlation between variables in our data set. While it is both computationally difficult and practically restrictive to comprehensively consider every pair of variables in large data sets, we can use dimensionality reduction techniques such as Principal Components Analysis described in Chapter 6, Words and Pixels – Working with Unstructured Data, to provide a high-level summary of the correlations between variables. This reduces the task of monitoring such correlations to examination of a few diagrams of the important components, which, in turn, can alert us to changes in the underlying structure of the data.

Changes in unsupervised model performance

The examples we looked at previously all concern a supervised model where we have a target to predict, and we measure performance by looking at AUC or similar metrics. In the case of the unsupervised models we examined in Chapter 3, Finding Patterns in the Noise – Clustering and Unsupervised Learning, our outcome is a cluster membership rather than a target. What sort of diagnostics can we look at in this scenario?

In cases where we have a gold-standard label, such as a human-annotated label of spam versus non-spam messages if we are clustering e-mail documents, we can examine whether the messages end up in distinct clusters or are mixed. In a sense, this is comparable to looking at classification accuracy. However, for unsupervised models, we might frequently not have any known label, and the clustering is purely an exploratory tool. We might still use human-annotated examples as a guide, but this becomes prohibitive for larger data sets. In other scenarios, such as sentiment in online media, remain subjective enough that human labels may not significantly enrich labels derived from automated methods such as the LDA topic model we discussed in Chapter 6, Words and Pixels – Working with Unstructured Data. In this case, how can we judge the quality of the clustering over time?

In cases where the number of groups is determined dynamically, such as through the Affinity Propagation Clustering algorithm described in Chapter 3, Finding Patterns in the Noise – Clustering and Unsupervised Learning, we examine whether the number of clusters remains fixed over time. In most cases we previously examined, though, the number of clusters remains fixed. Thus, we could envision one diagnostic in which we examine the distance between the centers of the nearest clusters between training cycles: for example, with a k-means model with 20 clusters, assign each cluster in week 1 its closest match in week 2 and compare the distribution of the 20 distances. If the clustering remains stable, then the distribution of these distances should as well. Changes could indicate that 20 is no longer a good number to fit the data or that the composition of the 20 clusters is significantly changing over time. We might also examine a value such as the sum of squares error in k-means clustering over time to see if the quality of the obtained clusters is significantly varying.

Another quality metric that is agnostic to a specific clustering algorithm is Silhouette analysis (Rousseeuw, Peter J. "Silhouettes: a graphical aid to the interpretation and validation of cluster analysis." Journal of computational and applied mathematics 20 (1987): 53-65). For each data point i in the set, we ask how dissimilar (as judged by the distance metric used in the clustering algorithm) on average it is to other points in its cluster, giving a value d(i). If the point i is appropriately assigned, then d(i) is near 0, as the average dissimilarity between i and other points in its cluster is low. We could also calculate the same average dissimilarity value for i for other clusters, and the second lowest value (the second best cluster assignment for i) is given by d'(i). We then obtain a silhouette score between –1 and 1 using the formula:

Changes in unsupervised model performance

If a data point is well assigned to its cluster, then it is much more dissimilar on average to other clusters. Thus, d'(i) (the 'second best cluster for i') is larger than d(i), and the ratio in the silhouette score formula is near 1. Conversely, if the point is poorly assigned to its cluster, then the value of d'(i) could be less than d(i), giving a negative value in the numerator of the silhouette score formula. Values near zero suggest the point could be reasonably assigned in the two clusters equally well. By looking at the distribution of silhouette scores over a data set, we can get a sense of how well points are being clustered over time.

Finally, we might use a bootstrap approach, where we rerun the clustering algorithm many times and ask how often two points end up in the same cluster. The distribution of these cluster co-occurrences (between 0 and 1) can also give a sense of how stable the assignment is over time.

Like clustering models, dimensionality reduction techniques also do not lend themselves easily to a gold standard by which to judge model quality over time. However, we can take values such as the principal components vectors of a data set and examine their pairwise dissimilarity (for example, using the cosine score described in Chapter 3, Finding Patterns in the Noise – Clustering and Unsupervised Learning) to determine if they are changing significantly. In the case of matrix decomposition techniques, we could also look at the reconstruction error (for example, averaged squared difference over all matrix elements) between the original matrix and the product of the factored elements (such as the W and H matrices in nonnegative matrix factorization).

