When discussing customer churn prediction in Chapter 3, we explained that by developing and adopting a customer churn prediction model we can target the customers which are most likely to churn in a retention campaign. As such, the use of a predictive model significantly increases the efficiency and return of a retention campaign by allowing to select true would-be churners and to exclude nonchurners. The reader may have realized that a further improvement may be achieved by selecting customers that are not only likely to churn but as well likely to be retained when targeted in a retention campaign. If we exclude would-be churners from the campaign that have made up their minds and therefore cannot be retained, a further increase in profitability will be achieved.
To this end, we introduce uplift modeling approaches in this chapter, which aim at estimating the net effect of a treatment, such as a marketing campaign, on customer behavior. Uplift models allow users to optimize the selection of customers to include in marketing campaigns as well as a further customization at the individual customer level of the campaign design, for example, in terms of the contacting channel and the characteristics of the incentive that is offered. Such customization may even further increase the effect and return of the campaign.
In the first section of this chapter, we will broadly introduce and motivate the use of uplift modeling as an alternative approach to standard predictive analytics as discussed before in this book. As will be elaborated subsequently in the second section, specific data requirements hold to develop uplift models, which may require to run dedicated experiments. Next, in the third section, various uplift modeling approaches will be introduced and subsequently the evaluation of uplift models is discussed. Finally, practical guidelines and a two-step approach toward developing uplift models are discussed, before the conclusions of the chapter are presented.
Predictive analytics are widely adopted for developing response models as introduced in Chapter 3 of this book. Response models aim at predicting which customers are likely to respond. By targeting these customers, the efficiency and expected returns of a marketing campaign are boosted. Remember that various types of responses can be considered when developing response models, e.g., a soft response such as reading or clicking on an advertisement, or a hard response such as purchasing or converting.
Response modeling is used for setting up different types of marketing campaigns, for instance, campaigns aimed at the following (Lo 2002):
The idea is to identify the customers who are most likely to respond and to offer these customers an incentive to effectively convert. Targeting a customer in a marketing campaign is referred to more generally as treating a customer, and a campaign or action toward a customer as a treatment. Typically, not all customers are targeted in a marketing campaign (i.e., are treated), because marketing budgets are limited and including a customer in a campaign comes at a cost—that is, the cost of setting up and developing the campaign and the cost involved in contacting the customer. Contacting the customer can occur through various channels, for example, by mail or email, by telephone, or by sales representatives visiting customers or addressing prospects in-store. Additionally, there is the cost of the incentive that is offered. Examples of incentives include coupons, vouchers, samples, promotional offers, and reductions.
Contacting costs are not necessarily uniform across the target population, although they often are assumed uniformly distributed for simplicity. For instance, some customers must be called multiple times before they are reached, or sales representatives might be required to travel longer to visit prospective customers living in remote areas. Incentives can also be diversified and customized in terms of contacting channel, type, and value of the incentive to optimize further the effect of the campaign. Predictive analytics are also used to customize campaign characteristics at the individual customer level, for instance to estimate the preferred channel or the minimum incentive for the campaign to be effective.
However, the use of such traditional response models is suboptimal because these models are developed to estimate gross response rather than net response. Estimating gross response consists of predicting all responders, whereas estimating net response consists of predicting those who will only respond when treated. In other words, traditional response models do not allow for distinguishing between customers who will respond or convert because of being targeted in the campaign and having received an incentive and those who would have responded anyway, even had they not been treated.
When examining the profitability of running targeted marketing campaigns, no net profit in fact results from including the second group in a marketing campaign—that is, those who would have responded anyway. Instead, a net loss is incurred by including these customers because no additional revenues are generated to cover the involved costs. For instance, when coupons offering a 10% discount are sent to customers who would have purchased a product anyway, they will pay less for the product or service compared with them not having been contacted. The marketing effort then leads to decreased revenues and hence a net loss.
Clearly, we must know the net effect of a treatment on a customer rather than the gross effect to optimize the concrete actions that are undertaken. Uplift modeling, also called net-lift, true-lift, or difference modeling, aims at precisely establishing the difference in customer behavior because of a specific treatment that is extended to a customer. In this chapter, we will discuss various approaches for uplift modeling.
In line with the literature and in line with a number of business applications discussed extensively in the previous chapter, the discussion on uplift modeling in this chapter will center on marketing applications. However, note that uplift modeling can have great use and generate significant added value beyond the marketing context. A number of example applications in other fields include the following:
Whenever data used for developing a predictive model are somehow affected or subject to change because of interactions between a business or organization and its customers, uplift modeling might be a more correct approach to reach unbiased conclusions. Uplift modeling allows distilling the effect of these interactions from the data and accounting for the effects of interactions within the model.
A key requirement for uplift modeling to distill the effects of interactions on the behavior of customers is availability of the right data. In the following section, we will discuss in detail what right exactly means and how these data can be gathered. A specific preliminary data collection strategy, which is indispensable and conditional for uplift modeling, will be extensively discussed. For uplift modeling, it is necessary to actively gather the required data by means of well-designed experiments or, alternatively, to passively gather the required data by tracking information on marketing campaigns at the customer level.
The remainder of this chapter is structured as follows. The next section discusses the effects of a campaign in terms of the achieved change in customer behavior. Subsequently, data requirements will be discussed for developing uplift models as an improved alternative to traditional response modeling. The third section discusses various approaches for developing uplift models. The fourth and final section of this chapter is dedicated to the evaluation of uplift models, which will appear even more challenging than evaluating traditional predictive models and will require specific measures and approaches. Therefore, tailored evaluation procedures for assessing the effectiveness of uplift models will be extensively discussed and illustrated. Both visual evaluation approaches and performance metrics will be covered. To conclude, notes on optimally operating uplift models in practice from a profit-driven perspective are provided in line with discussions in the previous chapter.
As introduced previously, the aim of uplift modeling is to distinguish between responders and nonresponders and additionally to distinguish within the group of responders between customers who respond because of the campaign and those who would respond even when not treated. In fact, within the group of nonresponders, a further and similar segmentation can and should be made in terms of response behavior when treated or not treated.
In some situations, it has been observed that customers can be adversely affected by targeting them in a marketing campaign. In other words, some customers do not respond when treated, whereas they would purchase when not treated. In Kane et al., (2014), four groups of customers have been identified. As seen in Figure 4.1, these groups are differentiated in two dimensions, based on response behavior and on being treated or not.
The resulting four customer types are named Sure Things, Lost Causes, Do-Not-Disturbs, and Persuadables. In the remainder of this chapter, we will make extensive use of these customer types. Hence, the reader is advised to memorize them.
Note that the actual behavior of a customer in terms of responding or not responding when treated or not treated likely depends on the various characteristics of the marketing campaign—that is, of the treatment that is applied. These characteristics can include, for example, the channel through which the customer is contacted and the type and size or amount of the financial incentive that is offered. It makes sense to optimize and customize these characteristics because we are in full control and can freely decide on these characteristics to maximize the returns of the campaign. In some settings, we might even customize these characteristics at the individual customer level. Given the important precondition of availability of the required data, uplift modeling approaches do accommodate such further optimization and customization at the individual customer level, as will be discussed in the section on uplift modeling approaches. Nonetheless, few case studies can be found in the literature elaborating such a setup, which is likely due to the involved complexity of gathering the required data, developing the appropriate uplift model, and implementing and operating the model for tuning and elaborating the marketing campaign.
Finally, note that the customer types might or might not exist in a customer base. More generally, within a particular customer base, any possible combination of the four customer types discussed above can be present. Whether these types exist and their exact combination depends on the characteristics of the population and of the campaign. For instance, there can be no Do-Not-Disturbs in a customer base for a particular type of campaign. In such a situation, there is no risk of adversely affecting the customers. Conversely, when there are no Persuadables, one should not run a campaign, because no profits will be generated. Although such an extreme situation might be rather exceptional, the fraction of Persuadables is often small. Performing a cost-benefit analysis therefore can be sensible.
Specific data requirements hold for building uplift models and for identifying Persuadables in a customer population, which in essence is the objective of uplift modeling. Information must be available on a sample of customers, that is, whether they responded to a campaign in which they were targeted. This campaign preferably is identical or, if not, at least to the greatest possible extent similar to the campaign that is to be launched on a larger scale or that is to be reiterated and for which we intend developing the uplift model. If the campaign that is eventually run is significantly different from the campaign employed to gather response data for a sample of customers, then the eventual uplift model and estimates concerning the effects of the new campaign on the behavior of customers might be less trustworthy and accurate.
In addition to this first sample containing information on the behavior of treated customers, data are also needed describing or capturing the behavior of a similar sample of customers who were not treated. Information is needed concerning whether these customers responded or, more generally, whether these customers displayed the behavior we aim to instigate without being treated. The underlying idea of gathering these two samples is to contrast them and thus distill the net effect of the treatment as a function of individual customer characteristics.
The first sample of treated customers is called the treatment group, whereas the sample of customers who were not treated is called the control group (which is also called the reference group). The difference in behavior observed between the customers in the control and treatment group is what allows uplift modeling approaches to estimate the net effect of a treatment on individual customers. Ideally, both samples have been selected randomly and are similar in terms of all relevant characteristics.
Figure 4.2 provides a conceptual overview of both the data collection and uplift modeling and of the subsequent campaign setup. In a first step, a development base is randomly selected from the full customer base. Note that the full customer base as represented in the figure might or might not include prospective customers, depending on the application of the uplift model that is developed. For instance, customer acquisition modeling aims at evaluating prospective customers in terms of responsiveness to an acquisition campaign in which the goal of the acquisition campaign is to attract new customers. Therefore, the meaning of full depends on the application at hand and should be determined as appropriate in the particular application setting to ensure a representative sample is drawn.
The development base is randomly split into treatment and control groups of equal size. The treatment and control groups will and will not be treated, respectively, with the envisaged campaign. Response data are recorded for both samples and are then pooled and randomly split into training and test sets, as discussed in Chapter 2. An uplift model can then be developed using the training set, as will be discussed in the next section of this chapter. The resulting model is to be evaluated on the test set using the evaluation procedures, as will be discussed in a separate section at the end of this chapter. The evaluation procedure allows deciding whether the model is likely to perform well in terms of selecting an appropriate model base for running the actual campaign.
To measure the campaign and model effectiveness and to gather additional data for further uplift modeling purposes, a random base is also to be setup. The random base concerns a sample of customers that is randomly drawn from the customer base. The random base itself should be randomly split into treatment and control groups of equal size that respectively will and will not be treated by the campaign, similarly to the control and treatment groups in the development base.
Note that as previously discussed, the original development base sample used to develop the uplift model could be the combined model and random base sample of a previous similar campaign (if available). Alternatively, the development base can be the combined model and random base sample of a previous run of a campaign that is repeated on a regular basis.
It is recommended to setup the development of the uplift model and run the campaign as an iterative process. In each iteration, data are gathered that record the response behavior of a control and a treatment group selected using the model and of a control and a treatment group selected randomly from the customer base. In other words, a random base sample is randomly selected from the customer base, which is subsequently split into a control and a treatment group. The development base used for the next iteration then consists of the combined random base and model base samples.
One could argue that the model base sample should not be used for developing an uplift model, because the model base was not randomly selected. In fact, a random sample is always preferred when building analytical models because the randomness eliminates possible biases. However, the model base sample is too valuable and represents a rich source of information that should be explored and exploited for further improving the uplift model, for further optimizing the model base sample selection, and for further maximizing the returns of future campaigns.
To test and control for a possible bias due to the nonrandom selection of the model base, a dummy variable indicating whether a customer was selected for the campaign or the random base can be included in the data sample that is used for developing the uplift model. If the dummy indicator is an important, statistically significant variable, then the bias is controlled for.
Based on the observed response rates in the control and treatment groups in the random base and model base, a data scientist can measure the effectiveness of the uplift model and the effectiveness of the campaign (Lo 2002). Note that the effectiveness of the uplift model is not the same as the effectiveness of the campaign and that both should be well distinguished when assessing model effectiveness. Table 4.1 summarizes the response rates observed in the various groups or samples in the campaign setup as illustrated in Figure 4.1, allowing measurement of effectiveness. The effects of the campaign, the effects of the model, and the combined effects of model and campaign together can be assessed by comparing the response rates among the resulting four groups of customers.
Table 4.1 Overview of Model and Campaign Effect Measurement
Treatment Group | Control Group | Treatment Minus Control | |
Model Base | RM,T | RM,C | |
Random Base | RR,T | RR,C | |
Model minus Random |
The effect of the campaign can be evaluated by checking whether the difference in response rates in the treatment groups is greater than in the control groups for both the model base and the random base. For both bases, the treatment and control groups are similar except for the effect of the campaign, which affects the response rate. Therefore, the greater the difference in response rate that is observed between these fully comparable groups, the greater the effectiveness of the campaign.
Conversely, for the model to effectively contribute to increasing the effect of the campaign by improving the selection of customers to treat, the difference in response rates between the treatment and control groups in the model base should be greater than the difference in response rate observed in the random base between the treatment and control groups . The difference in response rate in the random base measures the increase in response rate due to the campaign. For the model to be proved effective in selecting customers to treat, the increase in response rate in the model base should be greater than the increase attributable to the campaign. Hence, as proposed by (Lo 2002), the quantitative business objective of a response model, as measured by its effectiveness for running campaigns, is to maximize this difference, which is called the true lift:
The true lift evaluates the gain, for example in terms of response rate, revenues, and sales, that is achieved due to selecting the target population to be treated based on the model.
One can also reach the above equation for the true lift by adopting a perpendicular perspective to assessing model performance. The effectiveness of the model in identifying Persuadables can also be evaluated by contrasting the response rates in the treatment groups of the model and the random base. Both groups receive a treatment and only differ in terms of how they were selected, that is, by the model or in a random manner. Therefore, the larger the difference between the response rates that is observed, the more effective the model is in selecting Persuadables.
The difference in response rates between the control groups of the model and the random base can be expected to be negative because the uplift model aims at selecting Persuadables from the customer base who only respond when treated. Because no treatment is applied, the response rate in the control group of the model base (RM,C) can be expected to be less than the average response rate, which is exactly what is observed in the random base control group. Hence, because , the difference will be negative.
Comparing the difference in response rates for the treatment groups with the difference in response rates for the control groups provides a net measure for model effectiveness, the true lift:
Reworking this equation leads to the same equation of true lift provided above. Hence, both perspectives on assessing model effectiveness yield the same equation, as could be expected. Nonetheless, both rationales provide complementary insights.
Following the previous discussion, Table 4.2 provides a practical illustration of model and campaign effectiveness measurement by elaborating an example. As seen from the table, for the treatment group selected with the model, a response rate of 5.3% is achieved, whereas in the control group, a response rate of 1.2% is achieved. In the randomly selected treatment and control groups, the response rates are 2.1% and 0.8%, respectively. Therefore, when randomly selecting customers to target, the treatment results in an increase in response rate or uplift equal to . This result is considered the base effect of the campaign on which the model should further improve. Note that the campaign is effective in the sense that it boosts the response rate.
Table 4.2 Example Model and Campaign Effect Measurement
Treatment Group | Control Group | Treatment − Control | |
Model Base | 5.3% | 1.2% | 4.1% |
Random Base | 2.1% | 0.8% | 1.3% |
Model − Random | 3.2% | 0.4% | 2.8% |
When examining the uplift effect of the campaign when selecting target customers by making use of the model, we observe a treatment effect equal to , which is much greater than the benchmark effect of 1.3% uplift that is achieved when randomly selecting customers. The True Lift equals 4.1% − 1.3% = 2.8% and therefore, one can conclude that the model is effective and reinforces the effect of the campaign.
Several practical issues and challenges emerge when setting up the experimental design for uplift modeling, as shown in Figure 4.2.
A first major challenge can be to convince management of the required shift in its modeling paradigm and to switch from a traditional response modeling approach to uplift modeling. A significant investment is required to develop and implement the proposed experimental setup and modeling process, which comes without a guarantee of yielding increased returns. Additionally, for the following reasons, management might be reluctant to elaborate the proposed experimental setup for gathering the required data:
The investments and efforts to setup the experimental design and collect the required data will not immediately result in additional returns or benefits. Instead, immediate losses due to missed sales and suboptimal targeting are experienced. In fact, however, these losses relate to essential investments in data collection and model evaluation, which eventually will yield additional revenues and profits in subsequent campaigns.
When approval is obtained for implementing the experimental setup, a next challenge and important decision that must be made concerns the numbers of customers to select for the model and random base and for the control and treatment groups within the model and random base. As shown in the literature, the more observations available for developing a model, typically the better the result. However, practical limitations obviously exist:
The samples used to measure campaign effectiveness and, more specifically, to calculate the response rates shown in Table 4.1 and to test statistically the observed differences in response rates can be selected randomly from the available samples in the experimental setup to balance the sizes of these samples. This approach might be preferable from a statistical perspective or to address specific concerns related to the applied statistical tests. Conversely, if all samples used are sufficiently large, the effect of imbalanced sample sizes can be expected to be small. Practically speaking, the sample size of the random base treatment group and of the model base control group will be restrictive.
In the previous section, the possible effects of campaign characteristics on the behavior of customers were briefly mentioned. In addition to setting up an experiment and gathering data allowing the building of an uplift model for selecting the optimal set of customers to be targeted with a given campaign with fixed characteristics, the experimental design can be extended to accommodate optimization and customization of campaign characteristics at the individual customer level. Similar to the A/B testing discussed in the next section, this approach allows optimizing the campaign design to complement optimizing customer selection. Such an extended analysis requires treating multiple customer subsamples with different campaign characteristics. This requirement indeed further complicates both the setup and the analysis of these experiments, which might be even more costly and complex. Obviously, the added value might be greater but should, of course, compensate for the additional cost to be worth the effort.
Note that when extending the experimental design, there is a requirement to include sufficient customers in each subsample that is treated in a different manner—for example, with a unique combination of campaign characteristics such as contact channel and type and value of the incentive. As always, sufficiently large (preferably including more than 100 observations) subsamples are required to draw conclusions and derive patterns that are robust and hold for the full population of customers. Robust in this setting means that the findings do not depend on the exact observations selected in the treatment samples. In other words, if another sample of customers were selected as the development base for gathering data and developing an uplift model, the resulting uplift model should not be substantially different in terms of relationships and predictions.
The experimental setup shown in Figure 4.2 and previously discussed, with two or more groups treated differently, might remind the reader of so-called A/B testing, which is also known as split testing or bucket testing (Kohavi and Longbotham 2015). A/B testing is a common practice in webpage design and more generally in software development. The aim is to compare different designs of a webpage or variations of application interfaces experimentally to decide on the optimal layout. Usually, two designs are compared, version A and version B, hence the name A/B testing. When more than two designs are compared, we speak of multivariate testing. In A/B testing, visitors to a webpage or users of an application are shown variations of the page or interface. Comparing the performance of the alternative setups in terms of appropriate performance indicators allows determining which design is preferred and eventually to be implemented. Example performance indicators that are often used are the conversion rate, which measures the fraction of visitors of a webpage who purchase the offered product or service, and the click-through rate, which measures the fraction of visitors of a webpage who click on a link that is shown on the webpage.
A/B testing is similar to uplift modeling in the sense that an experiment is set up in which different customers receive a different treatment. However, in uplift modeling, a predictive model is developed by analyzing the behavior of these customers, allowing subsequent customization of treatment given to individual customers. Whereas in A/B testing, a single design or treatment is selected, implemented, and applied to the full population of users. Hence, in A/B testing, the overall optimal design or treatment at the aggregated population level is selected, whereas uplift modeling aims at optimizing the treatment at the individual level by making use of advanced data analytics. For more information on A/B testing, one can refer to Kohavi and Longbotham (2015). Note that the practice of A/B testing is suboptimal and could theoretically be replaced by an uplift modeling approach. Rather than selecting a single layout of a webpage, the optimal layout might be customized depending on user characteristics to improve the performance of the website or application. This improvement is to some extent what recommender systems are about, as discussed in Chapters 2 and 3. However, from a practical perspective, such a dynamic, customized interface might be complex to develop and implement. In addition, the consistency of the interface that is shown to the users might be of importance, thereby limiting the practical use of uplift modeling.
The purpose of uplift modeling is to estimate the expected net effect of a campaign or treatment on the behavior of individual customers. In a marketing campaign setting, the aim of uplift modeling is to identify Persuadables (i.e., customers who will purchase only if treated). Traditional response modeling approaches estimate gross response, leading to the identification and treatment of Sure Things and possibly Do-Not-Disturbs. These two groups should not be treated, but are labeled as positives in a traditional response modeling setup because they have been observed to purchase. Conversely, Persuadables who were not treated have not been observed to purchase and hence are labeled as nonresponders. Therefore, traditional response models are trained to predict the wrong targets for a campaign—Sure Things and Do-Not-Disturbs rather than Persuadables. This conceptual design error often goes unnoticed because response rates to campaigns indeed are higher when using a traditional response model than when randomly selecting target customers because of the Sure Things identified by the response model and included in the campaign.
Hence, the core issue with traditional response modeling is the objective function that is adopted for building the model and that does not capture the true objective because it is incorrectly specified. Response models aim at estimating the probability of responding instead of the increase in probability of responding based on a treatment. The increment in probability is exactly what uplift models estimate. In other words, uplift models estimate the change in behavior because of the marketing campaign, which is captured by the target variable y, defined as follows:
However, note that the value of this target variable cannot be observed for individual customers because a customer cannot be treated and not treated simultaneously, i.e., belong to both the treatment and the control groups. Therefore, specific analytical techniques are required—that is, uplift modeling approaches. This section provides an overview of uplift modeling approaches that have been selected by assessing the following essential properties:
The selected approaches can be categorized in the following four groups:
This categorization is adopted to structure this section and allows the reader to further identify and frame approaches as proposed in the literature on uplift modeling. In this section, we will introduce and discuss the most representative, useful, powerful, and/or popular approaches from these different groups. References are provided throughout the text to the original works, presenting the selected approaches and providing full details, discussions, and experimental evaluations.
The approach that performs best and that should be applied in a particular setting depends on the exact application, the available data, the characteristics of the population, and the personal preferences and skills of the involved data scientists and management. Experimentation with different approaches is highly recommended if permitted by time and budget constraints because a relatively large variability is observed in the reported performance across different application settings. Even within similar application settings but for heterogeneous campaign and customer population characteristics, a strong variability has been observed. This observation leads to an important recommendation: Be cautious, and carefully and precisely test and monitor the performance of uplift models both when in development and even more so when in operation.
A rather simple and intuitive approach for developing uplift models is called the two-model approach. The two-model approach builds on the traditional response modeling approach and combines two independent response models that are developed on two subsamples:
The aggregated uplift model (MU) combines these two models and estimates the net effect of a marketing campaign on the behavior of customers in terms of change in probability of responding. In other words, the uplift model estimates the uplift by subtracting the response probability when not treated, , estimated by the control response model, from the response probability when treated, , estimated by the treatment response model. The uplift model MU is formulated as follows:
In the literature, this approach is also known as the naïve approach, difference score method, or double classifier approach (Radcliffe 2007; Soltys, Jaroszewicz, and Rzepakowski 2015). The approach is indirect in the sense that uplift is not directly estimated by a model that is fitted to produce uplift scores. Instead, uplift is calculated indirectly from estimated response probabilities.
This approach has the benefit of being straightforward to implement. Two standard classification models must be developed, following the same methodology used for developing traditional response models estimating the gross response. For building the treatment and control response models, any supervised learning technique as discussed in Chapter 2 can be adopted. For instance, a two-model approach using logistic regression has been discussed and applied in Hansotia and Rukstales (2002a 2002b).
An important drawback of this approach is that the individual treatment and control response model constituting the uplift model is built using data from either the treatment group or the control group—without considering the other group (Chickering and Heckerman 2000; Hansotia and Rukstales 2002b; Radcliffe and Surry 2011). Uplift is then estimated by subtracting both model scores for each observation in the test set, as discussed above. By independently building the models, the model-building process does not actively search for and does not focus on finding patterns that are directly related to and hence indicative or predictive for uplift. The models are not built directly with the aim of estimating uplift but, rather, of estimating response behavior for two separate groups of customers.
The model-building process could result in one of the models selecting a rather different set of predictor variables than would a modeling approach directly estimating uplift (see below on direct estimation approaches). Moreover, to predict the uplift accurately, errors in the individual estimates of both models should not reinforce one another. Both models must therefore be highly accurate, since errors in one or both of the models could be amplified when predicting uplift and result in an inaccurate aggregate uplift model (Radcliffe and Surry 2011). However, highly accurate models appear difficult to achieve in practice. Consistency is not enforced among the models, and the effects are not directly assessed and modeled.
The two-model approach discussed in this section is considered an indirect means of estimating uplift, leading to associated issues and limitations. Instead, an estimation method that builds a single model to predict uplift directly using data from both control and treatment groups is preferred.
Several such approaches allowing direct estimation of uplift have been independently developed in the academic literature. Most of these approaches are either regression-based or tree-based. The next two sections introduce a series of methods stemming from these two respective classes of estimation approaches. A subsequent section will then discuss a selection of ensemble-based approaches for estimating uplift.
In this section, two approaches will be presented. Lo's method is based on logistic regression, whereas Lai's method and the generalized Lai method reformulate the uplift modeling problem to allow standard approaches to be used, as discussed in Chapter 2. As will be shown, Lo's method can also be generalized for use in combination with any standard classification technique.
A direct approach for uplift modeling that makes use of logistic regression was introduced in Lo (2002). The proposed methodology groups the treatment and control groups and incorporates a treatment dummy variable t, which indicates treatment or control group membership. See, for instance, the example dataset provided in Table 4.3, in which t is assigned a value of zero for control group membership and a value of one for treatment group membership.
Table 4.3 Dataset Including Treatment Dummy Variable t, Predictor Variables xi and Target Variable y
Customer | Age | Income | … | Treatment t | Target y |
John | 32 | 1,530 | … | 1 | 1 |
Sophie | 48 | 2,680 | … | 1 | 0 |
… | … | … | … | … | … |
Josephine | 23 | 1,720 | … | 0 | 0 |
Bart | 39 | 2,390 | … | 0 | 1 |
… | … | … | … | … | … |
Lo's method includes predictor variables x, treatment indicator t, and interaction variables as predictor variables in a logistic regression model that is fitted to estimate target variable y that indicates whether a customer responded or did not respond .
Similar to the discussed standard procedure for estimating a logistic regression model discussed in Chapter 2, Lo's method calls for a variable selection procedure to be applied in a forward, backward, or stepwise manner, depending on the data scientist's preference. The number of candidate variables is doubled by including the interaction variables . Hence, the aim of the variable selection procedure is to reduce the included set of predictor variables, which allows reaching a stable or robust logistic regression model that includes only statistically significant variables, allows interpretation and validation of the incorporated relationships, and makes accurate predictions and generalizes well to new, unseen observations.
The interaction variables that are explicitly advanced in Lo's method and included in the set of candidate predictor variables allow the model to account for the heterogeneous effect of a treatment based on the characteristics of a customer, as expressed by the predictor variables x. If a treatment works well in a particular segment (i.e., significantly increases the response rates for instance in the customer segment with ), then including the interaction variables allows the logistic regression model to pick up this pattern in the uplift model and to more accurately predict uplift. In other words, including these interaction variables increases the versatility of the approach.
Usually, interaction variables combining pairs of variables (or triplets, or more) are not preferred in business applications because of the reduced interpretability of the resulting model. For instance, the meaning of an interaction variable is arguably difficult to interpret unless you are to some extent trained in understanding the meaning of interaction effects. In Lo's method, however, the interaction effects are less complex because t is a simple dummy variable, and the reason for adopting these interaction effects can be quite easily explained to a nonexpert.
Note that the interaction effects allow data scientists and marketers or campaign developers to gain insight into the possibly divergent effects of a campaign on various subgroups in the customer base. By setting up well-designed experiments and by applying differentiated treatments to different samples of customers, one can customize the campaign characteristics in terms of, for example, channel or incentive to match the exact customer profile and allow further boosting of the returns to marketing efforts. This idea was discussed in the extended experimental design section earlier in this chapter.
Lo's method then applies logistic regression to model uplift as follows:
As discussed in Chapter 2, β0 represents the intercept, β1 to βk are coefficients measuring the main effects of the k predictor variables, and to capture the additional effects of the predictor variables due to the treatment. In other words, to measure the effect of the interaction effects between campaign and customer characteristics as discussed above. Finally, captures the main treatment effect.
When applying a variable selection procedure (e.g., a forward, backward, or stepwise variable selection), the treatment dummy indicator t and interaction effects can be eliminated from the model. However, for estimating uplift, the treatment indicator t should somehow be in the model. If not, then the model does not allow calculation of the difference between the probability of responding for and —that is, the uplift for a customer when targeted in the campaign. If the selection procedure eliminates t and the interaction effects from the final model, there are two possible explanations, which come with different solutions:
Another possibility is that the coefficient of the treatment dummy indicator in the model is significant but negative. This result would actually mean that the treatment has a negative effect on the probability of responding, so the campaign achieves the opposite of what is aimed for. Clearly, then, an improved campaign should be designed that does positively affect customers.
The predicted lift is essentially the difference between the probability of a customer responding when treated minus the probability of a customer responding when not treated. This approach is similar to the two-model approach. However, a single integrated model is now estimated using both data from the treatment and control groups to produce both estimates, whereas in the two-model approach, two separate and therefore nonintegrated models were constructed. Additional advantages of Lo's method are the intuitiveness of the approach in accommodating the estimation of the effect of a campaign by including the treatment dummy indicator and the interpretability of the resulting model, which allows understanding and validating the relationships that are incorporated in the model and which explains the observed behavior.
The disadvantages of Lo's approach, according to Kane et al., (2014), are that some compound errors might remain when subtracting two model scores, and a substantial collinearity between variables might be present in the model because some characteristics might be included as both baseline and interaction variables. Full details on this approach are provided by Lo (2002) and Kane et al. (2014).
Note that the underlying principle of this approach (i.e., the inclusion of a treatment dummy indicator), is not restricted to logistic regression and can be implemented in combination with any supervised learning approach. For example, one can include the treatment dummy variable t and the interaction variables in a neural network classification model and use this model similarly to obtain uplift estimates.
Finally, note that this approach might not work when adopting a base classifier that inherently incorporates a variable selection procedure, as discussed above. For instance, decision trees and ensembles of decision trees (e.g., obtained through bagging, boosting, or random forests) incorporate variable selection procedures. Therefore, the treatment dummy indicator t and the interaction variables might not be selected in the final model, thus not allowing calculation of uplift estimates by setting t to zero and one. Again, this condition might or might not mean the treatment has no significant effect, as discussed above. Other variables that are correlated with the treatment variables might be more predictive for the target variable and therefore preferred, making the treatment variables redundant.
As will be discussed in the next sections on decision tree-based approaches for developing uplift models, when the treatment nonetheless is significant, the dummy or an interaction effect can be enforced in the model, facilitating the development of an uplift model. This approach comes with greater complexity than enforcing t in the logistic regression model and therefore might be less preferred.
Two alternative regression-based uplift modeling approaches are Lai's method and the generalized Lai method as introduced in Lai et al. (2006) and Kane et al. (2014), respectively. These models essentially redefine the target variable as follows.
In the introduction section of this chapter, the customer population was categorized into four groups based on whether a customer responds when treated or not treated: Sure Things, Lost Causes, Persuadables, and Do-Not-Disturbs (see Figure 4.1). Preferably, we would like to know for each customer individually to which group he or she belongs, which would allow us to treat all Persuadables and to maximize the returns of a campaign. However, we do not have this information; we therefore develop uplift models with the aim of identifying the Persuadables.
What we do know, based on previous campaign or experimental data that were gathered to build an uplift model, is whether a customer was treated and whether the customer responded. Hence, a customer can be grouped into one of the following four categories as discussed in Lai et al. (2006) and Kane et al. (2014); see Figure 4.3:
Based on these four categories of customers, an alternative direct uplift modeling approach is proposed in Lai (2006) and Lai et al. (2006). This approach labels control nonresponders and treated responders as good targets because both groups together contain all Persuadables and do not contain Do-Not-Disturbs, who should not be treated. In addition to Persuadables, these groups (control nonresponders and treated responders) also contain Lost Causes and Sure Things. These groups should optimally not be treated, but only involve a minor cost compared with treating Do-Not-Disturbs.
Control responders and treated nonresponders are labeled bad targets because these groups only contain Do-Not-Disturbs, Lost Causes, and Sure Things. Targeting these groups always involves costs but never generates additional revenues. Hence, control responders and treated nonresponders should not be treated.
Thus, the uplift modeling problem is converted into a binary classification problem. Logistic regression or any other supervised learning technique as discussed in Chapter 2 can be applied to estimate the new target variable. The probability of being good resulting from the developed model allows ranking customers from high to low likelihood of being a Persuadable and selecting a target population for running the campaign by setting a cutoff score for a customer to be included in the campaign, as will be further discussed in a subsequent section. Note that including a treatment dummy in this approach is of no use because both the good and the bad targets stem from both the control and the treatment group.
Table 4.4 applies Lai's method to the example dataset provided in Table 4.3, with a value of the new target variable representing good customers (i.e., control nonresponders and treated responders) and representing bad customers (i.e., control responders and treated nonresponders). Applying a binary classification technique to the relabeled dataset excluding the treatment indicator and the original target variable then yields an uplift model.
Table 4.4 Relabeled Dataset of Table 4.3 Following Lai's Method
x1 | x2 | … | t | y | y′ | |
32 | 1,530 € | … | 1 | 1 | 1 | |
48 | 2,680 € | … | 1 | 0 | 0 | |
… | … | … | … | … | … | |
23 | 1,720 € | … | 0 | 0 | 1 | |
39 | 2,390 € | … | 0 | 1 | 0 | |
… | … | … | … | … | … |
In Kane et al. (2014), a variation of this approach is suggested by adopting a supervised learning model for predicting multiple nominal outcomes (e.g., multinomial logistic regression or decision tree) to directly estimate probability scores for each quadrant of the left panel in Figure 4.3—for belonging to class TR, TN, CR, or CN. This approach might lead to more flexibility in model development compared with the original formulation and to a higher precision in estimating class membership and lift score. The drawback of this approach is an increase in complexity. Table 4.5 relabels the dataset provided in Table 4.3 following this approach. Subsequently, a multiclass classification technique can be applied, again excluding the treatment indicator and the original target variable from the dataset, which will yield a classification model that allows predicting a probability score for each of the classes (TR, TN, CR, and CN).
Table 4.5 Relabeled Dataset of Table 4.3 Following the Generalized Lai Method
x1 | x2 | … | t | y | y′ | |
32 | 1,530 € | … | 1 | 1 | TR | |
48 | 2,680 € | … | 1 | 0 | TN | |
… | … | … | … | … | … | |
23 | 1,720 € | … | 0 | 0 | CN | |
39 | 2,390 € | … | 0 | 1 | CT | |
… | … | … | … | … | … |
The resulting probability scores can then be combined in an unweighted fashion as follows:
The lift score again allows ranking and selecting a target population to be treated by setting a cutoff. Essentially, this process reduces to the original approach in the sense that the probabilities for belonging to classes TR and TN are summed, thus representing the probability of good (i.e., the probability of being a Persuadable). In addition, the probabilities for belonging to groups TN and CR are summed, thus representing the probability of bad (i.e., the probability of not being a Persuadable). Hence, the lift score, with and , equals the following:
The essential difference from Lai's approach is the moment of aggregation, which in Lai's method occurs before estimating a model by redefining the target variable; however, in Kane's variation, the aggregation occurs after model estimation by recombining the probabilities, thus yielding the lift scores.
The development of a multiclass model as in Kane's variation allows further refining and generalizing the above equation and is therefore preferred. This model leads to the generalized Lai method. In Kane et al. (2014), it is shown that an adjustment of the above equation for the lift score is required when using Kane's variation to account for different sample sizes of the control and treatment groups.
Only when both samples are randomly drawn from the customer base and include the same number of customers (which is often not true) is the above equation statistically correct. In other words, only then will the estimated models produce tuned probabilities for a customer to belong to a particular class that can be reliably used to calculate the lift score and for ranking and selecting customers. Tuned in this setting means that a probability can be interpreted as an exact estimate of the probability—that is, one representing the precise likelihood of belonging to either class.
Because of imbalanced sample sizes, probability estimates might become biased. To correct for this bias, the lift scores should be calculated using the following equation:
with p(T) the proportion of treated customers, p(C) the proportion of customers in the control group, and . Intuitively, this correction makes sense. If the sample of treated customers is relatively small compared with the control group, then the fraction of treated responders will also be relatively small. The model will therefore produce small absolute probability estimates for customers to belong to the treated responders or treated nonresponders groups, whereas this probability should not depend on the original sample sizes of treatment and control groups. In essence, only the customer characteristics should determine the lift score; hence, the scaling in the above equation of the probability estimates produced by the model using the proportions of the samples used to develop the model.
An advantage of the simple and the generalized Lai methods is that traditional supervised learning techniques and model estimation procedures can be applied for developing an uplift model. Conversely, the approach is rather rough because by design it does not exclude Lost Causes and Sure Things from the control nonresponders and treatment responders. Hence, a further optimization can be desirable to maximize returns.
The two-model approach, the simple and generalized Lai method, and Lo's approach were evaluated on three datasets in Kane et al. (2014). Overall, the generalized Lai method appeared to perform best given the imbalanced sizes of the different samples, whereas Lo's approach worked well for some but not all datasets.
Most tree-based approaches for uplift modeling, such as uplift trees, are adaptations from popular classification tree algorithms such as C4.5 (Quinlan 1993), CART (Breiman et al. 1984), or CHAID (Kass 1980). Chapter 2 provided an introductory discussion of these standard classification-tree induction algorithms. The original setup of these approaches does not directly accommodate uplift estimation but is oriented toward class estimation. Nonetheless, it seems intuitive that classification trees can be altered in a relatively simple manner for uplift modeling purposes. In this section, a selected number of such adaptations are discussed in detail.
Standard classification trees consider a single sample with observations belonging to two or more classes. In uplift modeling, however, two samples exist: the treatment and control groups. To account for these two groups and to estimate lift rather than class membership, the proposed tree-based approaches alter the splitting criterion, the pruning technique, or both the splitting criterion and pruning technique involved in building a classification tree.
In Radcliffe and Surry (2011), a powerful tree-based uplift modeling approach is introduced called the significance-based uplift tree (SBUT), which is similar to the well-known CART and C4.5 decision tree induction algorithms.
As with most tree-based approaches, SBUT grows a tree by evaluating all potential splits. To do so, a quality measure is calculated for each potential split, indicating goodness. This approach allows iteratively selecting the best split and growing the tree until a stopping criterion is met or, alternatively, until a fully pure tree is obtained. SBUT aims at selecting for each node the split, which simultaneously does two things:
Both properties are important determinants of the quality of a split because a split can quite easily be found that yields a large difference in uplift between the resulting child nodes by simply separating a small number of observations exhibiting a high response rate and uplift in a node. However, the strong uplift observed in the respective child node only applies to a very limited number of observations and therefore does not have broad validity or applicability. Hence, a good splitting quality criterion achieves strong uplift in the child nodes but also accounts for the size of the child nodes. Figure 4.4 provides an illustration of the trade-off between achieving high uplift in a child node and accounting for the number of observations, a bad and a good split.
Note that some methods proposed in the literature strongly simplify the split quality evaluation by for instance ignoring the sizes of the child nodes and selecting splits solely based on difference in uplift (Hansotia and Rukstales 2002b). As an alternative, the tree-based approach proposed by Chickering and Heckerman (2000) does not adapt the splitting criterion to accommodate the specific objective of uplift modeling but rather enforces the final splits in all leaf nodes to be on the treatment dummy indicator t. Remember that the treatment variable indicates whether a customer received a treatment. Hence, in each leaf node, a probability of responding when treated and when not treated by setting t to one or zero can be calculated. Thus, the uplift can be estimated as the difference between these probabilities. Note that this approach strongly resembles Lo's method discussed in the above section, which also forces the treatment indicator t in the model for uplift estimation. However, Lo's approach directly aims at distilling the effect of the campaign on the probability of responding. Conversely, this tree-based version of Lo's approach does not fit trees that directly aim at estimating uplift but rather aim at estimating class membership (i.e., respond or not). Found responders and nonresponders are eventually split into treated and non-treated, but this split might not be meaningful.
Both simplified approaches (Hansotia and Rukstales, 2002b; Chickering and Heckerman, 2000) are intuitive but less well equipped to accommodate the specific modeling objective (i.e., estimating uplift). Consequently, as can be expected, they have already been reported to be less powerful in inducing uplift trees than the more refined decision tree approaches just discussed.
A measure similar to the standard information gain measure implemented in CART and C4.5 is proposed in Radcliffe and Surry (1999) and reported on more extensively in Radcliffe and Surry (2011). The measure, which we will call the uplift gain (GainU), evaluates the quality of a split by penalizing the difference in uplift between the left and right child node , which is to be maximized, by multiplying Δ with a factor between zero and one as a function of the difference in the sizes of the child nodes, nL and nR, respectively, representing the number of observations in the left and the right child node:
Parameter k in the GainU equation determines the importance or effect of the sample size difference in correcting the difference in uplift Δ. The larger k is, the smaller is the penalty factor for imbalanced node sizes (see the example of Figure 4.5). The value of parameter k is set heuristically (e.g., using a standard tuning approach).
When the child nodes are exactly balanced, then and the penalty factor equals 1, which then intuitively leads the uplift gain to be equal to the difference in uplift Δ between the child nodes. Note that we then essentially have the measure as proposed by Hansotia and Rukstales (2002b) as already explained. However, when the difference in child node sizes is large, the penalty factor shifts to zero. GainU then becomes small, indicating a poor split quality and leading to imbalanced splits being penalized.
The split with the largest GainU is considered the optimal split because it balances between the two objectives introduced above; i.e., it maximizes the difference in uplift between the child nodes and minimizes the difference in size of the child nodes.
The split with the largest GainU is iteratively selected when growing the tree. Alternative formulations for the uplift gain measure have been proposed in the literature but are beyond the scope of this chapter because these measures might or might not work well depending on the real world problem that must be solved (Radcliffe and Surry, 2011).
As an alternative to the above uplift gain measure, the significance-based splitting criterion is proposed in Radcliffe and Surry (2011). This measure is more complex and less intuitive but likely is also more powerful. When considering a candidate split, the SBUT approach fits a linear regression model that estimates the probability of responding for all observations in the Treatment and Control Groups in both child nodes as a function of a number of dummy variables indicating three things:
The linear regression model then becomes the following:
β0 is the intercept representing the baseline response probability for observations with all dummies having a value equal to zero (i.e., for control group customers in the left child node). βN captures the effect of belonging to the right child node compared with the baseline response probability, and βG of belonging to the treatment group. Finally, βNG captures the difference in response probability for belonging to the treatment group in the right child node compared with all other groups. Hence, βNG estimates the effects of both the split and the treatment in the right child node treatment group compared with all other subgroups, i.e., compared with the control groups in both the left and right child node and the treatment group in the left child node. Thus, βNG captures the difference in uplift, which is exactly the effect we aim to distill. The significance of this coefficient is therefore indicative of the strength of the split. We can find the optimal split by evaluating for all possible splits the significance of the interaction term.
The significance of the interaction term can be tested with a t-statistic following a t-distribution, an approach that provides an indication of significance given the other variables in the model and isolates the effect of the split on uplift, which is exactly what we need in this setting (Radcliffe and Surry 2011). The expression used to evaluate split quality is the following:
where n is the number of observations in the node that is split, UR and UL are the uplift in the right and left child node, and C44 is the (4, 4)-element of the matrix . The sum of squared errors, SSE, in the denominator of the statistic can be calculated as follows:
where nij is the size of the various groups in the child nodes, and pij is the estimated response probability in each node by the linear regression model.
Splits can be ranked according to the associated value of t2{βNG}, with a higher value of the statistic indicating a stronger significance of the coefficient.
One might wonder why a linear regression model is used in the significance-based splitting procedure instead of a logistic regression model, because the target variable is binary. Although not explicitly indicated by the developers, there are two main advantages to the use of linear regression over logistic regression in this setting:
Given the usually small effect of a treatment compared with the baseline or background effect and the often small size of the treatment group in the development base, a main challenge when developing uplift models concerns the stability of the resulting model. Stability or robustness refers to the generalization behavior and precision toward future applications on new customers different from those in the development sample. Particularly for decision-tree-based approaches, stability is a concern, and overfitting must actively be addressed. A pruning strategy involving a holdout validation sample as discussed in Chapter 2 can be used for this purpose.
In Radcliffe and Surry (2011), a variance-based pruning approach is proposed for application in combination with the significance-based splitting criterion discussed above. The training data for this purpose must be split randomly in k equally sized sets, with k by default equal to eight. The tree is grown in full by making use of one of these sets until, for example, all leaf nodes are pure, until a maximum depth is reached, or until a leaf node contains a minimum number of observations. In a second step, splits are removed if the uplift at a child node exhibits a standard deviation (as a measure of variability and thus of stability) greater than some predetermined threshold, with the standard deviation measured on the sets that were not used for growing the tree. The exact value to use as a threshold is highly application dependent. It is recommended by the developers of this approach to experiment and tune this parameter to test the effect and sensitivity of the uplift tree and to optimize its value for reaching maximum performance. In Radcliffe and Surry (2011) an indication is provided, placing the pruning threshold in the range of 0.5% to 3% for a baseline response rate (i.e., the response rate in the control group) in the range of 1% to 3% and uplift (i.e., the difference between the response rate in the treatment and the control group) between 0.1% to 2%.
Given the inherent instability of decision trees, an alternative approach for improving the stability of the obtained uplift model is to grow an ensemble of uplift trees. Bagging, boosting, and random forests approaches for uplift modeling will be discussed in a later section.
As an alternative to the significance-based splitting criterion, a number of divergence-based splitting criteria have been proposed in Rzepakowski and Jaroszewicz (2012), with the concept of distributional divergence drawn from the field of information theory. We will call these approaches divergence-based uplift trees (DBUT).
The aim of a split in an uplift tree is in essence to maximize the distance in the class distributions of the response between treatment and control groups in the child nodes. In other words, the fractions of responders and nonresponders (i.e., the class distribution of the target variable, with the fraction of nonresponders equal to one minus the fraction of responders) in the treatment groups should be as different as possible from the fractions of responders and nonresponders in the control group. When in a relative sense (i.e., in proportion to the group size), there are many more responders in the treatment group than in the control group in a node, then the uplift in that node is large. Remember that the fraction of responders in a group corresponds to the probability of responding in that group, and the difference in probability of responding between the treatment and control group is the uplift in that node.
Before putting forward specific candidate divergence measures D, we introduce the divergence-based splitting approach in general terms by defining the weighted aggregate divergence D over the left and right child nodes of a split S between the class distributions P(y) of the response variable y in the treatment and control groups:
The subscripts T and C, respectively, indicate treatment and control, and subscripts L and R indicate left and right child node. Consequently, if D(PT(y) : PC(y)) represents the divergence as measured in the parent node of the split, then a gain measure based on the divergence measure D for evaluating the quality of a split can be defined as follows:
GainD(S) is called the divergence gain measure and is similar to the general information gain and uplift gain measures defined respectively in Chapter 2 and in the previous section. Again, each candidate split can be evaluated by calculating the divergence gain measure, and the split with the highest divergence gain is selected. The procedure is repeated recursively as in standard decision tree approaches until a stopping criterion is met or until the tree is grown in full, after which it is pruned (see below).
Several divergence measures D allowing quantifying the difference in the (discrete) distribution P(y) of the response variable in the treatment and control groups have been proposed and tested in the literature for uplift modeling. These measures include the following (Rzepakowski and Jaroszewicz 2012):
These measures are generally defined to express the divergence or distance from a distribution , which is the baseline distribution, to a distribution , which deviates to some extent from Q as quantified by the measure, as follows:
For a more elaborate discussion on these divergence measures, one may refer to Csiszar and Shields (2004); Lee (1999); and Rzepakowski and Jaroszewicz (2012).
Soltys, Jaroszewicz, and Rzepakowski (2015) introduced an equivalent approach to the pruning approach discussed in Chapter 2 for making divergence-based uplift trees robust. A validation set is to be held aside for pruning and is not used when growing the tree. The tree is grown on the training set, and the performance of the tree is monitored on the validation set in terms of an uplift performance metric that will be discussed in detail in the final section of this chapter—that is, the Qini measure or area under the uplift curve (AUUC). Note that this pruning approach permits the use of any preferred and suitable evaluation metric for pruning instead of the AUUC. Early stopping can be applied to determine the size of the tree and to stop adding splits when overfitting occurs. Preferably, however, post-pruning is employed to maximize simultaneously the performance and the generalization power of the final tree.
The previous sections discussed stand-alone decision trees for estimating uplift. Inspired by ensemble methods such as bagging, boosting, and random forests (see Chapter 2), one can also construct an ensemble of decision trees for uplift estimation. This approach can be expected to improve the stability of the resulting uplift model and increase the precision of the predictions because in various benchmarking studies, ensembles have been shown to yield superior performance (Dejaeger et al. 2012; Verbeke et al. 2012). The basic idea is to construct a set of B uplift trees, each built on a randomly selected fraction ν of the training data containing both treatment and control group observations. For learning the individual uplift trees, any of the approaches discussed in the previous section can be adopted.
In this section, a number of ensemble approaches for uplift modeling will be discussed that have recently been proposed in the literature (Radcliffe and Surry 2011; Guelman, Guillén, and Pérez-Marín 2012, 2014, 2015; Soltys et al. 2015). The key idea of these approaches is to replace the base learner in bagging, boosting, or random forests with an uplift decision-tree learner, with additional adjustments that are made to address specific uplift modeling challenges and to exploit opportunities offered by these meta-learning schemes.
Algorithm 4.1 was adopted from Guelman et al. (2012) and introduces the uplift random forests ensemble approach for uplift modeling.
Note that the variable/split-point selection in step 7 of Algorithm 4.1 in the original approach defined in Guelman et al. (2012) is performed with the Kullback-Leibler divergence-based splitting criterion, as discussed above. However, alternative-splitting criteria can be used in this step.
In step 8 of the algorithm, it is mentioned that nodes are split into two or more branches. The Kullback-Leibler divergence-based splitting criterion as proposed in Guelman et al. (2014) limits the uplift decision trees to two child nodes. However, the original formulation in Rzepakowski and Jaroszewicz (2012) and the later adoption for uplift ensembles in Soltys et al. (2015) allow for the possibility of having splits with more than two child nodes. This latter approach can be useful to induce more-compact trees by providing more flexibility when growing the trees.
Uplift random forests come with two important parameters that must be determined: the number of trees B and the minimum node size lmin:
The final predicted uplift is then calculated by averaging the uplift over all trees. Again, alternative aggregation functions could be used—for instance, a function similar to the approach applied in boosting that accounts for tree performance when combining the estimates.
Algorithm 4.2 is obtained from Soltys et al. (2015) and adapts the standard bagging meta-learning scheme, as proposed by Leo Breiman (1996), for uplift modeling. DT and DC represent the datasets containing the observations of the treatment and the control groups, respectively.
Note that in step 7 of Algorithm 4.2, as in Algorithm 4.1, any splitting criterion can be used. In Soltys et al. (2015), the Euclidean divergence criterion as discussed in the previous section is used. Similar to Algorithm 4.1, the final predicted uplift is calculated by averaging the uplift over all trees, and the same two parameters B and lmin steer the ensemble learning process. Other than a minimum node size lmin, no pruning or stopping criterion is used when building the ensemble, meaning that full, or unpruned, trees are trained and included in the final model. It has been shown in the literature that this strategy in general improves the performance of the uplift ensemble when compared with including pruned trees (Soltys et al. 2015). Note that in the standard bagging technique as discussed in Chapter 2, pruned trees are used.
Two aspects of uplift random forests have been further improved in the causal conditional inference tree (CCIT) and forests (CCIF) approaches as described in Algorithm 4.3, adopted from Guelman et al. (2014). The CCIT is a decision tree learner that can be applied for constructing an ensemble (e.g., using the random forests meta-learning scheme as used in the CCIF approach and described in Algorithm 4.3). The enhancements of the CCIT approach over the uplift tree implemented in the uplift tree ensemble approach address overfitting and variable selection bias of variables with many possible splits or missing values. Overfitting has been addressed in other approaches through standard pruning strategies as explained in Chapter 2 (Radcliffe and Surry 2011; Soltys et al. 2015) but can also be addressed in the specific context of uplift modeling by performing statistical tests on the significance of the interactions between the treatment and splitting variables. The significance of the interactions can be evaluated with a procedure based on the theoretical framework of permutation tests. Full details on the testing procedure applied in step 7 in Algorithm 4.3 are provided in Guelman et al. (2014) and are beyond the scope of this chapter on uplift modeling. Because the proposed enhancements disentangle the splitting criterion into a variable selection step (step 11 in Algorithm 4.3) and a variable splitting step (step 12 in Algorithm 4.3), this procedure also addresses the variable selection bias already mentioned.
If the tested null hypotheses—stating that no significant interactions exist between the treatment variable and the predictor variables—cannot be rejected at some significance level α, then there exists no significant difference between the treatment and control group in terms of response. Therefore, a split based on such a variable will not induce child nodes in which an actual difference in response between the treatment and control group exists. In other words, no uplift is observed because of applying the treatment.
Conversely, when the null hypothesis is rejected and significant interactions exist, the predictor variable with the most significant or strongest interaction effect is selected. Subsequently, a split is defined on this variable as indicated in step 12 of Algorithm 4.3 by applying the G2(S) split criterion as proposed by Su et al. (2009). The latter is essentially a chi-squared interaction test between the treatment variable and the selected variable, which is binned in two groups by a given split.
The discussion on uplift modeling has thus far focused on binary outcome prediction and, more specifically, on response modeling. In the literature on uplift modeling, we almost exclusively find approaches and case studies aimed at estimating the effect of a treatment on a binary outcome variable. However, in many settings, the target variable of interest is continuous or ordinal in nature. For instance, we could setup a marketing campaign that aims at increasing customer spending rather than response. Alternatively, we could aim at increasing customer lifetime value, as discussed in Chapter 3.
It would be of great value and further boost the profitability of marketing campaigns and customer loyalty programs if we were able to model the long-term effects of such efforts on partial or total spending behavior. By partial, we refer to a single or limited selection of products or services over a short time span, whereas total refers to all products or services over a long time horizon.
Another example of a continuous target variable stems from the field of credit risk analytics, in which the loss given default related to a credit represents the fraction of the outstanding exposure that is not recovered in the case of default (Baesens, Roesch, Scheule 2016). From a profitability perspective, it is of great importance to know which collection strategies and actions effectively reduce the LGD. In this setting, uplift modeling with a continuous target variable has practical use to estimate the net effect of different treatments on the final loss. This use will allow minimization of the final loss by optimizing and customizing the exact treatments applied to defaulted obligors.
Except for Radcliffe and Surry (2011), who explicitly indicate the extensibility of the significance-based uplift tree approach for the continuous case, little effort appears to have been invested in developing uplift regression approaches. Hence, this field remains to be explored by scientists and practitioners.
Similar to regression trees, which define an alternative to classification trees for continuous target prediction (see Chapter 2), an equivalent approach has been proposed for the significance-based uplift tree. To find the optimal split, SBUT fits a linear regression model to predict the target variable in each child node as a function of treatment and child node membership and the interaction between these two. This technique allows testing for the significance of the treatment and uplift in increasing the response probability. Such a linear model can, of course, also be fitted in the case of a continuous outcome target variable. Thus, SBUT is directly applicable without any main adjustments to the continuous case.
Additionally, bagging, boosting, and random forests meta-learning schemes can be applied in combination with SBUT as a base learner leading to ensembles of uplift regression trees. Such schemes can lead to a more powerful uplift regression model than a stand-alone tree.
Alternatively, a two-model-based approach can be applied to the continuous case, fitting a separate regression model to both the treatment and control groups. The difference in the estimates is the uplift. The same drawbacks and advantages apply to the two-model approach for continuous target uplift modeling as discussed for the binary case.
Finally, Lo's approach can also be extended in a straightforward manner. For example, including the treatment variable and interaction terms with the treatment variable in a linear regression or neural network allows calculation of the difference in estimated output for the treatment variable taking a value of one and zero, respectively.
Evaluating these continuous uplift models will require the development of specific evaluation measures because, to our knowledge, no metrics have been defined in the literature on uplift modeling thus far that apply to the continuous case. This development could be challenging because evaluating binary uplift models is also not straightforward, as will be illustrated and discussed in the next section.
A conventional classification or regression model is evaluated by comparing the predictions made by the model on observations in a holdout test set with the observed outcomes for these observations. The differences between predictions and outcomes—that is, the errors on the observations in the test set—can be aggregated or summarized by means of performance measures discussed in Chapter 2 such as AUC, accuracy ratio, or MSE. Further insight into the performance of the model is provided by plotting factors such as the receiver operating characteristic (ROC) curve, correlation plots, and lift curves. Such visualizations can offer more-detailed insight about the performance but are less handy for comparing or expressing the overall accuracy or precision, which is exactly when performance indices are useful.
Although the use of an independent test set is also recommended for evaluating uplift models, we need alternative evaluation measures and visual evaluation approaches because of the fundamental problem of causal inference (FPCI) (Holland 1986). The FPCI essentially reduces to the simple fact that we cannot simultaneously observe for a single individual or entity the outcome of all possible treatments. Note that in general, no treatment should also be considered one of the possible treatments.
Because of the FPCI, we cannot be certain whether a treatment has any effect on an individual's behavior. Consequently, we cannot observe the value of the target variable we aim to estimate with an uplift model. Because the target variable, or uplift, for a single entity is not observable, we cannot calculate the error that is made by an uplift model by comparing estimates with outcomes at the entity level.
What we can observe at the entity level and what we must use for evaluation is the post hoc outcome, given that a treatment was or was not applied. By grouping individuals accordingly and calculating the observed difference in behavior between similar groups of individuals who received and did not receive a treatment, an indicator of the performance of the model is obtained, as will be further elaborated below. The meaning of similar in this sentence will appear to be of crucial importance.
To evaluate an uplift model, the first step is to randomly select a test set consisting of observations from both the treatment and control groups as present in the development base; see Figure 4.2. The distribution of treatment and control group customers in the test set is preferably identical to the overall distribution in the development base and in the training set, thus avoiding possible sources of bias.
In the remainder of this section, we will initially discuss visual evaluation approaches, after which we will define a number of performance metrics.
The uplift by decile graph and the gains charts are two intuitive visual evaluation approaches that are often used to gain insight into the performance of an uplift model.
To plot the uplift by decile graph, in a first step, all of the observations in both the treatment and control group in the test set are scored with the uplift model. Subsequently, observations from both groups together are ranked from high to low estimated uplift. The response rate can then be plotted separately for the observations of the treatment and control groups for each decile, as shown in the upper panel of Figure 4.6. Additionally, the difference in response rate in each decile (i.e., the Uplift by decile graph), can be plotted as shown in the lower panel of Figure 4.6.
The plots in Figure 4.6 allow analyzing the performance of an uplift model. Customers in the treatment group received a treatment to boost the response rate, whereas customers in the control group received no treatment. By comparing the response rates in both groups, the effect of the treatment can be observed. Two factors affect the differences in response rates observed in Figure 4.6:
The effectiveness of the campaign will affect the overall amount of uplift that is observed when comparing the full treatment group with the full control group. When the bars pertaining to the treatment group in the top panel of Figure 4.6 are of the same height as the bars pertaining to the control group, then the bars in the bottom panel of Figure 4.6 will be short, meaning that little uplift is achieved and the campaign appears to be ineffective. How much uplift is obtained or can be expected depends on the nature of the particular application. For some products, a strong uplift can be anticipated, whereas for other products (e.g., very expensive products), a limited uplift is to be expected.
The effectiveness of the model, conversely, becomes apparent from the bar plot in the lower panel of Figure 4.6. Ideally, the (positive) uplift should be situated as much as possible on the left side of the bar plot, meaning that the model assigns a high uplift score to customers who indeed display a higher response rate when treated (i.e., to the Persuadables in the customer base).
A good uplift model allows selecting treatment groups for which the treatment has a significant net effect on the response rate. Hence, if observations in both treatment and control groups are ranked using the uplift model, then by comparing the response rates in the treatment and control groups as a function of the uplift cutoff score, an indication of the quality of the model is obtained. The difference in response rate between treatment and control groups is expected to be relatively large for high cutoffs and decreases for decreasing cutoffs. The higher the difference at high cutoffs, the better the model manages to detect Persuadables.
By comparing the response rates in the treatment and control groups for the observations with the lowest predicted uplift scores, we can observe whether the treatment produces a negative effect on the propensity to respond. This result would occur if a lower response rate were observed among the treated versus the control groups, yielding a negative uplift or downlift because of the treatment.
Uplift can be negative for subgroups when the treatment has a negative effect on response behavior. This effect is observed for the so-called Do-Not-Disturb customers, as discussed previously in this chapter (see Figure 4.1). Observing downlift for the lowest ranked customers, that is, the observations with the lowest predicted uplift by the model, indicates good model quality because the model manages to identify Do-Not-Disturb customers accurately.
Between the Persuadables, who should be assigned the highest uplift values by the model, and the Do-Not-Disturbs, who should be assigned the lowest uplift by the model, the model should rank the Sure Things and Lost Causes. In an absolute sense, these last two groups should both be assigned a zero value for uplift because the treatment has no effect at all, in either a positive manner (e.g., for the Persuadables, who therefore should receive a positive uplift value) or in a negative manner (for example, for the Do-Not-Disturbs, who therefore should receive a negative uplift score).
This ranking leads to the response rate curve for a perfect uplift model, as displayed in Figure 4.7. The rationale behind the optimal response rate curve is the following: the perfect uplift model allows ranking customers according to expected effect. If we rank from high (left) to low (right) expected effect, then the optimal model initially ranks all Persuadables. The response rate can therefore initially be expected to equal zero in the control group and to equal 100% in the treatment group.
Next, after the Persuadables, the optimal uplift model ranks the customers for whom the treatment has no effect, either in a positive or in a negative manner. These customers are the Sure Things, who buy anyway, and the Lost Causes, who will never buy. The optimal uplift model does not distinguish between these two groups because the uplift for both groups is the same and equal to zero. The response rate in this group is different from zero because we have the Sure Things, who respond, and is different from 100% because we also have the Lost Causes, who do not respond. Note that the response rate level shown in Figure 4.7 for the combined group of Sure Things and Lost Causes was arbitrarily set at approximately 50%, assuming this group consists evenly of Sure Things and Lost Causes. Of course, the proportion of Persuadables, Sure Things, Lost Causes and Do-Not-Disturbs depends on the nature of the application. Therefore, the cutoff points on the x-axis shown on the curve for the optimal model in Figure 4.7 are also arbitrary.
Finally, the model ranks the customers for whom the treatment has a negative effect, that is, the Do-Not-Disturbs. Because these customers respond when not treated, we observe a response rate equal to 100% in the control group. Conversely, a response rate of zero is observed in the treatment group for these customers because they do not respond if treated.
Note that in Figure 4.6, the uplift in the top deciles is rather low and even negative in the third decile. The largest uplift is situated in the bottom deciles, corresponding to the customers who have been assigned a low score or uplift by the uplift model! Therefore, we can conclude that the uplift model yielding the uplift by decile graph in Figure 4.6 is not performing well and has limited practical value. In Figure 4.8, an uplift by decile curve of a good (i.e., accurate) uplift model is shown. A large uplift is achieved for the customers receiving the highest uplift scores, whereas negative uplift is observed for the customers assigned the lowest scores. Note that this curve is closer to the optimal shape of the curve, as shown in Figure 4.7.
An important final remark concerns the actual values of the estimated uplift, which are less relevant. Although calibrated uplift estimates that are exact in an absolute manner could be useful, the ranking of customers, or the relative scores, are more important.
The performance of an uplift model can alternatively be visualized by plotting the cumulative difference in response rate between treatment and control groups in the test set as a function of a selected fraction x of the customers as ranked by the uplift model from high to low uplift. This curve is called the cumulative uplift, cumulative incremental gains, or Qini curve (Radcliffe 2007). The cumulative difference in response rate is measured as the absolute or relative number of additional responders, that is, respectively expressed as the additional number as an amount of responders or as a fraction of the total population. Note that performance is again evaluated by comparing groups of observations rather than individual observations.
Figure 4.9 displays a Qini curve for an uplift model that can be benchmarked to the diagonal representing the Qini curve of a baseline random model. The cumulative incremental gains achieved by the random model are purely due to the treatment effect, and there is no additional gain by using the model to select customers. A good uplift model therefore should have a Qini curve that is well above the diagonal, thus further increasing the effect of the treatment as measured in terms of response.
Note that in Figure 4.9, the cumulative incremental gains on the y-axis could be expressed relative to the overall uplift effect, or increase in response rate, that is achieved by the treatment when comparing the full treatment and control group response rates. The overall uplift effect is equal to the cumulative incremental gain for treating 100% of the population. The Qini curves of the two uplift models shown in Figure 4.9 have not been normalized, and, as seen in the figure, the overall uplift effect of the treatment is approximately 0.8%. Moreover, the uplift effect is greater than 0.8% when selecting a smaller fraction of customers because the cumulative incremental gains for fractions below 100% are well above the diagonal. Because the distribution of treatment and control group customers along the uplift score range are generally not identical, a correction might be applied when plotting the Qini curve.
Performance metrics assess the quality of uplift models by summarizing the accuracy of the predictions made by the model in a single number. Although such metrics are less apt to provide detailed insight in the fine granular results, they do have the advantage of allowing easy comparison between different models. Often, alternative models are developed, and subsequently, they need to be compared. Likewise, for pruning decision trees, a performance metric is also required. In addition to helping identify the best model, performance metrics may serve as the objective function when building an uplift model (Naranjo 2012). This is outside the scope of this chapter but will be discussed in the next chapter on profit-driven analytics.
When adopting the visual evaluation approaches discussed above for comparing different models, one may find inconclusive results because performance is a function of the targeting depth, i.e., the fraction of customers that is treated. For example, in Figure 4.9, the Qini curves of two models are plotted. It can be seen that the black curve of uplift model B tops the gray curve of uplift model A for most values x, representing the fraction of customers that are treated. However, for x = 50%, that is, when approximately half of the customers are treated, it appears from the curves that model A performs better than model B. Thus, the plotted Qini curves do not offer a conclusive answer to the question of which of these two models performs best (although indeed one may be inclined to select model B).
Although metrics are clearly of great practical use, it is challenging to develop an adapted and intuitive measure for uplift model evaluation. In this section, we will discuss two measures proposed in the literature that are directly related to the visual evaluation approaches introduced in the previous section: quantile uplift and Qini measures.
In many studies on uplift modeling, performance is evaluated by reporting the uplift obtained when treating a specific quantile or proportion of the population. For instance, the top decile uplift is often used as a metric. Note that this value can directly be observed in the uplift by deciles graph discussed above, which additionally provides the uplift for other deciles. Accordingly, little added value is provided using quantile uplift values, although they do facilitate the communication, reporting, and comparison of model performance. They also may be directly linked to the actual usage of the uplift model when in practice the top decile will be effectively selected and targeted in a campaign. Note that top decile uplift relates to the uplift by decile graph in a manner very similar to how the top decile lift relates to the lift curve, as discussed in Chapter 2.
Somewhat more refined and informative are quantile uplift ratio measures, such as the ratio of the uplift achieved in a quantile over the overall or baseline uplift (Radcliffe and Surry 2011). For instance, the ratio of the uplift in the top decile over the baseline uplift can be used, which again reminds us of the top-decile lift measure discussed in Chapter 2 in terms of providing an indication of how much improvement the model offers over randomly selecting a treatment group.
One alternative ratio measure that provides an indication of the ability of a model to rank customers accurately in terms of the resorted effect of the treatment is the ratio of an upper quantile uplift value and a bottom quantile uplift value. For instance, the ratio of the top decile uplift and the bottom decile uplift provides an indication of how well Persuadables receive a high uplift score and Do-Not-Disturbs a low uplift score.
Note that quantile uplift based measures may be highly sensitive to the selected quantile or cutoff value of the uplift score and therefore can lead to ineffective conclusions when they are used for model comparison. As indicated by Radcliffe and Surry (2011), “Changing the value of the cutoff may change or reverse the results of a comparison.” The Qini measure discussed next provides an alternative that does not depend on the cutoff score, thus greatly improving the quantile uplift measures in this respect.
The Qini measure, introduced by Radcliffe (2007) and equivalent to the AUUC (Rzepakowski, Piotr; Jaroszewicz 2010), is adapted from the well-known Gini measure to evaluate binary classification models. The Gini measure (also called the accuracy ratio) is related to the Gini curve (also called the cumulative gains, cumulative percentage captured, or cumulative accuracy profile curve), which plots the fraction of captured positive class observations as a function of increasing cutoff score or selected fraction of the population as ranked by the model, and is related to the AUC measure as discussed in Chapter 2.
The Qini measure is defined in relation to the cumulative incremental gains, cumulative uplift, or Qini curve discussed above. It measures the area between the Qini curve of the uplift model and the Qini curve of the baseline random model—that is, the diagonal in Figure 4.9.
Note that the Qini measure is different from the Gini measure in the sense that it is unscaled and not limited between zero and one. The Gini measure takes the ratio of the area between the diagonal and the Gini curve of the model and the area between the diagonal and the optimal curve of the perfect classification model. Similarly, one could think of taking the ratio of the unscaled Qini of a model and the unscaled Qini of a perfect model, which is the area between the diagonal and the optimal Qini curve. One obvious question is what the optimal curve would look like. Indeed, this curve relates to the optimal non-cumulative uplift curve, which is shown in Figure 4.7 and stems from the optimal response curves for the treatment and control group, also shown in Figure 4.7. The cumulative version of this optimal uplift curve is the optimal cumulative uplift or Qini curve.
The issue with the optimal Qini curve is that the cutoff points are unknown since we cannot know (because of the FPCI) how many Persuadables, Sure Things, Lost Causes, and Do-Not-Disturbs there are in the population. Thus, the denominator required to scale or normalize the unscaled Qini measure in a manner similar to that of the Gini measure cannot be determined. In Radcliffe and Surry (2011), a simplified optimal Qini curve is proposed. This optimal curve ignores downlift and characterizes the optimal uplift model as fully achieving the observed overall uplift ū in the treatment over the control group by selecting the proportion ū of highest-ranked or -scored customers. This enables the calculation of an unscaled Qini metric for the optimal model and the scaling of the Qini metric so its value is between zero and one.
Although facilitating a comparison of uplift models and often used in the academic literature on uplift modeling, the unscaled Qini has limited practical use because it does not offer a benchmark value that allows easy interpretation, and on a related note, it is application dependent. This means that Qini values of models developed on different datasets, for different treatments, etc., cannot be compared. This is exactly what is valuable about measures such as AUC and R2, which allow data scientists to interpret the performance of a model by comparing it to other models.
Thus, the scaled Qini effectively improves on the unscaled Qini in the sense that it facilitates interpretation and comparison. Nonetheless, an additional drawback of the Qini measure that remains when scaling concerns the assessment of the full ranking made by the Qini measure. Because the Qini measure (similar to the AUC or Gini measure) evaluates model performance across all possible cutoff scores, it accounts for how well the model ranks every entity in the population, whether that entity has a very low, low, medium, high, or very high uplift score.
Typically, treating the full customer base is not optimal in terms of profitability, as will be further elaborated in Chapter 6. Thus, in many practical settings, we are interested only in how well the model performs on a fraction of the population. More specifically, we are concerned about how well the model performs on the customers that will effectively be treated. Often this concerns only a small fraction of customers (i.e., those with the highest uplift scores).
As will be discussed in the Chapter 6, to maximize the payoff of running a marketing campaign, the selected fraction of customers to be treated needs to be optimized as a function of the power or accuracy of the model (Verbeke et al. 2012; Verbraken et al. 2013). It will be shown that the optimal fraction should be used as a cutoff score to evaluate the performance of a model for reaching valid conclusions. A similar procedure could be developed for evaluating uplift models but may be too complex to elaborate in detail. Thus, there may be a sufficient, simplified version that sets a cutoff score close or equal to the operating point, as discussed for the quantile uplift measures. Such a cutoff score could then be used for calculating a quantile Qini measure—for instance, the top decile Qini, which is equal to the area between the Qini curve and the diagonal over the interval , with x representing the selected fraction of customers.
Various approaches for uplift modeling have been discussed in this chapter, hence an important question that may arise is which technique you should adopt for practically developing an uplift model. Although heavily dependent on the available time for developing the model, on specific requirements that hold within the application setting and personal preferences, as well as on the available skill and experience of the involved data scientist(s), we present the following guidelines and two-step approach for developing uplift models:
Purely from a performance perspective, experimental evaluations indicate ensemble-based approaches to yield overall the best performance in terms of Qini measure as well as quantile uplift measures (Kane et al., 2014; Soltys et al. 2015). Experiments that we conducted ourselves indicate uplift random forests overall to be a powerful approach, but at the same time we find a two-model approach, as well as Lo's and the generalized Lai method to perform well. However, substantial variation is observed in terms of performance across applications, and often, alternative approaches do better for a specific dataset. Also, the base response rate and the fraction of customers that will eventually be treated have an important impact and should be acknowledged in adopting the most appropriate approach in a particular setting. Therefore, the second step in the two-step approach—the experimental evaluation of the selected candidate techniques—is of critical importance in selecting the optimal solution.
All of the ensemble uplift modeling approaches discussed in this section have been implemented in the open source Uplift Modeling R package published and maintained by Leo Guelman, which can be found online at https://CRAN.R-project.org/package=uplift and which seamlessly integrates within the open source statistical software R (R Core team 2015). The SBUT and DBUT uplift decision tree techniques have been implemented in this package as base learners that can be selected in combination with the uplift random forests, uplift bagging, or causal conditional inference forests meta-learning schemes.
The two-model and regression-based approaches can be implemented in a straightforward manner in standard analytics software. Lo's approach requires including the treatment and treatment interaction effects in the set of candidate predictor variables when developing a binary prediction model, whereas Lai's method and the generalized Lai method require a transformation of the target variable as illustrated in Table 4.4 and Table 4.5.
Elaborated example applications, including data and code to develop uplift models, are published on the accompanying book website: www.profit-analytics.com.
Uplift modeling essentially allows us to distill and estimate the net effect of a treatment on an individual entity (e.g., the change in a customer's purchasing or churning behavior resulting from targeting that customer in a response or retention campaign). The main challenge in uplift modeling is the fundamental problem of causal inference. Essentially, the FPCI concerns the impossibility of simultaneously observing the effect of multiple different treatments on the same entity. For instance, a customer cannot be targeted with a response campaign while being excluded from that campaign. If this were possible, we would be able to observe or measure the exact difference in behavior because of the treatment, but clearly, we cannot. Therefore, uplift modeling contrasts and analyzes the behavior of groups of entities receiving different treatments to indirectly learn the effect of a treatment as a function of an entity's characteristics.
For the specific purpose of uplift modeling, a relatively elaborate experimental design needs to be established to gather the required data, as extensively discussed in this chapter. Various modeling approaches have been introduced for estimating uplift, ranging from simple yet intuitive approaches to complex but more powerful techniques. Evaluating the obtained uplift models is not a trivial task: again, due to the FPCI, it is impossible to simply compare predictions and outcomes. Therefore, to provide detailed insight in uplift model performance, specialized plots and measures such as uplift curves and the Qini measure are needed.