Chapter 2. Intro to Analytical Thinking

In the last chapter I defined analytical thinking as the ability to translate business problems into prescriptive solutions. There is a lot to unpack from this definition, and this will by our task in this chapter.

To really understand the power of prescriptive solutions, I will start by precisely defining each of the three stages present in any analysis of business decisions, these are the descriptive, predictive and prescriptive steps we have already mentioned in Chapter 1.

Since one crucial skill in our analytical toolbox will be formulating the right business questions from the outset, I will provide a first glimpse into this topic. Spoiler alert: we will only care about business questions that entail business decisions. We will then disect business decisions into levers, consequences and business results. The link between levers and consequences is intermediated by causation so I will devote quite a bit of time talking about this topic. Finally, I will talk about the role that uncertainty plays in business decisions. Each of these topics is tied to one skill that we will develop throughout the book.

Descriptive, Predictive and Prescriptive Questions

In Chapter 1 we saw that data maturity models usually depict a nice, smooth road that starts at the descriptive stage, goes through the predictive plateau to finally ascend to the predictive summit. But why is this the case? Let’s start by understanding what these mean and then we can discuss why commentators and practitioners alike believe that this is the natural ascension in the data evolution.

In a nutshell, descriptive relates to how things are, predictive to how we believe things will be, and prescriptive to how things ought to be. Take Tyrion Lannister´s quote in Game of Thrones’ “The Dance of the Dragons” episode: “It’s easy to confuse what is with what ought to be, especially when what is has worked out in your favor” (my emphasis). Tyrion seems to be claiming that when the outcome of a decision turns out to be positive, we think that this was the best we could do (a type of confirmation bias). Incidentally, when the outcome is negative, our tendency is to think that this was the worst possible result and attribute our fate to some version of Murphy´s Law.

In any case, as this discussion shows, the prescriptive stage is a place where we can rank different options so that words like “best” or “worst” make any sense at all. It follows that the prescriptive layer can never be inferior to the descriptive one, as in the former we can always make the best decision.

But what about prediction? To start, its intermediate ranking is at least problematic, since description and prescription relate to the quality of decisions, and prediction is an input to make decisions, which may or may not be optimal or even good. The implicit assumption in all maturity models is that the quality of decisions can be improved when we have better predictions about the underlying uncertainty in the problem; that good predictions allow us to plan ahead and move proactively, instead of reacting to the past with little or no room for maneuver. And of course, there is the fact that we cannot change the past; its role is for us to learn and prepare us to make better future decisions. Learning from the past, and the ability to move proactively: that’s when the power of prediction kicks in.

When is predictive analysis powerful: the case of cancer detection

Let’s take an example where better prediction can make a huge difference: cancer detection.1 Oncologists usually use some type of visual aid such as X-rays or the more advanced CT scans for early detection of different pathologies. In the case of lung cancer an X-ray or a CT scan is a description of the patients’ current health status. Unfortunately, visual inspection is highly ineffective unless the disease has already reached a late stage, so description here, by itself, may not provide enough time to react quickly enough. AI has shown remarkable prowess in predicting the existence of lung cancer from inspecting CT scans, by identifying spots that will eventually show to be malignant.2 But prediction can only take us so far. A doctor should then recommend the right course of action for the patient to fully recover. AI provides the predictive muscle, but humans (doctors) prescribe the treatment.

Descriptive Analysis: the case of customer churn

Let’s run a somewhat typical descriptive analysis of a use case that every company has dealt with: customer churn. We will see that without guidance from our business objectives, this type of analysis might take us to a dead end.

How has customer churn evolved in the recent past?

Suppose that your boss wants to get a hold of customer churn. As a first step, she may ask you to diagnose the magnitude of the problem. After wrangling with the data you come up with the following two plots (Figure 2-1). The left plot shows a time series of daily churn rates. Confidently, you state two things: after having a relatively stable beginning of the year, churn is now on the rise. Second, there is a clear seasonal pattern, with weekends having lower than average churn. In the right panel you show that municipalities with higher average income also have higher churn rates, which of course, is a cause of concern since your most valuable customers may be switching to other companies.

Churn rate descriptive analysis
Figure 2-1. Descriptive analysis of our company’s churn rate

This is a great example of what can be achieved with descriptive analysis. Here we have a relatively granular, up-to-date picture of the problem, and unfortunately, news are not good for the company: your boss was right to ask for this analysis, since churn is on the rise, having reached a yearly record at the end of May with no signs of going back to previous levels. Moreover, our remarkable ability to recognize patterns allows us to clearly identify three patterns in the data: changes in the trend and the existence of seasonal effects in our time series, and positive correlation between two variables in the scatterplot.

There are problems and risks with this type of analysis, however. As you’ve probably heard elsewhere, correlation does not imply causation, a topic that will be discussed at length later in this chapter. For instance, one plausible recommendation could be that the company should pull away from richer municipalities, as measured by average household income. This follows from a causal interpretation going from household incomes to churn rates, albeit an incorrect one.

The trap of finding actionable insights

One common catch phrase among consultants and vendors of big data solutions is that once given enough data your data analysts and data scientists will be able to find actionable insights.

This is a common trap among business people and novice data practitioners: the idea that given some data, if we inspect it long enough, these actionable insights will emerge, almost magically. I’ve seen teams spending weeks waiting for the actionable insights to appear, without luck of course.

Experienced practitioners reverse engineer the problem: start with the question, formulate hypotheses, and use your descriptive analysis to find evidence against or in favor of these hypotheses. Note the difference: under this approach we actively search of actionable insights by first deciding where to look for them, as opposed to waiting for them to emerge from chaos.

Moreover, the question remains as to how to create value from this picture. We now know that churn is rising (admitedly, this is better than not knowing) but since we don’t know why, we cannot device a retention policy. You may argue that if we further inspected the data we may find the root causes for this upward trend, and this is the right way to proceed: formulate hypothesis that guide our analysis of the data. Inspecting data without advancing some plausible explanations is the perfect recipe for making your analytics and data science teams waste valuable time.

How many customers will churn this period?

As a next step, your boss may ask you to predict churn in the future. How should you proceed? It really depends and what you want to achieve with this analysis. If you work in finance, for example, you may be interested in forecasting the income statement for the next quarter, so you’d be happy to predict aggregate churn rates into the future. If you are in the marketing department, however, you may want to predict which customers are at risk of leaving the company, possibly because you may try using different retention campaigns. Each of these questions can be answered with different methods and hypotheses at different levels.

What is the best lever we can pull to prevent customers from churning

Finally, suppose that your boss asks you to recommend alternative courses of action to reduce the rate of customer churn. This is where the prescriptive toolkit becomes quite handy and where the impact of making good decisions can be most appreciated. You may then pose a cost-benefit analysis for customer retention and come up with a rule such as the following: retain a customer whenever the expected profits from retention are positive, taking into account your budget; profits are calculated by subtracting the expected costs from the expected savings. You can then order all customers according to this metric in a decreasing fashion, and target your campaign to those customers with largest net savings until you exhaust your budget (or when you’ve reached the customer where the campaign breaks even).

We will have the opportunity to go into greater detail on this use case, but let me just single out two characteristics of any prescriptive analysis: as opposed to the two previous analyses, here we actively recommend courses of action that can improve our position, by way of incentivizing a likely-to-leave customer to stay longer with us. Second, prediction is used as an input in the decision-making process, helping us calculate expected savings and costs. AI will help us better estimate these quantities, necessary for our proposed decision rule. But it is this decision rule that creates value, not prediction by itself.

One of the objectives of this book is to prepare us on how to translate business questions into prescriptive solutions, so don’t worry if it’s not obvious yet. We will have time to go through many step-by-step examples.

Business questions and KPIs

One foundational idea in the book is that value is created from making decisions. As such, prediction in the form of AI or machine learning is just an input to create value. In this book, whenever we talk about business questions, we will always have in mind business decisions. Surely, there are business questions that are purely informative and no actions are involved. But since our aim is to systematically create value, we will only consider actionable questions. As a matter of fact, one byproduct of this book is that we will learn to look for actionable insights in an almost automatic fashion.

It begs the question, then, of why we have to make a decision. Only by answering this question will we be able to know how to measure the appropriateness or not of the choices we make. Decisions that cannot be judged in the face of any relevant evidence are to be discarded. As such, we will have to learn how to select the right metrics to track our performance. Many data science projects and business decisions fail not because of the logic used but because the metrics were just not right for the problem.

There is a whole literature on how to select the right key performance indicators (KPIs), and I believe I have little to add on this topic. The two main characteristics I look for are relevance and measurability. A KPI is relevant when it allows us to clearly assess the results from our decisions with respect to the business question. Notice that this doesn’t have to do with how pertinent the business question is, but rather, on whether we are able to evaluate if the decision worked or not, and by how much. It follows that a good KPI should be measurable and this better be with little or no delay with respect to the time when the decision was made. As a general rule, the longer we take measuring the outcome the lower the signal-to-noise ratio, so identifying the decision as a cause may become fuzzier.

SMART KPIs

For us, a good KPI has to be relevant and measurable. Compare this with the now classic SMART definition of good KPIs. The acronym stands for: Specific, Measurable, Achievable, Relevant and Timely. We have already mentioned the time dimension, and it’s hard to argue against specificity: there is a long distance between “improving our company’s state” and “increasing our profit margins”. The latter is quite specific and the former is so abstract that it can’t be actionable.

In my opinion, however, the property of being achievable sounds closer to a definition of a goal than to a performance indicator.

KPIs to measure the success of a loyalty program

Let’s briefly discuss one example. Suppose that our Chief Marketing Officer asks us to evaluate the creation of a loyalty program for the company. The question starts with an action (creating or not the loyalty program) so it immediately classifies for us as a business problem. What metrics should we track? To answer this let’s start the sequence of why questions.

  • Create a loyalty program. Why?

  • Because you want to reward loyal customers. Why?

  • Because you want to incentivize customers to stay longer with the company. Why?

  • Because you want to increase your revenues in the longer term. Why?

The sequence of why questions

This example is showcasing a technique that I call the sequence of why questions. It is used to identify the business metric that we want to optimize.

It works by starting with what you, your boss or colleagues may think they want to achieve and question the reasons for focusing on such objective. Move one step above and repeat. It terminates when you’re satisfied with the answer. Just in passing, recall that to be satisfied you must have a relevant and measurable KPI to quantify the business outcome you will focus on.

And of course, the list can go on. The important thing is that the final answer to these questions will usually let you clearly identify what KPI is relevant for the problem at hand, and any intermediate metrics that may provide useful; if it’s also measurable then you have found the right metric for your problem.

Consider the second question, for example. Why would anyone want to reward loyal customers? They are already loyal, without the need for any extrinsic motivation, so this strategy may even backfire. But putting aside the underlying reasoning, why is loyalty meaningful and how would you go about measuring the impact of the reward? I argue that loyalty by itself is not meaningful: we prefer loyal customers to not-so-loyal customers because they represent a more stable stream of revenues in the future. If you’re not convinced, think about those loyal but unprofitable customers. Do you still rank their loyalty as high as before? Don’t feel bad if your answer is negative: it just means that you are doing business because you want to make a decent living. If loyalty per se is not what you’re pursuing then you should keep going down the sequence of why questions, since it appears that we are aiming at the wrong objective.

Just for the sake of discussion, suppose that you still want to reward loyal customers. How do we measure if our rewards program worked, that is, what is a good KPI for this? One method is to directly ask our customers to rate their level of satisfaction, or whether they would recommend us to a friend or colleague, as is commonly done with the Net Promoter Score. Let us briefly discuss the pros and cons of these metrics.

On the bright side, it is a pretty direct assessment: we just go and ask our customers if they value the reward. It can’t get more straightforward than that. The problem here is that since humans act upon motivations, we can’t tell if the answer is truthful or if there’s some other underlying motive. It is not unreasonable that after creating the program and having already delivered some retention price discounts, Daniel, the customer, says that he is deeply dissatisfied with the company. We may then judge that our action (discounts) did not accomplish our objective of encouraging loyalty. But could it be that Daniel is trying to game our program? This type of strategic considerations matter when we assess the impact of our decisions.

An alternative is to let the customers indirectly reveal their level of satisfaction through their actions, say from the amount, frequency or ticket in their recent transactions, or through a lower churn rate for those who receive the reward relative to a well-designed control group.3 Companies will always have customer surveys, and they should be treated as a potentially rich source of information. But a good practice is to always check if what they say is supported by their actions. In some cases, we will find that it is better to replace attitudinal surveys that are sensitive to strategic manipulation for actions as our source to construct KPIs. Assuming that both of these are relevant KPIs (they both proxy loyalty), which one ranks better on the measurability dimension? I argue that the second type, since this property relates to our capacity to observe changes in the business objective, and surveys can be manipulated strategically.

The Net Promoter Score (NPS)

To calculate the NPS we directly ask (a sample of) our customers how likely they would recommend the company to others, based on a 0 to 10 scale. A customer is either defined as a Promoter, Detractor or Passive depending on the answer: promoters have answered that they are 9 or 10 likely, detractors less than or equal to six, and the remaining ones are passives.

The NPS is an aggregate measure computed as:

NPS=%Promoters-%Detractors

and can thus vary between -100 and 100.

An Anatomy of a Decision: a simple decomposition

Figure 2-2 shows the general framework we will use to decompose and understand business decisions. Starting from the right, it is useful to repeat one more time that we always start with the business —  possibly making use of the sequence of why questions described above, that allow us to precisely pinpoint what we wish to accomplish. If your objective is unclear or fuzzy, most likely the decision shouldn’t be made at all. Companies tend to have a bias for action, so fruitless decisions are sometimes made; this not only may have unintended negative consequences on the business side but may also take a toll on employees’ energy and morale. Moreover, we now take for granted that our business objective can be measured through relevant KPIs. This is not to say that metrics arise naturally: they must be carefully chosen by us, the humans, as will be shown below with an example.

Decomposing Decisions
Figure 2-2. Decomposing decisions

It is generally the case that we can’t simply manipulate those business objectives ourselves (remember Enron4), so we need to take some actions or pull some levers in order to try to generate results. Actions themselves map to a set of consequences that directly affect our business objective. To be sure: we pull the levers, and our business objective depend on consequences that arise when the “environment” reacts. The environment can be humans or technology, as we will see later.

Even if the mapping is straightforward (most times it isn’t) it’s still mediated by uncertainty, since at the time of the decision it is impossible to know exactly what the consequences will be. We will use the powers of AI to embrace this underlying uncertainty, allowing us to make better decisions. But make no mistake: value is derived from the decision and prediction is an input to make better decisions.

To sum up, in our daily lives and in the business, we generally pursue well-chosen, measurable objectives. Decision-making is the act of choosing among competing actions to attain these objectives. Data-driven decision-making is acting upon evidence to assess alternative courses of action. Prescriptive decision-making is the science of choosing the action that produces the best results for us; we must therefore be able to rank our choices relative to a measurable and relevant KPI.

An example: why did you buy this book?

One example should illustrate how this decomposition works for every decision we make (Figure 2-3). Take your choice to purchase this book. This is an action you already made, but, surely, you could have decided otherwise. Since we always start with the business problem, let me imagine what type of problem you were trying to solve.

Decomposing Decisions
Figure 2-3. Decomposing your decision to buy this book

Since this book is published by O’Reilly Media, most likely your objective is to advance your career and not just to have a nice, pleasurable Friday night read.5 This sounds like a medium-to-long run goal, and one possible metric is the number of interviews you get once you master the material (or at least write it down on your résumé, update your LinkedIn profile or the like). If you don’t want to change jobs, but rather be more effective at your current position, alternative metrics could be the number of end-to-end delivery of data science projects, number of ideas for new projects, etc. Notice how we must adjust the KPIs to different objectives. For now, let me just assume that the goal you want to attain is to be more productive at work which can be readilly measured.

The set of possible levers you can pull is now larger than just “buying” this book or “not”. You could have, for instance, adopted the “seven habits of highly effective people”, enrolled on an online course, keep improving your technical skills, improve your interpersonal skills, bought other books, or just do nothing. The advantage of starting with the business problem —  as opposed to a set of specific actions like “buy” or “not buy" —  is that your menu of options usually gets enlarged.

To simplify even more, let us assume that we only consider two actions: buy or not buy. If you don’t buy it (but please do) your productivity may keep increasing at the current rate. This is not the only possible consequence, of course. It could be that you get a sudden burst of inspiration and surprisingly start understanding all the intricacies of your job, positively and dramatically increasing your productivity. Or the opposite could happen, of course. The universe is full of examples where symmetry dominates. Nonetheless, let’s appeal to Occam’s razor and consider the only consequence that seems likely to occur: no impact on your productivity.

Occam’s razor

When there are many plausible explanations to a problem, the principle known as Occam’s razor appeals for the simplest one. Similarly, in statistics, when we have many possible models to explain an outcome, if we apply this principle we would attempt at using the most parsimonious one.

We will devote a whole chapter on the skill of simplification.

If you do buy and read this book, now we have at least three likely consequences: the book works and improves your analytical skills, it does nothing, or it worsens your skills. Contrary to the previous analysis, in this case the latter is likely and should survive Occam’s razor: I could be presenting some really bad practices that you haven’t heard of, that you end up naively adopting. Now, at the time of making the decision you don’t really know the actual consequence so you may have to resort to finding additional information, read reviews or use heuristics to assess the likelihood of each possible outcome. This is the underlying uncertainty in this specific problem.

To sum up, notice how a simple action helped us to clearly and logically find the problem being solved, a set of levers, their consequences, and the underlying uncertainty. This is in general a good practice that applies to any decision you make: if you are already making choices or decisions, think back to what specific problem you are attempting to solve — you can even try answering the sequence of why questions — and then reverse engineer a set of possible actions, consequences and uncertainty. When this is set up as complete decision problem we can now attempt to find the best course of action.

A primer on causation

We will devote a chapter on each of the “stages” in the decomposition, so there will be enough time to understand where these levers come from and how they map to consequences. It is important, though, to stop now, and recognize that this mapping is mediated by causal forces.

Going back to the saying that “correlation does not imply causation”, no matter how many times we’ve heard about it, it is still very common to get the two terms confused. Our human brain evolved to become a powerful pattern recognizing machine, but we are not so well equipped to distinguish causation from correlation. To be fair, even after taking into account this apparent impairment, we are by far the most sophisticated causal creatures that we know of, and infinitely superior to machines (since at the time of the writing they completely lack the ability and it is not even clear when this ability may be achieved or if it’s achievable at all).

Defining correlation and causation

Strictly speaking, correlation is the presence or absence of any linear dependencies in two or more variables. Though technically accurate, we can dispense of the “linear” part and be concerned about general relationships between variables. For instance, the scatterplot in Figure 2-1, showed that average household income in each municipality was positively correlated with churn: they tend to move in the same direction, so that on average, higher (lower) churn in a municipality is associated with higher (lower) average income.

Causality is harder to define, so let us take the shortcut followed by almost everyone: a relation of causality is one of cause and effect. X (partially) causes Y if Y is (partially) an effect of X. The “partial” qualifier is used because rarely is one factor the unique source of a relationship. To provide an alternative, less circular, definition let us think in terms of counterfactuals: had X not taken place, is it true that Y had been observed? If the answer is positive then it is unlikely that a causal relationship from X to Y exists. Again, the qualifier “unlikely” is important and related to the previous qualifier “partial”: there are causal relations that only occur if the right combination of causes is present. One example is whether our genes determine our behavior: it has been found that our genetic makeup by itself is generally not the unique cause of our behavior; instead, the right combination of genes and environmental conditions are needed for behavior to arise.6

Going back to the scatterplot example, our brain immediately recognizes a pattern of positive correlation. How do we even start thinking about causality? It is common to analyze each of the two possible causal directions and see whether one, the other or both, make sense with respect to our understanding of the world. Is it possible that the high churn rates cause the higher average income in each municipality? Since household income usually depends on more structural economic forces —  such as the education levels of the members of the household, their occupations and employment status —  this direction of causality seems dubious, to say the least. We can easily imagine a counterfactual world where we lower the churn rates (say, by aggressively giving retention discounts) without changing the recipient household’s income in a significant way.

What about the other direction? Can higher income be the cause of the higher rates of churn. It is plausible that higher income customers — paying higher prices — also expect higher quality, on average. If the quality of our product doesn’t match their expectations they may be more likely to switch companies. How would the counterfactual work? Imagine que could artificially increase some of our customers’ household incomes. Would their churn rate increase? This ability to create counterfactuals is fundamental to even have a conversation about causality.

Understanding Causality: some examples

To fully appreciate the difficulty in identifying causality from data let’s look at some examples.

Simulated data

Let’s start by analyzing the data in Figure 2-4. Here, again, we immediately identify a very strong positive correlation between variables Y and X. Can it be that two variables move together in such a strong manner, and there is no causal relationship between them? One thing should be clear from the outset: there is no way we can device causal stories without having some context, that is, without knowing what X and Y are and how they relate to the world.

strong correlation between two variables
Figure 2-4. A simulation of two highly correlated variables

This is an example of a spurious correlation, the case when two variables falsely appear to be related. The source of this deception is the presence of a third variable Z that affects both variables, ZX and ZY; if we don’t control for this third variable, the two will appear to move together even when they are not related at all. I know that this is the case, because the following Python code was used to simulate the data.

Example 2-1. Simulating the effect of a third unaccounted variable on the correlation of other two
# fix a seed for our random number generator and number of observations to simulate
np.random.seed(422019)
nobs = 1000
# our third variable will be standard normal
z = np.random.randn(nobs,1)
# let's say that z --> x and z--> y
# Notice that x and Y are not related!
x = 0.5 + 0.4*z + 0.1*np.random.randn(nobs,1)
y  = 1.5 + 0.2*z + 0.01*np.random.randn(nobs,1)

A simulation was used to unequivocally show the dangers of a third variable that is not taken into account in our analysis, so you may wonder if this is something we should worry about in your day-to-day work. Unfortunately, spurious correlations abound in the real world, so we should better learn to identify them and find workarounds. The effect of misrepresenting causation on the quality of a decision will not only lead to ineffective decision-making but also to a loss of valuable time when developing the predictive algorithms.

Churn and Income

Let us quickly revisit the positive association found between churn rates and households’ average income per municipality (Figure 2-1) and imagine that a strong competitor has entered the market with an aggressive pricing strategy targeted at customers in the medium-to-high income segments. It will be the case, then, that this third variable explains the positive correlation: more of your higher income customers will churn across all municipalities, but the effect will be higher in those where their relative size is also larger.

Can divorces in England explain pollution in Mexico?

Consider now the examples in Figure 2-5. The top left panel plots a measure of global CO2 emissions and per capita real gross domestic product (GDP) in Mexico for the period 1900-2016. The top right panel plots the number of divorces in Wales and England against Mexican GDP for 1900-2014. The bottom panel plots the three time series, indexed so that the 1900 observation is 100.7

Inspecting the first scatter plot, we find a very strong, almost linear relationship between the state of the Mexican economy, as measured by per capita real GDP, and global CO2 emissions. How can this be? Let’s explore causality in both directions: it is unlikely that CO2 emissions cause Mexican economic growth (to the best of my knowledge, CO2 is not an important input for any production processes in the Mexican economy). Since the Mexican economy isn’t that large on a global scale, it is also unlikely that Mexico’s economic growth has had such an effect on global contaminants. One can imagine that fast-growing economies like China and India (or the US and Great Britain during the 19th and 20th Centuries) would be responsible for a big part of global CO2 emissions, but this is unlikely for the case of Mexico.

mexican gdp, divorces in the UK and the time series
Figure 2-5. Top left panel plots global CO2 emissions against real per capita Gross Domestic Product (GDP) for Mexico for the period 1900-2016. Top right panel does the same, replacing CO2 emission with the number of divorces in Wales and England during 1900-2014. Bottom plot shows the time series for each of these variables.

The second scatterplot shows an even more striking relationship: per capita real GDP in Mexico is positively related to the number of divorces in England and Wales, but up to a certain point (close to $10K dollars per person); after reaching that level the relationship becomes negative. Causal stories in this case become rather convoluted. Just for illustration, one such story —  from economic growth in Mexico to divorce rates in the UK —  could be that as the Mexican economy developed, more English and Welch people migrated to the North American country to find jobs and share the pie of economic prosperity. This could have created broken homes and a high prevalence of divorce. This story is plausible, but highly unlikely, so there must be some other explanation.

As before, there is a third variable that explains the very strong but spurious correlations found in the data. This is what statisticians and econometricians call a time trend, that is, the tendency of a time series to increase (or decrease) over time. The bottom plot depicts the three time series over time. Observe first the evolution of per capita GDP and CO2 emissions. The two time series evolve hand-by-hand until the late 1960s and beginning of the 1970s, thereafter maintaining different, but still positive, trends or growth rates. A similar comment can be made for the number of divorces. These trend or growth rates are the third variable that is common to the three time series, creating strong but spurious correlations. For this reason, practitioners always start by detrending or controlling for the common trend among different time series allowing them to extract more information from this noisy data.

High value customers have lower Net Promoter Scores

Let’s give some examples closer to the business now starting with the risks of making customer satisfaction comparisons across customer segments. Recall that the NPS is a metric commonly used to track customer satisfaction, so it is natural to compare it across segments as in Figure 2-6.

net promoter score
Figure 2-6. Net Promoter Score (NPS) for two different value segments

The plot depicts average NPS for two different customer value segments, Premium and Regular, corresponding to customers with high and low customer lifetime values (CLV), respectively. The bars show that NPS is negatively correlated with the value of the customer, as measured by the CLV. Since this is a customer-centric company, one possible recommendation could be to focus only on lower-value customers (since they are the most satisfied). This is a causal interpretation that goes from value to satisfaction, and a customer’s value is treated as a lever. It could be, however, that a third variable is affecting both the NPS and the CLV, and that once we control for this intervening variable the relationship disappears. One such possible third value is our customers’ socioeconomic level, possibly capturing the higher quality expectations that we described in the churn example.

Selection effects and the health status of our employees

Another important reason why we cannot immediately identify causality when analyzing our data are selection effects.8 Suppose that our Chief Human Resources Officer is considering saving costs by eliminating the company’s on-site medical service, since in his opinion most employees use the company-provided off-site services. Since this is a data-driven company, she decides to apply an anonymous survey asking the following two questions to the employees:

  1. How is your health in general? Please provide an answer from 1 to 5, where 1 is Very Bad, 2 is Bad, 3 is Fair, 4 is Good and 5 is Excellent.

  2. In the last 2 months, have you been treated by our on-site doctors?

Figure 2-7 shows the average self-reported health status for groups of employees that declare having used the medical service or not. The CHRO was deeply concerned with the result presented by the Analytics Unit: treated employees report having worse health than those not treated. It seems that our on-site medical service is generating the exact opposite result it was designed for.

health status and selection effects
Figure 2-7. Self-reported health status for employees treated not treated on-site. Vertical lines correspond to confidence intervals.

But is this analysis sound? One data scientist pointed out that it could be the result of selection effects. Her rationale is that employees that feel sick may be self-selecting themselves into using the on-site service. She concluded that the results show the higher likelihood of going to the doctor of the sick and not the causal effect of providing on-site assistance on the employees’ health, so, she concludes, we must close it.

Figure 2-8 shows the two directions of causality. Self-selection (1) implies that sick employees are more likely to use the on-site service. They get treated according to standard medical practices, on average improving their conditions (2). The causal effect that the CHRO expected to find was given by (2), but unfortunately the selection effect was strong enough to counter the positive impact that the company’s doctors have.

health status and selection effects
Figure 2-8. Self-selection explains why treated employees show worse health conditions than those that have not attended the on-site medical service

Adverse selection in the credit card business

Let’s discuss now adverse selection, a type of selection effect common in situations where the involved parties have asymmetric information about the other side. Since it is commonly used to analyze credit outcomes in the financial sector, let us suppose that the Chief Risk Officer for a bank is deeply concerned about the recent upsurge in delinquency rates for the credit card products. After discussing alternative courses of action in the weekly Credit Committee they decide to increase interest rates hoping that this would lower the demand for credit and help cover the extra cost of risk.

One quarter later they find that, indeed, their credit card portfolio shrunk. However, delinquency rates increased. What happened here? At the new higher cost for credit, higher-risk customers were the only ones willing to accept such high interest rates. Our choice of levers incentivized the exact market segment we wanted to avoid to self-select and ask for credit card loans.

Can investing in infrastructure increase customer churn?

Another example from a capital intensive industry like telecommunications will show the dangers of decision-making in the presence of selection effects. Telcos have very large capital expenditures (CAPEX) since they need to constantly invest in building and maintaining a network to provide high quality communication services to their customers. Suppose that our Chief Financial Officer must decide where to focus our investing efforts during the next quarter. After looking at the data they plot churn rates in cities with no CAPEX last year and those where there was positive investment (Figure 2-9).

capex churn
Figure 2-9. Average churn rates in cities with and without CAPEX during the previous year

The results are both surprising and frustrating: it appears that CAPEX has had unintended consequences as churn is higher in cities where they invested last year relative to those without CAPEX. Could it be, maybe, that competitors reacted strategically investing even more heavily in those cities and capturing an increasing market share? That is likely, of course, but here, the most plausible explanation are selection effects. Last year they focused the investment efforts in cities that were lagging in terms of customer churn and satisfaction. The result is exactly like the one in the medical condition: the patients (the cities) still haven’t recovered fully so the data still shows the disadvantaged initial conditions.

Marketing mix and channel optimization: the case of online advertising

Our Chief Marketing Officer asks us to estimate the revenue impact of advertising spending on different channels, with a specific focus on our digital channels. After devoting considerable effort getting the data, we find some very pleasant news: the online advertising Return on Investment (ROI) was 12% for the year, the second consecutive year with double digits. However, members from the data science team raised concerns about us overestimating the real impact. Their logic was as follows (Figure 2-10).

Our retargeting partner usually waits for internet users to visit our webpage. When that happens, they place a cookie so that we can track their online behavior. Some time later, an ad is served on a publisher’s website with the hope of converting this lead. Some of these customers end up buying our products on our website, so it seems that our advertising investment has done wonders for us.

selection effects for retargeting
Figure 2-10. Selection effects explains high digital marketing ROI

The problem here is conceptual, though: ideally, advertising should be done to convert a user who was not planning to buy from us into a buyer. But those who end up buying from us had already self-selected themselves by visiting our webpage, hence showing interest on our products. Had we not placed the ad, would they have purchased anyway? If the answer to this counterfactual is affirmative, then there is a case that the ROI might be overestimated; it could even be negative! To get a reasonable estimate we must get rid of our customers’ self-selection.

Estimating the impact that influencers have on our revenues

Our data-driven CMO also wants us to evaluate the impact that our army of influencers had on our revenues. Ideally, influencers’ reach should create a network multiplier effect, the multiplier given by the number of followers, their followers’ followers, and so on. That’s both the beauty and complexity of social networks.

Influencing is difficult to measure, however, since followers self-select themselves to follow the influencer. No one forced them. They already liked something about the influencer, making them more likely to behave like the influencer, not because of the influencer, but since they share some common, possible unobserved to us, tastes. This is the classic chicken and egg problem.

influencers followers
Figure 2-11. Counterfactual analysis used to estimate the effect that influencers have. Situation (A) depicts the case where followers were influenced and changed their consumption patterns. In situation (B) followers self-selected themselves.

Figure 2-11 shows two alternative scenarios, A and B, at two different moments in time: before we measure the influencer’s impact and during the measurement. Most of us would agree that scenario A is closer to what we would expect from an influencer: her followers changed their behavior thanks to her own change (not observed here). Similar to the online advertising case, here our influencer converted non-customers into customers so she may well be well worth her price. Scenario B however can be potentially explained by self-selection of equals, and our influencer’s impact seems lower or even null.

Some difficulties in estimating causal effects

Estimating the causal impact on outcome Y of pulling a lever XY is paramount since we are trying to engineer optimal decision-making. The analogy is not an accident: like the engineer who has to understand the laws of physics to build skyscrapers, bridges, cars or planes, the analytical leaders of today must have some level of understanding of the causal laws mediating our own actions and the consequences to make the best possible decisions. And this is something that humans must do; AI will help us later in the decision-making process, but we must first overcome the causal hurdles.

Problem 1: We can’t observe counterfactuals

As discussed in the previous sections, there are several problems that make our identification of causal effects much harder. The first one is that we only observe the facts so we must imagine alternative counterfactual scenarios. In each of the previous examples, we knew that direct causal interpretation was problematic since we were able to imagine alternative universes with different outcomes. It is an understatement that one of the most important skills analytical thinkers must develop is to question the initial interpretation given to empirical results, and to come up with counterfactual alternatives to be tested. Would the consequences be different, had we pulled different levers, or the same levers but under different conditions?

Let’s stop briefly to discuss what this question entails. Suppose we want to increase lead conversion in our telemarketing campaigns. Tom, a junior analyst who took one class in college on Freudian psychoanalysis suggests that female call center representatives should have higher conversion rates, so they decide to make all outbound calls for a day with their very capable group of women representatives. The next day they meet to review the results: lead conversion went from the normal 5% to an outstanding 8.3%. It appears that Freud was right, or better, that Tom’s decision to take the class had finally proven correct. Or does it?

To get the right answer, we need to imagine a customer receiving one call from the female representative in one universe, and the exact same call from a male representative in a parallel universe (Figure 2-12). Exact customer, exact timing, exact mood and exact message; everything is the same in the two scenarios: we only change the tone of voice from that of a male to a female. Needless to say, putting in practice such counterfactual sounds impossible. Later in this chapter we will describe how we can simulate these impossible counterfactuals through well-designed randomized experiments or A/B tests.

call center counterfactual
Figure 2-12. Counterfactual analysis of lead conversion rates in a call center

Problem 2: Heterogeneity

A second problem is heterogeneity. Humans are intrinsically different, each and every one the product of both our genetic makeup and lifetime experiences, creating unique world visions and behaviors. Our task is not only to estimate how behavior changes when we choose to pull a specific lever —  the causal effect —  but we must also take care of the fact that different customers react differently. An influencer recommending our product will have different effects on you and me: I may now be willing to try it while you may choose to remain loyal to your favorite brand. How do we even measure heterogenous effects?

Figure 2-13 shows the famous bell curve, the normal distribution, the darling of statistical aficionados. I’m using it here to represent the natural variation we may encounter when analyzing our customers’ response when our influencer recommends our product. Some of his followers, like me, will accept the cue and react positively —  represented as an action right of the vertical dashed line, the average response across all followers, followers’ followers and so on. Some will have no reaction whatsoever, and some may even react negatively; that’s the beauty of human behavior, we sometimes get the full spectrum of possible actions and reactions. The shape of the distribution has important implications, and in reality, our responses may not be as symmetric; we may have longer left or right tails and reactions may be skewed towards the positive or the negative. The important thing here is that people react differently, making things even more difficult for us when we try to estimate a causal effect.

customer heterogeneity
Figure 2-13. A normal distribution as a way to think about customer heterogeneity

The way we usually deal with heterogeneity is by dispensing of it by estimating a unique response, usually given by the average or the mean (the vertical line in Figure 2-13). The mean, however, is overly sensitive to extreme observations, so we may sometimes replace it by the median, having the property that 50% of responses are lower (to the left) and 50% higher (to the right); with bell-shaped distributions the mean and the median are conveniently the same.

Problem 3: Selection Effects

One final problem covered in detail in the previous section is the prevalence of selection effects. Either we choose the customer segments we want to act upon, or they self-select themselves, or both. One general result when empirically comparing the average outcomes of two groups, say those treated or not treated is:9

ObservedDifference=CausalEffect+SelectionBias

It is standard practice to plot average outcomes as in the left panel of Figure 2-14. In this case, the outcome for the treated (customers receiving a call from female reps) is 0.29 units (say dollars) higher than for those not exposed to our action or lever. This number corresponds to the left-hand side of the previous equation. The right panel shows the corresponding distributions of outcomes. Using the mean to calculate differences is standard practice, but it is useful to remember that there are a full spectrum of responses, in some cases with a clear overlap between the two groups: the shaded areas show responses from customers in the two groups that are indistinguishable from each other.

causal effects and selection bias
Figure 2-14. Left panel plots the observed differences in average sales for customers receiving a call from female and male representatives. Right panel shows the actual distributions of outcomes.

In any case, the difference in observed outcomes (left-hand side) is not enough for us since we already know that it is potentially biased by selection effects; since our interested is in estimating the Causal Effect we must therefore device a method to cancel the pervasive effect from selection.

Statisticians and econometricians, not to mention philosophers and scientists, have been thinking about this problem for centuries now. Since it is physically impossible to get an exact copy of each of our customers, is there a way to assign our treatments and circumvent the selection bias? It was Ronald A. Fisher, the famous 20th century statistician and scientist who put on firm grounds the method of experimentation, the most prevalent among practitioners when we want to estimate causal effects. The idea is simple enough to describe without making use of technical jargon.

A primer on A/B testing

While we may not be able to get exact copies of our customers, we may still be able to simulate such copying device using randomization, that is, by randomly assigning customers to two groups: those who receive the treatment and those who don’t. Note that the choice of two groups is done for ease of exposition, as the methodology applies for more than two treatments.

We know that customers in each group are different, but by correctly using a random assignment we dispose of any selection bias: our customers were selected by chance, and chance is thought to be unbiased. In practical terms before our customers get a call from our call center representatives, customers in the female treatment are, on average, ex-ante the same as those in the male treatment. Luckily, we can always check if random assignment created groups that are, on average, ex-ante equal.

Risks when running randomized trials

We have noted that randomization is unbiased in the sense that the result of a random draw is obtained by chance. In practice we simulate pseudorandom numbers that have the look-and-feel of a random outcome, but are in fact computed with a deterministic algorithm. For instance, in Excel, you can use the =RAND() funtion to simulate a pseudo-random draw from a uniform distribution.

It is important to remember, however, that using randomization does not necessarily eliminate selection bias. For example, even though the probability of happening may be extremely low, by pure chance, we may end up with a group of male customers on the male representative group and female customers, so our random assignment ended up selecting by gender, potentially biasing our results. That’s why we always check if random assignment passes the ex-post test of checking differences in means on observable variables.

Last but not least, there may be ethical concerns since in practice we are potentially affecting the outcomes of one group of customers. One should always checklist any ethical considerations we might have before running an experiment.

You may be wondering what it means for two groups to be indistinguishable before making the random assignment (ex-ante equal). Think about how you would tell two people apart: start checking, one by one, each and every observable characteristic and see if they match. If there’s something where they look different then they are not indistinguishable. We do the same for two different groups of people: list all observable characteristics and check if their group averages are the same, after taking into account the natural random variation. For instance, if customers in the female and male representative groups are on average 23 and 42 years old respectively, we should repeat the randomization to make them indistinguishable in terms of all observables, including age.

A/B testing in practice

In the industry, the process of randomizing to assign different treatments is called A/B testing. The name comes from the idea that we want to test an alternative B to our default action A, the one we commonly use. As opposed to the techniques in the machine learning toolbox, A/B testing can be performed by anyone without a strong technical background. We may need, however, to guarantee that our test satisfies a couple of technical statistical properties, but these are relatively easy to understand and put in practice. The process usually goes as follows:

  1. Select an actionable hypothesis you want to test: for example, call center female representatives have a higher conversion rate than men. This is a crisp hypothesis that is falsifiable.

  2. Choose a relevant and measurable KPI to quantify the results from the test: in the example, we have chosen the average conversion rates. If conversion rates for female reps are not sufficiently higher than those for men we have falsified the hypothesis.

  3. Select the number of customers that will be participating in the test: this is the first technical property that must be carefully selected and will be discussed below.

  4. Randomly assign the customers to both groups and check that randomization produced groups that satisfy the ex-ante indistinguishable property.

  5. After the test is performed, measure the difference in average outcomes. We should take care of the rather technical detail of whether a difference is generated by pure chance or not.

If randomization was done correctly, we have eliminated the selection bias, and the difference in average outcomes provides an estimate of the causal effect.

Understanding power and size calculations

Step 3, selecting the number of customers, is what practioners call power and size calculations, and unfortunately there are key trade-offs we must face. Recall that one common property of statistical estimation is that the larger the sample size the lower the uncertainty we have about our estimate. We can always estimate the average outcome for groups of 5, 10 or 1000 customers assigned to the B group, but our estimate will be more precise for the latter than for the former. From a strictly statistical point of view, we prefer having large experiments or tests.

From a business perspective, however, testing with large groups may not be desirable. First, our assignment must be respected until the test comes to an end, so there is the opportunity cost of trying other potentially more profitable treatments, or even our control or base scenario. Because of this, it is not uncommon that the business stakeholders want to finish the test as quickly as possible. In our call center example, it could very much have been the case that conversion rates were lower with the group of female reps, so during a full day we operated suboptimally which may take an important toll on the business (and our colleagues’ bonuses). We simply can’t know at the outset (but a well designed experiment should include some type of analysis of this happening).

Because of this trade-off we usually select the minimum number of customers that satisfies two statistical properties: experiments should have the right statistical size and power so that we can conclude with enough confidence if it was a success or not. This takes us to the topic of false positives and false negatives.

False positives and false negatives

In our call center example, suppose that contrary to Tom’s assumption, male and female representatives have the exact same productivity or conversion efficiency. The real but unobserved effect of our treatment should then be zero, but when we compare the average outcomes the difference will always be non-zero, even if small. How do we know if the difference in average outcomes is due to random noise or if it is showing a real, but possibly small difference? Here’s where statistics enter the story.

There is a false positive when we mistakenly conclude that there is a difference in averages between groups, that is, when we conclude that the treatment had an effect. The choice of the size of the test is done to minimize the probability of this happening.

On the other hand, it could be that the treatment actually worked, but we chose a sample small enough that we are not able to conclude with confidence that it is not caused by random noise. Note that this is quite a common scenario, and could happen if our business counterparts think that by having a larger test they will risk the results of the current quarter. The result is that we end up with an underpowered test. In our call center example, we falsely conclude that representatives’ productivity is the same across genders when indeed one has higher conversion rates.

The left panel in Figure 2-15 shows the case of an underpowered test. The alternative B treatment creates 30 additional sales, but because of the small sample sizes, this difference is estimated with insufficient precision. The right panel shows the case where the real difference is close to 50 extra sales, and we were able to precisely estimate the averages and their differences.

Results of underpowered and powered experiments
Figure 2-15. Left panel shows the result of an underpowered test: there is a difference in the average outcomes for the treated and untreated but the small sample sizes for each group cannot estimate this effect with enough precision. Right panel shows the ideal result where there is a difference and we can correctly conclude this is the case.

Let’s briefly talk about the costs of false positives and false negatives in the context of A/B testing. For this, recall what we wanted to achieve with the experiment to begin with: we are currently pulling a lever and want to know if an alternative is superior for a given metric that impacts our business. As such, there are two possible outcomes: we either continue pulling our A lever, or we substitute it with the B alternative. In the case of a false positive, the outcome is making a subpar substitution. Similarly, with a false negative we mistakenly continue pulling the A lever, which also impacts our results. In this sense both are kind of symmetric (in both cases we have an uncertain long-term impact), but it is not uncommon to treat them asymmetrically, by setting the probability of a false positive at 5% or 10% (size), and the probability of a false negative at 20% (one minus the power).

There is however the opportunity cost of designing and running the experiment. With an underpowered experiment we cannot conclude anything meaningful: if there is a difference across treatments, we just won’t be able to identify it with confidence. If there isn’t we won’t be able to distinguish it from pure chance. That’s why most practitioners tend to fix the size of a test and find the minimum sample size that allows us to detect some minimum effect.

Selecting the sample size

In tests where we only compare two alternatives, it is common to encounter the following relationship between the variables of interest:

MDE=(tα+t1-β)Var(Outcome)NP(1-P)

Here tk is critical value to reject a hypothesis with probability k according to a t distribution, α and 1-β are the size and power of test, MDE the minimum detectable effect of the experiment, N the number of customers in the test, P is the fraction assigned to the treatment group, and Var(Outcome) is the variance of the outcome metric you’re using to decide if the test is successful or not.

The next snippet shows how to calculate the sample size for your experiment with Python.

# Example: calculating the sample size for an A/B test
from scipy import stats
def calculate_sample_size(var_outcome, size, power, MDE):
    '''
    Function to calculate the sample size for an A/B test
    MDE = (t_alpha + t_oneminusbeta)*np.sqrt(var_outcome/(N*P*(1-P)))
    df: degrees of freedom when estimating the variance of the outcome
    (usually large enough)
    '''
    df = 1000
    t_alpha = stats.t.ppf(1-size, df)
    t_oneminusbeta = stats.t.ppf(power, df)
    # same number of customers in treatment and control group
    P = 0.5
    N = ((t_alpha + t_oneminusbeta)**2 * var_outcome)/(MDE**2 * P * (1-P))
    return N

sample_size_for_experiment = calculate_sample_size(var_y, size, power, MDE)
print('We need at least {0} customers in experiment'.format(
np.around(sample_size_for_experiment),decimals=0))

In practice, we start by setting these values (if symmetry seems like an appealing property, setting them to same level sounds like a reasonable alternative, otherwise you can follow standard practices such as the one described above). We then must choose a minimum detectable effect, that is, the minimum impact on the metric we wish to affect (say profits) that makes the experiment worthwhile for the business. We can then reverse engineer the sample size we need.

Uncertainty

We have now talked about each of the stages in the decomposition: starting with the business we reverse engineer the actions or levers that impact our objective and corresponding KPIs, mediated by some consequences. Since decisions are made under uncertainty, this mapping from actions to consequences is not known at the time of the decision. But by now we already know that uncertainty is not our enemy, that we can embrace it thanks to the advances in predictive power of AI.

But why do we have uncertainty? Let us first discuss what this uncertainty is not, and then we can talk about what it is. When we flip a coin the outcome of this experiment is driven by chance: if the coin is balanced the chances that it falls on heads or tails is 50%, but we cannot fully anticipate which one will be. Arguably, this is the closest example of randomness to us, since we have played heads and tails since our childhoods.

This is not, however, the type of uncertainty we have when we are making decisions, and that is good news for us. The fact that ours is not pure randomness allows us to use powerful predictive algorithms, combined with our knowledge of the problem to select input variables or features to create a prediction. With pure randomness, the best thing we can do is learn or model the distribution of outcomes and derive some theoretical properties that allows us to make smart choices or predictions.10

The four main sources of uncertainty are our need to simplify, heterogeneity, complex and strategic behavior and pure ignorance about the phenomenon, each of which will be described in turn. Note that as analytical thinkers we should always know where uncertainty comes from, but it is not uncommon that we end up being taken by surprise.

Figure 2-16 shows one way to classify our uncertainties.11 The technique divides the world into four quadrants: there are those things that we know we know (upper left), that we know we don’t know (upper right), that we don’t know that we know (lower left) and that we don’t know that we don’t know (lower right). Well done simplification usually moves things we know into the unknown territory for the sake of convenience (A). Heterogeneity should ideally be a known unknown (B) but it may sometimes be in the tenuous border between us being conscious or not about our not knowing (C). The same can be said about behavior with possibly unknown results: good training will make us conscious about our uncertainty, but many times we don’t really know. Ignorance is usually the most dangerous and the source of unintended consequences. Scientists and analytical leaders alike embrace these opportunities and try to move them to the known-known quadrant.

Results of underpowered and powered experiments
Figure 2-16. The four quadrants allow us to classify phenomena with respect to us being conscious or not of knowning/not knowing the truth value of each occurence.

Uncertainty from simplification

Albert Einstein has many great quotes, but one my favorites is “everything should be made as simple as possible. But not simpler.” In the same vein, statistician George Box famously said that “all models are wrong, but some are useful”. Models are simplifications, metaphors that help us understand the workings of the highly complex world we live in.

I cannot exaggerate enough the importance that this ability —  learning to simplify —  has for the modern analytical thinker. We will have enough time later in the book to exercise our analytical muscle through some well-known techniques, but we should now discuss the toll that simplification has.

As analytical thinkers we constantly face the trade-off between getting a good-enough answer or devoting more time to develop a more realistic picture of the problem at hand. The cost is not only time, but the precision or certainty we have about the answer being correct. We must decide how much uncertainty we’re comfortable with and how much we are willing to accept, in order to get a timely solution. But this calibration takes practice, as Einstein succinctly puts it in the first quote.

One clear example of the powers and dangers of simplification are maps. Figure 2-17 shows a section of the official Transit for London (Tfl) London’s tube map on the left and a more realistic version on the right also by the transportation authority.12 With the objective of making our transportation decisions fast an easy, a map trades-off realism for ease-of-use. As users of the map, we now face uncertainty about the geography, distances, angles and even the existence of possible relevant venues such as parks or museums. But to a first approximation we feel comfortable with this choice of granularity since our first objective is being able to get from our origin to a destination. We can later take care of the remaining parts of the problem.

maps are simplifications, models
Figure 2-17. Sections of the London underground maps. Left panel corresponds to the official tube map. Right panel shows a more realistic version of the same section.

This last point takes me to another related issue: one common simplification technique is to divide a complex problem into simpler subproblems that can each be tackled independently; something that computer scientists call the Divide and Conquer technique. Each of these components generates uncertainty that will later be aggregated. The result of this aggregation may or may not be cleaner than the one obtained from trying to simplify the more general problem. Nonetheless, as important as simplification is, it is also the fact that we should be conscious about what and where we used it. As Box, the statistician, complemented “(…) the approximate nature of the model must always be born in mind”.13

Uncertainty from heterogeneity

The large variety of behaviors, tastes and responses can usually be modelled with the use of distributions, such as we did in Figure 2-13. In this case, for the explicit purpose of solving a problem, we can dispense of understanding the nitty-gritty details of how and why outcomes are so diverse, and instead just take a modelling approach. It becomes handy, then, to know some basic properties about distributions.

For instance, when we have no information about the distribution other than the range, or there are grounds to make the argument that outcomes will not end up accumulating anywhere, we might model the heterogeneity using a uniform distribution. Uniformity is first and foremost assumed for simplification purposes, but can also by used to model heterogenous outcomes. Take the distribution of people in a subway station waiting for the next train to arrive. Their main goal may be taking the train as quickly as possible. If this can be guaranteed they may next want to be able to sit, or to be as close to their gates as possible. What happens in peak hours? We see that people end up taking the full extent of the platform in an almost uniform fashion, and that is what we should expect under the goals we just stated.

As discussed previously, the bell curve or Gaussian distribution is pervasive in the sciences. We sometimes use it for simplification purposes as it has some highly desirable properties, but again, we can ground its use on first principles, that is, as a way to actually model the distribution of heterogenous results. We may appeal to a version of the Central Limit Theorem, that states that under certain conditions, the distribution of averages or sums of numbers end up being close enough to a Normal.14

Other commonly used distributions are power-law distributions, that, contrary to the Gaussian distribution, have longer tails.15 For instance, when modelling the reach or just the number of followers that your influencer has, we may resort to a power-law distribution, but there are many other examples where these distributions naturally arise.16

Figure 2-18 shows the results of drawing one million observations from uniform, normal and power-law distributions. You can immediately see how results accumulate, or not, as described above. General knowledge of these and other distributions can be very handy in our simplification strategies.

different distributions
Figure 2-18. Histograms for the results of drawing one million observations from a uniform (left), normal (center) and power-law (right) distribution

One final word about this source of uncertainty is in place: when making decisions it may be reasonable to deal with heterogeneity by modelling and simplifying it. Nonetheless, many valuable business opportunities may arise by embracing it, allowing us to move to personalization and customization strategies built upon the diversity in tastes and behaviors.

Uncertainty from interactions

Another source of uncertainty arises from the simple fact that we are social animals that continuously interact with each other; if acting in isolation sometimes creates the appearance that we are flipping coins to make our decisions, social interactions create another layer of complexity that we need to deal with; this has always been true, but modern social networks and their impact on our businesses make it even more important.

Take strategic behavior, for example. We constantly interact with our customers, and naturally, they want to get the best deal possible. This can create situations where they understand our motivations and end up gaming our system. This commonly happens with retention offers where we identify individuals who are likely to switch companies and offer them some type of discount to convince them that it is better to stay loyal to us. They may then act as if they were going to leave to continue getting these very nice discounts. Who wouldn’t?

Something similar happens with the design of compensation schemes for our sales executives. It is not uncommon to use a combination of fixed salary and variable bonus when they reach some targets set up in advance. But many times, we see that executives end up delaying sales since they have already reached their quota, or when they believe that the current quota is unattainable and they would rather start the next sales period with some secure transactions. Economists have studied the design of incentive-compatible compensation schemes that are both consistent with our sales personnel self-interest and optimal for the company, but putting these schemes in practice ends up being a sometimes painful trial and error process.

Many other examples of strategic interactions abound, but now let’s consider the case where interactions follow simple deterministic rules such as “if person A says hi, be nice”. Even in these cases we can get behavior that appears to be random. One well-studied example is John Conway’s Game of Life.17 This type of cellular automaton evolves in a two-dimensional grid such as the one depicted on Figure 2-19.18 Each colored pixel or cell can only interact with its immediate neighbors each time creating three possible outcomes: it lives, dies or multiplies. There are only three simple rules of interaction, and depending on the initial conditions you can get completely different outcomes that appear to be random to any observer.

different distributions
Figure 2-19. John Conway’s Game of Life. A plethora of aggregate phenomena arises from three simple rules of how each cell or pixel interact with its neighbors.

You may wonder if this is something worth your time and attention or if it’s just an intellectual curiosity. As a starter it should serve as a cautionary tale: even simple rules of behavior can create complex outcomes so we don’t really need sophisticated consumers trying to game our systems. But social scientists have also been using these tools to make sense of human behavior so, at the minimum, they ought to be useful for us when making decisions in our businesses.

Uncertainty from ignorance

The last source of uncertainty is pure ignorance: many times we simply don’t know what the consequence will be when we pull a lever that has never been pulled in the past, or at least for a specific group of customers. Modern analytical leaders embrace this uncertainty, mostly with the use of A/B testing: these experiments help us understand how customers react when new situations arise, in controlled, low-risk environments. A company’s ability to scale testing at the organizational level can create a rich knowledge base to innovate and create value in the medium-to-long term. But there is always a trade-off: we may need to sacrifice short-term profits for medium term value and market leadership. That’s why we need a new brand of analytical decision makers in our organizations.

Key takeaways

  • Analytical thinking: is the ability to identify and translate business questions into prescriptive solutions.

  • Value is created by making decisions: we create value for our companies by making better decisions. Prediction is only one input necessary in our decision-making process.

  • Stages in the analysis of decisions: there are generally three stages when we analyze a decision: we first gather, understand and interpret the facts (descriptive stage). We then may wish to predict the outcomes of interest. Finally, we choose the levers to pull to make the best possible outcome (prescriptive stage).

  • Anatomy of a decision: we choose an action that may have one or several consequences that impact our business outcomes. Since generally we don’t know which consequence will result, this choice is made under conditions of uncertainty. The link between actions and consequences is mediated by causality.

  • Estimating causal effects has several important difficulties: selection biases abound, so directly estimating the causal effect of a lever is generally not possible. We also need to master the use of counterfactual thinking and dealing with heterogenous effects.

Further Reading

Almost every book on data science or big data describes the distinction between descriptive, predictive and prescriptive analysis. You may check Thomas Davenport’s now classic Competing on Analytics (or any of the sequels) or Bill Schmarzo’s Big Data: Understanding How Data Powers Big Business (or any of the prequels and sequels).

The anatomy of decisions used here follows that literature and is quite standard. We will come back to this topic in a later chapter where I will provide enough references.

My favorite treatments of causality can be found in the books by Joshua Angrist and Jörn-Steffen Pischke Mostly Harmless Econometrics and the most recent Mastering ‘Metrics’: The Path from Cause to Effect. If you are interested you can find there the mathematical derivation of the equality between difference in observed outcomes and causal effects plus selection bias. They also present alternative methods to identify causality from observational data, that is, from data that was not obtained through a well-designed test.

A substantially different approach to causal reasoning can be found in Judea Pearl’s and Dana Mackenzie’s The Book of Why. The new science of cause and effect. Scott Cunningham’s Causal Inference: the mixtape provides a great bridge between the two approaches, focusing mostly on the first literature (econometrics of causal inference) but devoting a chapter and several passages to Pearl’s approach using causal graphs and diagrams. At the time of the writing of this book it’s also free to download from https://www.scunning.com/cunningham_mixtape.pdf.

There are many treatments of A/B testing, starting with Dan Siroker’s and Pete Koomen’s A/B Testing: The Most Powerful Way to Turn Clicks into Customers. Peter Bruce’s and Andrew Bruce’s Practical Statistics for Data Scientists from O’Reilly Media provides an accessible introduction to statistical foundations, including power and size calculations. Carl Andersen’s Creating a Data-Driven Organization, also from O’Reilly, briefly discusses some best practices in A/B testing emphasizing its role on data- and analytics- driven organizations. Ron Kohavi (previously at Microsoft and now at Airbnb) has been forcefully advancing the use of experimentation in the industry. You can find some great material in his (and others’) ExP Experimentation Platform (https://exp-platform.com/), including an online version of a book coauthored with Diane Tang and Ya Xu Advanced Topics in Experimentation (https://exp-platform.com/advanced-topics-in-online-experiments/).

My discussion of uncertainty follows many ideas in Scott E. Page’s The Model Thinker: What you need to know to make data work for you. This is a great place to start thinking about simplification and modelling, and provides many examples where distinct distributions, complex behavior and network effects appear in real life. Ariel Rubinstein’s Economic Fables gives an entertaining, but authoratative description of the role that models play as metaphors of a complex world.

1 https://www.theguardian.com/society/2014/sep/22/cancer-late-diagnosis-half-patients

2 https://www.nytimes.com/2019/05/20/health/cancer-artificial-intelligence-ct-scans.html

3 We will talk about designing experiments or A/B tests later in this chapter.

4 https://www.investopedia.com/updates/enron-scandal-summary/

5 Not that it couldn’t be used like that, of course.

6 To see the plethora of information on this topic just Google “genes environment causal behavior” and pick the study that you find most striking.

7 Sources: GDP data comes from https://www.rug.nl/ggdc/historicaldevelopment/maddison/releases/maddison-project-database-2018. CO2 emissions from https://www.co2.earth/images/data/2100-projections_climate-scoreboard_20 15-1027.xlsx. Divorce rates from https://www.ons.gov.uk/file?uri=/peoplepopulationandcommunity/birthsdeat hsandmarriages/divorce/datasets/divorcesinenglandandwales/2014/divorceta bles2014.xls.

8 This use case is motivated by the opening example in the book Mostly Harmless Econometrics. See the references at the end of the chapter.

9 Hereafter I will use the term “treated” or “those who receive a treatment” refering to those customers that are exposed to our action or lever. This jargon is common in the statistical analysis of experiments and it is no coincidence that we have already encountered it discussing the case of our employees health status, as it was first used in the analysis of medical trials.

10 In the coin tossing example, for instance, after observing the outcomes we may end up modelling the distribution as Bernoulli trials, and predict a theoretically derived expected value (number for trials times the estimated probability of heads, say).

11 https://en.wikipedia.org/wiki/There_are_known_knowns

12 https://www.timeout.com/london/blog/tfl-has-secretly-made-a-geographically-accurate-tube-map-091515

13 https://en.wikipedia.org/wiki/All_models_are_wrong

14 https://en.wikipedia.org/wiki/Central_limit_theorem

15 The Normal distribution accumulates 99% of the possible outcomes within 2.57 standard deviations from the mean and 99.9% within almost 3.3 standard deviations.

16 Other examples and applications of power-law distributions in business can be found in http://www.hermanaguinis.com/JBV2015.pdf

17 https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life

18 You can “play” the game yourself at https://playgameoflife.com/ and marvel at the rich diversity of outcomes that can be generated by simple deterministic rules.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset