When graphs and statistics lie

I should be clear, statistics don't lie, people lie. One of the easiest ways to trick your audience is to confuse correlation with causation.

Correlation versus causation

I don't think I would be allowed to publish this book without taking a deeper dive into the differences between correlation and causation. For this example, I will continue to use my data for TV consumption and work performance.

Correlation is a quantitative metric between -1 and 1 that measures how two variables move with each other. If two variables have a correlation close to -1, it means that as one variable increases, the other decreases, and if two variables have a correlation close to +1, it means that those variables move together in the same direction; as one increases, so does the other, and the same when decreasing.

Causation is the idea that one variable affects another. For example, we can look at two variables: the average hours of TV watched in a day and a 0-100 scale of work performance (0 being very poor performance and 100 being excellent performance). One might expect that these two factors are negatively correlated, which means that as the number of hours of TV watched increases in a 24-hour day, your overall work performance goes down. Recall the code from earlier, which is as follows. Here, I am looking at the same sample of 14 people as before and their answers to the question how many hours of TV do you watch on average per night:

import pandas as pd 
hours_tv_watched = [0, 0, 0, 1, 1.3, 1.4, 2, 2.1, 2.6, 3.2, 4.1, 4.4, 4.4, 5] 

These are the same 14 people as mentioned earlier, in the same order, but now, instead of the number of hours of TV they watched, we have their work performance as graded by the company or a third-party system:

work_performance = [87, 89, 92, 90, 82, 80, 77, 80, 76, 85, 80, 75, 73, 72] 

Then, we produce a DataFrame:

df = pd.DataFrame({'hours_tv_watched':hours_tv_watched, 'work_performance':work_performance}) 

Earlier, we looked at a scatter plot of these two variables and it seemed to clearly show a downward trend between the variables: as TV consumption went up, work performance seemed to go down. However, a correlation coefficient, a number between -1 and 1, is a great way to identify relationships between variables and, at the same time, quantify them and categorize their strength.

Now, we can introduce a new line of code that shows us the correlation between these two variables:

df.corr() # -0.824 

Recall that a correlation, if close to -1, implies a strong negative correlation, while a correlation close to +1 implies a strong positive correlation.

This number helps support the hypothesis because a correlation coefficient close to -1 implies not only a negative correlation but a strong one at that. Again, we can see this via a scatter plot between the two variables. So, both our visual and our numbers are aligned with each other. This is an important concept that should be true when communicating results. If your visuals and your numbers are off, people are less likely to take your analysis seriously:

Correlation versus causation

Correlation: hours of TV watched and work performance

I cannot stress enough that correlation and causation are different from each other. Correlation simply quantifies the degree to which variables change together, whereas causation is the idea that one variable actually determines the value of another. If you wish to share the results of the findings of your correlational work, you might be met with challengers in the audience asking for more work to be done. What is more terrifying is that no one might know that the analysis is incomplete and you may make actionable decisions based on simple correlational work.

It is very often the case that two variables might be correlated with each other but they do not have any causation between them. This can be for a variety of reasons, some of which are as follows:

  • There might be a confounding factor between them. This means that there is a third lurking variable that is not being factored and in, that acts as a bridge between the two variables. For example, previously we showed that you might find that the amount of TV you watch is negatively correlated with work performance, that is, as the number of hours of TV you watch increases, your overall work performance may decrease. That is a correlation. It doesn't seem quite right to suggest that watching TV is the actual cause of a decrease in the quality of work performed. It might seem more plausible to suggest that there is a third factor, perhaps hours of sleep every night, that might answer this question. Perhaps, watching more TV decreases the amount of time you have for sleep, which in turn limits your work performance. The number of hours of sleep per night is the confounding factor.
  • They might not have anything to do with each other! It might simply be a coincidence. There are many variables that are correlated but simply do not cause each other. Consider the following example:
Correlation versus causation

Correlation analysis: cheese consumption and civil engineering doctorates

It is much more likely that these two variables only happen to correlate (more strongly than our previous example, I may add) than cheese consumption determines the number of civil engineering doctorates in the world.

You have likely heard the statement correlation does not imply causation and the last graph is exactly the reason why data scientists must believe that. Just because there exists a mathematical correlation between variables does not mean they have causation between them. There might be confounding factors between them or they just might not have anything to do with each other!

Let's see what happens when we ignore confounding variables and correlations become extremely misleading.

Simpson's paradox

Simpson's paradox is a formal reason for why we need to take confounding variables seriously. The paradox states that a correlation between two variables can be completely reversed when we take different factors into account. This means that even if a graph might show a positive correlation, these variables can become anti-correlated when another factor (most likely a confounding one) is taken into consideration. This can be very troublesome to statisticians.

Suppose we wish to explore the relationship between two different splash pages (recall our previous A/B testing in Chapter 7, Basic Statistics). We will call these pages Page A and Page B once again. We have two splash pages that we wish to compare and contrast, and our main metric for choosing will be in our conversion rates, just as earlier.

Suppose we run a preliminary test and find the following conversion results:

Page A

Page B

75% (263/350)

83% (248/300)

This means that Page B has almost a 10% higher conversion rate than Page A. So, right off the bat, it seems like Page B is the better choice because it has a higher rate of conversion. If we were going to communicate this data to our colleagues, it would seem that we are in the clear!

However, let's see what happens when we also take into account the coast that the user was closer to, as shown:

 

Page A

Page B

West Coast

95% (76 / 80)

93% (231/250)

East Coast

72% (193/270)

34% (17/50)

Both

75% (263/350)

83% (248/300)

Thus the paradox! When we break the sample down by location, it seems that Page A was better in both categories but was worse overall. That's the beauty and, also, the horrifying nature of the paradox. This happens because of the unbalanced classes between the four groups.

The Page A/East Coast group and the Page B/West Coast group are providing most of the people in the sample, therefore skewing the results to not be as expected. The confounding variable here might be the fact that the pages were given at different hours of the day and the West coast people were more likely to see Page B, while the East coast people were more likely to see Page A.

There is a resolution to Simpson's paradox (and therefore an answer); however, the proof lies in a complex system of Bayesian networks and is a bit out of the scope of this book.

The main takeaway from Simpson's paradox is that we should not unduly give causational power to correlated variables. There might be confounding variables that have to be examined. Therefore, if you are able to reveal a correlation between two variables (such as website category and conversion rate or TV consumption and work performance), then you should absolutely try to isolate as many variables as possible that might be the reason for the correlation, or can at least help explain your case further.

If correlation doesn't imply causation, then what does?

As a data scientist, it is often quite frustrating to work with correlations and not be able to draw conclusive causality. The best way to confidently obtain causality is, usually, through randomized experiments, such as the ones we saw in Chapter 8, Advanced Statistics. One would have to split up the population groups into randomly sampled groups and run hypothesis tests to conclude, with a degree of certainty, that there is a true causation between variables.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset