B.2. Statistical theory

In this book, we necessarily concentrate on (correctly) processing data, without stopping to explain a lot of theory. The steps we use will be more understandable after we review a bit of statistical theory in this section.

B.2.1. Statistical philosophy

The predictive tools and machine learning methods we demonstrate in this book get their predictive power not from uncovering cause and effect (which would be a great thing to do), but by tracking and trying to eliminate differences in data and by reducing different sources of error. In this section, we’ll outline a few of the key concepts that describe what’s going on and why these techniques work.

Exchangeability

Since basic statistical modeling isn’t enough to reliably attribute predictions to true causes, we’ve been quietly relying on a concept called exchangeability to ensure we can build useful predictive models.

The formal definition of exchangeability is this: suppose all the data in the world is x[i,],y[i] (i=1,...m). Then we call the data exchangeable if for any permutation j_1, ...j_m of 1, ...m, the joint probability of seeing x[i,],y[i] is equal to the joint probability of seeing x[j_i, ], y[j_i]. In other words, the joint probability of seeing a tuple x[i, ], y[i] does not depend on when we see it, or where it comes in the sequence of observations.

The idea is that if all permutations of the data are equally likely, then when we draw subsets from the data using only indices (not snooping the x[i,],y[i]), the data in each subset, though different, can be considered as independent and identically distributed. We rely on this when we make train/test splits (or even train/calibrate/test splits), and we hope (and should take steps to ensure) this is true between our training data and future data we’ll encounter in production.

Our hope in building a model is that in the unknown future, data the model will be applied to is exchangeable with our training data. If this is the case, then we’d expect good performance on training data to translate into good model performance in production. It’s important to defend exchangeability from problems such as overfit and concept drift.

Once we start examining training data, we (unfortunately) break its exchangeability with future data. Subsets that contain a lot of training data are no longer indistinguishable from subsets that don’t have training data (through the simple process of memorizing all of our training data). We attempt to measure the degree of damage by measuring performance on held-out test data. This is why generalization error is so important. Any data not looked at during model construction should be as exchangeable with future data as it ever was, so measuring performance on held-out data helps anticipate future performance. This is also why you don’t use test data for calibration (instead, you should further split your training data to do this); once you look at your test data, it’s less exchangeable with what will be seen in production in the future.

Another potential huge loss of exchangeability in prediction is summarized is what’s called Goodhart’s law: “When a measure becomes a target, it ceases to be a good measure.” The point is this: factors that merely correlate with a prediction are good predictors—until you go too far in optimizing for them or when others react to your use of them. For example, email spammers can try to defeat a spam detection system by using more of the features and phrases that correlate highly with legitimate email, and changing phrases that the spam filter believes correlate highly with spam. This is an essential difference between actual causes (which do have an effect on outcome when altered) and mere correlations (which may be co-occurring with an outcome and are good predictors only through exchangeability of examples).

Bias variance decomposition

Many of the modeling tasks in this book are what are called regressions where, for data of the form y[i],x[i,], we try to find a model or function f() such that f(x[i,])~E[y[j]|x[j,]~x[i,]] (the expectation E[] being taken over all examples, where x[j,] is considered very close to x[i,]). Often this is done by picking f() to minimize E[(y[i]-f(x[i,]))^2].[2] Notable methods that fit closely to this formulation include regression, k-nearest neighbors (KNN), and neural nets.

2

The fact that minimizing the squared error gets expected values right is an important fact that is used in method design again and again.

Obviously, minimizing square error is not always your direct modeling goal. But when you work in terms of square error, you have an explicit decomposition of error into meaningful components, called the bias/variance decomposition (see The Elements of Statistical Learning by T. Hastie, R. Tibshirani, and J. Friedman; Springer, 2009). The bias/variance decomposition says this:

E[(y[i] - f(x[i, ]))^2] = bias^2 + variance + irreducibleError

Model bias is the portion of the error that your chosen modeling technique will never get right, often because some aspect of the true process isn’t expressible within the assumptions of the chosen model. For example, if the relationship between the outcome and the input variables is curved or nonlinear, you can’t fully model it with linear regression, which only considers linear relationships. You can often reduce bias by moving to more complicated modeling ideas: kernelizing, GAMs, adding interactions, and so on. Many modeling methods can increase model complexity (to try to reduce bias) on their own, for example, decision trees, KNN, support vector machines, and neural nets. But until you have a lot of data, increasing model complexity has a good chance of increasing model variance.

Model variance is the portion of the error that your modeling technique gets wrong due to incidental relations in the data. The idea is this: a retraining of the model on new data might make different errors (this is how variance differs from bias). An example would be running KNN with k = 1. When you do this, each test example is scored by matching to a single nearest training example. If that example happened to be positive, your classification will be positive. This is one reason we tend to run KNN with larger k values: it gives us the chance to get more reliable estimates of the nature of neighborhood (by including more examples) at the expense of making neighborhoods a bit less local or specific. More data and averaging ideas (like bagging) greatly reduce model variance.

Irreducible error is the truly unmodelable portion of the problem (given the current variables). If we have two datums x[i, ], y[i] and x[j,], y[j] such that x[i, ] == x[j, ], then (y[i] - y[j])^2 contributes to the irreducible error. We emphasize that irreducible error is measured with respect to a given set of variables; add more variables, and you have a new situation that may have its own lower irreducible error.

The point is that you can always think of modeling error as coming from three sources: bias, variance, and irreducible error. When you’re trying to increase model performance, you can choose what to try based on which of these you are trying to reduce.

Averaging is a powerful tool

Under fairly mild assumptions, averaging reduces variance. For example, for data with identically distributed independent values, averages of groups of size n have an expected variance of 1/n of the variance of individual values. This is one of the reasons why you can build models that accurately forecast population or group rates even when predicting individual events is difficult. So although it may be easy to forecast the number of murders per year in San Francisco, you can’t predict who will be killed. In addition to shrinking variances, averaging also reshapes distributions to look more and more like the normal distribution (this is the central limit theorem and related to the law of large numbers).

Statistical efficiency

The efficiency of an unbiased statistical procedure is defined as how much variance there is in the procedure for a given dataset size: that is, how much the estimates produced by that procedure will vary, when run on datasets of the same size and drawn from the same distribution. More efficient procedures require less data to get below a given amount of variance. This differs from computational efficiency, which is about how much work is needed to produce an estimate.

When you have a lot of data, statistical efficiency becomes less critical (which is why we don’t emphasize it in this book). But when it’s expensive to produce more data (such as in drug trials), statistical efficiency is your primary concern. In this book, we take the approach that we usually have a lot of data, so we can prefer general methods that are somewhat statistically inefficient (such as using a test holdout set, and so on) over more specialized, statistically efficient methods (such as specific ready-made parametric tests like the Wald test and others).

Remember: it’s a luxury, not a right, to ignore statistical efficiency. If your project has such a need, you’ll want to consult with expert statisticians to get the advantages of best practices.

B.2.2. A/B tests

Hard statistical problems usually arise from poor experimental design. This section describes a simple, good, statistical design philosophy called A/B testing that has very simple theory. The ideal experiment is one where you have two groups—control (A) and treatment (B)—and the following holds:

  • Each group is big enough that you get a reliable measurement (this drives significance).
  • Each group is (up to a single factor) distributed exactly like populations you expect in the future (this drives relevance). In particular, both samples are run in parallel at the same time.
  • The two groups differ only with respect to the single factor you’re trying to test.

In an A/B test, a new idea, treatment, or improvement is proposed and then tested for effect. A common example is a proposed change to a retail website that it is hoped will improve the rate of conversion from browsers to purchasers. Usually, the treatment group is called B and an untreated or control group is called A. As a reference, we recommend “Practical Guide to Controlled Experiments on the Web” (R. Kohavi, R. Henne, and D. Sommerfield; KDD, 2007).

Setting up A/B tests

Some care must be taken in running an A/B test. It’s important that the A and B groups be run at the same time. This helps defend the test from any potential confounding effects that might be driving their own changes in conversion rate (hourly effects, source-of-traffic effects, day-of-week effects, and so on). Also, you need to know that differences you’re measuring are in fact due to the change you’re proposing and not due to differences in the control and test infrastructures. To control for infrastructure, you should run a few A/A tests (tests where you run the same experiment in both A and B).

Randomization is the key tool in designing A/B tests. But the split into A and B needs to be made in a sensible manner. For example, for user testing, you don’t want to split raw clicks from the same user session into A/B, because then A/B would both have clicks from users that may have seen either treatment site. Instead, you’d maintain per-user records and assign users permanently to either the A or the B group when they arrive. One trick to avoid a lot of record keeping between different servers is to compute a hash of the user information and assign a user to A or B depending on whether the hash comes out even or odd (thus, all servers make the same decision without having to communicate).

Evaluating A/B tests

The key measurements in an A/B test are the size of effect measured and the significance of the measurement. The natural alternative (or null hypothesis) to B being a good treatment is that B makes no difference, or B even makes things worse. Unfortunately, a typical failed A/B test often doesn’t look like certain defeat. It usually looks like the positive effect you’re looking for is there and you just need a slightly larger follow-up sample size to achieve significance. Because of issues like this, it’s critical to reason through acceptance/rejection conditions before running tests.

Let’s work an example A/B test. Suppose we’ve run an A/B test about conversion rate and collected the following data.

Listing B.12. Building simulated A/B test data
set.seed(123515)
d <- rbind(                                                                 1
   data.frame(group = 'A', converted = rbinom(100000, size = 1, p = 0.05)), 2
   data.frame(group = 'B', converted = rbinom(10000, size = 1, p = 0.055))  3
)

  • 1 Builds a data frame to store simulated examples
  • 2 Adds 100,000 examples from the A group simulating a conversion rate of 5%
  • 3 Adds 10,000 examples from the B group simulating a conversion rate of 5.5%

Once we have the data, we summarize it into the essential counts using a data structure called a contingency table.[3]

3

The confusion matrices we used in section 6.2.3 are also examples of contingency tables.

Listing B.13. Summarizing the A/B test into a contingency table
tab <- table(d)
print(tab)
##      converted
## group     0     1
##     A 94979  5021
##     B  9398   602

The contingency table is what statisticians call a sufficient statistic: it contains all we need to know about the experiment outcome. We can print the observed conversion rates of the A and B groups.

Listing B.14. Calculating the observed A and B conversion rates
aConversionRate <- tab['A','1']/sum(tab['A',])
print(aConversionRate)
## [1] 0.05021

bConversionRate <- tab['B', '1'] / sum(tab['B', ])
print(bConversionRate)
## [1] 0.0602

commonRate <- sum(tab[, '1']) / sum(tab)
print(commonRate)
## [1] 0.05111818

We see that the A group was measured at near 5%, and the B group was measured at near 6%. What we want to know is this: can we trust this difference? Could such a difference be likely for this sample size due to mere chance and measurement noise? We need to calculate a significance to see if we ran a large enough experiment (obviously, we’d want to design an experiment that was large enough—what we call test power, which we’ll discuss in section B.6.5). What follows are a few good tests that are quick to run.

Fisher’s test for independence

The first test we can run is Fisher’s contingency table test. In the Fisher test, the null hypothesis that we’re hoping to reject is that conversion is independent of group, or that the A and B groups are exactly identical. The Fisher test gives a probability of seeing an independent dataset (A=B) show a departure from independence as large as what we observed. We run the test as shown in the next listing.

Listing B.15. Calculating the significance of the observed difference in rates
fisher.test(tab)

##     Fisher's Exact Test for Count Data
##
## data:  tab
## p-value = 2.469e-05
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  1.108716 1.322464
## sample estimates:
## odds ratio
##   1.211706

This is a great result. The p-value (which in this case is the probability of observing a difference this large if we in fact had A=B) is 2.469e-05, which is very small. This is considered a significant result. The other thing to look for is the odds ratio: the practical importance of the claimed effect (sometimes also called clinical significance, which is not a statistical significance). An odds ratio of 1.2 says that we’re measuring a 20% relative improvement in conversion rate between the A and B groups. Whether you consider this large or small (typically, 20% is considered large) is an important business question.

Frequentist significance test

Another way to estimate significance is to again temporarily assume that A and B come from an identical distribution with a common conversion rate, and see how likely it would be that the B group scores as high as it did by mere chance. If we consider a binomial distribution centered at the common conversion rate, we’d like to see that there’s not a lot of probability mass for conversion rates at or above B’s level. This would mean the observed difference is unlikely if A=B. We’ll work through the calculation in the following listing.

Listing B.16. Computing frequentist significance
print(pbinom(                  1
   lower.tail = FALSE,         2
   q = tab['B', '1'] - 1,      3
   size = sum(tab['B', ]),     4
   prob = commonRate           5
   ))
## [1] 3.153319e-05

  • 1 Uses the pbinom() call to calculate how likely different observed counts are
  • 2 Signals that we want the probability of being greater than a given q
  • 3 Asks for the probability of seeing at least as many conversions as our observed B groups did. We subtract one to make the comparison inclusive (greater than or equal to tab['B', '1']).
  • 4 Specifies the total number of trials as equal to what we saw in our B group
  • 5 Specifies the conversion probability at the estimated common rate

This is again a great result. The calculated probability is small, meaning such a difference is hard to observe by chance if A = B.

B.2.3. Power of tests

To have reliable A/B test results, you must first design and run good A/B tests. We need to defend against two types of errors: failing to see a difference, assuming there is one (described as test power); and seeing a difference, assuming there is not one (described as significance). The closer the difference in A and B rates we are trying to measure, the harder it is to have a good probability of getting a correct measurement. Our only tools are to design experiments where we hope A and B are far apart, or to increase experiment size. A power calculator lets us choose experiment size.

Example Designing a test to see if a new advertisement has a higher conversion rate

Suppose we’re running a travel site that has 6,000 unique visitors per day and a 4% conversion rate[4] from page views to purchase inquiries (our measurable goal). We’d like to test a new design for the site to see if it increases our conversion rate. This is exactly the kind of problem A/B tests are made for! But we have one more question: how many users do we have to route to the new design to get a reliable measurement? How long will it take us to collect enough data? We’re allowed to route no more than 10% of the visitors to the new advertisement.

4

We’re taking the 4% rate from http://mng.bz/7pT3.

In this experiment, we’ll route 90% of our traffic to the old advertisement and 10% to the new advertisement. There is uncertainty in estimating the conversion rate of the old advertisement going forward, but for simplicity of example (and because nine times more traffic is going to the old advertisement) we will ignore that. So our problem is this: how much traffic should we route to the new advertisement?

To solve this, we need some criteria for our experimental design:

  • What is our estimate of the old advertisement’s conversion rate? Let’s say this is 0.04 or 4%.
  • What is a lower bound on what we consider a big enough improvement for the new advertisement? For the test to work, this must be larger than the old conversion rate. Let’s say this is 0.046 or 4.5%, representing a slightly larger-than-10% relative improvement in conversion to sale.
  • With what probability are we willing to be wrong if the new ad was no better? That is, if the new ad is in fact no better than the old ad, how often are we willing to “cry wolf” and claim there is an improvement (when there is in fact no such thing)? Let’s say we are willing to be wrong in this way 5% of the time. Let’s call this the significance level.
  • With what probability do we want to be right when the new ad was substantially better? That is, if the new ad is in fact converting at a rate of at least 4.5%, how often do we want to detect this? This is called power (and related to sensitivity, which we saw when discussing classification models). Let’s say we want the power to be 0.8 or 80%. When there is an improvement, we want to find it 80% of the time.

Obviously, what we want is to be able to detect improvements at sizes close to zero, at a significance level of zero, and at a power of 1. However, if we insist on any of these parameters being at their “if wishes were horses value” (near zero for improvement size, near zero for significance level, and near 1 for power), the required test size to ensure these guarantees becomes enormous (or even infinite!). So as part of setting expectations before a project (always a good practice), we must first negotiate these “asks” to more achievable values such as those we just described.

When trying to determine sample size or experiment duration, the important concept is statistical test power. Statistical test power is the probability of rejecting the null hypothesis when the null hypothesis is false.[5] Think of statistical test power as 1 minus a p-value. The idea is this: you can’t pick out useful treatments if you can’t even identify which treatments are useless. So you want to design your tests to have test power near 1, which means p-values near 0.

5

See B. S. Everitt, The Cambridge Dictionary of Statistics (Cambridge University Press, 2010).

The standard way to estimate the number of visitors we want to direct to the new advertisement is called a power calculation and is supplied by the R package pwr. Here is how we use R to get the answer:

library(pwr)
pwr.p.test(h = ES.h(p1 = 0.045, p2 = 0.04),
           sig.level = 0.05,
           power = 0.8,
           alternative = "greater")

#     proportion power calculation for binomial distribution (arcsine transfo
     rmation)
#
#              h = 0.02479642
#              n = 10055.18
#      sig.level = 0.05
#          power = 0.8
#    alternative = greater

Notice that all we did was copy our asks into the pwr.p.test method, though we did put the two assumed rates we are trying to distinguish through the ES.h() method, which converts the difference of rates into a Cohen-style “effect size.” In this case. ES.h(p1 = 0.045, p2 = 0.04) is 0.025, which is considered quite small (and therefore hard to measure). Effect sizes are very roughly how big an effect you are trying to measure relative to the natural variation of individuals. So we are trying to measure a change in the likelihood of a sale that is 1/0.025 or 40 times smaller than the individual variation in likelihood of a sale. This is unobservable for any small set of individuals, but observable with a large enough sample.[6]

6

Effect sizes are nice idea, and have a rule of thumb that 0.2 is small, 0.5 is medium, and 1.0 is large. See https://en.wikipedia.org/wiki/Effect_size.

The n = 10056 is the amount of traffic we would have to send to the new advertisement to get a test result with at least the specified quality parameters (significance level and power). So we would need to serve the new advertisement to 10056 visitors to achieve our A/B test measurement. Our site receives 6,000 visitors a day, and we are only allowed to send 10% of them, or 600, to the new advertisement each day. So it would take us 10056/600 or 16.8 days to complete this test.[7]

7

This is fact one of the dirty secrets of A/B tests: measuring small improvements of rare events such as conversion of an advertisement to a sale (often called “conversion to sale”) takes a lot of data, and acquiring a lot of data can take a lot of time.

Venue shopping reduces test power

We’ve discussed test power and significance under the assumption you’re running one large test. In practice, you may run multiple tests trying many treatments to see if any treatment delivers an improvement. This reduces your test power. If you run 20 treatments, each with a p-value goal of 0.05, you would expect one test to appear to show significant improvement, even if all 20 treatments are useless. Testing multiple treatments or even reinspecting the same treatment many times is a form of “venue shopping” (you keep asking at different venues until you get a ruling in your favor). Calculating the loss of test power is formally called “applying the Bonferroni correction” and is as simple as multiplying your significance estimates by your number of tests (remember, large values are bad for significances or p-values). To compensate for this loss of test power, you can run each of the underlying tests at a tighter p cutoff: p divided by the number of tests you intend to run.

B.2.4. Specialized statistical tests

Throughout this book, we concentrate on building predictive models and evaluating significance, either through the modeling tool’s built-in diagnostics or through empirical resampling (such as bootstrap tests or permutation tests). In statistics, there’s an efficient correct test for the significance of just about anything you commonly calculate. Choosing the right standard test gives you a good implementation of the test and access to literature that explains the context and implications of the test. Let’s work on calculating a simple correlation and finding the matching correct test.

We’ll work with a synthetic example that should remind you a bit of our PUMS Census work in chapter 8. Suppose we’ve measured both earned income (money earned in the form of salary) and capital gains (money received from investments) for 100 individuals. Further suppose that there’s no relation between the two for our individuals (in the real world, there’s a correlation, but we need to make sure our tools don’t report one even when there’s none). We’ll set up a simple dataset representing this situation with some lognormally distributed data.

Listing B.17. Building synthetic uncorrelated income
set.seed(235236)                                        1
d <- data.frame(EarnedIncome = 100000 * rlnorm(100),
                 CapitalGains = 100000 * rlnorm(100))   2
print(with(d, cor(EarnedIncome, CapitalGains)))         3

# [1] -0.01066116

  • 1 Sets the pseudo-random seed to a known value so the demonstration is repeatable
  • 2 Generates our synthetic data
  • 3 The correlation is –0.01, which is very near 0—indicating (as designed) no relation.

We claim the observed correlation of -0.01 is statistically indistinguishable from 0 (or no effect). This is something we should quantify. A little research tells us the common correlation is called a Pearson coefficient, and the significance test for a Pearson coefficient for normally distributed data is a Student’s t-test (with the number of degrees of freedom equal to the number of items minus 2). We know our data is not normally distributed (it is, in fact, lognormally distributed), so we research further and find the preferred solution is to compare the data by rank (instead of by value) and use a test like Spearman’s rho or Kendall’s tau. We’ll use Spearman’s rho, as it can track both positive and negative correlations (whereas Kendall’s tau tracks degree of agreement).

A fair question is, how do we know which is the exact right test to use? The answer is, by studying statistics. Be aware that there are a lot of tests, giving rise to books like 100 Statistical Tests in R by N. D. Lewis (Heather Hills Press, 2013). We also suggest that if you know the name of a test, consult B. S.Everitt and A. Skrondal, The Cambridge Dictionary of Statistics, Fourth Edition (Cambridge University Press, 2010).

Another way to find the right test is using R’s help system. help(cor) tells us that cor() implements three different calculations (Pearson, Spearman, and Kendall) and that there’s a matching function called cor.test() that performs the appropriate significance test. Since we weren’t too far off the beaten path, we only need to read up on these three tests and settle on the one we’re interested in (in this case, Spearman). So let’s redo our correlation with the chosen test and check the significance.

Listing B.18. Calculating the (non)significance of the observed correlation
with(d, cor(EarnedIncome, CapitalGains, method = 'spearman'))

# [1] 0.03083108

(ctest <- with(d, cor.test(EarnedIncome, CapitalGains, method = 'spearman')))
#
#       Spearman's rank correlation rho
#
#data:  EarnedIncome and CapitalGains
#S = 161512, p-value = 0.7604
#alternative hypothesis: true rho is not equal to 0
#sample estimates:
#       rho
#0.03083108

We see the Spearman correlation is 0.03 with a p-value of 0.7604, which means truly uncorrelated data would show a coefficient this large about 76% of the time. So there’s no significant effect (which is exactly how we designed our synthetic example).

In our own work, we use the sigr package to wrap up these test results for more succinct formal presentation. The format is similar to the APA (American Psychological Association) style, and n.s. means “not significant.”

sigr::wrapCorTest(ctest)

# [1] "Spearman's rank correlation rho: (r=0.03083, p=n.s.)."
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset