B.3. Examples of the statistical view of data

Compared to statistics, machine learning and data science have an optimistic view of working with data. In data science, you quickly pounce on noncausal relations in the hope that they’ll hold up and help with future prediction. Much of statistics is about how data can lie to you and how such relations can mislead you. We only have space for a couple of examples, so we’ll concentrate on two of the most common issues: sampling bias and missing-variable bias.

B.3.1. Sampling bias

Sampling bias is any process that systematically alters the distribution of observed data.[8] The data scientist must be aware of the possibility of sampling bias and be prepared to detect it and fix it. The most effective way is to fix your data collection methodology.

8

We would have liked to use the common term “censored” for this issue, but in statistics the phrase censored observations is reserved for variables that have only been recorded up to a limit or bound. So it would be potentially confusing to use the term to describe missing observations.

For our sampling bias example, we’ll continue with the income example we started in section B.4. Suppose through some happenstance we were studying only a high-earning subset of our original population (perhaps we polled them at some exclusive event). The following listing shows how, when we restrict to a high-earning set, it appears that earned income and capital gains are strongly anticorrelated. We get a correlation of -0.86 (so think of the anticorrelation as explaining about (-0.86)^2 = 0.74 = 74% of the variance; see http://mng.bz/ndYf) and a p-value very near 0 (so it’s unlikely the unknown true correlation of more data produced in this manner is in fact 0). The following listing demonstrates the calculation.

Listing B.19. Misleading significance result from biased observations
veryHighIncome <- subset(d, EarnedIncome+CapitalGains>=500000)
print(with(veryHighIncome,cor.test(EarnedIncome,CapitalGains,
    method='spearman')))
#
#       Spearman's rank correlation rho
#
#data:  EarnedIncome and CapitalGains
#S = 1046, p-value < 2.2e-16
#alternative hypothesis: true rho is not equal to 0
#sample estimates:
#       rho
#-0.8678571

Some plots help to show what’s going on. Figure B.8 shows the original dataset with the best linear relation line run through. Note that the line is nearly flat (indicating change in x doesn’t predict change in y).

Figure B.8. Earned income versus capital gains

Figure B.9 shows the best trend line run through the high income dataset. It also shows how cutting out the points below the line x+y=500000 leaves a smattering of rare high-value events arranged in a direction that crudely approximates the slope of our cut line (–0.8678571 being a crude approximation for –1). It’s also interesting to note that the bits we suppressed aren’t correlated among themselves, so the effect wasn’t a matter of suppressing a correlated group out of an uncorrelated cloud to get a negative correlation.

Figure B.9. Biased earned income vs. capital gains

The code to produce figures B.8 and B.9 and calculate the correlation between suppressed points is shown in the following listing.

Listing B.20. Plotting biased view of income and capital gains
library(ggplot2)
ggplot(data=d,aes(x=EarnedIncome,y=CapitalGains)) +
   geom_point() + geom_smooth(method='lm') +
   coord_cartesian(xlim=c(0,max(d)),ylim=c(0,max(d)))             1
ggplot(data=veryHighIncome,aes(x=EarnedIncome,y=CapitalGains)) +
   geom_point() + geom_smooth(method='lm') +
   geom_point(data=subset(d,EarnedIncome+CapitalGains<500000),
         aes(x=EarnedIncome,y=CapitalGains),
      shape=4,alpha=0.5,color='red') +
   geom_segment(x=0,xend=500000,y=500000,yend=0,
      linetype=2,alpha=0.5,color='red') +
   coord_cartesian(xlim=c(0,max(d)),ylim=c(0,max(d)))             2
print(with(subset(d,EarnedIncome+CapitalGains<500000),
    cor.test(EarnedIncome,CapitalGains,method='spearman')))       3
#
#        Spearman's rank correlation rho
#
#data:  EarnedIncome and CapitalGains
#S = 107664, p-value = 0.6357
#alternative hypothesis: true rho is not equal to 0
#sample estimates:
#        rho
#-0.05202267

  • 1 Plots all of the income data with linear trend line (and uncertainty band)
  • 2 Plots the very high income data and linear trend line (also includes cut-off and portrayal of suppressed data)
  • 3 Computes correlation of suppressed data

B.3.2. Omitted variable bias

Many data science clients expect data science to be a quick process, where every convenient variable is thrown in at once and a best possible result is quickly obtained. Statisticians are rightfully wary of such an approach due to various negative effects such as omitted variable bias, collinear variables, confounding variables, and nuisance variables. In this section, we’ll discuss one of the more general issues: omitted variable bias.

What is omitted variable bias?

In its simplest form, omitted variable bias occurs when a variable that isn’t included in the model is both correlated with what we’re trying to predict and correlated with a variable that’s included in our model. When this effect is strong, it causes problems, as the model-fitting procedure attempts to use the variables in the model to both directly predict the desired outcome and to stand in for the effects of the missing variable. This can introduce biases, create models that don’t quite make sense, and result in poor generalization performance.

The effect of omitted variable bias is easiest to see in a regression example, but it can affect any type of model.

An example of omitted variable bias

We’ve prepared a synthetic dataset called synth.RData (download from https://github.com/WinVector/PDSwR2/tree/master/bioavailability) that has an omitted variable problem typical for a data science project. To start, please download synth.RData and load it into R, as the next listing shows.

Listing B.21. Summarizing our synthetic biological data
load('synth.RData')
print(summary(s))
##       week         Caco2A2BPapp       FractionHumanAbsorption
##  Min.   :  1.00   Min.   :6.994e-08   Min.   :0.09347
##  1st Qu.: 25.75   1st Qu.:7.312e-07   1st Qu.:0.50343
##  Median : 50.50   Median :1.378e-05   Median :0.86937
##  Mean   : 50.50   Mean   :2.006e-05   Mean   :0.71492
##  3rd Qu.: 75.25   3rd Qu.:4.238e-05   3rd Qu.:0.93908
##  Max.   :100.00   Max.   :6.062e-05   Max.   :0.99170
head(s)
##   week Caco2A2BPapp FractionHumanAbsorption
## 1    1 6.061924e-05              0.11568186
## 2    2 6.061924e-05              0.11732401
## 3    3 6.061924e-05              0.09347046
## 4    4 6.061924e-05              0.12893540
## 5    5 5.461941e-05              0.19021858
## 6    6 5.370623e-05              0.14892154
# View(s)                                       1

  • 1 Displays a date in a spreadsheet-like window. View is one of the commands that has a much better implementation in RStudio than in basic R.

This loads synthetic data that’s supposed to represent a simplified view of the kind of data that might be collected over the history of a pharmaceutical ADME[9] or bioavailability project. RStudio’s View() spreadsheet is shown in figure B.10. The columns of this dataset are described in table B.1.

9

ADME stands for absorption, distribution, metabolism, excretion; it helps determine which molecules make it into the human body through ingestion and thus could even be viable candidates for orally delivered drugs.

Figure B.10. View of rows from the bioavailability dataset

Table B.1. Bioavailability columns

Column

Description

week In this project, we suppose that a research group submits a new drug candidate molecule for assay each week. To keep things simple, we use the week number (in terms of weeks since the start of the project) as the identifier for the molecule and the data row. This is an optimization project, which means each proposed molecule is made using lessons learned from all of the previous molecules. This is typical of many projects, but it means the data rows aren’t mutually exchangeable (an important assumption that we often use to justify statistical and machine learning techniques).
Caco2A2BPapp This is the first assay run (and the “cheap” one). The Caco2 test measures how fast the candidate molecule passes through a membrane of cells derived from a specific large intestine carcinoma (cancers are often used for tests, as noncancerous human cells usually can’t be cultured indefinitely). The Caco2 test is a stand-in or analogy test. The test is thought to simulate one layer of the small intestine that it’s morphologically similar to (though it lacks a number of forms and mechanisms found in the actual small intestine). Think of Caco2 as a cheap test to evaluate a factor that correlates with bioavailability (the actual goal of the project).
FractionHumanAbsorption This is the second assay run and is what fraction of the drug candidate is absorbed by human test subjects. Obviously, these tests would be expensive to run and subject to a lot of safety protocols. For this example, optimizing absorption is the actual end goal of the project.

We’ve constructed this synthetic data to represent a project that’s trying to optimize human absorption by working through small variations of a candidate drug molecule. At the start of the project, they have a molecule that’s highly optimized for the stand-in criteria Caco2 (which does correlate with human absorption), and through the history of the project, actual human absorption is greatly increased by altering factors that we’re not tracking in this simplistic model. During drug optimization, it’s common to have formerly dominant stand-in criteria revert to ostensibly less desirable values as other inputs start to dominate the outcome. So for our example project, the human absorption rate is rising (as the scientists successfully optimize for it) and the Caco2 rate is falling (as it started high, and we’re no longer optimizing for it, even though it is a useful feature).

One of the advantages of using synthetic data for these problem examples is that we can design the data to have a given structure, and then we know the model is correct if it picks this up and incorrect if it misses it. In particular, this dataset was designed such that Caco2 is always a positive contribution to fraction of absorption throughout the entire dataset. This data was generated using a random non-increasing sequence of plausible Caco2 measurements and then generating fictional absorption numbers, as shown next (the data frame d that you also loaded from synth.RData is the published graph we base our synthetic example on). We produce our synthetic data that’s known to improve over time in the next listing.

Listing B.22. Building data that improves over time
set.seed(2535251)
s <- data.frame(week = 1:100)
s$Caco2A2BPapp <- sort(sample(d$Caco2A2BPapp,100,replace=T),
   decreasing=T)
sigmoid <- function(x) {1/(1 + exp(-x))}
s$FractionHumanAbsorption <-               1
 sigmoid(
   7.5 + 0.5 * log(s$Caco2A2BPapp) +       2
   s$week / 10 - mean(s$week / 10) +       3
   rnorm(100) / 3                          4
   )
write.table(s, 'synth.csv', sep=',',
   quote = FALSE, row.names = FALSE)

  • 1 Builds synthetic examples
  • 2 Adds in Caco2 to the absorption relation learned from the original dataset. Note that the relation is positive: better Caco2 always drives better absorption in our synthetic dataset. We’re log transforming Caco2, as it has over 3 decades of range.
  • 3 Adds in a mean-0 term that depends on time to simulate the effects of improvements as the project moves forward
  • 4 Adds in a mean-0 noise term

The design of this data is this: Caco2 always has a positive effect (identical to the source data we started with), but this gets hidden by the week factor (and Caco2 is negatively correlated with week, because week is increasing and Caco2 is sorted in decreasing order). Time is not a variable we at first wish to model (it isn’t something we usefully control), but analyses that omit time suffer from omitted variable bias. For the complete details, consult our GitHub example documentation (https://github.com/WinVector/PDSwR2/tree/master/bioavailability).

A spoiled analysis

In some situations, the true relationship between Caco2 and FractionHumanAbsorption is hidden because the variable week is positively correlated with FractionHumanAbsorption (as the absorption is being improved over time) and negatively correlated with Caco2 (as Caco2 is falling over time). week is a stand-in variable for all the other molecular factors driving human absorption that we’re not recording or modeling. Listing B.23 shows what happens when we try to model the relation between Caco2 and FractionHumanAbsorption without using the week variable or any other factors.

Listing B.23. A bad model (due to omitted variable bias)
print(summary(glm(data = s,
   FractionHumanAbsorption ~ log(Caco2A2BPapp),
   family = binomial(link = 'logit'))))
## Warning: non-integer #successes in a binomial glm!
##
## Call:
## glm(formula = FractionHumanAbsorption ~ log(Caco2A2BPapp),
##    family = binomial(link = "logit"),
##     data = s)
##
## Deviance Residuals:
##    Min      1Q  Median      3Q     Max
## -0.609  -0.246  -0.118   0.202   0.557
##
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)
## (Intercept)        -10.003      2.752   -3.64  0.00028 ***
## log(Caco2A2BPapp)   -0.969      0.257   -3.77  0.00016 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 43.7821  on 99  degrees of freedom
## Residual deviance:  9.4621  on 98  degrees of freedom
## AIC: 64.7
##
## Number of Fisher Scoring iterations: 6

For details on how to read the glm() summary, please see section 7.2. Note that the sign of the Caco2 coefficient is negative, not what’s plausible or what we expected going in. This is because the Caco2 coefficient isn’t just recording the relation of Caco2 to FractionHumanAbsorption, but also having to record any relations that come through omitted correlated variables.

Working around omitted variable bias

There are a number of ways to deal with omitted variable bias, the best ways being better experimental design and more variables. Other methods include use of fixed-effects models and hierarchical models. We’ll demonstrate one of the simplest methods: adding in possibly important omitted variables. In the following listing, we redo the analysis with week included.

Listing B.24. A better model
print(summary(glm(data=s,
   FractionHumanAbsorption~week+log(Caco2A2BPapp),
   family=binomial(link='logit'))))
## Warning: non-integer #successes in a binomial glm!
##
## Call:
## glm(formula = FractionHumanAbsorption ~ week + log(Caco2A2BPapp),
##     family = binomial(link = "logit"), data = s)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -0.3474  -0.0568  -0.0010   0.0709   0.3038
##
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)
## (Intercept)         3.1413     4.6837    0.67   0.5024
## week                0.1033     0.0386    2.68   0.0074 **
## log(Caco2A2BPapp)   0.5689     0.5419    1.05   0.2938
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 43.7821  on 99  degrees of freedom
## Residual deviance:  1.2595  on 97  degrees of freedom
## AIC: 47.82
##
## Number of Fisher Scoring iterations: 6

We recovered decent estimates of both the Caco2 and week coefficients, but we didn’t achieve statistical significance on the effect of Caco2. Note that fixing omitted variable bias requires (even in our synthetic example) some domain knowledge to propose important omitted variables and the ability to measure the additional variables (and to try to remove their impact through the use of an offset; see help('offset')).

At this point, you should have a more detailed intentional view of variables. There are, at the least, variables you can control (explanatory variables), important variables you can’t control (nuisance variables), and important variables you don’t know (omitted variables). Your knowledge of all of these variable types should affect your experimental design and analysis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset