A distribution is a description of likelihoods of possible values in a set of data. For example, it could be the set of plausible heights of an adult, American, 18-year-old male. For a simple numeric value, the distribution is defined thus: for a value b, the distribution is the probability of seeing a value x, with x <= b. This is called the cumulative distribution function (CDF).
We can often summarize a set of possible outcomes by naming a distribution, and some summary statistics. For example, we can say that if we flip a fair coin 10 times, the number of heads we observe should be binomially distributed (defined in section B.5.7) with an expected mean of 5 heads. In all cases, we are concerned with how values are generated, and getting a bit more detail beyond just a characterization of mean and standard deviation, such as getting the name and shape of the distribution.
In this section, we’ll outline a few important distributions: the normal distribution, the lognormal distribution, and the binomial distribution. As you work further, you’ll also want to learn many other key distributions (such as Poisson, beta, negative binomial, and many more), but the ideas we’ll present here should be enough to get you started.
The normal or Gaussian distribution is the classic symmetric bell-shaped curve, as shown in figure B.1. Many measured quantities, such as test scores from a group of students, or the age or height of a particular population, can often be approximated by the normal. Repeated measurements will tend to fall into a normal distribution. For example, if a doctor weighs a patient multiple times, using a properly calibrated scale, the measurements (if enough of them are taken) will fall into a normal distribution around the patient’s true weight. The variation will be due to measurement error (the variability of the scale). The normal distribution is defined over all real numbers.
In addition, the central limit theorem says that when you’re observing the sum (or mean) of many independent, bounded variance random variables, the distribution of your observations will approach the normal as you collect more data. For example, suppose you want to measure how many people visit your website every day between 9 a.m. and 10 a.m. The proper distribution for modeling the number of visitors is the Poisson distribution; but if you have a high enough volume of traffic, and you observe long enough, the distribution of observed visitors will approach the normal distribution, and you can make acceptable estimates about your traffic by treating the number of visitors as if it were normally distributed.
Many real-world distributions are approximately “normal”—in particular, any measurement where the notion of “close” tends to be additive. An example would be adult heights: a 6-inch difference in height is large both for people who are 5'6" and for those who are 6".
The normal is described by two parameters: the mean m and the standard deviation s (or, alternatively, the variance, which is the square of s). The mean represents the distribution’s center (and also its peak); the standard deviation represents the distribution’s “natural unit of length”—you can estimate how rare an observation is by how many standard deviations it is from the mean. As we mention in chapter 4, for a normally distributed variable
So an observation more than three standard deviations away from the mean can be considered quite rare, in most applications.
Many machine learning algorithms and statistical methods (for example, linear regression) assume that the unmodeled errors are distributed normally. Linear regression is fairly robust to violations of this assumption; still, for continuous variables, you should at least check if the variable distribution is unimodal and somewhat symmetric. When this isn’t the case, you may wish to consider using a variable transformation, such as the log transformations that we discuss in chapter 4.
In R the function dnorm(x, mean = m, sd = s) is the normal probability density function: it will return the probability of observing x when it’s drawn from a normal distribution with mean m and standard deviation s. By default, dnorm assumes that mean=0 and sd = 1 (as do all the functions related to the normal distribution that we discuss here). Let’s use dnorm() to draw figure B.1.
library(ggplot2) x <- seq(from=-5, to=5, length.out=100) # the interval [-5 5] f <- dnorm(x) # normal with mean 0 and sd 1 ggplot(data.frame(x=x,y=f), aes(x=x,y=y)) + geom_line()
The function rnorm(n, mean = m, sd = s) will generate n points drawn from a normal distribution with mean m and standard deviation s.
library(ggplot2) # draw 1000 points from a normal with mean 0, sd 1 u <- rnorm(1000) # plot the distribution of points, # compared to normal curve as computed by dnorm() (dashed line) ggplot(data.frame(x=u), aes(x=x)) + geom_density() + geom_line(data=data.frame(x=x,y=f), aes(x=x,y=y), linetype=2)
As you can see in figure B.2, the empirical distribution of the points produced by rnorm(1000) is quite close to the theoretical normal. Distributions observed from finite datasets can never exactly match theoretical continuous distributions like the normal; and, as with all things statistical, there is a well-defined distribution for how far off you expect to be for a given sample size.
The function pnorm(x, mean = m, sd = s) is what R calls the normal probability function, otherwise called the normal cumulative distribution function: it returns the probability of observing a data point of value less than x from a normal with mean m and standard deviation s. In other words, it’s the area under the distribution curve that falls to the left of x (recall that a distribution has unit area under the curve). This is shown in the listing B.3.
# --- estimate probabilities (areas) under the curve --- # 50% of the observations will be less than the mean pnorm(0) # [1] 0.5 # about 2.3% of all observations are more than 2 standard # deviations below the mean pnorm(-2) # [1] 0.02275013 # about 95.4% of all observations are within 2 standard deviations # from the mean pnorm(2) - pnorm(-2) # [1] 0.9544997
The function qnorm(p, mean = m, sd = s) is the quantile function for the normal distribution with mean m and standard deviation s. It’s the inverse of pnorm(), in that qnorm(p, mean = m, sd = s) returns the value x such that pnorm(x, mean = m, sd = s) == p.
Figure B.3 illustrates the use of qnorm(): the vertical line intercepts the x axis at x = qnorm(0.75); the shaded area to the left of the vertical line represents the area 0.75, or 75% of the area under the normal curve.
The code to create figure B.3 (along with a few other examples of using qnorm()) is shown in the following listing.
# --- return the quantiles corresponding to specific probabilities --- # the median (50th percentile) of a normal is also the mean qnorm(0.5) # [1] 0 # calculate the 75th percentile qnorm(0.75) # [1] 0.6744898 pnorm(0.6744898) # [1] 0.75 # --- Illustrate the 75th percentile --- # create a graph of the normal distribution with mean 0, sd 1 x <- seq(from=-5, to=5, length.out=100) f <- dnorm(x) nframe <- data.frame(x=x,y=f) # calculate the 75th percentile line <- qnorm(0.75) xstr <- sprintf("qnorm(0.75) = %1.3f", line) # the part of the normal distribution to the left # of the 75th percentile nframe75 <- subset(nframe, nframe$x < line) # Plot it. # The shaded area is 75% of the area under the normal curve ggplot(nframe, aes(x=x,y=y)) + geom_line() + geom_area(data=nframe75, aes(x=x,y=y), fill="gray") + geom_vline(aes(xintercept=line), linetype=2) + geom_text(x=line, y=0, label=xstr, vjust=1)
Now that we’ve shown some concrete examples, we can summarize how R names the different functions associated with a given probability distribution. Suppose the probability distribution is called DIST. Then the following are true:
For some reason, R refers to the cumulative distribution function (or CDF) as the short term distribution function. Be careful to check if you want to use the probability density function or the CDF when working with R.
The lognormal distribution is the distribution of a random variable X whose natural log log(X) is normally distributed. The distribution of highly skewed positive data, like the value of profitable customers, incomes, sales, or stock prices, can often be modeled as a lognormal distribution. A lognormal distribution is defined over all non-negative real numbers; as shown in figure B.4 (top), it’s asymmetric, with a long tail out toward positive infinity. The distribution of log(X) (figure B.4, bottom) is a normal distribution centered at mean(log(X)). For lognormal populations, the mean is generally much higher than the median, and the bulk of the contribution toward the mean value is due to a small population of highest-valued data points.
For a population that’s approximately normally distributed, you can use the mean value of the population as a rough stand-in value for a typical member of the population. If you use the mean as a stand-in value for a lognormal population, you’ll overstate the value of the majority of your data.
Intuitively, if variations in the data are expressed naturally as percentages or relative differences, rather than as absolute differences, then the data is a candidate to be modeled lognormally. For example, a typical sack of potatoes in your grocery store might weigh about five pounds, plus or minus half a pound. The distance that a specific type of bullet will fly when fired from a specific type of handgun might be about 2,100 meters, plus or minus 100 meters. The variations in these observations are naturally represented in absolute units, and the distributions can be modeled as normals. On the other hand, differences in monetary quantities are often best expressed as percentages: a population of workers might all get a 5% increase in salary (not an increase of $5,000/year across the board); you might want to project next quarter’s revenue to within 10% (not to within plus or minus $1,000). Hence, these quantities are often best modeled as having lognormal distributions.
Let’s look at the functions for working with the lognormal distribution in R (see also section B.5.3). We’ll start with dlnorm() and rlnorm():
We can use dlnorm() and rlnorm() to produce figure 8.4, shown earlier. The following listing demonstrates some properties of the lognormal distribution.
# draw 1001 samples from a lognormal with meanlog 0, sdlog 1 u <- rlnorm(1001) # the mean of u is higher than the median mean(u) # [1] 1.638628 median(u) # [1] 1.001051 # the mean of log(u) is approx meanlog=0 mean(log(u)) # [1] -0.002942916 # the sd of log(u) is approx sdlog=1 sd(log(u)) # [1] 0.9820357 # generate the lognormal with meanlog = 0, sdlog = 1 x <- seq(from = 0, to = 25, length.out = 500) f <- dlnorm(x) # generate a normal with mean = 0, sd = 1 x2 <- seq(from = -5, to = 5, length.out = 500) f2 <- dnorm(x2) # make data frames lnormframe <- data.frame(x = x, y = f) normframe <- data.frame(x = x2, y = f2) dframe <- data.frame(u=u) # plot densityplots with theoretical curves superimposed p1 <- ggplot(dframe, aes(x = u)) + geom_density() + geom_line(data = lnormframe, aes(x = x, y = y), linetype = 2) p2 <- ggplot(dframe, aes(x = log(u))) + geom_density() + geom_line(data = normframe, aes(x = x,y = y), linetype = 2) # functions to plot multiple plots on one page library(grid) nplot <- function(plist) { n <- length(plist) grid.newpage() pushViewport(viewport(layout=grid.layout(n, 1))) vplayout<- function(x,y) { viewport(layout.pos.row = x, layout.pos.col = y) } for(i in 1:n) { print(plist[[i]], vp = vplayout(i, 1)) } } # this is the plot that leads this section. nplot(list(p1, p2))
The remaining two functions are the CDF plnorm() and the quantile function qlnorm():
The following listing demonstrates plnorm() and qlnorm(). It uses the data frame lnormframe from the previous listing.
# the 50th percentile (or median) of the lognormal with # meanlog=0 and sdlog=10 qlnorm(0.5) # [1] 1 # the probability of seeing a value x less than 1 plnorm(1) # [1] 0.5 # the probability of observing a value x less than 10: plnorm(10) # [1] 0.9893489 # -- show the 75th percentile of the lognormal # use lnormframe from previous example: the # theoretical lognormal curve line <- qlnorm(0.75) xstr <- sprintf("qlnorm(0.75) = %1.3f", line) lnormframe75 <- subset(lnormframe, lnormframe$x < line) # Plot it # The shaded area is 75% of the area under the lognormal curve ggplot(lnormframe, aes(x = x, y = y)) + geom_line() + geom_area(data=lnormframe75, aes(x = x, y = y), fill = "gray") + geom_vline(aes(xintercept = line), linetype = 2) + geom_text(x = line, y = 0, label = xstr, hjust = 0, vjust = 1)
As you can see in figure B.5, the majority of the data is concentrated on the left side of the distribution, with the remaining quarter of the data spread out over a very long tail.
Suppose you have a coin that has a probability p of landing on heads when you flip it (so for a fair coin, p = 0.5). In this case, the binomial distribution models the probability of observing k heads when you flip that coin N times. It’s used to model binary classification problems (as we discuss in relation to logistic regression in chapter 8), where the positive examples can be considered “heads.”
Figure B.6 shows the shape of the binomial distribution for coins of different fairnesses, when flipped 50 times. Note that the binomial distribution is discrete; it’s only defined for (non-negative) integer values of k
Let’s look at the functions for working with the binomial distribution in R (see also section B.5.3). We’ll start with the probability density function dbinom() and the random number generator rbinom():
You can use dbinom() (as in the following listing) to produce figure B.6.
library(ggplot2) # # use dbinom to produce the theoretical curves # numflips <- 50 # x is the number of heads that we see x <- 0:numflips # probability of heads for several different coins p <- c(0.05, 0.15, 0.5, 0.75) plabels <- paste("p =", p) # calculate the probability of seeing x heads in numflips flips # for all the coins. This probably isn't the most elegant # way to do this, but at least it's easy to read flips <- NULL for(i in 1:length(p)) { coin <- p[i] label <- plabels[i] tmp <- data.frame(number_of_heads=x, probability = dbinom(x, numflips, coin), coin_type = label) flips <- rbind(flips, tmp) } # plot it # this is the plot that leads this section ggplot(flips, aes(x = number_of_heads, y = probability)) + geom_point(aes(color = coin_type, shape = coin_type)) + geom_line(aes(color = coin_type))
You can use rbinom() to simulate a coin-flipping-style experiment. For example, suppose you have a large population of students that’s 50% female. If students are assigned to classrooms at random, and you visit 100 classrooms with 20 students each, then how many girls might you expect to see in each classroom? A plausible outcome is shown in figure B.7, with the theoretical distribution superimposed.
Let’s write the code to produce figure B.7.
p = 0.5 # the percentage of females in this student population class_size <- 20 # size of a classroom numclasses <- 100 # how many classrooms we observe # what might a typical outcome look like? numFemales <- rbinom(numclasses, class_size, p) 1 # the theoretical counts (not necessarily integral) probs <- dbinom(0:class_size, class_size, p) tcount <- numclasses*probs # the obvious way to plot this is with histogram or geom_bar # but this might just look better zero <- function(x) {0} # a dummy function that returns only 0 ggplot(data.frame(number_of_girls = numFemales, dummy = 1), aes(x = number_of_girls, y = dummy)) + # count the number of times you see x heads stat_summary(fun.y = "sum", geom = "point", size=2) + 2 stat_summary(fun.ymax = "sum", fun.ymin = "zero", geom = "linerange") + # superimpose the theoretical number of times you see x heads geom_line(data = data.frame(x = 0:class_size, y = tcount), aes(x = x, y = y), linetype = 2) + scale_x_continuous(breaks = 0:class_size, labels = 0:class_size) + scale_y_continuous("number of classrooms")
As you can see, even classrooms with as few as 4 or as many as 16 girls aren’t completely unheard of when students from this population are randomly assigned to classrooms. But if you observe too many such classrooms—or if you observe classes with fewer than 4 or more than 16 girls—you’d want to investigate whether student selection for those classes is biased in some way.
You can also use rbinom() to simulate flipping a single coin.
# use rbinom to simulate flipping a coin of probability p N times p75 <- 0.75 # a very unfair coin (mostly heads) N <- 1000 # flip it several times flips_v1 <- rbinom(N, 1, p75) # Another way to generate unfair flips is to use runif: # the probability that a uniform random number from [0 1) # is less than p is exactly p. So "less than p" is "heads". flips_v2 <- as.numeric(runif(N) < p75) prettyprint_flips <- function(flips) { outcome <- ifelse(flips==1, "heads", "tails") table(outcome) } prettyprint_flips(flips_v1) # outcome # heads tails # 756 244 prettyprint_flips(flips_v2) # outcome # heads tails # 743 257
The final two functions are the CDF pbinom() and the quantile function qbinom():
The next listing shows some examples of using pbinom() and qbinom().
# pbinom example nflips <- 100 nheads <- c(25, 45, 50, 60) # number of heads # what are the probabilities of observing at most that # number of heads on a fair coin? left.tail <- pbinom(nheads, nflips, 0.5) sprintf("%2.2f", left.tail) # [1] "0.00" "0.18" "0.54" "0.98" # the probabilities of observing more than that # number of heads on a fair coin? right.tail <- pbinom(nheads, nflips, 0.5, lower.tail = FALSE) sprintf("%2.2f", right.tail) # [1] "1.00" "0.82" "0.46" "0.02" # as expected: left.tail+right.tail # [1] 1 1 1 1 # so if you flip a fair coin 100 times, # you are guaranteed to see more than 10 heads, # almost guaranteed to see fewer than 60, and # probably more than 45. # qbinom example nflips <- 100 # what's the 95% "central" interval of heads that you # would expect to observe on 100 flips of a fair coin? left.edge <- qbinom(0.025, nflips, 0.5) right.edge <- qbinom(0.025, nflips, 0.5, lower.tail = FALSE) c(left.edge, right.edge) # [1] 40 60 # so with 95% probability you should see between 40 and 60 heads
One thing to keep in mind is that because the binomial distribution is discrete, pbinom() and qbinom() won’t be perfect inverses of each other, as is the case with continuous distributions like the normal.
# because this is a discrete probability distribution, # pbinom and qbinom are not exact inverses of each other # this direction works pbinom(45, nflips, 0.5) # [1] 0.1841008 qbinom(0.1841008, nflips, 0.5) # [1] 45 # this direction won't be exact qbinom(0.75, nflips, 0.5) # [1] 53 pbinom(53, nflips, 0.5) # [1] 0.7579408
R has many more tools for working with distributions beyond the PDF, CDF, and generation tools we’ve demonstrated. In particular, for fitting distributions, you may want to try the fitdistr method from the MASS package.