B.1. Distributions

A distribution is a description of likelihoods of possible values in a set of data. For example, it could be the set of plausible heights of an adult, American, 18-year-old male. For a simple numeric value, the distribution is defined thus: for a value b, the distribution is the probability of seeing a value x, with x <= b. This is called the cumulative distribution function (CDF).

We can often summarize a set of possible outcomes by naming a distribution, and some summary statistics. For example, we can say that if we flip a fair coin 10 times, the number of heads we observe should be binomially distributed (defined in section B.5.7) with an expected mean of 5 heads. In all cases, we are concerned with how values are generated, and getting a bit more detail beyond just a characterization of mean and standard deviation, such as getting the name and shape of the distribution.

In this section, we’ll outline a few important distributions: the normal distribution, the lognormal distribution, and the binomial distribution. As you work further, you’ll also want to learn many other key distributions (such as Poisson, beta, negative binomial, and many more), but the ideas we’ll present here should be enough to get you started.

B.1.1. Normal distribution

The normal or Gaussian distribution is the classic symmetric bell-shaped curve, as shown in figure B.1. Many measured quantities, such as test scores from a group of students, or the age or height of a particular population, can often be approximated by the normal. Repeated measurements will tend to fall into a normal distribution. For example, if a doctor weighs a patient multiple times, using a properly calibrated scale, the measurements (if enough of them are taken) will fall into a normal distribution around the patient’s true weight. The variation will be due to measurement error (the variability of the scale). The normal distribution is defined over all real numbers.

Figure B.1. The normal distribution with mean 0 and standard deviation 1

In addition, the central limit theorem says that when you’re observing the sum (or mean) of many independent, bounded variance random variables, the distribution of your observations will approach the normal as you collect more data. For example, suppose you want to measure how many people visit your website every day between 9 a.m. and 10 a.m. The proper distribution for modeling the number of visitors is the Poisson distribution; but if you have a high enough volume of traffic, and you observe long enough, the distribution of observed visitors will approach the normal distribution, and you can make acceptable estimates about your traffic by treating the number of visitors as if it were normally distributed.

Many real-world distributions are approximately “normal”—in particular, any measurement where the notion of “close” tends to be additive. An example would be adult heights: a 6-inch difference in height is large both for people who are 5'6" and for those who are 6".

The normal is described by two parameters: the mean m and the standard deviation s (or, alternatively, the variance, which is the square of s). The mean represents the distribution’s center (and also its peak); the standard deviation represents the distribution’s “natural unit of length”—you can estimate how rare an observation is by how many standard deviations it is from the mean. As we mention in chapter 4, for a normally distributed variable

  • About 68% of observations will fall in the interval (m-s,m+s).
  • About 95% of observations will fall in the interval (m-2*s,m+2*s).
  • About 99.7% of observations will fall in the interval (m-3*s,m+3*s).

So an observation more than three standard deviations away from the mean can be considered quite rare, in most applications.

Many machine learning algorithms and statistical methods (for example, linear regression) assume that the unmodeled errors are distributed normally. Linear regression is fairly robust to violations of this assumption; still, for continuous variables, you should at least check if the variable distribution is unimodal and somewhat symmetric. When this isn’t the case, you may wish to consider using a variable transformation, such as the log transformations that we discuss in chapter 4.

Using the normal distribution in R

In R the function dnorm(x, mean = m, sd = s) is the normal probability density function: it will return the probability of observing x when it’s drawn from a normal distribution with mean m and standard deviation s. By default, dnorm assumes that mean=0 and sd = 1 (as do all the functions related to the normal distribution that we discuss here). Let’s use dnorm() to draw figure B.1.

Listing B.1. Plotting the theoretical normal density
library(ggplot2)

x <- seq(from=-5, to=5, length.out=100) # the interval [-5 5]
f <- dnorm(x)                           # normal with mean 0 and sd 1
ggplot(data.frame(x=x,y=f), aes(x=x,y=y)) + geom_line()

The function rnorm(n, mean = m, sd = s) will generate n points drawn from a normal distribution with mean m and standard deviation s.

Listing B.2. Plotting an empirical normal density
library(ggplot2)

# draw 1000 points from a normal with mean 0, sd 1
u <- rnorm(1000)

# plot the distribution of points,
# compared to normal curve as computed by dnorm() (dashed line)
ggplot(data.frame(x=u), aes(x=x)) + geom_density() +
   geom_line(data=data.frame(x=x,y=f), aes(x=x,y=y), linetype=2)

As you can see in figure B.2, the empirical distribution of the points produced by rnorm(1000) is quite close to the theoretical normal. Distributions observed from finite datasets can never exactly match theoretical continuous distributions like the normal; and, as with all things statistical, there is a well-defined distribution for how far off you expect to be for a given sample size.

Figure B.2. The empirical distribution of points drawn from a normal with mean 0 and standard deviation 1. The dotted line represents the theoretical normal distribution.

The function pnorm(x, mean = m, sd = s) is what R calls the normal probability function, otherwise called the normal cumulative distribution function: it returns the probability of observing a data point of value less than x from a normal with mean m and standard deviation s. In other words, it’s the area under the distribution curve that falls to the left of x (recall that a distribution has unit area under the curve). This is shown in the listing B.3.

Listing B.3. Working with the normal CDF
# --- estimate probabilities (areas) under the curve ---

# 50% of the observations will be less than the mean
pnorm(0)
# [1] 0.5

# about 2.3% of all observations are more than 2 standard
# deviations below the mean
pnorm(-2)
# [1] 0.02275013

# about 95.4% of all observations are within 2 standard deviations
# from the mean
pnorm(2) - pnorm(-2)
# [1] 0.9544997

The function qnorm(p, mean = m, sd = s) is the quantile function for the normal distribution with mean m and standard deviation s. It’s the inverse of pnorm(), in that qnorm(p, mean = m, sd = s) returns the value x such that pnorm(x, mean = m, sd = s) == p.

Figure B.3 illustrates the use of qnorm(): the vertical line intercepts the x axis at x = qnorm(0.75); the shaded area to the left of the vertical line represents the area 0.75, or 75% of the area under the normal curve.

Figure B.3. Illustrating x < qnorm(0.75)

The code to create figure B.3 (along with a few other examples of using qnorm()) is shown in the following listing.

Listing B.4. Plotting x < qnorm(0.75)
# --- return the quantiles corresponding to specific probabilities ---

# the median (50th percentile) of a normal is also the mean
qnorm(0.5)
# [1] 0

# calculate the 75th percentile
qnorm(0.75)
# [1] 0.6744898
pnorm(0.6744898)
# [1] 0.75

# --- Illustrate the 75th percentile ---

# create a graph of the normal distribution with mean 0, sd 1
x <- seq(from=-5, to=5, length.out=100)
f <- dnorm(x)
nframe <- data.frame(x=x,y=f)
# calculate the 75th percentile
line <- qnorm(0.75)
xstr <- sprintf("qnorm(0.75) = %1.3f", line)

# the part of the normal distribution to the left
# of the 75th percentile
nframe75 <- subset(nframe, nframe$x < line)

# Plot it.
# The shaded area is 75% of the area under the normal curve
ggplot(nframe, aes(x=x,y=y)) + geom_line() +
  geom_area(data=nframe75, aes(x=x,y=y), fill="gray") +
  geom_vline(aes(xintercept=line), linetype=2) +
  geom_text(x=line, y=0, label=xstr, vjust=1)

B.1.2. Summarizing R’s distribution naming conventions

Now that we’ve shown some concrete examples, we can summarize how R names the different functions associated with a given probability distribution. Suppose the probability distribution is called DIST. Then the following are true:

  • dDIST(x, ...) is the distribution function (or PDF, see the next callout) that returns the probability of observing the value x.
  • pDIST(x, ...) is the cumulative distribution function that returns the probability of observing a value less than x. The flag lower.tail = FALSE will cause pDIST(x, ...) to return the probability of observing a value greater than x (the area under the right tail, rather than the left).
  • rDIST(n, ...) is the random number generator that returns n values drawn from the distribution DIST.
  • qDIST(p, ...) is the quantile function that returns the x corresponding to the pth percentile of DIST. The flag lower.tail = FALSE will cause qDIST(p, ...) to return the x that corresponds to the 1 - pth percentile of DIST.
R’s confusing naming convention

For some reason, R refers to the cumulative distribution function (or CDF) as the short term distribution function. Be careful to check if you want to use the probability density function or the CDF when working with R.

B.1.3. Lognormal distribution

The lognormal distribution is the distribution of a random variable X whose natural log log(X) is normally distributed. The distribution of highly skewed positive data, like the value of profitable customers, incomes, sales, or stock prices, can often be modeled as a lognormal distribution. A lognormal distribution is defined over all non-negative real numbers; as shown in figure B.4 (top), it’s asymmetric, with a long tail out toward positive infinity. The distribution of log(X) (figure B.4, bottom) is a normal distribution centered at mean(log(X)). For lognormal populations, the mean is generally much higher than the median, and the bulk of the contribution toward the mean value is due to a small population of highest-valued data points.

Figure B.4. Top: The lognormal distribution X such that mean(log(X)) = 0 and sd(log(X)) = 1. The dashed line is the theoretical distribution, and the solid line is the distribution of a random lognormal sample. Bottom: The solid line is the distribution of log(X).

Don’t use the mean as a “typical” value for a lognormal population

For a population that’s approximately normally distributed, you can use the mean value of the population as a rough stand-in value for a typical member of the population. If you use the mean as a stand-in value for a lognormal population, you’ll overstate the value of the majority of your data.

Intuitively, if variations in the data are expressed naturally as percentages or relative differences, rather than as absolute differences, then the data is a candidate to be modeled lognormally. For example, a typical sack of potatoes in your grocery store might weigh about five pounds, plus or minus half a pound. The distance that a specific type of bullet will fly when fired from a specific type of handgun might be about 2,100 meters, plus or minus 100 meters. The variations in these observations are naturally represented in absolute units, and the distributions can be modeled as normals. On the other hand, differences in monetary quantities are often best expressed as percentages: a population of workers might all get a 5% increase in salary (not an increase of $5,000/year across the board); you might want to project next quarter’s revenue to within 10% (not to within plus or minus $1,000). Hence, these quantities are often best modeled as having lognormal distributions.

Using the lognormal distribution in R

Let’s look at the functions for working with the lognormal distribution in R (see also section B.5.3). We’ll start with dlnorm() and rlnorm():

  • dlnorm(x, meanlog = m, sdlog = s) is the probability density function (PDF) that returns the probability of observing the value x when it’s drawn from a lognormal distribution X such that mean(log(X)) = m and sd(log(X)) = s. By default, meanlog = 0 and sdlog = 1 for all the functions discussed in this section.
  • rlnorm(n, meanlog = m, sdlog = s) is the random number that returns n values drawn from a lognormal distribution with mean(log(X)) = m and sd(log(X)) = s.

We can use dlnorm() and rlnorm() to produce figure 8.4, shown earlier. The following listing demonstrates some properties of the lognormal distribution.

Listing B.5. Demonstrating some properties of the lognormal distribution
# draw 1001 samples from a lognormal with meanlog 0, sdlog 1
u <- rlnorm(1001)

# the mean of u is higher than the median
mean(u)
# [1] 1.638628
median(u)
# [1] 1.001051

# the mean of log(u) is approx meanlog=0
mean(log(u))
# [1] -0.002942916

# the sd of log(u) is approx sdlog=1
sd(log(u))
# [1] 0.9820357

# generate the lognormal with meanlog = 0, sdlog = 1
x <- seq(from = 0, to = 25, length.out = 500)
f <- dlnorm(x)

# generate a normal with mean = 0, sd = 1
x2 <- seq(from = -5, to = 5, length.out = 500)
f2 <- dnorm(x2)

# make data frames
lnormframe <- data.frame(x = x, y = f)
normframe <- data.frame(x = x2, y = f2)
dframe <- data.frame(u=u)

# plot densityplots with theoretical curves superimposed
p1 <- ggplot(dframe, aes(x = u)) + geom_density() +
  geom_line(data = lnormframe, aes(x = x, y = y), linetype = 2)

p2 <- ggplot(dframe, aes(x = log(u))) + geom_density() +
  geom_line(data = normframe, aes(x = x,y = y), linetype = 2)

# functions to plot multiple plots on one page
library(grid)
nplot <- function(plist) {
  n <- length(plist)
  grid.newpage()
  pushViewport(viewport(layout=grid.layout(n, 1)))
  vplayout<-
     function(x,y) { viewport(layout.pos.row = x, layout.pos.col = y) }
  for(i in 1:n) {
    print(plist[[i]], vp = vplayout(i, 1))
  }
}

# this is the plot that leads this section.
nplot(list(p1, p2))

The remaining two functions are the CDF plnorm() and the quantile function qlnorm():

  • plnorm(x, meanlog = m, sdlog = s) is the cumulative distribution function that returns the probability of observing a value less than x from a lognormal distribution with mean(log(X)) = m and sd(log(X)) = s.
  • qlnorm(p, meanlog = m, sdlog = s) is the quantile function that returns the x corresponding to the pth percentile of a lognormal distribution with mean(log(X)) = m and sd(log(X)) = s. It’s the inverse of plnorm().

The following listing demonstrates plnorm() and qlnorm(). It uses the data frame lnormframe from the previous listing.

Listing B.6. Plotting the lognormal distribution
# the 50th percentile (or median) of the lognormal with
# meanlog=0 and sdlog=10
qlnorm(0.5)
# [1] 1
# the probability of seeing a value x less than 1
plnorm(1)
# [1] 0.5

# the probability of observing a value x less than 10:
plnorm(10)
# [1] 0.9893489

# -- show the 75th percentile of the lognormal

# use lnormframe from previous example: the
# theoretical lognormal curve

line <- qlnorm(0.75)
xstr <- sprintf("qlnorm(0.75) = %1.3f", line)

lnormframe75 <- subset(lnormframe, lnormframe$x < line)

# Plot it
# The shaded area is 75% of the area under the lognormal curve
ggplot(lnormframe, aes(x = x, y = y)) + geom_line() +
  geom_area(data=lnormframe75, aes(x = x, y = y), fill = "gray") +
  geom_vline(aes(xintercept = line), linetype = 2) +
  geom_text(x = line, y = 0, label = xstr, hjust = 0, vjust = 1)

As you can see in figure B.5, the majority of the data is concentrated on the left side of the distribution, with the remaining quarter of the data spread out over a very long tail.

Figure B.5. The 75th percentile of the lognormal distribution with meanlog = 1, sdlog = 0

B.1.4. Binomial distribution

Suppose you have a coin that has a probability p of landing on heads when you flip it (so for a fair coin, p = 0.5). In this case, the binomial distribution models the probability of observing k heads when you flip that coin N times. It’s used to model binary classification problems (as we discuss in relation to logistic regression in chapter 8), where the positive examples can be considered “heads.”

Figure B.6 shows the shape of the binomial distribution for coins of different fairnesses, when flipped 50 times. Note that the binomial distribution is discrete; it’s only defined for (non-negative) integer values of k

Figure B.6. The binomial distributions for 50 coin tosses, with coins of various fairnesses (probability of landing on heads)

Using the binomial distribution in R

Let’s look at the functions for working with the binomial distribution in R (see also section B.5.3). We’ll start with the probability density function dbinom() and the random number generator rbinom():

  • dbinom(k, nflips, p) is the PDF that returns the probability of observing exactly k heads from nflips of a coin with heads probability p.
  • rbinom(N, nflips,p) is the random number generator that returns N values drawn from the binomial distribution corresponding to nflips of a coin with heads probability p.

You can use dbinom() (as in the following listing) to produce figure B.6.

Listing B.7. Plotting the binomial distribution
library(ggplot2)
#
# use dbinom to produce the theoretical curves
#

numflips <- 50
# x is the number of heads that we see
x <- 0:numflips

# probability of heads for several different coins
p <- c(0.05, 0.15, 0.5, 0.75)
plabels <- paste("p =", p)

# calculate the probability of seeing x heads in numflips flips
# for all the coins. This probably isn't the most elegant
# way to do this, but at least it's easy to read

flips <- NULL
for(i in 1:length(p)) {
  coin <- p[i]
  label <- plabels[i]
  tmp <- data.frame(number_of_heads=x,
                    probability = dbinom(x, numflips, coin),
                    coin_type = label)
  flips <- rbind(flips, tmp)
}

# plot it
# this is the plot that leads this section
ggplot(flips, aes(x = number_of_heads, y = probability)) +
  geom_point(aes(color = coin_type, shape = coin_type)) +
  geom_line(aes(color = coin_type))

You can use rbinom() to simulate a coin-flipping-style experiment. For example, suppose you have a large population of students that’s 50% female. If students are assigned to classrooms at random, and you visit 100 classrooms with 20 students each, then how many girls might you expect to see in each classroom? A plausible outcome is shown in figure B.7, with the theoretical distribution superimposed.

Figure B.7. The observed distribution of the count of girls in 100 classrooms of size 20, when the population is 50% female. The theoretical distribution is shown with the dashed line.

Let’s write the code to produce figure B.7.

Listing B.8. Working with the theoretical binomial distribution
p = 0.5 # the percentage of females in this student population
class_size <- 20 # size of a classroom
numclasses <- 100 # how many classrooms we observe

# what might a typical outcome look like?
numFemales <- rbinom(numclasses, class_size, p)                1

# the theoretical counts (not necessarily integral)
probs <- dbinom(0:class_size, class_size, p)
tcount <- numclasses*probs

# the obvious way to plot this is with histogram or geom_bar
# but this might just look better

zero <- function(x) {0} # a dummy function that returns only 0

ggplot(data.frame(number_of_girls = numFemales, dummy = 1),
  aes(x = number_of_girls, y = dummy)) +
  # count the number of times you see x heads
  stat_summary(fun.y = "sum", geom = "point", size=2) +        2
  stat_summary(fun.ymax = "sum", fun.ymin = "zero", geom = "linerange") +
  # superimpose the theoretical number of times you see x heads
  geom_line(data = data.frame(x = 0:class_size, y = tcount),
            aes(x = x, y = y), linetype = 2) +
  scale_x_continuous(breaks = 0:class_size, labels = 0:class_size) +
  scale_y_continuous("number of classrooms")

  • 1 Because we didn’t call set.seed, we expect different results each time we run this line.
  • 2 stat_summary is one of the ways to control data aggregation during plotting. In this case, we’re using it to place the dot and bar, measured from the empirical data, in with the theoretical density curve.

As you can see, even classrooms with as few as 4 or as many as 16 girls aren’t completely unheard of when students from this population are randomly assigned to classrooms. But if you observe too many such classrooms—or if you observe classes with fewer than 4 or more than 16 girls—you’d want to investigate whether student selection for those classes is biased in some way.

You can also use rbinom() to simulate flipping a single coin.

Listing B.9. Simulating a binomial distribution
# use rbinom to simulate flipping a coin of probability p N times

p75 <- 0.75 # a very unfair coin (mostly heads)
N <- 1000  # flip it several times
flips_v1 <- rbinom(N, 1, p75)

# Another way to generate unfair flips is to use runif:
# the probability that a uniform random number from [0 1)
# is less than p is exactly p. So "less than p" is "heads".
flips_v2 <- as.numeric(runif(N) < p75)

prettyprint_flips <- function(flips) {
  outcome <- ifelse(flips==1, "heads", "tails")
  table(outcome)
}

prettyprint_flips(flips_v1)
# outcome
# heads tails
# 756   244
prettyprint_flips(flips_v2)
# outcome
# heads tails
# 743   257

The final two functions are the CDF pbinom() and the quantile function qbinom():

  • pbinom(k, nflips, p) is the CDF that returns the probability of observing k heads or fewer from nflips of a coin with heads probability p. pbinom(k, nflips, p, lower.tail = FALSE) returns the probability of observing more than k heads from nflips of a coin with heads probability p. Note that the left tail probability is calculated over the inclusive interval numheads <= k, while the right tail probability is calculated over the exclusive interval numheads > k.
  • qbinom(q, nflips, p) is the quantile function that returns the number of heads k that corresponds to the qth percentile of the binomial distribution corresponding to nflips of a coin with heads probability p.

The next listing shows some examples of using pbinom() and qbinom().

Listing B.10. Working with the binomial distribution
# pbinom example

nflips <- 100
nheads <- c(25, 45, 50, 60)  # number of heads

# what are the probabilities of observing at most that
# number of heads on a fair coin?
left.tail <- pbinom(nheads, nflips, 0.5)
sprintf("%2.2f", left.tail)
# [1] "0.00" "0.18" "0.54" "0.98"

# the probabilities of observing more than that
# number of heads on a fair coin?
right.tail <- pbinom(nheads, nflips, 0.5, lower.tail = FALSE)
sprintf("%2.2f", right.tail)
# [1] "1.00" "0.82" "0.46" "0.02"

# as expected:
left.tail+right.tail
#  [1] 1 1 1 1

# so if you flip a fair coin 100 times,
# you are guaranteed to see more than 10 heads,
# almost guaranteed to see fewer than 60, and
# probably more than 45.
# qbinom example

nflips <- 100

# what's the 95% "central" interval of heads that you
# would expect to observe on 100 flips of a fair coin?

left.edge <- qbinom(0.025, nflips, 0.5)
right.edge <- qbinom(0.025, nflips, 0.5, lower.tail = FALSE)
c(left.edge, right.edge)
# [1] 40 60

# so with 95% probability you should see between 40 and 60 heads

One thing to keep in mind is that because the binomial distribution is discrete, pbinom() and qbinom() won’t be perfect inverses of each other, as is the case with continuous distributions like the normal.

Listing B.11. Working with the binomial CDF
# because this is a discrete probability distribution,
# pbinom and qbinom are not exact inverses of each other

# this direction works
pbinom(45, nflips, 0.5)
# [1] 0.1841008
qbinom(0.1841008, nflips, 0.5)
# [1] 45

# this direction won't be exact
qbinom(0.75, nflips, 0.5)
# [1] 53
pbinom(53, nflips, 0.5)
# [1] 0.7579408

B.1.5. More R tools for distributions

R has many more tools for working with distributions beyond the PDF, CDF, and generation tools we’ve demonstrated. In particular, for fitting distributions, you may want to try the fitdistr method from the MASS package.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset