Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 7 Probabilistic Applications of Integration

Figure 7.1 Albert Einstein (1879–1955), best known for his formula E = MC² and for his statement “God does not play dice with the world,” was awarded the Nobel Prize in Physics in 1921.

images

Preview

“Lest men suspect your tale untrue, Keep probability in view.”*

John Gay, English poet and dramatist (1685–1732)

At the beginning of the nineteenth century, the prevailing scientific view was a clockwork universe in which everything was determinable, if not actually determined. An asteroid hitting Earth was not a chance event; rather, it could have been anticipated if the position and velocity of all asteroids in the solar system were known. This was the view of Pierre-Simon Laplace (1749–1827) whose methods on how to compute future positions of planets and comets from observations of their past positions were published in a five-volume treatise titled Méchanique céleste. A century later, the clockwork view of classical mechanics was shattered by twentieth century quantum mechanics, built on Werner Heisenberg's (1901–1976) uncertainty principle. This principle implies that the precision with which both the position and velocity of any object can be known at a given point in time is limited. Einstein's difficulty in accepting this principle is encapsulated in one of his most quoted statements (translated from the German): “God does not play dice with the world.” Whether we believe that the time of death of an individual is preordained by God or is subject to the throw of a cosmic die is not relevant. In either case, we know that many kinds of computations in biology are intrinsically predictions of the frequencies of the occurrence of particular events. Think of flipping coins: heads can be predicted to occur only 50% of the time.

The mathematical theory of chance started when a French writer, Chevalier de Méré, with a penchant for gambling and an interest in mathematics, challenged the French mathematician Blaise Pascal (1623–1662) to solve a betting problem. Pascal teamed up with another French mathematician, Pierre de Fermat (1601–1665), to solve the problem; in the process, they laid the foundations of probability theory. It is no exaggeration to say that without probability theory, the biological sciences would not exist as we know them today. All concepts, ideas, and calculations relating to the theory and practice of mathematical statistics and its indispensable application to experimental biology would not exist. Thus, students in the biological sciences need to become immersed in ideas that relate to chance and probabilities as early as possible and have these ideas reinforced as often as possible. Certain foundational aspects of probability theory rely on integration; this chapter provides an elementary introduction to these aspects of probability.

We begin with organizing data into histograms, a topic reminiscent of how we used discrete rectangles to approximate the area under a continuous curve. Integration allows us to make the transition from histograms to continuous probability density functions and cumulative distribution functions. These functions often have infinite domains (e.g., heights theoretically can take on any positive value) and, consequently, require improper integrals. Next we consider means and variances of these functions, fundamental concepts for all sciences. We then focus on applications: the probability of getting infected as a disease sweeps through a population, identification of underweight children, time to extinction of marine species, life expectancy of humans and dinosaurs, and probabilities of tumor regrowth after medical treatment.

7.1 Histograms, PDFs, and CDFs

Histograms and probabilities

In previous chapters we saw how gathering data can lead to uncovering and ultimately understanding many interesting phenomena. In the simplest case, data represent single values, each associated with an object or an individual. Some examples of sets of data we considered (or, in the case of a coin or die, familiar to us all) are listed in Table 7.1.

Table 7.1 Examples of random variables with expected ranges

images

One way of visualizing large data sets is to organize values by the proportion that fall into subintervals, known as bins. For example, consider the fourth data set listed in Table 7.1: If we have data on 100 animals no larger than a cow, we can ask what proportion of these 100 individuals require, respectively, (0, 1], (1, 2], ..., and (9, 10] thousand kilocalories per day. We can then plot the proportions as a bar graph, referred to as a histogram.

Histograms

A histogram is a graphical representation of a data set shown as adjacent rectangles, erected over nonoverlapping intervals (bins), with the area of each rectangle equal to the fraction of data points in the interval.

Given a data set, various types of histograms can be created, depending on how we divide up the interval over which the data are defined. The most common type is obtained by splitting the range of the data into equal-sized bins, as illustrated in the following example.

Example 1 Bird diversity in oak woodlands

In spring 1994, the number of bird species in forty different oak woodland sites in California were collected. Each site was around 5 hectares in size–the equivalent of about 12 acres, or 0.019 square mile–and the sites were situated in a relatively homogeneous habitat. The numbers of bird species found in these sites are listed here:

images

Construct a histogram with two intervals corresponding to 0 to 25 species and 25 to 50 species.
Construct a histogram with intervals of width 10 over the domain [0, 50).
Use technology to construct a histogram with intervals of width 5 over the domain [0, 50).
Determine the units on the vertical axis of the histograms.

For this problem, assume the intervals include the left endpoint but not the right endpoint.

Solution

Since 13 of the 40 data points are between 0 and 24, the fraction of data points in the first interval is 13/40 = 0.325. Since the remaining 27 data points are in the second interval, the fraction of data points in the second interval is 27/40 = 0.675. To draw the histogram, we sketch a rectangle over the right interval that is approximately twice as high as the rectangle over the left interval. More precisely, we want the area of the left rectangle to equal 0.325. Since the base of the rectangle is of length 25, its height must be 0.325/25 = 0.013. We want the area of the right rectangle to equal 0.675. Therefore, the height of this rectangle is 0.675/25 = 0.027. The resulting histogram is illustrated in Figure 7.2a. Note that, by construction, the total shaded area is 1.

Figure 7.2 Frequencies y of species richness x in oaklands binned into histograms with a. two bins, b. five bins (first is empty), and c. 10 bins (first three are empty).
Intervals of width 10 correspond to [0, 10), [10, 20), [20, 30), [30, 40), and [40, 50). The number of data points in [0, 10) interval is 0. Hence, we draw no rectangle over this interval. The number of data points in the [10, 20) interval is 3. Hence, the fraction of data in this interval is 3/40 = 0.075. Since the width of the interval is 10, the height of the rectangle over the [10, 20) interval should be 0.075/10 = 0.0075. Similarly, the number of data points in the [20, 30) interval is 22, implying that the fraction of data in this interval is 22/40 = 0.55 and that the height of the rectangle over this interval should be 0.55/10 = 0.055. For the interval [30, 40), the number of data points is 13; the fraction of data is thus 13/40 = 0.325, so the height of the rectangle over this interval should be 0.325/10 = 0.0325. Finally, for the interval [40, 50) the number of data points is 2; the fraction of data is thus 2/40 = 0.05, so the height of the rectangle over this interval should be 0.05/10 = 0.005. The resulting histogram is illustrated in Figure 7.2b. Again, by construction, the total shaded area is 1.
Many programs exist (e.g., most spreadsheet and statistical software) that create histograms for which one can specify the size of the intervals to be plotted. Specifying intervals of width 5 in one of these programs yields Figure 7.2c.
Since the areas of the rectangles are unitless, the units on the vertical axes of the histogram have to be the reciprocal of the units on the horizontal axes. For the histograms in Figure 7.2, the units on the horizontal axis are number of species. Hence, the units on the vertical axes are .

As we have seen in Example 1, histograms provide a sense of whether there is a center to a data set (i.e., where most of the data points lie), how much spread there is to the data set (i.e., is it even over the whole range, peaked in more than one region, or concentrated around a center), and how skewed the data set is (i.e., whether there are more data points to the left or right of the center). For example, in Figure 7.2c, the data are centered around the bin of 25–30 species and the histogram is slightly skewed to the right.

One interpretation of the area of a rectangle in a histogram is the proportion of the data in the interval. An alternative interpretation is in terms of random variables. A random variable, usually written X, is a variable whose possible values are outcomes of a random phenomenon, such as the rolling of a die, this evening's winning lottery numbers, or the height of a randomly chosen individual. Two important types of random variables are discrete random variables, which we discuss next, and continuous random variables, which we examine later in this section. In the discussion of random variables, we provide only simple, intuitive definitions to facilitate our presentation of probabilistic applications of integration. Introductory statistics and probability classes provide more in-depth studies of these concepts.

Discrete Random Variables

A discrete random variable X takes on a countable number of values, say x₁, x₂, x₃,..., with probabilities p₁, p₂, p₃,.... These probabilities are called the probability distribution for X, are nonnegative numbers, and sum to one (i.e., p₁ + p₂ + p₃ + ··· = 1) as the random variable always takes on some value. We write

to denote “X = x_i with probability p_i.”

When discrete random variables take on an infinite number of values, the infinite sum p₁ + p₂ + p₃ + ··· corresponds to a infinite series, which is well-defined only if p₁ + p₂ + ··· + p_n exists. As with the improper integrals discussed in Section 7.2, understanding when these limits are well-defined is a delicate issue. However, all of our examples only deal with finite number of values.

Example 2 Playing roulette

Roulette is a casino game named after the French diminutive for “little wheel.” In American casinos, the roulette wheel has 38 colored and numbered pockets of which 18 are black, 18 are red, and 2 are green; a small ball moves as the wheel rotates. A player betting one dollar on red wins one dollar if the ball lands in a red pocket and loses otherwise. Let X be the earning of a player betting one dollar on red. Find the probability distribution for X.

Solution Assuming that the ball is equally likely to land in each pocket (i.e., the wheel is fair), the probability of the ball landing in a red pocket is 18/38 = 9/19 and the probability of landing in a black or green pocket is 20/38 = 10/19. Hence, X equals 1 with probability 9/19 and −1 with probability 10/19. As with all casino games, betting on red slightly favors the house.

To see how random variables relate to data sets and their histograms, imagine writing down each data value on a slip of paper, putting all the slips in a hat, drawing one slip at random from the hat, and calling this value X. For an interval of data values [a, b), we define

For a histogram of this data set, the area of the rectangle above the interval [a, b) equals P(a ≤ X < b).

Example 3 Species diversity in a randomly chosen site

An oak woodland site is randomly selected from the forty sites presented in Example 1. Let X denote the number of bird species in that site.

Find and interpret the probability P(0 ≤ X < 25).
Find P(20 ≤ X < 30).
Find P(20 ≤ X < 40).

Solution

Since the proportion of sites with less than 25 bird species is 13/40 = 0.325, P(0 ≤ X < 25) = 0.325. In other words, there is a 32.5% chance that a randomly chosen site has fewer than 25 bird species. This value corresponds to the area over the interval [0, 25) in Figure 7.2a.
Since the proportion of sites with at least 20 species and fewer than 30 species is 22/40 = 0.55, P(20 ≤ X < 30) = 0.55. In other words, there is a 55% chance that a randomly chosen site has between 20 and 30 bird species. This corresponds to the area over the interval [20, 30) in Figure 7.2b.
Since the proportion of sites with at least 20 species and fewer than 40 species is (22 + 13)/40 = 0.875, P(20 ≤ X < 40) = 0.875. In other words, there is a 87.5% chance that a randomly chosen site has between 20 and 40 bird species. This corresponds to the total area over the intervals [20, 30) and [30, 40) in Figure 7.2b.

Example 4 From histograms to probabilities

In a study involving 252 men, Garath Fisher estimated the percentage of body fat by weighing individuals underwater and taking various body circumference measurements. A histogram for the data is shown in Figure 7.3. Assume a man is randomly

images

Figure 7.3 Frequencies y of percentage body fat x in a study of 225 men

selected from this study. Let X denote the percentage of body fat of this randomly selected man.

Estimate P(X < 10)
Estimate P(10 ≤ X < 30)
Estimate P(X ≥ 30)

Solution

The area of the rectangle above the interval [0, 10] of the histogram depicted in Figure 7.3 is approximately 0.015 × 10 = 0.15. Hence, we approximate P(X < 10) = 0.15. Equivalently, we estimate that 15% of the men have less than 10% body fat.
The area of the rectangle above the interval [10, 20] of the histogram depicted in Figure 7.3 is approximately 0.037 × 10 = 0.37. The area of the rectangle above the interval [20, 30] is approximately 0.038 × 10 = 0.38. Hence, we approximate P(10 ≤ X < 30) = 0.37 + 0.38 = 0.75. Equivalently, we estimate that 75% of the men have between 10% and 30% body fat.
Since the sum of the areas of the rectangles must be one (i.e., the total fraction of data is one), the area of the rectangles over the intervals [30, 40] and [40, 50] must equal 1 − 0.75 − 0.15 = 0.10. Hence, we approximate P(X ≥ 30) = 0.10 or, in words, we estimate that 10% of the men in the study have greater than 30% body fat.

Probability density functions

Some data sets are naturally discrete, such as the distribution of litter size among female cats of a particular age. Other data sets involving physical measurements—such as height, weight, or time—can take on a continuum of values. In the latter case, when sufficiently many measurements are taken, the histogram may be well approximated by a continuous function, as seen in the next example.

Example 5 Like father, like son

A demographer decides he wants to understand the heights of fathers and their sons. Discuss how he might conduct this study.
Karl Pearson (1857–1936), one of the founders of mathematical statistics, collected data on the heights of 1078 father-son pairs. Use his data (readily available on the Internet) and compare the histograms of heights for fathers and sons using 5, 10, and 20 bins. Discuss what you find.

Solution

The demographer would first identify the population for which he wants to understand the heights of fathers and sons. It is well known that individuals from different nationalities and groups differ on average with respect to height. For example, Tutsi men of Burundi and Rwanda are regarded as the tallest humans, averaging over 6 feet, whereas Pygmy men and women of central Africa are the shortest, averaging 4 feet 5 inches and 4 feet 6 inches, respectively. After identifying the population (e.g., Pygmies in central Africa), the demographer would take a large, randomized sample of fathers and their sons within the target population. A large sample ensures that he is unlikely to get a misleading result due to chance. For example, if he only chooses two fathers, he might get by chance two fathers who are much taller than their sons, even though this does not accurately reflect the population. Choosing individuals randomly (e.g., via a telephone directory or census data) prevents biases in data collecting. For instance, a demographer who selects only the tallest fathers (e.g., only professional basketball players) in a population is likely to find their sons are typically shorter.
In 1903, Pearson measured the heights of 1,078 fathers and their sons in England. Histograms of these data with 5, 10, and 20 bins are shown in Figure 7.4. Despite sons having a slightly wider range of heights, these histograms suggest that fathers and sons have quite similar distributions of heights. Later we will see that the sons are about an inch taller on average than their fathers. However, this slight difference is not readily apparent in the histograms. Figure 7.4 illustrates, quite unexpectedly, that as the number of bins is increased, the histogram is well approximated by a continuous curve.

images

Figure 7.4 Histograms of heights x of fathers (top row) and sons (bottom row) with 5, 10, and 20 equally sized bins. The red curve is a probability density function that approximates the histogram.

The continuous function approximating the histogram in Figure 7.4 is an example of a probability density function. The graphs of these functions are the continuous analogues of histograms. Since probability density functions are used to describe distributions from data sets, they need to satisfy certain natural properties.

Probability Density Function (PDF)

A probability density function (PDF) is a piecewise continuous function f(x) such that

f(x) ≥ 0 for all x; that is, probabilities are nonnegative
total area under f(x) equals one; that is, the area under the histogram equals one

Example 6 Constructing a PDF from a nonnegative function

Let a be a constant. Consider the function defined by f(x) = ax for 0 ≤ x ≤ 5 and f(x) = 0 otherwise. Determine for what value of a the function f is a PDF.

Solution In order for f to be a PDF, f needs to be nonnegative. Hence, a must be nonnegative. The area under f must equal one. Since f(x) = 0 outside the interval [0, 5], the area under f is given by

images

Solving for a yields a = 2/25 = 0.08.

Since PDFs are used to approximate histograms of data sets, we can use them to estimate the fraction of data lying in any interval.

Area Under a PDF

Let f(x) be a PDF describing the distribution of a data set. Then, the fraction of data lying in the interval a ≤ x ≤ b can be approximated as shown:

images

In words, the fraction of data in an interval is approximately the area of the PDF above this interval.

In our approximation of the data in the interval [a, b], we included the right endpoint b, contrary to our convention with histograms. For data sets whose distributions are well approximated by a PDF, the fraction of data taking on exactly the value b is 0 or very small. Hence, the effect of including or excluding an endpoint b is negligible.

Example 7 Heights of fathers

The distribution of fathers' heights from Example 5 is approximated by this PDF:

Use numerical integration to approximate the fraction of fathers whose heights are between 6 and 7 feet.

Solution Since 6 and 7 feet correspond to 72 and 84 inches, we can estimate the fraction of fathers with heights between 6 and 7 feet by the integral

images

which is given approximately by 0.058. Hence, approximately 6% of the fathers in Pearson's study were between 6 and 7 feet. A graphical representation of this fraction is shown in the blue shaded region in the graph at the left.

An alternative interpretation of the area under a PDF is in terms of continuous random variables. Imagine we have a hat that contains an infinite number of (infinitely thin) slips of paper, each with different numbers such that the proportion of slips with numbers in the interval [a, b] is given by

images

where f is a PDF. Now shake this hat and, with your eyes closed, grab a slip of paper. Let X denote the value on this slip. Since X can assume any real value, it is called a continuous random variable with PDF f(x), where the probability that X takes on a value in the interval [a, b] equals f(x)dx.

Continuous Random Variables

A continuous random variable X with PDF f(x) satisfies

images

In other words, the probability of lying in an interval equals the area of the PDF above this interval.

The simplest continuous random variable is a uniform random variable that places equal weight on all values in a given interval. For example, one might expect the birth time of babies to be approximately uniformly distributed over the year, as illustrated in Figure 7.5. However, some interesting patterns emerge, including high peaks in September and late December (could the latter be a possible synergism of tax considerations and induced/Cesarean births?) and a low trough in late April/early May. The peaks and troughs are surely climate and culture dependent.

images

Figure 7.5 Fraction of births each week in 1978 in the United States

Example 8 Birth times

In this example, ignore the subtleties of birth dates indicated in Figure 7.5. Let's assume that the time of birth X in days after January 1st for a randomly chosen individual is equally likely to be any time of the year. Ignoring leap years, our birth-time distribution has the following PDF:

images

Show that f is a PDF.
Compute the probability of a randomly chosen individual having a birth date in January.

Solution

Since f(x) = 0 outside of the interval [0, 365], the area under f(x) is

Since f(x) ≥ 0 for all x, f is a PDF.
Since January comprises the first thirty-one days of the year, we obtain

In other words, there is approximately a 8.5% chance that a randomly chosen student from your calculus class was born in January.

The PDF in Example 8 is an example of the following general class of density functions:

Uniform PDF

The uniform PDF on the interval [a, b] is given by

images

Cumulative distribution functions

Another way of describing a discrete or continuous random variable is with a cumulative distribution function.

Cumulative Distribution Function (CDF)

The cumulative distribution function (CDF) of a random variable X is the function F defined by

If X describes a data set, then F(x) equals the fraction of data in the interval (−∞, x].

If X is a continuous random variable with PDF f(x), then F(x) corresponds to the area under f over the interval (−∞, x]. Formally, we write this as follows:

If there exists an a such that f(t) = 0 for t ≤ a, then this expression simplifies to

The case for which there is no such a results in an improper integral (i.e., an integral over an infinite range) which is discussed in Section 7.2.

There are several nice properties of CDFs (opposed to PDFs). For example, if X is a random variable with a CDF F, then

Thus, when you are given a CDF, computing probabilities is much easier as no integration needs to be performed. If you are only given the PDF, you are stuck with doing the integration one way or the other.

Example 9 From PDF to CDF

Consider the random variable X corresponding to the birth time of a randomly chosen individual with the PDF

images

Find and plot the CDF for X.
Use the CDF to find the probability of X lying in January. Compare your answer to what was found in Example 8.

Solution

Since f(x) = 0 for x ≤ 0, we obtain F(x) = P(X ≤ x) = 0 whenever x ≤ 0. Alternatively, for 0 < x < 365, we obtain

Finally, we have

for x ≥ 365. Thus,

Figure 7.6 CDF for the birth time distribution

Plotting the CDF yields Figure 7.6.
January corresponds to the interval [0, 31]. The fraction of data in this interval is given by F(31) − F(0) = − 0 ≈ 0.0849315. This answer agrees with what was found in Example 8.

A CDF F(x) for a random variable X is characterized by the following properties.

CDF Properties

A nonnegative function F(x) is a CDF if and only if it has these properties:

F(y) ≥ F(x) whenever y ≥ x; that is, F is a nondecreasing function
F(x) = 1 and F(x) = 0
F(x) is right continuous for all x; that is, F(y) = F(x) for all x

The first property requires that the fraction data with values less than x are nondecreasing with x. Intuitively, as x increases, one is only including more data in the interval (−∞, x]; hence, the fraction cannot get smaller. The second property ensures that all the data lie somewhere on the real line. The third property ensures that there can be no jump discontinuities in F(x) from the right. For discrete random variables, there always are jump discontinuities from the left. For example, if X = 0 with probability and X = 1 with probability , then the CDF for X is

images

Clearly, F(x) has jump discontinuities at x = 0, 1. However, as expected, F(x) = and = 1. It is worth noting, as we saw in Example 9, that the CDFs for continuous random variables are continuous and, therefore, easily satisfy the third property.

CDFs can arise quite naturally from differential equation models, as the following example illustrates.

Example 10 Drug decay and the exponential CDF

Lidocaine is a common local anesthetic and antiarrhythmic drug. The elimination rate constant for lidocaine is c = 0.43 (per hour) for most patients. If y is the amount of drug in the body and there is no further input of drug into the body, we can model the drug dynamics by

where t denotes times in hours and y₀ is the initial amount of lidocaine in the body.

Solve for y(t).
Write an expression, call it F(t), that represents the fraction of drug that has left the body by time t ≥ 0.
If we define F(t) = 0 for t ≤ 0, verify that F(t) satisfies the three properties of a CDF.
What is the probability that a randomly chosen molecule of drug leaves the body in the first two hours? What is the probability that a randomly chosen molecule of drug leaves the body between the second hour and fourth hour?

Solution

Separating and integrating yields

Since y₀ = y(0) = C₂, we obtain y(t) = y₀e^−0.43t.
The fraction of drug in the body at time t is = = e^−0.43t. Hence, the fraction that has left by time t is F(t) = 1 − e^−0.43t for t ≥ 0.
Let F(t) = 0 for t ≤ 0. Since F′(t) = 0.43e^−0.43t > 0 for t ≥ 0, F is nondecreasing for all t. Since F(t) = 0 for t ≤ 0, F(t) = 0. Since e^−0.43t = 0, F(t) = 1. Finally, since

and F(0) = 0 for all t ≤ 0, F(t) is continuous at t = 0 and, therefore, continuous for all t. Hence, F is a CDF.
The likelihood that a particular molecule of drug is eliminated in the first two hours is given by F(2) ≈ 0.58. The likelihood that a particular molecule of drug is eliminated between the second and fourth hour is F(4) − F(2) ≈ 0.24. Hence, a randomly chosen molecule of drug is much more likely to be eliminated in the first two hours than in the second two hours.

Example 10 is a particular instance of the exponential distribution that arises in many applications. The general exponential distribution and additional applications are discussed in this chapter's problem sets, but to emphasize the importance of exponential distribution, we define it here.

Exponential PDF and CDF

The exponential PDF on the interval [0, ∞), with rate parameter c > 0, is given by

images

The corresponding exponential CDF is

images

Since CDFs are nondecreasing continuous functions, it is often easier to fit functions to empirically derived CDFs than to empirically derived PDFs, which might be less smooth. When fitting CDFs, however, we need to be careful—as the following example illustrates.

Example 11 Survivorship histograms and CDFs for the Mediterranean fruit fly

The Mediterranean fruit fly is one of the world's most destructive pests of deciduous fruits—such as apples, pears, and peaches—and of citrus fruits as well. Adults of both sexes may live six months or more under favorable conditions. University of California scientist James Carey and his colleagues reared Mediterranean fruit flies under laboratory conditions and daily recorded the number of adults surviving a given number of days after emerging from the pupal stage. This resulted in the following data:

images

The histogram associated with the data is illustrated in Figure 7.7a. The cumulative proportion of dead individuals at times 0, 10, 20, ...80 is shown in Figure 7.7b. The experiment was stopped after 85 days when 3% of individuals were still alive, so we don't know at what time all the individuals died. Using technology to fit a quartic equation F(x) representing the cumulative data yields (each coefficient rounded to five significant figures)

images

Figure 7.7 Mortality histogram and cumulative mortality distribution for the Mediterranean fruit fly

Use the fitted CDF F(x) to estimate the probability that an individual dies before age 18 days.
What is the probability that an individual survives at least until age 46 days?
What is the probability that an individual lives beyond 100 days?

Solution

The probability that an individual dies before reaching 18 days is, by definition of the CDF,

Hence, there is a 15.6% chance that a randomly chosen fruit fly dies before its 18th day of life.
The probability that an individual survives at least until age 46 days is equal to 1 minus the probability that the individual dies before reaching age 46 days:

In other words, 53.3% of the fruit flies survived at least 46 days.
The probability that an individual lives beyond 100 days is 1 − F(101) (i.e., 1 minus the probability of dying by age 101). However, F(101) = 2.335, which clearly violates the requirement that F(x) ≤ 1. The reason is that we only fitted the data up to day 80. We do not have sufficient data to know how to construct F(x) beyond 80 days because, in the original data set, not all individuals had died at the termination of the experiment.

Example 11 suggests that for some populations, we could have a problem constructing F(x) if we do not have an estimate of the maximum life span of individuals. At the beginning of this millennium, for example, the Guinness Book of Records reported that the oldest fully authenticated age to which any human has ever lived is attributed to a French woman, Jeanne-Louise Calment, who was born on February 21, 1875 and died at age 122 years and 164 days. Individuals who appear to be older than this are alive today, but authentication of their birth date is required for them to be listed in the Guinness book of records.

Percentiles

Using the CDF, we can define quantities called percentiles, which play a special role in statistics and probability.

Medians, Quartiles, and Percentiles

Let F(x) be a CDF for a continuous random variable. The value of x such that F(x) = p is called the p × 100th percentile of the random variable. The 25th, 50th, and 75th percentiles are known as the first quartile, the median, and the third quartile, respectively.

Example 12 Drug decay percentiles

In Example 10, we found the CDF

which describes the fraction of lidocaine that has left the body after t hours. Find the median and 90th percentile for this CDF. Discuss what these numbers mean.

Solution To find the median, we need to solve F(t) = 0.50 as follows:

images

The median of 1.61 hours corresponds to the time when 50% of the drug has left the body.

To find the 90th percentile, we need to solve F(t) = 0.9 as follows:

images

The 90th percentile of 5.35 hours corresponds to the time when 90% of the drug has left the body.

Example 13 Birth time quartiles

In Example 9, we presented

images

as the PDF of the birth time X(in days) of a randomly chosen individual in a population where births are equally likely at any time of the year. In such a population, compute the birthdays of individuals falling on the median and first and third quartiles of f.

Solution The median and first and third quartiles of f are solutions to, respectively,

which are c = 182.5, 91.25, and 273.75. For a non-leap year, these correspond to the first of July, the second of April, and the first of October.

Example 14 An overweight baby

A medical practitioner examines a young boy of 24 months and finds that the child is 83 cm tall and weighs 14.1 kg. Use the percentile charts of the Centers for Disease Control and Preventions (CDC) to decide if the boy is heavier than normal for his height and how his height and weight relate to boys of other ages.

Solution Reading off the CDC percentile charts presented in Figure 7.8 for length and weight of boys aged 0 to 24 months, we see that 83 cm corresponds to the 10th percentile for height of a 24-month-old boy, while 14.1 kg corresponds to the 90th percentile for weight. Thus, the boy is well above the median weight for his height. His height equals the median for boys aged 19 months; the median age for his weight is not on the 0 to 24 month chart but it is about the 98th percentile for a 19th-month-old-boy.

images

Figure 7.8 CDC length and weight percentile charts for boys aged 0 to 24 months

PROBLEM SET 7.1

Level 1 DRILL PROBLEMS

In Problems 1 to 4 construct a histogram for the given data sets.

images

5. If X denotes a score in Problem 1, find

P(50 ≤ X ≤ 59)
P(50 ≤ X < 69)
P(70 ≤ X ≤ 89)
P(90 ≤ X < 100)

6. If X denotes a score in Problem 2, find

P(1 ≤ X ≤ 10)
P(1 ≤ X < 21)
P(31 ≤ X < 41)
P(51 ≤ X ≤ 60)

7. If X denotes a score in Problem 3, find

P(X < 71)
P(1 ≤ X < 141)

8. If X denotes a score in Problem 4, find

P(X < 500)
P(X ≥ 500)

In Problems 9 to 12, find a constant a so that the given function is a PDF and find the values of x that correspond to the median, the first quartile, and the third quartile.

9. f(x) = 2ax, 0 ≤ x ≤ 2

10. f(x) = 5ax, 1 ≤ x ≤ 5

11. f(x) = ax², 0 ≤ x ≤ 1

12. f(x) = 3ax², 1 ≤ x ≤ 4

In Problems 13 to 16, use the CDC charts in Figure 7.8 to estimate the length for age and weight for age percentiles for the following boys of age a months, w kg, and l cm.

13. a = 21, w = 12.5, l = 85 cm

14. a = 20, w = 12.2, l = 87 cm

15. a = 15, w = 13.4, l = 87 cm

16. a = 18, w = 10.9, l = 83 cm

images

20. If = −0.15y, y(0) = y₀

Find y(t).
If we define F(t) = 0 for t ≤ 0 and F(t) = 1 − y(t)/y₀, verify that F(t) is a CDF.
If X is a random variable whose CDF is given by F(t), find P(0 < X ≤ 1).

21. Consider the function g(x) whose graph is shown below:

images

For what value of c is f(x) = cg(x) a PDF?
For a continuous random variable with PDF f(x), find P(2 ≤ X ≤ 3).

22. Consider the function g(x) whose graph is shown below:

images

For what value of c is f(x) = cg(x) a PDF?
For a continuous random variable with PDF f(x), find P(3 ≤ X ≤ 12).

23. For Problem 21, find an expression for the CDF and plot it.

24. For Problem 22, find an expression for the CDF and plot it.

25. Consider

images

Verify that F(x) is a CDF.
Assume X is a continuous random variable with CDF F(x). Find P(0 ≤ X ≤ 1), P(2 ≤ X ≤ 10).

26. Consider

images

Verify that F(x) is a CDF.
Assume X is a continuous random variable with CDF F(x). Find P(0 ≤ X ≤ 1) and P(2 ≤ X ≤ 10).

27. The distribution of fathers' heights from Example 5 is approximated by the PDF

Use numerical integration to approximate:

Fraction of fathers whose heights are between 5 and 6 feet
Fraction of fathers whose heights are greater than 7 feet or less than 5 feet

28. The distribution of sons' heights from Example 5 is approximated by the PDF

Use numerical integration to approximate:

Fraction of sons whose heights are between 5 and 6 feet
Fraction of sons whose heights are greater than 7 feet or less than 5 feet

Level 2 APPLIED AND THEORY PROBLEMS

29. The following distribution table gives the distribution of cholesterol level for 6,000 children, 4 to 19 years old. Cholesterol level is measured in milligrams per 100 milliliters of blood. The class intervals include the left endpoint but exclude the right endpoint.

images

Sketch the histogram for the given intervals.
Find the probability that a randomly selected child in this group has a cholesterol level of ≥ 140.
Find the probability that a randomly selected child in this group has a cholesterol level between 100 and 220.

30. A study of grand juries compared the demographic characteristics of jurors with those of the general population to see if the jury panels were representative. Here are the results for age. Only persons 21 and older are considered; countywide age distribution is known from public health department data.

images

Sketch the histogram for countywide percentage and the number of jurors. What do you notice? For simplicity, assume that the last bin is [60, 70).

31. According to the US Census Bureau's International Data Base, these were the life expectancies in 2000 for the following countries:

images

Sketch a histogram with the following bins: less than 50 (plot as if on the interval [45, 50)), [50, 55), [55, 60), [60, 65), [65, 70), [70, 75), [75, 80), and greater than 80 (plot as if on the interval [80, 85)).
Say you are selecting one of the countries at random (i.e., each country is equally likely to be selected). Give the probability of getting a country with a life expectancy of

32. Let f(x) represent the PDF for the weight of a field mouse in Williamsburg, Virginia, where x is measured in grams. Express the following probabilities as integrals:

A randomly chosen field mouse weighs between 20 and 30 grams.
A randomly chosen field mouse weighs less than 40 grams.

33. Let f(x) represent the PDF for the weight of a pigeon in New York City where x is measured in ounces. Express the following probabilities as integrals:

A randomly chosen pigeon does not weigh between 13 and 14 ounces.
A randomly chosen pigeon is in the weight class 12–15 ounces, but does not weigh between 13 and 14 ounces.

34. If you are really bad at darts, then the PDF for the distance x (in inches) that your dart is from the center of a 12- inch dartboard may be given by

images

Verify that f(x) is a PDF.
Compute the probability that your dart is more than 9 inches from the center.
Compute the probability that you dart is less than 3 inches from the center.

Note: This PDF assumes that you are equally likely to hit any point on the dartboard (a fact that you are asked to verify in Problem 25 of Problem Set 7.3).

35. Suppose you are a champion dart player with a PDF for the distance x (in inches) that your dart is from the center of a 12-inch dartboard given by

images

Verify that f(x) is a PDF.
Compute the probability that your dart is more than 1 inch from the center.
Compute the probability that your dart is between 1/4 and 1/2 inch from the center.

36. According to Thomson and colleagues,* the elimination constant for lidocaine for patients with hepatic impairment is 0.12 per hour. Hence, for a patient who has received an initial dose of y₀ mg, the lidocaine level y(t) in the body can be modeled by the differential equation

Solve for y(t).
Write an expression, call it F(t), that represents the fraction of drug that has left the body by time t ≥ 0.
If F(t) = 0 for t ≤ 0, verify that F(t) is a CDF.
What is the probability that a randomly chosen molecule of drug leaves the body in the first two hours?
What is the probability that a randomly chosen molecule of drug leaves the body between the second hour and fourth hour?

37. Consider a drug that has an elimination rate constant of c. If y is the amount of drug in the body and there is no further input of drug into the body, we can model the drug dynamics by

where t denotes times in hours and y₀ is the initial amount of drug in the body.

Solve for y(t).
Write an expression, call it F(t), that represents the fraction of drug that has left the body by time t ≥ 0.
If F(t) = 0 for t ≤ 0, verify that F(t) is a CDF.
Find an expression that allows one to calculate for any value c > 0 and times 0 < r < s what proportion of the drug is removed on the interval [r, s].

38. In the early 1960s, Robert MacArthur of Princeton University and Edward O. Wilson of Harvard University developed a theory to explain why big islands generally have more species than smaller islands, and why the numbers of species on islands of similar sizes are inversely related to island distance from continental landmasses. They argued that the number of species on an island represents a dynamic balance between the rate at which new species arrive at that island and the rate at which species on the island go extinct. The simplest model of island biodiversity assumes that the rate of change of the number N of species is given by a constant rate I of immigration of new species from the mainland and that species on the island go extinct at a rate proportional to N.

If the proportionality constant is c, then we obtain

where t denotes time in years. To know what the species immigration rate I might be for a particular island, we need to know the number of species on the mainland that serve as a source for the colonization process. On the other hand, the extinction rate c on each island is a characteristic of the island alone, rather than of the surrounding mainlands and the distance of the island to these mainlands. To understand the likelihood a species already on the island has gone extinct by time t, we can ignore the immigration process (i.e., keep track only of the species currently on the island) and consider the model

Solve for N(t).
Write an expression, F(t), for the fraction of species that has gone extinct by year t.
Donald Levin, a botany professor at the University of Texas, Austin, was quoted in Science Daily: “Roughly 20 of the 297 known mussel and clam species and 40 of about 950 fishes have perished in North America in the last century.”* Use these data to approximate the extinction constants c for mussel and clam species and for fish species.
Using your estimates from part b, estimate the probability that a specific clam or mussel species goes extinct in the next decade.
Using your estimates from part b, estimate the probability that a specific fish species goes extinct in the next decade.

7.2 Improper Integrals

As we saw in the previous section in defining cumulative distribution functions, we often encounter integrals in which the limits of integration are not finite. These improper integrals come in three varieties:

where a is a real number. In this section, we discuss when these integrals are well defined.

One-sided improper integrals

What is the area under the graph of y = e^−x for x ≥ 0? At first, one might reason this way: Since the region under the curve goes on forever, the area is infinite. To evaluate this statement, define A(t) to be the area under e^−x from x = 0 to x = t, as illustrated in Figure 7.9. In other words,

images

Figure 7.9 Area under e^−x from x = 0 to x = t

Computing A(t) yields

images

A(t) is always less than 1 for any t > 0. Therefore, the area under e^−x for x ≥ 0 cannot be infinite. In fact, it is natural to define the area under e^−x for x ≥ 0 to be

Thus, even though the curve is of infinite length, the area under this curve is finite. Our first guess was wrong!

Inspired by this example, we propose the following definition.

Convergent and Divergent Improper Integrals

For any given real number a define

images

When the limit exists and is finite, f(x) dx is convergent, otherwise it is divergent. Similarly, define

When the limit exists and is finite, f(x) dx is convergent, otherwise it is divergent.

Example 1 Convergent versus divergent

Determine whether the following integrals are convergent or divergent. If convergent, determine their value.

Solution

For any t ≥ 2,

Taking the limit yields

Hence, is convergent and equals .
For any t ≥ 2,

Since

is divergent.
For any t ≥ 0,

Since

does not exist (i.e., the values oscillate between 0 and 2), sin x dx is divergent.

Example 1 shows that while the curves and are similar (i.e., both decreasing to zero as x goes to ∞), the areas under these curves are infinitely different: encloses an infinite area for x ≥ 2, while encloses a finite area for x ≥ 2. Figure 7.10 shows that decreases to zero much slower than . This observation suggests the following question: How fast does the function have to approach zero to ensure convergence? The following example formulates a precise answer to this question for p-integrals.

images

Figure 7.10 Area under curves that go to zero at different rates

Example 2 p-integrals

Determine for which p > 0, the integral

is convergent and divergent.

Solution Example 1 dealt with the case of p = 1 and found to be divergent. Assume that p ≠ 1. In which case

images

When p > 1, it follows that t^1−p has a negative exponent and

Hence, is convergent if p > 1. When p < 1, it follows that t^1−p has a positive exponent and

Hence, is divergent if p < 1.

Example 2 illustrates that convergence depends subtly on the speed at which f(x) approaches zero as x approaches ∞. For instance, while seems to go to zero only slightly faster than , the integral is convergent (i.e., p = 1.0001 > 1) while the integral is divergent. Although this might appear shocking at first, notice that the former integral converges to a very large value: = 10,000. More generally, as p > 1 approaches 1 from above, the area under approaches ∞ because .

The p-integrals are related to the Pareto distribution, named after the Italian economist Vilfredo Pareto (1848–1923). Pareto originally used this power distribution to describe the allocation of wealth among individuals; it also has been used to describe social, scientific, geophysical, and many other types of observable phenomena. In the next example, we examine the Pareto distribution and its use to describe the frequency of individuals visiting websites.

Example 3 The Pareto Distribution

The PDF for the Pareto distribution is of the form

images

where p > 1 and C is a constant that you will determine.

Determine for what value of C, f(x) is a PDF. Your answer will depend on p > 1.
Find the CDF for the Pareto distribution.
A scientist at Hewlett Packard's Information Dynamics Lab used the Pareto distribution to describe how many AOL users visited various websites on one day in 1997 (which is “ancient time” for the Internet). The data are shown in Figure 7.11 (from L. A. Acamic and B. A. Huberman (2002) Zipf's law and the internet. Glottometrics 3: 143–150) and conform to a Pareto distribution with p = 2.07. Estimate the fraction of websites that received visits from ten or fewer AOL users.

Figure 7.11 Number of websites visited by different numbers of AOL users.

Solution

We need to compute the area under f(x). Since f(x) = 0 for x ≥ 1, the area under f is given by

Since k > 1, Example 2 implies that

Hence, in order for f to be a PDF, we need that C = p − 1 and we get

for x ≥ 1.
The CDF is given by

Since f(x) = 0 for x < 1, we get F(x) = 0 for x < 1 and for x > 1,

Thus, the CDF is given by
If p = 2.07, then the fraction of websites visited by ≤ 10 AOL users can be approximated by

Hence, approximately 91.5% of the websites were visited by ten or fewer users.

Example 3 leads us to the following definition.

Pareto PDF and CDF

The PDF of the Pareto distribution for parameter p > 1 is defined to be

images

Its CDF is

images

Example 3 also illustrates how we go from PDFs to CDFs by integrating over the interval (−∞, x). Conversely, suppose that you are given a CDF for a continuous random variable. How do you find the associated PDF? The following theorem says all you have to do is differentiate. Hence, integrate to go from PDF to CDF and differentiate to go from CDF to PDF.

Theorem 7.1 Fundamental theorem of PDFs

Suppose that f is a probability density function. Then the CDF

satisfies

Proof. The proof of this theorem follows from the fundamental theorem of calculus presented in Section 5.4 by letting a → −∞ in the statement of that theorem.

Example 4 Exponential distribution revisited

Recall that in Example 10 of Section 7.1 we considered a model of the decay of lidocaine in the human body. We found that the fraction of molecules of this drug that have been eliminated by t ≥ 0 hours is given by

and F(t) = 0 for t ≤ 0.

If F(t) represents the CDF of the random variable of how long a randomly chosen lidocaine molecule spends in the body, then find the corresponding PDF of this random variable.
Use the PDF in part a to find the probability that a randomly chosen molecule of this drug is eliminated in the first two hours. Compare your answer to what was found in Example 10 of Section 7.1.

Solution

The derivative of F(t) for t > 0 is F′(t) = 0.43e^−0.43t. The derivative of F(t) for t < 0 is F′(t) = 0. Hence, the PDF is given by
Let X be the random variable whose CDF is given by F(t).X corresponds to the time a randomly chosen molecule of drug gets eliminated. Using the PDF, we obtain

Hence, there is a 58% chance that a randomly chosen molecule of drug gets eliminated in the first two hours. This is the same answer we found in Example 10 of Section 7.1.

Convergence tests

As we have seen, the integral of a function cannot be always expressed in terms of elementary functions (e.g., f(x) = e^−x2). One way to get around this issue is to numerically estimate f(x)dx. However, if the upper limit is +∞, then numerical estimates only make sense if the integral converges because, informally,

images

if the left-hand side is convergent and b is sufficiently large. Consequently, it is important to have methods that determine whether an improper integral is convergent or not. A powerful yet simple test for convergence is the comparison test. The basic idea of this test is to compare the integral in question (the one for which convergence is not understood) to an integral for which convergence is understood.

Theorem 7.2 Comparison test

Suppose that f(x) ≥ g(x) ≥ 0 for x ≥ a. Then:

Convergence If f(x)dx is convergent, then g(x)dx is convergent.

Divergence If g(x)dx is divergent, then f(x)dx is divergent.

The idea behind this theorem, as illustrated in Figure 7.12, is intuitive. If the area under f is finite and f ≥ g ≥ 0, then the area under g is finite. Conversely, if the area under g is infinite, then the area under f is infinite.

images

Figure 7.12 Comparing areas under f(x) (in blue) and g(x) (in red) when f(x) ≥ g(x) ≥ 0

Example 5 Using comparison test

Use the comparison test to determine whether the following integrals are convergent or divergent.

images

Solution

Since 1 ≤ 2 + sin x ≤ 3 for all x,

for all x > 0. Moreover, since is convergent (i.e., a p-integral with p > 1), the comparison test implies that dx is convergent.
Since x ≥ for all x ≥ 1, we have x + ≤ 2x for x ≥ 1. Hence,

for x ≥ 1. Since is divergent (i.e., a p-integral with p = 1), the comparison test implies that is divergent.
Since x² ≥ x for all x ≥ 1, e^−x² ≤ e^−x for all x ≥ 1. Since is convergent, the comparison test implies that dx is convergent Moreover, as e^−x² ≤ 1, we have that dx is finite. Hence, dx is convergent.

Improper integrals can lead to maddening paradoxes, as the following example illustrates.

Example 6 Torricelli's trumpet (or Gabriel's horn)

Consider the surface created by rotating the curve y = about the x axis as illustrated in Figure 7.13.

This surface is sometimes called Torricelli's trumpet, named after the Italian mathematician Evangelista Torricelli (1608–1647). It can be shown that the volume of this infinite trumpet is given by the expression

and the surface area is given by the expression

images

Figure 7.13 y = for x ≥1 rotated about the x axis

images

Determine whether the surface area and volume are convergent or divergent.
Discuss how much paint it would take to paint the surface versus how much paint it would take to fill the trumpet.

Solution

Since the volume is determined by a p-integral with p = 2 > 1, we can say the volume is finite. In fact,

For the surface area, we have x > 0. Since dx is a p-integral with p = 1, the comparison test implies that is divergent. Therefore, the surface area is infinite!
As the surface area is infinite, it would take an infinite amount of paint to paint the surface. On the other hand, if we plugged up the hole at the end of trumpet, then we could fill the trumpet with a finite amount of paint. After being poured out the same paint would cover the interior surface of the trumpet. How can this be? Paraphrasing the words of Thomas Hobbes: To make sense of this conundrum, one needs to be mad rather than a geometrician or a logician.

Two-sided improper integrals

We conclude this section by defining

A first attempt at this definition might be

images

Unfortunately this definition is flawed, as the following example illustrates.

Example 7 When definitions go wrong

Compute the integral using the definition

images

for any value of k and discuss any anomalies that arise.

Solution

images

Hence, if we believe that is well defined, we must conclude that equals 0 (when k = 0), ∞ (when k > 0), and −∞ (when k < 0) all at the same time! Since this is clearly impossible, we have shown that our naïve definition of a doubly infinite integral is flawed.

To avoid the faulty path taken in Example 7, we need to avoid simultaneously confounding two computations involving ∞ (one in the positive direction and one in the negative direction). We avoid this by separating out the two computations, as presented in the following property box:

Doubly Infinite Integrals

f(x) dx is convergent if these limits exist:

images

otherwise, f(x) dx is divergent. If convergent, we define

images

In Problem Set 7.2, you will be asked to show that for convergent integrals

images

for any a. Hence, for convergent integrals, we cannot make the infinite from nothing as in Example 7.

Example 8 Convergence of doubly improper integrals

Determine whether the following integrals are convergent or divergent.

The signed area for each of these curves is shown in Figure 7.14.

images

Figure 7.14 Signed areas for Example 8

Solution

Since is divergent. Hence, is divergent.
To compute ∫ xe^−x²dx, we use the substitution u = x², du = 2xdx. Then,

Therefore,

Similarly,

Hence,

Many PDFs have bi-infinite tails (i.e., f(x) > 0 for all x (−∞, ∞)). One such example is the Laplace distribution.

Example 9 The Laplace distribution

An important distribution discovered by the French mathematician and astronomer Pierre-Simon Laplace (1749–1827) is the double exponential or Laplace distribution whose probability density function is given by f(x) = ae^−b|x| where b > 0 is a parameter and a > 0 is a constant that you will determine. As the Laplace distribution describes the random motion of a particle in a liquid with a constant settling rate, it has been used to describe dispersal of marine larvae along a coastline. Let X denote the distance (say, in kilometers) a larva has traveled northward from its birthplace. If X is negative, then the larva has traveled south.

Determine what a needs to be to ensure that f is a probability density function.
Suppose for one marine species b = 1. Determine the probability that a randomly chosen larva from this population travels more than 1 km north from its birthplace.
Suppose for another marine species b = 2. Determine the probability that a randomly chosen larva from this population travels more than 1 km north from its birthplace.
In light of your answers to parts b and c, provide an interpretation of the parameter b.

Solution

We need that . Computing the first half of this improper integral leads to

Computing the second half of this improper integral leads to

Hence, . Since we need that 2 = 1, we obtain a = .
If b = 1 (the units of b are km⁻¹), then the fraction of larvae that travel at least 1 km north is given by

Hence, there is approximately an 18% chance that a randomly chosen larvae travels at least 1 km north. The area corresponding to this integral is illustrated in the following figure:
If b = 2, then the fraction of larva that travel at least 1 km north is given by

Hence, there is approximately a 7% chance that a randomly chosen larva travels at least 1 km north.
The larger b is the more likely it is that a randomly chosen larva travels a shorter distance before settling. In fact, redoing parts b and c with an arbitrary b, we find that the chance of a randomly chosen larva moving at least 1 km north is e^−b.

Given the importance of the Laplace distribution, we conclude by providing a general definition for this distribution.

Laplacian PDF and CDF

The Laplacian PDF on the interval [0, ∞), with distance parameter b > 0, is given by

The corresponding Laplacian CDF is

images

The derivation of the Laplace CDF is left to Problem 32 in Problem Set 7.2.

PROBLEM SET 7.2

Level 1 DRILL PROBLEMS

Determine whether the integrals in Problems 1 to 10 are convergent or divergent. If convergent, determine their value.

images

For Problems 11 to 14, use the comparison test to determine whether the integrals are convergent or divergent.

images

For Problems 15 to 18, find the CDF of the given PDF.

images

For Problems 19 to 22, find the PDF of the given CDF.

images

In many species of birds, the age at which individuals die is assumed to follow an exponential distribution. In Problems 23 to 28, estimate the age of death corresponding to the first, second (median), and third quartiles of the distribution for the named species and given rate-of-death parameter b (units are per year rates: data from D. B. Botkin and B. S. Miller, “Mortality Rates and Survival of Birds,” American Naturalist, 108(1974): 181–192).

23. Blue tit, b = 0.72

24. Starling, b = 0.52

25. Grey heron, b = 0.31

26. Alpine swift, b = 0.18

27. Yellow-eyed penguin, b = 0.10

28. Royal albatross, b = 0.06 (an average of two different estimates)

Level 2 APPLIED AND THEORY PROBLEMS

29. Estimate the numerical value of e^−x² by writing it as the sum of e^−x² dx and e^−x² dx. Approximate the first integral using Simpson's rule with n = 8. Show that the second integral is smaller than 0.0000001. Hint: Compare to e^−4x dx.

30. Determine how large a needs to be to ensure that

Hint: Compare to .

31. If is convergent, show that

images

for all a.

32. Show that

images

is the CDF to PDF

33. Consider a marine species whose larvae disperse northward and southward according to the Laplace distribution f(x) = e^−2|x|. For all the individuals that travel in both directions:

Determine the fraction of individuals that travel north at least 2 km.
Determine the fraction of individuals that travel south at least 2 km.
Determine the fraction of individuals that travel at most 2 km north.

34. Consider a large pine tree situated in a long, narrow valley. Suppose the seeds of this pine tree disperse up and down the valley according to the Laplace distribution f(x) = e^−|x|/2.

What proportion of all seeds disperse up the valley at least 1 km.
What proportion of seeds disperse down valley at least 3 km.
Determine the fraction of seeds that disperse at most 2 km in either direction of the tree.

35. The following graph shows a plot on a log-log scale of the frequency of trips of different durations (in hours) made by albatrosses (data from A. M. Edwards et al., “Revisiting Levy Flight Search Patterns of Wandering Albatrosses, Bumblebees and Dear,” Nature 449 (2007): 1044–1048).

images

From this plot deduce the proportion of trips that lie between 5 and 20 hours.

36. Journal Problem College Mathematics Journal 24 (September 1993): 343. Peter Lindstrom reported that a student handled an ∞/∞form as follows:

images

What is wrong, if anything, with this student's solution?

37.

images

Evangelista Torricelli studied in Galileo's home near Florence. When Galileo died, Torricelli succeeded his teacher as mathematician and philosopher for the Grand Duke of Tuscany, their friend and patron.

Torricelli described his amazement at discovering an infinitely long solid with a surface that calculates to have an infinite area, but a finite volume: “It may seem incredible that although this solid has an infinite length, nevertheless none of the cylindrical surfaces we considered has an infinite length but all of them are finite.” (See http://curvebank.calstatela.edu/torricelli/torricelli.htm.)

In Example 6, we introduced Torricelli's trumpet and paraphrased Thomas Hobbes, that is, to solve Torricelli's conundrum one needs to be mad rather than a geometrician or a logician. Present an argument that resolves this paradox.

38.

images

Newton and Leibniz have been credited with the discovery of calculus, but much of its development was due to the mathematicians Pierre-Simon Laplace, Joseph-Louis Lagrange, and Karl Gauss.

These three great mathematicians of calculus were contrasted by W. W. Rouse Ball*:

“The great masters of modern analysis are Lagrange, Laplace, and Gauss, who were contemporaries. It is interesting to note the marked contrast in their styles. Lagrange is perfect both in form and matter, he is careful to explain his procedure, and though his arguments are general they are easy to follow. Laplace, on the other hand, explains nothing, is indifferent to style, and, if satisfied that his results are correct, is content to leave them either with no proof or with a faulty one. Gauss is exact and as elegant as Lagrange, but even more difficult to follow than Laplace, for he removes every trace of the analysis by which he reached his results, and strives to give a proof which while rigorous will be as concise and synthetical as possible.”

Pierre-Simon Laplace taught Napoleon Bonaparte, who appointed him for a short time as France's Minister of Interior. Today, Laplace is best known as a major contributor to probability, taking it from gambling to a true branch of mathematics. He was one of the earliest to evaluate the improper integral

images

which plays an important role in the theory of probability. Use the web to find the value of this improper integral and its applications in mathematics, particularly probability theory.

7.3 Mean and Variance

As we have seen in Section 7.1, histograms provide visual summaries of large data sets. Sometimes these histograms are nicely approximated by the graph of a continuous function, the probability density function (PDF). When this occurs, a scientist can describe concisely his or her data set to another scientist by describing the PDF. Many important PDFs can be expressed in terms of families of functions whose parameters provide some basic information about the shape of the PDF. These parameters are often related to the mean and variance of the PDF. The mean is a measurement of the centrality of a data set. The variance, on the other hand, describes the spread of the data set around the mean; the greater the variance, the greater the spread in the data.

Means

There are numerous ways to characterize the central tendency in a set of data, including several ways of defining the concept of a mean—the most important being the arithmetic mean. For a data set {x₁,..., x_n} of real numbers, the arithmetic mean, or average, is given by the sum of the data values divided by the number of data values; namely, . It is useful to express this well-known expression in a slightly different manner, as illustrated in the following example.

Example 1 Computing the mean

In its May 1995 issue, the journal Condor published a study of competition for nest holes among collared flycatchers, a species of bird. Researchers collected data by periodically inspecting nest boxes located on the island of Gotland in Sweden. The accompanying data give the number of flycatchers breeding at fourteen distinct plots.

Find the arithmetic mean of this data set.
Consider the random variable X given by randomly selecting a data point. Find the probability distribution of X, that is, p_i such that P(X_i = x_i) = p_i for x₁ = 0, x₂ = 1, x₃ = 2, x₄ = 3, x₅ = 4, and x₆ = 5.
Express the arithmetic mean in terms of the p_i and x_i in part b.

Solution

The arithmetic mean is given by

Hence, on average there are approximately 1.4 fly catchers breeding in a plot.
The data values are x₁ = 0, x₂ = 1, x₃ = 2, x₄ = 3, x₅ = 4, and x₆ = 5. The fraction of zeros is . The fractions of ones and twos, respectively, are and . The fractions of 3s, 4s, and 5s are all . Hence, p₁ = , p₂ = , p₃ = , and p₄ = p₅ = p₆ = .
From part a, we rewrite the arithmetic mean as

From our work in part b, we get that the arithmetic mean equals

Example 1 motivates the following more general definition.

Mean for a Discrete Random Variable

Consider a discrete random variable X that takes on the values x₁, x₂, x₃,..., x_k with probabilities p₁, p₂, p₃,..., p_k. The mean of X equals

images

One inspiration for this definition is Blaise Pascal's statement that the excitement felt by a gambler is equal to the amount he might win times the probability of winning it. From a gambling perspective, if x₁,..., x_k are the amounts you can win and p₁,..., p_k are the likelihoods of winning these amounts, then the mean is what you expect to win. Each term in the sum corresponds to the “amount you might win times the probability of winning it.”

Example 2 How much do you expect to win playing roulette?

Recall from Example 2 of Section 7.1, the roulette wheel in American casinos has 38 colored and numbered pockets of which 18 are black, 18 are red, and 2 are green. A player betting one dollar on red wins one dollar if the ball lands in a red pocket and loses otherwise. Let X be the earning of a player betting one dollar on red. Find the mean of X and discuss its meaning.

Solution We have X taking on the values x₁ = 1 and x₂ = −1 with probabilities , respectively. Hence, the mean of X equals

Thus, an individual playing roulette on “average” loses about a nickel. One way to interpret this statement is to say that if 1,000 individuals bet 1 dollar on red, the casino will earn about 1,000 · 0.052 = 52 dollars.

Now suppose X is a continuous random variable with PDF f(x). To find the mean of X, we may consider approximating X by a discrete random variable by dividing the real line into intervals of length Δx with endpoints:

The probability of X taking on a value in the interval [x, x + Δx) is approximately f(x)Δx. Using the definition of the mean for a discrete random variable as motivation, the mean of X should be approximately given by the sum of the values weighted by their probabilities, that is,

images

Taking the limit as Δx goes to zero yields the integral xf(x) dx, which suggests the following definition.

Mean for a Continuous Random Variable

For a continuous random variable X with PDF f(x), the mean of X is given by

provided that the improper integral is convergent.

Example 3 Throwing darts

Sebastian is a terrible dart player. In his honor, the local pub has created a large dartboard with a radius of 2 feet. With this dartboard, Sebastian always hits the board, but his dart is equally likely to hit any point on the board. Let X be the distance from the center that the dart lands. In Problem 25 of Problem Set 7.3, you are asked to show that the PDF for X is given by

images

Find the mean distance that Sebastian's darts land from the center.
Find the probability that a dart lands less than the mean distance from the center.

Solution

To find the mean, we compute

So on average, a dart thrown by Sebastian lands feet from the center.
To find P , we compute

Hence, there is less than a 50% chance that Sebastian's dart will land within feet of the center, even though is the mean distance of all shots from the center of the dartboard.

Example 4 Exponential means

Consider a drug with elimination constant c. The fraction of drug left after t hours has an exponential distribution with parameter c. As illustrated in Example 10 of Section 7.1, the PDF for this distribution is given by

images

Find the mean of the exponential distribution. What is its interpretation in the context of drug decay?
For the typical patient, lidocaine has an elimination constant of 0.43 per hour. What is the mean time for a molecule to leave? What is the half-life in the body of a lidocaine molecule?

Solution

The mean of the exponential distribution is given by

Since c has units “per day”, has units “days” and corresponds to the mean number of days it takes for a molecule of drug to be cleared from the body.
The mean elimination time for lidocaine is ≈ 2.33 days. On the other hand, the half-life is given by the solution to

Hence, half of the molecules are eliminated before the mean time to elimination.

As discussed in Problem Set 7.1, exponential distribution can be used to model extinction times for species.

Example 5 Extinction rates

In their article, “Extinction Rates of North American Freshwater Fauna” (Conservation Biology 13 (1999): 1220–1222), Anthony Ricciardi and Joseph B. Rasmussen showed that time to extinction of a species is exponentially distributed, with 0.1% of terrestrial and marine animals going extinct per decade.

What is the elimination constant c for this data set? What is the mean extinction time?
How long is it expected to take for half the species to go extinct?
Ricciardi and Rasmussen estimated future extinction rates by assuming all currently imperiled species (i.e., endangered or threatened) will not survive this century. Under this assumption, 0.8% of species would be going extinct per decade. Determine how this alters the answers to parts a and b.

Solution

If species extinctions are exponentially distributed and time x is measured in years, then the data of Ricciardi and Rasmussen tell us that

Solving for c yields

From part a of Example 4, we get that the mean time to extinction for a species is = 9,995 years.
To determine the half-life of the extinction process, we need to solve

for t which yields 6,928.01 years.
Solving

for c yields c = 0.000803217. The mean to extinction shrinks by a factor of approximately 8 to ≈ 1,245 years. Solving

for t yields a half-life of 863 years, which is the expected time it will take for half the currently extant species to go extinct.

Examples 3 and 4 illustrate that the fraction of data to the left of the mean can be significantly greater than 50%. This raises two questions. First, what is the geometrical interpretation of the mean? To answer this question, imagine that we take a (infinitely) long board and cut out the area lying under the PDF. If we placed this wooden PDF as shown in Figure 7.15 on a fulcrum at the mean, then the PDF would balance perfectly. Second, for what type of PDFs is 50% of the area to the left (and to the right) of the mean? A partial answer to this question is provided in the following example, using the concept of a symmetrical function and an odd function. Recall from Chapter 1 that an odd function g(x) satisfies g(x) = −g(−x) for all x.

images

Figure 7.15 PDF with fulcrum at the mean

Example 6 Symmetrical PDFs

Let f(x) be a PDF that is symmetric around x = a. In other words, f(x) is a PDF and f(a + x) = f(a − x) for all x as illustrated in Figure 7.16. If the mean associated with f(x) is well defined, we expect it to equal a, as the PDF should balance at this point. To verify this assertion analytically, assume the mean is well defined (i.e., xf(x)dx is convergent) and do the following:

images

Figure 7.16 A symmetric PDF f(x) around a satisfies f(x + a) = f(x − a) for all x.

Show that the mean is zero if a = 0.
Show that g(x) = f(x + a) is a PDF.
For a ≠ 0, use parts a and b and the change of variables u = a + x to find the mean.

Solution

Assume a = 0. Then, f is symmetric around 0; namely, f(−x) = f(x) for all x. For any b > 0,

Since we have assumed that xf(x)dx is convergent, the mean equals
Let g(x) = f(a + x). We have that g(x) is nonnegative for all x as f is nonnegative for all x. Furthermore, using the change of variables u = a + x,

Therefore, g(x) is a PDF. In Problem Set 7.3, you are asked to show that if X is a random variable with PDF f(x), then the PDF of X − a is g(x).
Let g(x) = f(a + x). By our assumption of symmetry around x = a, g(−x) = f(a − x) = f(a + x) = g(x). In other words, g is symmetric around x = 0. Using the change of variables u = x − a, we get the mean of the PDF f(x):

In summary, the preceding example proves that for symmetric PDFs with a convergent mean, the mean corresponds to the point of symmetry of the PDF. Furthermore, by symmetry, the area under the PDF below the point of symmetry is equal to the area under the PDF above the point symmetry. Each contains half of the total area, so that, by definition, the point of symmetry is the 50th percentile, or median. Thus, we have the following result:

Means and Medians of Symmetrical Random Variables

A continuous random variable X with PDF f(x) is symmetric around x = a if f(a − x) = f(a + x) for all x. A discrete random variable X is symmetric around x = a if P(X = a − x) = P(X = a + x) for all x. In either case, if the mean of X is well defined, then the mean of X and the median of X equal a.

Example 7 Means of symmetric PDFs

Assuming the means are well defined, find the means of the following probability densities.

Birthday PDF:
Triangular PDF:
Laplacian PDF:

Solution

Since f(x) is symmetric around x = , the mean of the birthday distribution is = 182.5.
Since the triangular distribution is symmetric around x = 1, x = 1 is the mean.
Since the Laplacian distribution is symmetric around x = 0, 0 is the mean.

Sometimes the mean of a random variable can be infinite or not well defined, as the following example illustrates.

Example 8 Pareto distribution revisited

In Section 7.2, we introduced the Pareto PDF

images

with parameter a > 1.

Find for what values of a > 1 the Pareto PDF has a finite mean; and when it is finite, find the mean.
In an article in Contemporary Physics, Michael Newman found that the distribution of population sizes (ten thousands) of cities in the United States in the year 2000 was well approximated by a Pareto PDF with a = 2.18471, as illustrated in Figure 7.17. Estimate the mean population size of a city.

Figure 7.17 Distribution of population sizes for cities in the United States in the year 2000. In the right panel, the distribution is plotted on a log–log scale. The best-fitting Pareto PDF is shown as a solid line.

Data Source: M. W. Newman, “Power Laws, Pareto Distributions, and Zipf's Law,” Contemporary Physics 46 (2005): 323–351.

Solution

Provided the integral is convergent, the mean of the Pareto PDF is given by

Since is a p-integral with p = a − 1 (see Example 2 of Section 7.2), it is convergent if a > 2 and divergent if a ≤ 2. For a > 2, Example 2 of Section 7.2 with p = a − 1 implies that

is the mean of the Pareto distribution.
Since a = 2.18471 > 2, the mean is well defined and given by

Hence, according to the Pareto distribution approximation, the average population size in a U.S. city in the year 2000 was approximately sixty four thousand individuals.

Variance and standard deviation

The importance of going beyond the mean was captured by the English mathematician Sir Francis Galton (1822–1911). He famously queried why statisticians of his day typically limited their enquiries to computing averages; he commented (pp. 62–63 in Galton, F., 1889, Natural Inheritance. Macmillan & Co., London), that souls of these statisticians “seem as dull to the charm of variety as that of the native of one of our flat English counties, whose retrospect of Switzerland was that, if its mountains could be thrown into its lakes, two nuisances would be got rid of at once.” One method to calculate the spread of data around the mean is to compute variance.

Variance for Discrete Random Variables

Let X be a random variable taking on the values x₁, x₂, x₃,..., x_k with probabilities p₁, p₂, p₃,..., p_k. Let μ be the mean of X. The variance, which we denote σ², is defined by

images

while its square root σ, which has the same units as the mean, is referred to as the standard deviation.

For a data set taking on the distinct values x₁, x₂, x₃,..., x_k with relative frequencies p₁, p₂, p₃,..., p_k and mean μ, the variance and standard deviation for the data set are defined exactly as for the discrete random variable. Some technologies employ a correction factor for calculating the variance associated with a sample of size n by multiplying the variance, as defined above, by the factor n/(n − 1) to get what statisticians call an unbiased estimate of the variance. If the value you obtain in solving a problem does not agree with the value we provide as solution, it may be that your technology employed this correction factor.

In describing the spread of data around the mean, scientists typically use the standard deviation rather than the variance. Unlike the variance, the standard deviation and the mean have the same units as the data. We will show, at the end of this section, that three-quarters of a data set always lies within two standard deviations of its mean. Equivalently, the probability that a random variable takes on a value within two standard deviations of its mean is at least three-quarters. In the next section, we will show that for certain bell-shaped distributions, approximately “two-thirds” of a data set lies within one standard deviation of its mean.

Example 9 Computing variances and standard deviations

Recall Example 1 which referred to a study of competition for nest holes among collared flycatchers. Researchers collected data by periodically inspecting nest boxes located on the island of Gotland in Sweden. The accompanying data give the number of flycatchers breeding at fourteen distinct plots:

Let X be the number of flycatchers in a randomly chosen plot. Find the variance and standard deviation of X.

Solution In Example 1 we found that μ ≈ 1.43. Hence, the variance is given by

images

and the standard deviation is σ ≈ ≈ 1.55.

The following example illustrates that standard deviations measure the spread of the data around its mean.

Example 10 Seeing the spread

A person places multiple bets on three “fair” games. Her winnings for the games are as follows:

Game A −1, 0, 0, 0, 1 (dollar values)
Game B −1, −1, 0, 1, 1 (dollar values)
Game C −2, −1, 0, 1, 2 (dollar values)

Plot the histograms of the frequencies of observed values for the each of these data sets.
Compute the variances for the each of these data sets.
Discuss what you find.

Solution

Plotting the histograms yields:
Since all the histograms balance at 0, the mean for all data sets is 0. The variances are as follows:
For game A, the variance is 0.4 as there is some variation about the mean 0. Since game B has more data points away from the mean than game A, the variance for this game is greater than that for game A. Since game C has the greatest spread in winnings, it has the largest variance.

Consider a continuous random variable X with PDF f(x). Assume the mean μ = xf(x)dx is well defined. To define the variance of X, we can approximate X by a discrete random variable by dividing the real line into intervals of length Δx with endpoints:

Since the probability X takes on a value between x and x + Δx is approximately f(x)Δx, the variance should approximately equal

images

Equivalently,

images

Taking the limit as Δx goes to zero yields (x − μ)² f(x)dx.

Variance and Standard Deviation for a Continuous Random Variable

For a continuous random variable X with PDF f(x) and mean μ, the variance of X is given by

provided the improper integral converges. The standard deviation of X is given by σ, the square root of the variance.

Example 11 Integrating to get a variance

Find the variances of the following PDFs.

Birthday distribution (non leap years):
A Triangular distribution:

Solution

Earlier, we found the mean of the non leap-year birthday PDF is μ = . Hence, the variance is given by
Earlier we found that the mean of the triangular distribution is μ = 1. Hence, the variance is given by

The following example studies the effect of the standard deviation on the shape of the distribution.

Example 12 Laplacian variance

Recall from Example 9 in Section 7.2, the Laplacian PDF f(x) = e^−b|x|. Since this distribution is symmetric, its mean is 0.

Find the standard deviation of this PDF.
Use technology to plot the PDF for different b values and discuss how the standard deviation affects the shape of the PDF.

Solution

We need to compute

Applying integration by parts twice yields

Hence,

Therefore, σ = .
Plotting the PDF for b = 1, 5, 10 yields this graph:

For larger b values, the standard deviation is smaller. The PDF tends to concentrate more around the mean of 0 when the standard deviation is smaller.

The following example provides an easier way of computing variances.

Example 13 Variance: mean-squared property

Let f be a PDF with mean μ. Assuming σ² is well defined, show that

Solution The definition of variance and rules of integration imply

images

Example 14 Variance for the Pareto distribution

In Example 8, we found that the Pareto distribution with parameter a > 0 has a finite mean μ = provided that a > 2. Determine when the variance for the Pareto distribution is finite and find the variance when it is finite.

Solution Recall that the Pareto PDF is given by

images

Assume that a > 2, in which case the mean equals μ = . Example 13 implies that the variance of this PDF is convergent only if x²f(x)dx is convergent. We have that

is a p-integral (see Example 2 of Section 7.2) with p = a − 2. Therefore, this integral converges only if a > 3, in which case

When a > 3, Example 13 implies that the variance of the Pareto distribution equals

images

Chebyshev's inequality

As we saw in the previous example, through its square root (i.e., standard deviation), the variance provides both a measurement and a sense of the spread around the mean. Larger standard deviations suggest greater spread around the mean. A basic inequality from probability theory provides a general method of estimating what fraction of the data is within a certain number of standard deviations of the mean. This inequality is Chebyshev's inequality, named after the mathematician Pafnuty Chebyshev (1821–1894), who first proved it.

Theorem 7.3 Chebyshev's inequality

Let X be a random variable with mean μ and standard deviation σ. Then,

Proof. We provide a proof in the case of a continuous random variable with PDF f(x). In which case, we obtain

images

Thus, we have shown that

images

Since this theorem holds for all PDFs, no matter how peaked or how flat, the bounds in some cases can be rather weak.

Example 15 Using Chebyshev's inequality

In 1998 in Hong Kong, the number of newborns was 52,955 whose mean birth weight was 3.2 kg and standard deviation was 0.5 kg.* Using only these data, estimate the following quantities.

Fraction of newborns weighing between 2.2 and 4.2kg
Fraction of newborns weighing between 1.7 and 4.7kg

Solution To apply Chebyshev's inequality, let X be the weight of a randomly selected newborn.

Since μ = 3.2 and σ = 0.5, we find that 2.2 = μ −2σ and 4.2 = μ + 2σ. By Chebyshev's inequality with k = 2, we find that at least of the new-borns weighed between 2.2 and 4.2kg.
Since μ = 3.2 and σ = 0.5, we find that 1.7 = μ −3σ and 4.7 = μ + 3σ. By Chebyshev's inequality with k = 3, we find that at least of the new borns weighed between 1.7 and 4.7kg.

Example 15 illustrates that Chebyshev's inequality states that at least of the data (k = 2) are within two standard deviations of the mean, at least of the data (k = 3) are with three standard deviations of the mean, and so on.

PROBLEM SET 7.3

Level 1 DRILL PROBLEMS

Compute the mean, variance, and standard deviation of the data set given in Problems 1 to 8.

1. 1, 1, 0, 1, 1

2. 2, 0, 2

3. 1, 1, 1, 1, 1

4. 1, 2, 3, 4, 5, 6 (a die)

5. 1, 5, 7

6. −1, −2, 1, 4

7. The set of numbers that contains 2 zeros, 6 ones, 17 twos, and 8 threes.

8. The set of numbers that contains 7 negative twos, 5 negative ones, 3 zeros, 8 ones, and 12 twos.

Compute the mean of the random variable with the indicated PDF in Problems 9 to 16.

9. f(x) = for 0 ≤ x ≤ 2 and f(x) = 0 elsewhere

10. f(x) = for x ≥ 1 and f(x) = 0 elsewhere

11. f(x) = for |x| ≥ 1 and f(x) = 0 elsewhere

12. f(x) = e^−x for x ≥ 0 and f(x) = 0 elsewhere

13. f(x) = images

14. f(x) =

15. f(x) = xe^−x for x ≥ 0 and f(x) = 0 elsewhere

16. f(x) = images for x ≥ 0 and f(x) = 0 elsewhere (Hint: See Example 5 of Section 5.6)

Compute the variance of the PDFs in Problems 17 to 20.

17. f(x) = for 0 ≤ x ≤ 2 and f(x) = 0 elsewhere

18. f(x) = for x ≥ 1 and f(x) = 0 elsewhere

19. f(x) = for |x| ≥ 1 and f(x) = 0 elsewhere

20. f(x) = xe^−x for x ≥ 0 and f(x) = 0 elsewhere

21. Consider the following data set:

Find the mean and standard deviation.
According to Chebyshev's inequality, what fraction of data (at the bare minimum) has to lie in the interval [μ −2σ, μ + 2σ]? What fraction of the data does lie in this interval?

22. Consider the following data set:

Find the mean and standard deviation.
According to Chebyshev's inequality, what fraction of data (at the bare minimum) has to lie in the interval [μ −2σ, μ + 2σ]? What fraction of the data does lie in this interval?

23. Suppose that a random variable X has a PDF of

Find P(0.2 ≤ X ≤ 0.6). Sketch this probability on a graph of the PDF.
Find and graph the CDF.
Find the mean value.

24. Suppose that a random variable X has a PDF of

Find P(1.2 ≤ X ≤ 2.4). Sketch this probability on a graph of the PDF.
Find and graph the CDF.
Find the mean value.

Level 2 APPLIED AND THEORY PROBLEMS

25. Let X be the distance from the center that a dart lands on a dartboard with a radius of 2 feet.

Show that the PDF for x is given by
Find the mean and variance of this PDF.

26. In spring 1994, the number of bird species in forty different oak woodland sites in California were collected. Each site was around 5 hectares in size—the equivalent of about 12 acres, or 0.019 square miles— and the sites were situated in a relatively homogeneous habitat. The numbers of bird species found in these sites are listed below:

images

Let X be the number of species in a randomly chosen site. Using technology, compute the mean and standard deviation of X.

27. The following numbers are length in seconds of scenes showing tobacco use in six animation movies from a film studio:

Let X be the length of a scene from a randomly chosen movie.

Compute the mean for X.
Compute the standard deviation of X.

28. The following numbers are belly bristles per fruit fly in a sample size of six:

Find the relative frequency of 30.
Compute the mean number of belly bristles in this sample.
Compute the standard deviation of this data set.
According to Chebyshev's inequality, what is the minimum fraction of the data taking on values between 26.5 and 34.5? What is the actual fraction of data taking on values between 26.5 and 34.5?

29. Consider the exponential random variable X with PDF given by

images

Find the variance and standard deviation of X. Compare these numbers to the mean of the exponential distribution. What do you notice?

30. According to Thomson et al. (see Problem 36 in Problem Set 7.1), the elimination constant for lidocaine for patients with hepatic impairment is 0.12 per hour.

Determine the mean time μ for a lidocaine molecule to be eliminated.
Determine the fraction of lidocaine eliminated by time t = μ.

31. Donald Levin was quoted in Science Daily (see Problem 38 in Problem Set 7.1) as stating: “Roughly 20 of the 297 known mussel and clam species and 40 of about 950 fishes have perished in North America in the last century.”

Use these data to approximate the mean time to extinction constants for mussel and clam species and for fish species.
Determine the fraction of mussel and clam species and fish species that will be lost in the next century.

32. In Example 3 of Section 7.2, the following Pareto PDF was used to describe how many AOL users visited certain websites on one day in 1997:

images

Find the mean of this PDF.
Compute the variance for this PDF. What does the variance suggest about the variability in the number of hits that a website can experience?

33. Let X denote the number of years a patient lives after receiving treatment for an acute disease such as cancer. Under appropriate conditions, X is exponentially distributed. Suppose that the probability that a patient will live at least five years after treatment is 0.85.

Find the mean value of X.
Find the probability a patient will live at least ten years.

34. Based on data from 1974 to 2000 in Humboldt and Del Norte counties in California, the mean time to the next earthquake of magnitude ≥ 4 is approximately 2.5 weeks. Assuming that the time to the next earthquake is exponentially distributed, find the probability there will be an earthquake of magnitude ≥ 4 in the next week.

35. According to a newspaper article titled “Babies by the Dozen for Christmas: 24-Hour Baby Boom,” a record forty-four babies were born in one 24-hour period at the Mater Mothers' Hospital, Brisbane, Australia, on December 18, 1997. The article listed the times of birth for all of the babies.* The histogram of the times between births is as follows:

images

If this histogram has proportions {0.35, 0.25, 0.16, 0.12, 0.08, 0.02, 0.0, 0.0, 0.0, 0.02} centered on the values 7.5, 22.5, and so on every 15 minutes up to 142.5, then what is the mean time between births?
If this histogram is approximately exponentially distributed with a mean of 33.26 minutes between births, then what fraction of the times between births were less than 30 minutes? According to the histogram, what fraction of times between births were less than 30 minutes?
According to the exponential distribution, what fraction of the times between births were more than 75 minutes? Compare this with the actual fraction of times between births that are more than 80 minutes, as depicted in the histogram.

36. Here is a fun but challenging exercise: Construct a data set so that only 75.1% of the data lie in the interval [μ −2σ, μ + 2σ].

37. Let X be a continuous random variable with PDF f(x). Show that g(x) = f(x + a) is a PDF for the continuous random variable Y = X − a.

7.4 Bell-Shaped Distributions

An important class of PDFs has graphs that are bell-shaped (see center panel in Figure 7.18). In this section, we investigate two PDFs with this property: the logistic PDF and the normal PDF. Both are symmetric about their means and nonzero on the interval (−∞, ∞). Although logistic and normal PDFs have similar shapes, each is used to represent biological data arising from different types of processes. In this section, we also study random variables that are normal on a log scale. This lognormal distribution is right skewed or right tailed (also referred to as positively skewed—see Figure 7.18) because it is zero on (−∞, 0].

images

Figure 7.18 Symmetric and skewed PDFs

The logistic distribution

The logistic growth equation studied in Chapter 6 describes how populations change over time. In the next example, we show that solutions to the logistic equation can lead to a CDF for the logistic distribution.

Example 1 Logistic spread of diseases

Consider a population of individuals in which a disease is spreading, and individuals once infected remain infected (e.g., HIV/AIDS). Let y denote the fraction of infected individuals also known as the prevalence of the disease—and let t denote time in months. If the rate of increase of infected individuals is proportional to the product of the fraction of infected individuals and the fraction of uninfected individuals, then

where r is a constant that describes how rapidly the disease spreads in the population.

Assuming that r = 1 and y(0) = 0.5, solve for y(t).
Verify that y(t) is a CDF on −∞ < t < ∞.
Determine the probability that a randomly chosen individual from this population will be infected within the next two months.
Find the PDF associated with the CDF y(t) and show that it is symmetric.

Solution

Separating and integrating yields

Using the initial condition y(0) = 0.5, we solve for C₂:

Hence,
To verify that y(t) is a CDF for −∞ < t < ∞, we need to check three things (see Section 7.1). First, to see that y(t) is non-decreasing, we can take the derivative using the quotient rule

Since y′(t) > 0 for all t, y(t) is increasing. Second, we need to verify that y(t) = 1 and y(t) = 0. Indeed, dividing the numerator and denominator of y(t) by e^t yields

Similarly, y(t) = 0. Finally, since y(t) is continuous for all t, it is right continuous for all t.
The probability that a randomly chosen individual will get infected before the second month is . The probability that a randomly chosen individual will get infected before . Hence, the probability that a randomly chosen individual will get infected between t = 0 and . Hence, there is an approximately 38% chance that a randomly chosen individual will get infected within two months.
To find the PDF, we can use the fundamental theorem of PDFs; namely, the PDF f(t) is given by the derivative of the CDF:

It follows that

Hence, the PDF is symmetric around t = 0.

In the previous example, we selected y(0) = 0.5, resulting in a symmetric PDF around zero. Thus, the mean is zero, provided that the associated improper integral is convergent. More generally, we can derive a logistic PDF for any initial condition y(0), as well as for any arbitrary r > 0 in which case the PDF is symmetric around a value other than zero. In particular, in Problem Set 7.4 asks you to show the following:

Logistic PDF and CDF

A solution y(t) to = ry(1 − y) with y(0) (0, 1) and r > 0 gives a CDF of the following form:

where a = ln(1/y(0) − 1). This CDF corresponds to the logistic distribution. The associated PDF is

images

Note that a > 0 if y(0) (0, 0.5) and a < 0 if y(0) (0.5, 1).

Example 2 Playing with the logistic PDF

Assume that the logistic PDF describes the distribution of infection times. Let r be the intrinsic rate of growth of the disease and y(0) the fraction of individuals infected during week t = 0.

Consider a disease for which r = 1. Determine the fraction of people infected by the disease in the next two weeks if y(0) = 0.25 or y(0) = 0.75.
Use technology to plot the PDF for r = 1 and y(0) = 0.25, 0.5, and 0.75. Discuss what you find.
Consider a disease for which y(0) = 0.1. Determine the fraction of people infected by the disease in the next two weeks if the intrinsic rate of growth is r = 0.5 or r = 5.
Use technology to plot the PDF for y(0) = 0.1 and r = 0.5, 1, and 5. Discuss what you find.

Solution

If r = 1 and y(0) = 0.25, then a = ln(1/y(0) − 1) ≈ 1.1 and the CDF is given by

The fraction of people infected in the next two weeks is given by

Hence, 46% are infected in the next two weeks. If r = 1 and y(0) = 0.75, then a = ln(1/y(0) − 1) ≈ −1.1 and the CDF is given by

The fraction of people infected in the next two weeks is given by

Hence, only 21% are infected in the next two weeks.
Using technology, we obtain the PDFs plotted in Figure 7.19. These plots illustrate that as we increase y(0), the “center” of the PDF tends to move to the left. In other words, as the fraction of individuals infected at week 0 increases, the timeto-infection for all individuals decreases.

Figure 7.19 Plots of logistic PDFs for the case r = 1 and a = ln(1/y(0) − 1), where y(0) = 0.75 (red curve), y(0) = 0.5 (green curve), and y(0) = 0.25 (blue curve)
If y(0) = 0.1, then a = ln(1/y(0) − 1) ≈ 2.2. Hence, if r = 0.5, then the CDF is given by

The fraction of individuals infected in the next two weeks is given by

Hence, 13% are infected in the next two weeks. Alternatively, if r = 5, then the CDF is given by

The fraction of individuals infected in the next two weeks is given by

Hence, 90% are infected in the next two weeks.
Using technology, we obtain the PDFs plotted in Figure 7.20.

Figure 7.20 Plots of logistic PDFs for the case a = 2.197 (corresponding to y(0) = 0.1), and r = 0.5 (blue curve), r = 1.0 (green curve) and r = 5.0 (red curve)

These plots illustrate that as we increase r, the “center” of the PDF tends to move to the left and the spread around the center decreases. In other words, for diseases that spread quickly (i.e., r is larger), most people catch the disease quickly and around the same time. For diseases that spread slowly (i.e., r is smaller), there is greater variability in the time it takes for a person to get infected, although in the logistic model everyone gets infected in the end.

Example 2 illustrated how the mean and variance of the logistic distribution are affected by the parameters r and y(0). The following example determines the mean of the logistic distribution.

Example 3 Mean of the logistic PDF

Let

and

images

be the CDF and PDF of the logistic distribution. Do the following:

Find t = T such that y(T) = 0.5. In other words, find T such that half of the data lie to the left of T and half of the data lie to the right of T.
Verify that f(t) is symmetric around t = T.
Assuming the mean is well defined, find the mean.

In Problem Set 7.4, you will be asked to verify that the mean is well defined, that is, tf(t)dt is convergent.

Solution

Solving y(T) = yields
To check symmetry of f(t) about t = T, we need to verify that f(a/r + t) = f(a/r − t) for all t. Indeed, we have
Since f is symmetric around a/r, the mean is given by μ = a/r provided that the integral tf(t)dt is convergent.

In addition to determining the mean, it is possible—but more challenging—to compute the variance of the logistic PDF.

Mean and Variance of the Logistic PDF

The mean of the logistic PDF f(t) = images is

and the variance is

The logistic distribution can also describe the spread of an organism across a landscape.

Example 4 Organismal spread

Pyura praeputialis is a large tunicate (a species of sea squirt reaching lengths of up to 35 cm) that, in Chile, is distributed exclusively along 60 to 70 km of coastline in and around the bay of Antofagasta. This tunicate is a sessile, dominant species, capable of forming extensive beds of barrel-like individuals tightly cemented together in rocky intertidal and shallow subtidal zones. Using experimental quadrats, biologist Jorge Alvarado and colleagues* investigated recolonization dynamics of P. praeputialis in Chile after removal of adult individuals. Alvardo and colleagues found that the fraction of occupied habitat is approximately

where t is measured in hundreds of days. For a randomly chosen point in the habitat, what is the mean time for it to be occupied?

Solution Since r = 1.7 and a = 4, the mean time for a location being occupied is . Therefore, on average, it takes 235 days for a randomly chosen location to get occupied.

An important type of regression analysis is associated with the logistic PDF. In Chapter 1, we demonstrated fitting linear models y = ax + b to data and inferring values for y from values of x. Suppose we want to infer the probability p of a certain event occurring associated with some measurement t. For example, t could be the age of a healthy cow and p could be the probability that this cow will die during the next year. If we have data on the proportion p(t) of cows that have died at or before t, then we may consider fitting the logistic CDF to data. A method for fitting the logistic CDF to data is illustrated in the following two examples.

Example 5 Transforming the logistic into a linear equation

Show the function y = ln images is linear in t when p(t) is a logistic CDF.

Solution Since p(t) = , we get the following:

images

Example 5 implies that if we have a set of n data points, (t₁, p₁),...,(t_n, p_n), whose distribution we want to describe with the logistic CDF, it suffices to find the best-fitting line through the transformed data, images . Finding the best-fitting parameters r and a in this manner is called logistic regression.

Example 6 Logistic regression

A medical researcher used chemicals to induce the growth of prostate tumors in several hundred male rats. He surgically removed the resulting tumors after 150 days. Then he measured the cumulative proportion of individuals that had the tumors return within 90 days as a function of the size of the original tumor that he removed. The results are given in the first two columns in Table 7.2. Find the best-fitting logistic equation to this data set.

Table 7.2 Cumulative proportion p of rates growing new prostate tumors as a function of weight t (grams) of original tumor removed

images

Solution We use the transformation y = ln images of the p values in the second column of Table 7.2 to obtain the third column of this table. For the t values, we select the midpoint values t₁ = 0.5, t₂ = 1.5,...t₁₀ = 9.5, and for the last bin we use t₁₁ = 10.5, even though it represents all weights ≥ 10. Using technology to find the best-fitting line, we get y = 0.76t − 4.9—that is, r = 0.76 and a = 4.9. The transformed data and regression line are illustrated in Figure 7.21; we obtain a very good fit, except at the highest weight of 10 g.

images

Figure 7.21 Linear regression on transformed logistic data

Normal distribution

The most ubiquitous probability distribution in the natural and behavioral sciences is the normal distribution. For example, the normal distribution describes the distributions of heights (see Example 5 of Section 7.1) and weights of many organisms, crop yields (see Example 7), human IQs (see Example 9), and much more. Its ubiquity stems from this fact: if each data point is under the influence of many, independent, additive effects, then one can prove that the distribution of the data is well approximated by the normal distribution. The normal distribution is also known as the Gaussian distribution, after the German mathematician Karl Friedrich Gauss (1777–1855), who is shown in Figure 7.22 with the normal distribution in the background.

images

Figure 7.22 Deutsche mark showing Karl Gauss and the normal distribution in the background

PDF of the Normal Distribution

The PDF of the normal distribution is given by

images

where μ is the mean of the distribution and σ is the standard deviation.

The effect of increasing μ on this distribution is to move the graph to the right. The standard deviation σ, on the other hand, controls the spread of the distribution about its center. For small σ, the distribution is more peaked or concentrated around the mean; for larger σ, the distribution is fatter with a lower, broader peak, as illustrated in Figure 7.23. As we discussed in Chapter 5, there is no elementary representation of the antiderivative of f(x). Hence, we use numerical estimates.

images

Figure 7.23 Normal distributions, with mean 0 and standard deviations of 1 (blue) and 2 (red)

Example 7 Wheat yields

In 1910, W. B. Mercer and A. D. Hall conducted a wheat yield experiment at Rothamsted Experimental Station in Great Britain. In 500 identical plots, wheat was grown and yield (in bushels) was recorded. The resulting histogram of the data is approximately normal, as illustrated in Figure 7.24.

images

Figure 7.24 Histogram for the Rothamsted experiment

Data Source: Mercer, W. B. and Hall, A. D. (1911). The Experimental Error of Field Trials. Journal of Agricultural Science 4, 107–132.

The mean of the data is 3.95 bushels and the standard deviation is 0.45 bushels. Use numerical integration to approximate the following quantities:

The likelihood that the yield in a randomly chosen plot is between 3.5 and 4.5 bushels
The likelihood that the yield in a randomly chosen plot is at least 5 bushels

Solution For this problem, we have

images

Integrating f(x) dx numerically using technology yields 0.731 (to three decimal places). Hence, there is approximately a 73% chance the yield will between 3.5 and 4.5 bushels.
Integrating f(x) dx numerically using technology yields 0.00982 (to three significant digits). Hence, there is slightly less than 1% chance the yield will be at least 5 bushels.

Aside from using numerical integrators, we can use tables to estimate areas under normal densities. At first, you might think that we need an infinite number of tables to deal with all possible values of μ and σ. However, this is not the case. Using a simple substitution, we can reduce everything to a question about one normal distribution, the standard normal distribution.

Standard Normal Distribution

A random variable Z has a standard normal distribution if it is normally distributed with mean 0 and standard deviation 1; that is, it has the PDF

The following example illustrates how all questions about normal distributions can be reformulated as a question about the standard normal distribution.

Example 8 From arbitrary normal to standard normal distributions

Let X be normally distributed with mean μ and standard deviation σ. Let Z be normally distributed with mean 0 and standard deviation 1. Show that for any a,

Solution Since X has a normal distribution with mean μ and standard deviation σ, we have that

images

Consider the change of variables, z = (x − μ)/σ. Then dz = , z = (a − μ)/σ when x = a, z = −∞, and

images

z Scores

Since questions about normally distributed random variables or data can be converted to questions about the standard normal distribution, we can use z scores to determine what fraction of normally distributed data lies between the data's mean and z standard deviations above the mean. Table 7.3 reports z scores where rows determine the first two digits of z and columns the third digit of z. For example, in the row labeled (at the left) 1.0 and in the column headed 0.00, the entry is 0.3413. Therefore, for data with a standard normal distribution, the fraction of data lying in the interval [0, 1] is 34.13%. Alternatively, suppose the data have a standard normal distribution and we want to know what fraction of the data lies in the interval (−∞, 1.68]. The z table tells us the fraction of data lying in the interval [0, 1.68] is 0.4535. Since 50% of the data lies in (−∞, 0], the fraction in the interval (−∞, 1.68] is 0.4535 + 0.5 = 0.9535, as shown in the top panel of Figure 7.25. By contrast for z < 0, the symmetry of the standard normal distribution around 0 implies that subtraction, rather than addition, is required. For example, if z = −1.2, as shown in the bottom panel of Figure 7.25, 50%−34.13% = 15.87% of the area under the standard normal PDF lies on the interval (−∞, −1.2].

images

Figure 7.25 Using z scores to calculate area under a standard normal PDF

Table 7.3 Standard normal distribution

images

Example 9 IQ score in socially disparate communities

Psychologists and sociologists use scores on standardized intelligence quotient (IQ) tests to predict performance outcomes of individuals in different parts of society. In a study conducted by Naomi Breslau and colleagues, subjects from communities in southeastern Michigan and the city of Detroit had their IQ tested at age 6 and then again five years later at age 11. The summary statistics are given in Table 7.4.

Assume the distribution of IQs in each of the categories can be reasonably well approximated by a normal distribution. Determine the proportion of six-year-olds who have an IQ less than or equal to 110 in each of the two normal birth-weight groups.

Table 7.4 Mean score with standard deviation in parentheses of IQ measurements at age 6 and 5 years later at age 11 for children in Michigan stratified by birth weight and home location into eight cases

images

Solution Let X be the (approximately) normally distributed IQ score of a randomly chosen six-year-old from the urban community. To find P(X ≤ 110), Example 8 tells us it is sufficient to find P(Z ≤ z) where z = (110 − μ)/σ, μ is the mean of X, and σ² is the variance of X. From Table 7.4, we have μ = 99.1 and σ = 14.0. Therefore,

From Table 7.3, we find that z = 0.78 corresponds to the probability 0.2823. To this we have to add 0.5 for the area to the left of 0. Thus, the desired probability is

or 78% of the population have an IQ of 110 or less.

To answer this question for the suburban community, Table 7.4 tells us that μ = 113.3 and σ = 15.4. Therefore, for this community

Using Table 7.3 and the symmetry of the standard normal distribution, we get or 42% of the population have an IQ of 110 or less.

Thus, 78% of normal-birth-weight urban six-year-olds, but only 42% of normal-birth-weight suburban six-year-olds, have an IQ less than 110.

Lognormal distribution

One potential problem with a normally distributed random variable X is that it can be negative; typically, however, biological data only assume positive values (e.g., height or weight). This issue may not be a problem if only the extreme tail of the distribution is associated with a negative value of X, as in the wheat yield example illustrated in Figure 7.24. Sometimes, however, the “normality” of a data set is not apparent until it has been appropriately transformed to take values on (−∞, ∞). For example, a common transformation for data sets of positive values is taking the natural logarithm of all the data values. Data that exhibit a normal distribution after such a transformation are said to be lognormally distributed.

Lognormal Distribution I

A random variable X is lognormally distributed if ln X is normally distributed. In other words, there exist parameters μ and σ > 0 such that

images

for any real number a.

Example 10 Strep throat incubation periods

The incubation period is the time elapsed between exposure to a pathogen and the time when symptoms and signs are first apparent. For streptococcal sore throat, this period (in hours) has been found to be lognormally distributed; the parameters of the log-transformed distribution are estimated to be μ = 4.07 and σ = 0.42 (see Figure 7.26). Using these estimates, find the following quantities:

The fraction of individuals who start exhibiting symptoms within the first two days
The fraction of individuals who start exhibiting symptoms after three days

images

Figure 7.26 Incubation times in hours for a streptococcal sore throat.

Data Source: P. E. Sartwell, “The Distribution of Incubation Periods of Infectious Diseases,” American Journal of Epidemiology 141 (1995): 386–394.

Solution

Let X be the number of hours for a randomly chosen incubation period. For these kind of data, It is usually reasonable to assume (an assumption that can be checked) that ln X is approximately normal. We want to know P(X ≤ 48). Equivalently, P(ln X ≤ ln 48). The desired z value for this probability is

Using z-table (Table 7.3), we get

Hence, 32% of the people exhibited symptoms in the first two days.
We want to know P(X > 72). Equivalently, P(ln X > ln 72). The desired z value for this probability is

Using z-table (Table 7.3), we get

Hence, 31% of the people exhibited symptoms after three days.

The following example determines the PDF for the lognormal distribution and explores the effects of the parameters μ and σ on the shape of the distribution.

Example 11 Lognormal PDF

Let X be a random variable such that ln X has a normal distribution with mean μ and standard deviation σ.

Use a change of variables to find the PDF for X.
For σ = 1, plot the PDF of X with μ = −1, 0, and 1. Discuss how μ influences the shape of X's PDF.
For μ = 1, plot the PDF of X with σ = 0.5, 1, and 1.5. Discuss how σ influences the shape of X's PDF.

Solution

Since ln X is normally distributed, the PDF of ln X is given by

To determine the PDF of X, let's begin by finding an expression for P(X ≤ a) for any positive real number a. Since X ≤ a if and only if ln X ≤ ln a, we obtain

By the fundamental theorem of PDFs, the PDF of X is given by

for x > 0.
Using technology to plot the PDF of X with μ = −1, 0, and 1 and σ = 1 yields the following graph:

Increasing μ moves the center of the distribution to the right and increases the spread of the distribution about the center.
Using technology to plot the PDF of X with σ = 0.5, 1, and 1.5 and μ = 0 yields the following graph:

Increasing σ moves the center of the distribution to the left but still increases the spread of the distribution, as represented by the size of the tails (i.e., the area under the curve from x = 2 to 3).

In Problem 27 in Problem Set 7.4, you will be asked to show that the mean and variance of the lognormal distribution satisfy the relationships given below.

Lognormal Distribution II

The PDF of the lognormal distribution is defined in terms of two positive parameters μ > 0 and σ > 0 by the function

images

The mean m and variance v of this distribution are given by

and

Example 12 Survival of moths

An entomologist needs adult moths for her wind tunnel studies on how moths navigate their way in flight using pheromones in an odor plume. In a pilot study, she reared the moths from eggs until they eclosed from their pupal stage; then she selected 194 of the healthiest looking individuals for her flight studies. In Table 7.5, the number of moths dying each week is given until the last moth died in the 28th week. Now do the following:

Calculate the proportion of moths dying each week and variance of the resulting distribution.
Calculate the mean and variance of age at death.
Calculate the parameters for the lognormal distribution using the estimates from part b and plot the normal and lognormal distributions based on these estimates against the data.

Table 7.5 Number of moths dying each week (values rounded to three decimal places for presentation purpose)*

images

Solution

Since the total number of moths at the beginning of the first week is 194, the proportion dying in week i (i = 1,..., 28) is the number dying in that week divided by 194. See column 3 of Table 7.5.
The mean age at death m is obtained from the calculation

Note that we selected the midpoint of each week to represent the point at which all individuals die during the week. This, of course, is an approximation, but some approximation must be used because of the discrete nature of the problem. The answer obtained using the given formula is 8.42. The variance v associated with age at death is obtained from the calculation

The answer obtained is 25.06.
If the observed mean and variance are m = 8.46 and v = 25.15, then we need to use the relationships m = e^μ+σ²/2 and v = (e^σ² − 1)e^2μ+σ² to find parameters μ and σ for a lognormal distribution with the observed mean m and variance v. In Problem 38 of Problem Set 7.4, you are asked to show that the resulting equations are

and

Solving these yields μ = 1.98 and σ² = 0.30. The lognormal distribution generated by these parameters is plotted in red in Figure 7.27. In contrast, the normal distribution with mean 8.46 and variance 25.15 is plotted in black Figure 7.27. Clearly, the lognormal distribution provides a much better fit to the data.

Figure 7.27 Fraction of moths dying each week is plotted over the 28-week period for the actual data (closed circles), as well as the lognormal (red curve) and normal (black curve) distributions that have the same mean and variance as the data.

PROBLEM SET 7.4

Level 1 DRILL PROBLEMS

Assume that a data set is normally distributed with a mean of 0 and a standard deviation of 1. A value X is randomly selected. Find the probability requested in Problems 1 to 4.

1. P(0 ≤ X < 0.85)

2. P(X ≤ 0)

3. P(X ≥ 0.55)

4. P(−1.00 < X < 0.75)

5. In the PDF of the standard normal distribution, find the area under the PDF bounded by the lines z = 1.20 and z = 1.90 and compare this with the value of z = 1.90 − 1.20 = 0.70 in Table 7.3.

6. For X normally distributed with mean μ = −1 and standard deviation σ = 1, calculate P(X ≥ 0).

7. For X normally distributed with mean μ = 1 and standard deviation σ = 2, calculate P(X > 0).

8. For X normally distributed with mean μ = −2 and standard deviation σ = 2, calculate P(−3.00 < X < −1.00).

9. For X lognormally distributed with log mean μ = −2 and log standard deviation σ = 2, calculate P(e⁻³ < X < e⁻¹).

10. For X lognormally distributed with log mean μ = 0, calculate P(0 < X < 1).

11. For X lognormally distributed with log mean μ = 0 and log standard deviation σ = 1, calculate P(0 < X < 0.5).

12. For X lognormally distributed with log mean μ = 1 and log standard deviation σ = 2, calculate P(1 < X < 4).

13. Example 4 discussed the spatial spread of the large tunicate Pyura praeputialis on the Chilean coast. In this example, the fraction of habitat occupied by this tunicate species at time t equals

where t is measured in hundreds of days.

What fraction of habitat was occupied on day t = 0?
What fraction of habitat was occupied at 100 days?
At what point in time will 95% of the habitat be covered?

14. Suppose the fraction of habitat occupied by a tunicate species at time t equals

where t is measured in hundreds of days. Let T be the time a randomly chosen location becomes occupied. Find the mean, median, and variance of T.

15. Consider Example 1 with r = 0.1 (units per month) and y(0) = 0.5 (a relatively slow-spreading disease).

Solve the differential equation for y(t).
Verify that y(t) is a CDF.
Find the probability that a randomly chosen individual is infected with the disease in the next two months.

16. Consider Example 1 with r = 3 (units per month) and y(0) = 0.5 (a relatively fast-spreading disease).

Solve the differential equation for y(t).
Verify that y(t) is a CDF.
Find the probability that a randomly chosen individual is infected with the disease in fifteen days (i.e., 0.5 months).

17. Consider Example 1 with r = 1 (units per month) and y(0) = 0.1 (i.e., 10% of the population have the disease).

Solve the differential equation for y(t).
Verify that y(t) is a CDF.
Find the probability that a randomly chosen individual is infected with the disease in one month.

18. Consider Example 1 with r = 0.5 (units per month) and y(0) = 0.3.

Solve the differential equation for y(t).
Verify that y(t) is a CDF.
Find the probability that a randomly chosen individual is infected with the disease in the 1.5 months.

Use logistic regression to find the best-fitting functions p(t) to the data in the sets D = {(t₁, p₁),...,(t_n, p_n)} given in Problems 19 to 22.

19. D = {(1, 0.10), (2, 0.15), (3, 0.30), (4, 0.49), (5, 0.58), (6, 0.76), (7, 0.87), (8, 0.95), (9, 0.93), (10, 0.98)}

20. D = {(1, 0.03), (2, 0.02), (3, 0.08), (4, 0.09), (5, 0.21), (6, 0.30), (7, 0.52), (8, 0.61), (9, 0.84), (10, 0.88)}

21. D = {(1, 0.01), (3, 0.01), (5, 0.03), (7, 0.03), (9, 0.10), (11, 0.18), (13, 0.29), (15, 0.48), (17, 0.73), (19, 0.85), (21, 0.87)}

22. D = {(1, 0.16), (3, 0.17), (5, 0.27), (7, 0.34), (9, 0.44), (11, 0.58), (13, 0.63), (15, 0.77), (17, 0.78), (19, 0.85), (21, 0.92)}

Level 2 APPLIED AND THEORY PROBLEMS

23. In a large study, human birth weights were found to be approximately normally distributed with mean of 120 ounces and standard deviation of 18 ounces (1 pound = 16 ounces; 1 ounce = 28.35 grams).

Find the probability that a randomly chosen baby has a birth weight of 8 pounds or less.
Find the probability that a randomly chosen baby weighs between 6 and 8 pounds at birth.
Find the probability that a randomly chosen baby weighs more than 9 pounds at birth.

24. A patient is said to be hyperkalemic (high levels of potassium in the blood) if the measured level of potassium is 5.0 milliequivalents per liter (meq/L) or more. In a population of students at Ozark University, the distribution of potassium levels is normally distributed with mean 4.5 meq/L and standard deviation 0.4 meq/L. Estimate the proportion of students who are hyperkalemic.

25. The gestation period of a pregnant woman is normally distributed with mean of 279 days and standard deviation of 16 days.

Find the probability that the gestation period is between 263 and 295 days.
Find the probability that the gestation period is greater than 303 days.

26. Answer the following questions for the data in Example 9.

What is the IQ value that corresponds to the 95th percentile for each of the two six-year-old low-birth-weight groups in Table 7.4?
In the normal-birth-weight urban and suburban communities, what is the change from age 6 to age 11 in the estimated proportion of individuals who have an IQ of 140 and above?
In the two eleven-year-old low-birth-weight communities, the 50th percentile of the suburban community corresponds to which percentile in the urban community?

In Problems 27 to 30, we emphasize that we are dealing with the lognormal distribution and note that the value e^μ is not the mean but the median (see Problem 35) and that the dispersion parameter σ is not the square root of the variance v of the distribution.

27. The latent period of disease is the time from a person initially getting infected to the moment the person exhibits first symptoms. In a paper by Sartwell, as referenced in Figure 7.26, Sartwell reported that the latency period (measured in days) of salmonellosis was approximately lognormally distributed. Taking the natural logs of the latency periods that are measured in days, he estimated that μ = ln(2.4) and σ = ln(1.47). Using these estimates, find the following quantities:

The fraction of individuals who start exhibiting symptoms within the first three days
The fraction of individuals who start exhibiting symptoms after four days
The fraction of individuals who start exhibiting symptoms between the start of the second day and end of the third day

28. The latent period of disease is the time from a person initially getting infected to the moment the person exhibits first symptoms. Sartwell found that the latency period (measured in days) of poliomyelitis was approximately lognormally distributed. Taking the natural logs of the latency periods that are measured in days, he estimated that μ = ln(12.6) and σ = ln(1.5). Using these estimates, find the following quantities:

The fraction of individuals who start exhibiting symptoms within the first two weeks
The fraction of individuals who start exhibiting symptoms after ten days
The fraction of individuals who start exhibiting symptoms between the start of the twelfth day and the end of the fifteenth day.

29. The survival time after cancer diagnosis is the number of days a patient lives after being diagnosed with cancer. In an article titled “Variation in the Duration of Survival of Patients with the Chronic Leukemias,” Blood. 1960 Mar; 15: 332–349, M. Feinleib and B. McMahon reported that the survival time for female patients diagnosed with lymphatic leukemia (measured in months) was approximately lognormally distributed. Taking the natural logs of the survival times, they estimated that μ = ln(17.2) and σ = ln(3.21). Using these estimates, find the following quantities:

The fraction of individuals who survived less than one year
The fraction of individuals who survived at least two years
The fraction of individuals who survived between 1 and 1.5 years

30. The survival time after cancer diagnosis is the number of days a patient lives after being diagnosed with cancer. Feinleib and McMahon, in a study cited in the previous problem, found that the survival time for female patients diagnosed with myelocytic leukemia (measured in months) was approximately lognormally distributed. Taking the natural logs of the survival times, they estimated that μ = ln(15.9) and that σ = ln(2.80). Using these estimates, find the following quantities:

The fraction of individuals who survived less than one year
The fraction of individuals who survived at least two years
The fraction of individuals who survived between the start of thirteen months and end of eighteen months (i.e., between 1 and 1.5 years)

31. In looking over her data, the entomologist mentioned in Example 12 found that she had transposed the number of individuals dying in weeks 7 and 8. After fixing this mistake, redo all the calculations covered in Example 12. Note differences to the estimates of the mean and variance associated with the actual data and the log mean μ and log variance σ² for the associated log normal PDF.

32. Linguist G. Herdan (see “The Relation Between the Dictionary Distribution and the Occurrence Distribution of Word Length and Its Importance for the Study of Quantitative Linguistics,” Biometrika 45 (A58): 222–228) found that length of spoken words in n = 738 phone conversations was lognormally distributed, with mean m = 5.05 letters and variance v = 2.16 letters. Find the probability that a randomly chosen word had six or more letters. Hint: See Problem 38.

33. The Gompertz growth equation normalized so that the variable y has the interpretation of a proportion (i.e., y = 1 is an equilibrium and upper bound) is given by

This equation can be used to model a variety of population processes, including tumor growth (y is proportion of maximum size), population growth (y is proportion of environmental carrying capacity), and acquisition of new technologies, as illustrated in the following example.

The Gompertz equation has been used to model mobile phone uptake, where y(t) is the fraction of individuals who have a mobile phone by time t (say, in years) and r is a parameter that can be fitted to the actual data. Using this model, we can derive a probability density function that represents the time at which an individual acquired her first mobile phone. To illustrate this idea, assume that y(0) = 1/e (i.e., currently 36.79% of people have mobile phones) and r = 1.

Solve the differential equation for r = 1 and y(0) = 1/e.
Verify that F(t) = 1 − y(t), where y(t) is the solution found in part a, is a CDF.
Find the PDF for your CDF.
Compute the probability that a randomly chosen individual acquires a mobile phone two years from now.

34. Consider Example 1 with r > 0 and y(0) = y₀ (0, 1).

Solve the differential equation for y(t).
Verify that F(t) = 1 − y(t) is a CDF.

35. Show that the PDF of a normal curve has its maximum (i.e., median) at x = μ and points of inflection at x = μ + σ and x = μ − σ.

36. Consider Example 1 with r > 0 and y(0) = y₀ (0, 1).

Verify that y(t) can be written as y(t) = where a = ln(1/y₀ − 1).
Find the PDF for this CDF.

37. For the lognormal distribution defined by

images

show that the mean m and variance v are given by

and

38. If ln X is a normally distributed random variable with mean μ and variance σ², and X is a lognormally distributed random variable with mean m and variance v, then show that

and

7.5 Life Tables

In Section 6.1 we introduced the simplest differential equation model of population growth:

This model implicitly assumes that all individuals, whether young or old, have the same mortality and fecundity rates. Although this assumption is a useful first approximation, mortality and fecundity are often age dependent. For instance, many animals become sexually mature only after they have reached a particular age. Additionally, the risk of mortality is often higher at younger and older ages. In this section, we consider models that account for age-specific mortality and reproduction.

Survivorship functions

Biology professor Gregory Erickson and colleagues studied fossils of four North American tyrannosaurs—Albertosaurus, Tyrannosaurus, Gorgosaurus, and Daspletosaurus. Using the femur bones of these fossils, the scientists estimated that the life spans of the dinosaurs ranged from birth to 28 years. Based on these estimates, the scientists created a life table for each of the dinosaurs. These life tables keep track of what fraction l(t) of individuals survived to age t. For example, the life table for Albertosaurus sarcophagus (see Figure 7.28) is reported in Table 7.6.

Survivorship Function

A function l: [0, ∞) → [0, 1] is a survivorship function if

l(0) = 1; that is, all individuals survive to age 0
l(t) is nonincreasing; that is, if an individual survived to age t, then it survived to all earlier ages
l(t) = 0; that is, all individuals eventually die

images

Figure 7.28 Albertosaurus sarcophagus

Table 7.6 Life table for A. sarcophagus

images

Example 1 Aging dinosaurs

Use Table 7.6 to do the following:

Determine what fraction of dinosaurs died between ages 4 and 6.
Determine what fraction of dinosaurs died between ages 11 and 14.
Plot l(t) and discuss its shape.

Solution

Since l(4) = 0.56, 56% of the dinosaurs survived to age 4. Similarly, l(6) = 54% of the dinosaurs survived to age 6. Since l(4) − l(6) = 2%, it follows that 2% of dinosaurs died between ages 4 and 6.
Since l(11) = 46% of the dinosaurs survived to age 11 and l(14) = 38% of the dinosaurs survived to age 14, l(11) − l(14) = 8% of those alive at age 2 died between ages 11 and 14.
Plotting l(t) with technology yields Figure 7.29. As we expect, l(t) is a decreasing function of t; the fraction of individuals surviving decreased with age. Figure 7.29 shows that l(t) decreases sharply at age 1 and is concave down for ages between 2 and 20 years. Hence, survivorship decreased at an increasing rate during ages 2 to 20 years. Alternatively, survivorship decreased at a slower rate at the older ages.

Figure 7.29 Plot of the entries in Table 7.6 showing proportion l(t) of individual Albertosaurus sarcophagus that survived to age t

Survivorship functions have a natural relationship to CDFs of an appropriate random variable, as the following example shows.

Example 2 From survivorship to CDFs and PDFs

Let l(t) be the survivorship function for A. sarcophagus and let X be the age at which a randomly chosen A. sarcophagus dies. If F is the CDF for X, then determine the relationship between F and l. If X is a continuous random variable, what is the PDF for X?

Solution Since l(t) is the fraction of individuals that die after age t, l(t) = P(X > t). Alternatively, F(t) = P(X ≤ t). Since P(X > t) = 1 − P(X ≤ t) = 1 − F(t), we have that l(t) = 1 − F(t) and F(t) = 1 − l(t). If X is a continuous random variable, then by the fundamental theorem of PDFs in Section 7.2, the PDF of X is given by F′(t) = −l′(t).

Using Table 7.6, we can determine how the mortality rates of A. sarcophagus vary with age. In particular, imagine (as did a famous movie!) that on a remote island scientists were able to create 100 A. sarcophagus babies. The life table implies that of these 100, 60 survive to age 2 and 56 survive to age 4. Hence, 4 of 60 individuals die from age 2 to 4. Thus, the mortality rate over this two-year period is 4/60 = 6.7% and the annual mortality rate is approximately 3.3% per year. Equivalently, we estimate the mortality rate as

images

In the following example, we compute and interpret the mortality rates for the remaining age classes.

Example 3 Dinosaur mortality rates

Refer to the life table for A. sarcophagus.

Determine age-specific mortality rates.
Discuss which ages were most susceptible and least susceptible to mortality.

Solution

For the mortality rate from age 0 to age 2, we have

We already found that the mortality rate at age 2 is 0.033. To determine the mortality rate at age 4, we can compute

Computing the remaining mortality m(t) rates yields this table:
This table suggests the mortality rate in the first year is greatest. For individuals surviving after the first year, mortality risk tends to increase with age and then decrease in the last few years.

In Example 3, we computed mortality rates using the relationship

images

where Δt is the step size between measurements. Multiplying both sides of this equation by −l(t) yields

Taking the limit as Δt approaches 0 provides the following result:

Survivorship-Mortality Equation

If l(t) is the fraction of individuals that survive to age t and m(t) is the mortality rate at age t, then l(t) and m(t) satisfy the equation

Equivalently,

images

Example 4 Constant mortality rates

For many short-lived mammals and birds, the mortality rate m(t) is approximately constant*. Assuming that m(t) = m for all t, determine l(t) and the CDF associated with this survival function. Do they look familiar?

Solution If m(t) = m is constant, then l′(t) = −ml(t). The general solution to this equation is l(t) = l(0)e^−mt. Since all individuals survive to age 0, l(0) = 1 and l(t) = e^−mt. In Example 2, we noted that 1 − l(t) = 1 − e^−mt for t ≥ 0 is the CDF for the distribution of ages. This CDF corresponds to the exponential distribution with mean . Hence, for individuals with a constant mortality rate m per year, life expectancy is years.

Example 5 Mortality rates for humans in the United States

Life tables are an ancient and important tool for human demography. They are widely used for descriptive and analytic purposes in public health, health insurance, epidemiology, and population geography. In recognition of the importance of these life tables, the Max Planck Institute for Demographic Research, the University of California at Berkeley, and the Institut national d'études démographiques developed The Human Life-Table Database, an online resource for human life tables at http://www.lifetable.de/. Data for male and female survivorship functions are illustrated in Figure 7.30. The survivorship function of males is well approximated by the function

where t is measured in years. Compute and interpret the mortality rate m(t) for l(t).

images

Figure 7.30 Life tables for females (in red) and males (in blue) for the United States in 2005. Black curves correspond to the best-fitting curves to the data.

Data Source: www.lifetable.de

Solution The mortality rate is given by m(t) = −. By the chain rule,

images

Therefore,

images

Hence, the instantaneous mortality rate is initially low (approximately 0.02% at t = 0) in the first year and increases super-exponentially (approximately 4.5% mortality rate for 75-year-olds) as shown on the left.

images

Life expectancy

Given a survival function l(t) for a population, we can ask: What is the life expectancy of an individual? To answer this question, let X be the age at which a randomly chosen individual dies. The mean of X is the mean life span of an individual in the population. To compute this mean, recall that the CDF for X is given by F(t) = 1 − l(t) for t ≥ 0 and 0 otherwise. Hence, the PDF for X (assuming l is differentiable!) is −l′(t) for t ≥ 0 and 0 otherwise. The mean of X is given by

provided the improper integral is convergent. Let's assume it is convergent. We can simplify the integral by integrating by parts. Define u = t and dv = −l′(t)dt, so that du = dt and v = −l(t). This yields

Evaluating this integral from 0 to b and taking the limit as b → ∞ yields

images

If we assume bl(b) = 0, then

Hence, we proved the result shown in Theorem 7.4.

Theorem 7.4 Life expectancy theorem

Let l(t) be a continuously differentiable survivorship function satisfying bl(b) = 0. Let X be the random variable whose CDF is given by 1 − l(t) for t ≥ 0 and 0 otherwise. Then, the mean of X, which is the life expectancy of an individual, equals

provided that the integral is convergent.

All the mortality functions l(t) we use have the property that bl(b) = 0. This limit condition is, of course, much stronger than the requirement that l(b) approach 0 as b increases. This condition, however, is not sufficient to ensure convergence. For example, if l(t) = for t ≥ 2, then l(t)dt is divergent.

Example 6 Life expectancy of A. sarcophagus

Estimate the mean age of A. sarcophagus using Table 7.6.

Solution Using the right endpoint rule and assuming a maximum life span is 30 years, we get

images

Hence, the life expectancy of A. sarcophagus is 8.59 years. In the Problem Set 7.5, you are asked to verify that the left endpoint rule provides a more optimistic life expectancy of 10.23 years.

Example 7 Men versus Women

For the data presented in Example 5, the survival function for males is well approximated by

and the survival function for females is well approximated by

Use numerical integration to estimate the life expectancy of males and females in the United States.

Solution Using technology, we get

Using technology, we get

Hence, females were expected to live five years longer than males.

Example 8 Older is better

Consider a hypothetical population whose mortality rate is

Determine the life expectancy of this population.

Solution To determine the life expectancy, we need to find l(t). Since l(t) must satisfy l′(t) = −m(t)l(t) and l(0) = 1, we can use separation of variables to solve for l(t):

images

Since l(0) = 1 = e^C, we get l(t) = .

To find the life expectancy, we need to compute images . Using the substitution u = 1 + t, we get

images

Therefore,

images

Reproductive success

So far we have only considered the likelihood of an individual surviving until a certain age. To better understand the dynamics of a population, we also need to know how the reproductive success of individuals depends on their ages. In other words, how many progeny do individuals of a particular age produce on average? In developing the models, we let b(t) denote the average number of progeny produced by an individual of age t. The likelihood l(t) of an individual surviving to age t in conjunction with b(t) provides a considerable amount of information about the demography of a population, as the following example illustrates.

images

Figure 7.31 The vole Microtus agrestis

Example 9 Vole life history

Table 7.7 Life table for Microtus agrestis where t is measured in weeks, l(t) is the fraction of females surviving to age t, and b(t) is the average number of female offspring produced per week by an individual of age t

images

In their classic text, The Distribution and Abundance of Animals, ecologists H. G. Andrewartha and L. C. Birch created the life table, Table 7.7, for females of the vole species Microtus agrestis. Use this table to answer the following question: If you were given 100 female voles of age 0, and you placed them in your backyard, how many female progeny would they produce during their lifetime? Assume that no individuals live beyond 72 weeks and that the entries b(t) in Table 7.7 apply to all females surviving each of the eight-week periods over which the data are discretized.

Solution Of the 100 females voles, we expect 83% will survive to week 8. Each of these 83 will produce on average 0.08 daughters per week. Hence, in the interval [0, 8], we expect 83 × 0.08 × 8 = 53.12 daughters to be produced. Then, 73% of the female voles survive to week 16. Each of these surviving females will produce on average 0.3 daughters per week from week 8 to week 16. Hence, in the interval [8, 16], we expect 73 × 0.3 × 8 = 175.2 daughters to be produced. Continuing in this manner, we get Table 7.8.

Adding all these daughters yields 588 daughters expected to be produced by 100 female vole. Equivalently, each female vole will produce on average 5.88 daughters.

Table 7.8 Number of daughters (rounded to nearest integer) produced by 100 female voles as they pass through each of the specified age categories and a proportion drop out of each category according to the survival schedule (function)l(t)

Age categories	Daughters
[0, 8]	53
[8, 16]	175
[16, 24]	175
[24, 32]	107
[32, 40]	49
[40, 48]	20
[48, 56]	6
[56, 64]	2
[64, 72]	1
Total	588

Example 9 illustrates how to use a life table to determine the average number of daughters produced by a female during her lifetime. To generalize the computations in Example 9 to an arbitrary survival function l(t) and an arbitrary birth function b(t) ≥ 0, assume that initially there are N females (e.g., N = 100 in Example 9) and that Δt is the width of the time intervals for life table (e.g., Δt = 8 in Example 9). The number of females that survive to age t₁ = Δt is Nl(t₁). Each of these females produces b(t₁)Δt daughters. Hence, by time t₁, there are Nl(t₁)b(t₁)Δt daughters. The number of females that survive to age t₂ = 2Δt is Nl(t₂). Each of these females produces approximately b(t₂)Δt daughters in the time interval [t₁, t₂]. Hence, by time t₂, there are approximately

daughters produced. Continuing in this manner, there are approximately

daughters produced. Taking the limit as Δt → 0 yields the expected number of daughters D to be

If we now define the reproductive number R₀ to be the number of daughters that each individual female is expected to produce in her lifetime—that is R₀ = D/N—then we obtain the following relationship:

Reproductive Number

Let l(t) be a survival function and b(t) be a reproduction function. The reproductive number for the population, defined to be the average number of daughters produced by a female, is given by

whenever the improper integral is well defined.

Ignoring the role of males, if R₀ > 1, then each female more than replaces herself in each generation and the population grows. On the other hand, if R₀ < 1, then each female fails to fully replace herself in each generation and the population declines.

Example 10 Reproductive number for painted turtles

Painted turtles are found in Iowa where their favorite pastime is basking in the sun on warm March days. At night, they retire to the bottom of a wetland. Females lay their eggs in late May or June. Using a mark-recapture study, biology professor Henry Wilbur estimated the survival and reproductive functions of painted turtles. He found that l(t) ≈ 0.243e^−0.273t for t ≥ 1 and l(t) ≈ e^−1.69t for t < 1. Moreover, he assumed that female turtles are reproductively mature at age 7 and that mature females produce on average 6.6 daughters per year. Using this information, do the following:

Estimate the life expectancy of a female painted turtle.
Estimate the reproductive number of the painted turtle. Based on this estimate, discuss whether you think the painted turtle population is increasing or decreasing.

Solution

To estimate the life expectancy, we need to compute l(t)dt. By the splitting property for integrals, . The first integral equals

Since, , we obtain

Therefore, the life expectancy is approximately 0.4825 + 0.6774 ≈ 1.16 years. Hence, the average female turtle is not expected to live to a reproductively mature age, though some do and they reproduce—but enough? The answer lies in the next part.
The reproductive number is given by R₀ = l(t)b(t)dt. Since b(t) = 0 for t ≤ 7,

Since, ignoring the constant of integration, ∫ 0.243e^−0.273t6.6dt ≈ −5.875e^−0.273t, we find

So a female painted turtle is expected to produce less than one daughter during her lifetime. This suggests that the population of painted turtles is in decline, as individuals are not replacing themselves over their lifetime.

Example 11 Reproductive number in the United States

According to the United Nation's online data website, http://data.un.org/, the reproductive number for women in the United States in the period 2000–2005 is one; on average, a woman produces one daughter during her lifetime. What does this tell us about b(t)? Recall, the survival function for women in the United States in 2005 is well approximated by

Assume that women have a constant birthrate b during their “childbearing years,” which in conventional international statistical usage is from age 15 to 49. Estimate b.

Solution From the definition of R₀ and the reproductive number of a woman in the United States, we get

images

Using numerical integration, we get

images

Hence, we get 1 = b · 33.33 or b= 1/33.33 ≈ 0.03.

In addition to applications in demography, life tables can be used to understand the spread of disease in a population. As a striking parallel to the demographic process of survivorship and reproduction, consider the following: An individual who contracts a disease will be subject to a maturation process known as a latent period and then will become infective, which is akin to reaching sexual maturity. Then, in each period, the infected individual may or may not infect another individual, which is akin to reproduction. Along the way, of course, the infected individual may either recover from the disease or die, which is akin to mortality.

Example 12 Measles epidemics

Measles is a highly infectious viral disease (genus Morbillivirus of the family Paramyxoviridae) that infects, in particular, human infants and adults. An individual infected with measles will become infectious at anywhere from seven to eighteen days and remain infectious for about eight days. Let l(t) be the fraction of individuals infected with measles t days after getting infected (see second column of Table 7.9). The number of new infections that arise from an infected individual (these new infections are equivalent to “births” in the context of the growth of the infected population) depends on many factors, including the rate at which individuals contact other individuals on public transport, at the workplace, and so on. However, in the population of concern, public health officials have determined the number of new cases that infected individuals can be expected to give rise to before they themselves are cured or die; see the third column in Table 7.9.

Table 7.9 Life table for a measles epidemic

images

If several infectious individuals are introduced into the population to which these data apply, is an epidemic expected to occur (i.e., is the population of infectious individuals expected to grow)?
If the proportion of individuals vaccinated in a population reduces the expected number of individuals infected per infectious individual by this same proportion, then what proportion of the population should be vaccinated to ensure that the disease will not spread?

Solution

Since we have cast this problem in terms of life table analysis, whether a measles epidemic will occur depends on the value of R₀ being greater or less than 1. From Table 7.9, it follows that

Hence, an infected individual infects, on average, 2.29 other individuals and the population of infected individuals will grow. An epidemic is likely.
If a proportion of individuals y are vaccinated, the proportion available to spread the disease is 1 − y. To control the population, we need to select y to ensure that R₀ < 1; that is, we need to solve R₀ = 2.29(1 − y) < 1 for y. This implies that 2.29y > 2.29 − 1 or y > 1.29/2.29 ≈ 0.56. Hence, at least 56% of the population should be vaccinated to ensure that measles does not spread in the population.

PROBLEM SET 7.5

Level 1 DRILL PROBLEMS

Use Life Table 7.6 for Albertosaurus sarcophagus to compute the quantities in Problems 1 to 4.

1. The fraction of A. sarcophagus that died between 14 and 20 years

2. The fraction of A. sarcophagus that died between 20 and 28 years

3. The fraction of A. sarcophagus that lived at least 6 years

4. The fraction of A. sarcophagus that lived at least 8 years

Use Life Table 7.7 for Microtus agrestis to compute the quantities in Problems 5 to 8.

5. The fraction of female voles that lived fewer than 24 weeks

6. The fraction of female voles that lived fewer than 40 weeks

7. The fraction of female voles that lived between 24 and 48 weeks

8. The fraction of female voles that lived between 40 and 64 weeks

Use the survivorship curves for men and women in Example 7 to compute the quantities in Problems 9 to 12.

9. The fraction of women who lived at least 75 years

10. The fraction of men who lived at least 75 years

11. The fraction of women who lived between 25 and 75 years

12. The fraction of men who lived between 25 and 75 years

13. Find the survivorship function l(t) when m(t) = a + bt with a > 0 and b > 0.

14. Find the survivorship function l(t) when m(t) = with a > 0 and b > 0. Discuss how a and b influence the shape of the survivorship function.

15. Use Life Table 7.7 for Microtus agrestis to approximate the mortality rates for all age classes of the female vole. Discuss any pattern in the mortality rates that you observe.

16. Use Life Table 7.7 for Microtus agrestis to compute the life expectancy of the female vole.

Compute the life expectancy for populations with the (hypothetical) survivorship functions in Problems 17 to 22. Assume t is measured in years. Don't be surprised if one of them turns out to be infinite.

images

Compute R₀ for populations with the (hypothetical) survivorship and reproduction functions in Problems 23 to 28. Assume t is measured in years.

images

Level 2 APPLIED AND THEORY PROBLEMS

29. Show that m(t) = ln[l(t)] provided that l(t) is differentiable.

30. If l(t)dt is convergent and b(t) ≤ B for all t, show that R₀ = l(t)b(t)dt is convergent.

31. According to the work of Erikson and colleagues reported in Table 7.6, the survivorship function l(t) for the dinosaur species Albertosaurus is well approximated by

for t ≥ 2. This function plotted against the data is shown here:

images

Find and plot the mortality function m(t) for t ≥ 2.

32. Using l(t) from Example 7, find the mortality function m(t) for females in the United States in 2005.

33. According to the work of Erikson and colleagues, the mortality rate for the dinosaur species Gorgosaurus is given by

Find and plot the survivorship function l(t).

34. According to the work of Erikson and colleagues, the mortality rate for the dinosaur species Daspletosaurus is given by

Find and plot the survivorship function l(t).

35. According to the work of Erikson and colleagues, the survivorship function for the dinosaur species Tyrannosaurus is

Find and plot the mortality rate m(t).

36. According to a National Vital Statistics Report (volume 54, number 14), the life table for people in the United States in 2003 was as follows:

images

Using right endpoints, estimate the life expectancy of a human.

37. During an outbreak of a SARS-like coronavirus, data were collected that resulted in the construction of the following table:

images

Use this table to answer the following questions:

If several infectious individuals are introduced into another population, is the epidemic expected to spread?
If the proportion of individuals vaccinated in a population reduces the expected number of individuals infected per infectious individual by this same proportion, then what proportion of the population should be vaccinated to ensure that the disease will not spread?

38. Communicable diseases often have at least two stages: a latent stage in which the individual is infected but not infectious, and an infectious stage in which the individual can infect others. For a deadly disease where the time to death is exponentially distributed with mean 1/q days, the fraction of individuals surviving t days with the disease is l(t) = e^−qt. Using differential equations to model the infection with two stages, latent and infectious, the infectiousness of an average infected individual (i.e., the number of people infected per day) is given by

where 1/a is the mean duration of the latent period, 1/c is the mean duration of the infectious period, and k is the rate an infectious individual infects others. For this model find R₀.

39. The parameters of the HIV epidemic vary considerably from country to country. The following table shows survival (including both death and drop-out rates) for treated and untreated segments of the sexually promiscuous population. The numbers reflect the fact that we expect all individuals to die within ten years if they are infected, unless they are treated. In the latter case, we assume that the individuals leave or drop out of the sexually promiscuous population after being part of it for twenty years. Also their infectivity is less for some of the infectivity period because the levels of virus in their body fluids are reduced by treatment. Infectivity comes back later as the efficacy of treatment is reduced over time.

Compare the R₀ for the treated and untreated segments of the population. What do you conclude?
What levels of condom use in the two subpopulations are needed to control the epidemic, assuming that condom use reduces the probability of transmission by 95%?

images

40. Botswana is a midsized country in central Africa. With a population of just over two million people, it is one of the most sparsely populated countries in the world. Using 2006 data from Human Life-Table database (see Example 5), one can approximate the survival functions of males and females as

These approximations work fairly well as shown below, with males in blue and females in red.

images

Use numerical integration to estimate the life expectancy for males and females in Botswana.

CHAPTER 7 REVIEW QUESTIONS

Consider the following data set corresponding to scores on a test. Let X denote a randomly chosen test score.

Score Frequency

50–59 6

60–69 14

70–79 26

80–89 10

90–99 4
1. Construct a histogram.
2. Find P(0 ≤ X ≤ 89).
3. Find P(X > 79).
Let f(x) = ax³, 0 ≤ x ≤ 4.
1. Find a constant a so that it is a PDF.
2. Find the mean of this PDF.
Consider the hyperbolic function

for any k > 0.
1. Show that F(x) is a CDF.
2. Let X be a random variable with a CDF F(x). Find P(1 ≤ X ≤ 2).
Use the comparison test to prove that the integral

is convergent for any r > 0.
Determine for which p > 0 values the integral is convergent.
Use the convergence test to determine whether the given integrals converge or diverge.
Show that f(x) = is a PDF on [1, 2] and find its CDF.
Compute the mean, variance, and standard deviation for a pair of dice; that is, this data set:
2, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 11, 11, 12
Compute the mean and variance of the random variable with PDF f(x) = for x ≥ 1 and f(x) = 0 elsewhere.
According to Thomson et al. (see Problem 36 in Problem Set 7.1), the elimination constant for lidocaine for patients with congestive heart failure is 0.31 per hour. Hence, for a patient who has received an initial dosage of y₀ mg, the lidocaine level y(t) in the body can be modeled by the differential equation
1. Solve for y(t).
2. Write an expression, call it F(t), that represents the fraction of drug that has left the body by time t ≥ 0.
3. If F(t) = 0 for t ≤ 0, verify that F(t) is a CDF.
4. What is the probability that a randomly chosen molecule of drug leaves the body in the first two hours?
5. What is the probability that a randomly chosen molecule of drug leaves the body between the start of the second and start of the fourth hour?
The 1999 American Academy of Physician Assistants (AAPA) Physician Assistant Census Survey found that the mean income for a clinically practicing physician's assistant (PA) working full-time was $68,164, with a standard deviation $17,408. Using Chebyshev's inequality, determine a lower bound for the fraction of PAs with an income between $42,052 and $94,276.
The time for a mosquito to mature from larva to pupa is approximately exponentially distributed with a mean of fourteen days. Find the probability that a mosquito has matured from larva to pupa in ten days or less. Find the probability a mosquito has taken at least fourteen days to mature from larva to pupa.
According to Alexei A. Sharov, Department of Entomology at Virginia Tech, mortality depends on many factors such as temperature, population, and density. Sharov also said that when life tables are built, the effect of these factors is averaged and only age is considered as a factor that determines mortality. Sharov developed a life table for a sheep population in which females are counted once a year, immediately after breeding season:

Use this table to compute the life expectancy of a female sheep.
Consider a random variable X with the Pareto distribution with parameter p = 5.
1. According to Chebyshev's inequality, what is a lower bound for the probability of X being within two standard deviations of its mean?
2. Find the probability that X is within two standard deviations of its mean.
Using 2006 data from Human Life-Table Database (see Example 5 of Section 7.5) one can approximate the survival functions of females in Botswana as
1. Find the fraction of women who live at least forty years.
2. Find the mortality rate at age 40.
For loggerhead turtles, females become reproductively active around age 21 years. The fraction of individuals that live to age 21 is 0.0023, and the mortality rate for individuals older than 21 is approximately 0.2 per year. Reproductively active females produce, on average, 160 eggs per year of which half are daughters. Estimate the reproductive number R₀ for this population. What is the fate of the population? How much does one need to reduce the mortality rate of individuals ≥ 21 to reverse this fate?
Given a continuously differentiable survivorship function l(t), let X correspond to the lifetime of a randomly chosen individual. What is the PDF and CDF for X?
Find parameters a and r in the logistic PDF

such that the mean of this PDF is 1 and the variance is π².
You are told that a set of data is normally distributed with mean and variance equal to 21 and 64, respectively. Estimate the proportion of these data that have a value greater than or equal to 29.
In an experiment involved in rearing cohorts of the human louse, Pediculus humanus, researchers Francis Evans and Fredrick Smith at the University of Michigan obtained data on the proportion of individuals that survive over time. From their data, one can calculate the proportion of adults surviving each week of their experiment as follows*: {0.124, 0.301, 0.273, 0.210, 0.083, 0.009}. Since these six values sum to one, they imply that all individuals are dead by the end of the sixth week. In the following figure, the cumulative proportion of adults dying each week is plotted along with the CDF of best-fitting lognormal distribution.

If the equation of the best-fitting lognormal PDF is

then what is the mean survival time for these lice? How does this value compare to the value obtained by calculating the mean survival directly from the data, if the six values correspond to survival midway between each week? What accounts for the difference in these two approaches to estimating the mean?

Score	Frequency
50–59	6
60–69	14
70–79	26
80–89	10
90–99	4

GROUP PROJECTS

Seeing a project through on your own, or working in a small group to complete a project, teaches important skills. The following projects provide opportunities to develop such skills.

Project 7A Fitting Distributions

Search the Web for a data set consisting of at least several hundred data points. Explore your data as outlined next; provide illustrations to enhance the presentation of your analysis.

Draw histograms for several different bin sizes. Select the histogram that results in the smoothest looking probability distribution in terms of being approximated by some curve. Note that if the bin size is too large, the histogram will look like a few big blocks. If the bin size is too small, the histogram will look like a picket fence with lots of missing staves.
Calculate the mean and variance from the histogram. Compare these values to the values you get when calculating the mean and variance directly from the data.
Calculate the expected fraction of data in each bin of a theoretical histogram obtained from a uniform, logistic, normal, and lognormal distribution that has the same mean and variance as the histogram you constructed from the data.
Use a sum-of-squares measure to compare how well the four distributions in step 3 fit the data and discuss your results.
Bonus: Search the Web or books for other distributions not dealt with in this chapter and repeat steps 2 and 3 for these distributions.

Project 7B Play with Logistic Regression

Use an appropriate computer technology to generate a set of data that conforms to the logistic distribution

for the case a = 5 and r = 1 as follows.

First verify that p(0.5) = 0.011 and p(10) = 0.993. Thus, x [0.5, 10] covers more than 98% of the range of values that p(x) can assume.
Use your technology to generate 100 values x_i, i = 1,..., 100, of a random variable X that is uniformly distributed on [0.5, 10]. Make sure that the mean and variance of these 100 values conform to the theoretically expected values.
For each x_i, calculate the corresponding . Now for each i generate a value z_i from the uniform distribution on [0, 1]. (Most technologies refer to this as generating a value at random between 0 and 1.) If z_i > p_i set y_i = 0; otherwise, set y_i = 1. Once you have done this for all i = 1,..., 100, you will have a data set D = {(x_i, y_i)|i = 1,..., 100)} with values of x_i between 0.5 and 10 and a value of y_i either 0 or 1.
Construct a histogram for these data using six equal bin sizes and the proportion of data points in the bin that have a y_i value equal to 1.
Use logistic regression to estimate the parameters â and from the best-fitting linear model of the transformed data from the histogram. How close are â and to the values 5 and 1, respectively?
Now repeat the exercise with 300 points and again with 1000 points. In each case, how close are â and to the values 5 and 1, respectively? What do you notice?
Write a report that contains your results, and in the concluding section interpret the basic concepts behind this exercise.

*This version of the statement can be found in William Hermann's published conversations with Einstein (W. Hermann, 1983. Einstein and the Poet: In Search of Cosmic Man. Branden Press, Inc., Brookline Village, MA 02147, p. 58.).

*P. D. Thomson, K. L. Melmon, J. A. Richardson, et al., “Lidocaine Pharmacokinetics in Advanced Heart Failure, Liver Disease, and Renal Failure in Humans,” Ann Intern Med 78 (1973): 499–508.

*Posted January 10, 2002 at http://www.sciencedaily.com/releases/2002/01/020109074801.htm

*W. W. Rouse Ball, A Short Account of the History of Mathematics, London: Macmillan, 1893.

*See www.dh.gov.hk.

*See Peter K. Dunn, “A Simple Dataset for Demonstrating Common Distributions,” Journal of Statistics Education 7 (no. 3, 1999).

*J. L. Alvardo et al., 2001. “Patch Recolonization by the Tunicate Pyura praeputialis in the Rocky Intertidal of the Bay of Antofagasta, Chile: Evidence for Self-Facilitation Mechanisms”, Marine Ecology Progress Series 224 (2001): 93–101.

*T. A. Ebert, Plant and Animal Populations: Methods in Demography (San Diego: Academic Press, 1999).

*These values apply to the group of individuals that had already survived the egg and three larval stages, and these data are adapted by us from Francis C. Evans and Frederick E. Smith, “The Intrinsic Rate of Natural Increase for the Human Louse, Pediculus humanus L.,” American Naturalist 86 (No. 830)(1952): 299–310.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for CHAPTER 7: Probabilistic Applications of Integration

Create new playlist

Sign In

Sign Up

CHAPTER 7

Probabilistic Applications of Integration

Preview

7.1 Histograms, PDFs, and CDFs

Histograms and probabilities

Probability density functions

Cumulative distribution functions

Percentiles

7.2 Improper Integrals

One-sided improper integrals

Convergence tests

Two-sided improper integrals

7.3 Mean and Variance

Means

Variance and standard deviation

Chebyshev's inequality

7.4 Bell-Shaped Distributions

The logistic distribution

Normal distribution

z Scores

Lognormal distribution

7.5 Life Tables

Survivorship functions

Life expectancy

Reproductive success

GROUP PROJECTS

Project 7A Fitting Distributions

Project 7B Play with Logistic Regression

Table of Contents for
CHAPTER 7: Probabilistic Applications of Integration