The probability density function

So far, we have considered the cumulative distribution function as the main way to describe a random variable. However, for a large class of important models, the probability density function (pdf) is an important alternative characterization.

To understand the distinction between the cdf and pdf, we need the notion of probability. In the context of random variables, probability simply means the likelihood that the random outcome falls within a certain range of values, normalized to a number between 0 and 1. For example, let's consider the example of women's heights discussed in the previous section. We concluded that 42.8% of women have a height between 63 inches and 68 inches. An alternative way to express this is to say that, for the random variable that represents women's heights, the probability that the outcome is between 63 and 68 is .428.

The main distinction between the cdf and pdf is the way probabilities are represented by each of them:

  • For a cdf, the probability that the outcome is in a range is computed as the difference between the values of the cdf at the endpoints of the range
  • For a pdf, the probability that the outcome is in a range is computed as the area under the curve determined by the range

To clarify these concepts, let's consider the following figure:

The probability density function

This figure displays both the cdf and pdf for a standard Normal distribution. In both figures, we have a range of values on the horizontal axis defined by the values a and b. The figure graphically illustrates the probability that the outcome falls in this range in each case:

  • In the case of the cdf, the probability is given by F(b)-F(a), which corresponds to the length of the highlighted segment on the y axis
  • In the case of the cdf, the probability is given by the area bound by the curve between the values a and b

This observation actually explains why the cdf is more useful computationally. To compute the probability using the cdf, all we need is the difference between two values, while to do the same calculation using the pdf, it is necessary to find the value of an area. As this is not a simple shape, we need calculus to compute the area. In fact, in the case of the Normal distribution, this is an area that cannot be computed even by the methods usually seen in calculus courses! Of course, this complexity still exists when we do computations in Python, but the details are fortunately hidden from us.

This is a good time to mention briefly how the pdf is related to significant characteristics of a distribution such as mean and standard deviation. When talking about a random variable, the notion of average is technically called the expected value or mean of the random variable. Intuitively, this is the average of the values that we expect to see if a large number of trials with the same distribution is observed. Likewise, the variance of the random variable is the average squared deviation from the mean.

Finally, the standard deviation is the square root of the variance. Unfortunately, to give a mathematical definition of these notions for a continuous random variable, we again need calculus. As this book concentrates on the practical application of Python to data analysis, we will be content with the intuitive meaning of these concepts and let the computer do the dirty computational work under the hood.

We finalize by pointing out that we have, in this section, concentrated on a continuous distribution characterized by a smooth cdf. On the other extreme are discrete distributions, which have a cdf that looks like a staircase, much like the examples seen in the previous section. Discrete random variables cannot be represented by a pdf. Instead, they are defined in terms of a probability mass function (pmf). An example of a discrete random variable is considered in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset