CHAPTER
5

Calculating Descriptive Statistics: Measures of Dispersion

In This Chapter

  • Calculating the range of a sample
  • Calculating the variance and standard deviation of a sample and population
  • Calculating the variance and standard deviation of grouped data
  • Using measures of relative position to identify outliers
  • Using Excel to calculate measures of dispersion

In Chapter 4, we calculated measures of central tendency by summarizing our data set into a single value. But in doing so, we lost information that could be useful. For the video game example, if the only information I provided you was that the mean of my sample was 6.6 hours, you would not know whether all the values were between 6 and 7 hours or whether the values varied between 1 and 12 hours. As you will see later, this distinction can be very important.

To address this issue, we rely on the second major category of descriptive statistics, measures of dispersion, which describes how far the individual data values have strayed from the mean. So let’s look closely at them and at the different ways we can measure dispersion.

Measures of dispersion tell me how the data are spread out around the mean. The smaller the value for the measure of dispersion is, the closer the data values are to the mean, whereas the larger the value for the measure of dispersion is, the more spread out the data values are.

To see it clearly, let’s look at these two data sets:

1            3              5              7            9

              3      4      5      6      7

Each data set has 5 observations and each has the same mean, which is 5. But the values in the first data set are more spread out, while the values in the second data set are closer to the mean. That’s because the standard deviation (which is a measure of dispersion) is larger for the first data set than for the second.

In this chapter, we will look at measures of dispersion for ungrouped data, measures of dispersion for grouped data, and measures of relative position. So, let’s get started.

Measures of Dispersion for Ungrouped Data

This includes range, variance, and standard deviation. Let’s see what each one means.

Range

The range is the simplest measure of dispersion and is calculated by finding the difference between the highest value and the lowest value in the data set. To demonstrate how to calculate the range, I’ll use the following example.

DEFINITION

Obtain the range of a sample by subtracting the smallest value from the largest value.

One of Debbie’s special qualities is that she is a dedicated grill-a-holic when it comes to barbequing in the backyard. The following data set represents the number of meals each month that Debbie cranks out on the turbo-charged grill:

7      9      8      11      4

The range of this sample would be:

Range = 11 – 4 = 7 meals

As you can see, range is a very simple measure of dispersion. Yet it has a limitation—it relies on only two data values to describe the variation in the sample. No other values between the highest and lowest points are part of the range calculation. Variance and standard deviation, on the other hand, don’t suffer from this limitation.

Variance

One of the most common measures of dispersion in statistics is the variance, which summarizes the squared deviation of each data value from the mean. The formula for the sample variance is:

DEFINITION

The variance is a measure of dispersion that describes the relative distance between the data points in the set and the mean of the data set. This measure is widely used in inferential statistics.

where:

s2 = the variance of the sample

= the sample mean

n = the size of the sample

= the deviation of each data value from the sample mean

The first step in calculating the sample variance is to determine the mean of the data set, which in the grilling example is 7.8 meals per month. The rest of the calculations can be facilitated by the following table.

The final sample variance calculation becomes this:

For those of us who like to do things in one step, we can also do the entire variance calculation in the following equation:

Using the Raw Score Method (When Grilling)

A more efficient way to calculate the variance of a data set is known as the raw score method. Even though at first glance this equation may look more imposing, its bark is much worse than its bite. Check it out and decide for yourself what works best for you.

where:

= the sum of each data value after it has been squared

= the square of the sum of all the data values

Okay, don’t have heart failure just yet. Let me lay this out in the following table to prove to you there are fewer calculations than with the previous method.

xi

xi2

7

49

9

81

8

64

11

121

4

16

As you can see, the results are the same regardless of the method used. The benefits of the raw score method become more obvious as the size of the sample (n) gets larger.

BOB’S BASICS

If you are calculating the variance by hand, my advice is to do your fingers and calculator battery a favor and use the raw score method.

Variance of a Population

So far, we have discussed the variance in the context of samples. The good news is the variance of a population is calculated in the same manner as the sample variance. The bad news is we need to introduce another funny-looking Greek symbol: lowercase sigma. The equation for the variance of a population is as follows:

where:

σ2 = the variance of the population (pronounced “sigma squared”)

= the deviation of each data value from the population mean

N = the size of the population

WRONG NUMBER

Notice that the denominator for the population variance equation is N, whereas the denominator for the sample variance is n – 1.

The raw score version of this equation is:

Even though this procedure is identical to the sample variance, let me demonstrate with another example. Let’s say I am considering my statistics class as my population and the following ages are the measurement of interest. (Can you guess which one is me? My age adds a little spice to the variance.)

21      23      28      47      20      19      25      23

I’ll use the raw score method for this calculation with the population size (N) equal to 8. (I’d love to see a class this size.)

xi

xi2

21

441

23

529

28

784

47

2,209

20

400

19

361

25

625

23

529

Standard Deviation

This method is pretty straightforward. The standard deviation is simply the square root of the variance. Just as with the variance, there is a standard deviation for both the sample and population, as shown in the following equations.

DEFINITION

A standard deviation is the square root of a variance.

Sample standard deviation:

Population standard deviation:

To calculate the standard deviation, you must first calculate the variance and then take the square root of the result. Recall from the previous sections that the variance from the sample of the number of meals Debbie grilled per month was 6.7. The standard deviation of this sample is as follows:

meals

Also recall the variance for the age of my class was 71.7. The standard deviation of the age of this population is as follows:

The standard deviation is actually a more useful measure than the variance because the standard deviation is in the units of the original data set. In comparison, the units of the variance for the grill example would be 6.7 “meals squared,” and the units of the variance for the age example would be 71.7 “years squared.” I don’t know about you, but I’m not too thrilled having my age reported as 2,209 squared years. I’ll take the standard deviation over the variance any day.

Measures of Dispersion for Grouped Data

We start by calculating the variance of grouped data. (Remember, grouped data is when your data is presented as a frequency distribution instead of raw, ungrouped data.) Then to find the standard deviation, just take the square root of the variance.

There are different versions of the formula to calculate the variance for grouped data. This is the one I prefer:

Where:

f = the frequency in each class

M = the midpoint of each class

k = the number of classes

= the sample mean

n = the number of observations

I like this formula because I think it’s the simplest—just calculate the fM2 column, as shown below, by multiplying the frequency by the square of the midpoint for each class. Add them up and you get the result of this: . To illustrate, let’s apply the formula to our grade example from the last chapter.

In this example, k = 5 classes and n = 20 observations. We calculated the mean in Chapter 4, and it was 77.5 points. We just calculated in the table above and got 123,645. Substitute in the formula to get the variance as follows:

Then take the square root of the variance to find the standard deviation, shown below.

The standard deviation, s = points.

Measures of Relative Position

Another way of looking at dispersion of data is through measures of relative position, which describe the percentage of the data below a certain point. This technique includes quartile and interquartile measurements.

Quartiles

Quartiles divide the data set into four equal segments after it has been arranged in ascending order. Approximately 25 percent of the data points will fall below the first quartile, Q1. Approximately 50 percent of the data points will fall below the second quartile, Q2. And, you guessed it, 75 percent should fall below the third quartile, Q3. To demonstrate how to calculate Q1, Q2, and Q3, let’s look at the following data showing the estimated population of the 10 largest cities in the United States from the 2010 Census.

DEFINITION

Quartiles measure the relative position of the data values by dividing the data set into four equal segments.

City

Population (in millions)

New York, NY

8.18

Los Angeles, CA

3.79

Chicago, IL

2.7

Houston, TX

2.1

Philadelphia, PA

1.53

Phoenix, AZ

1.45

San Antonio, TX

1.33

San Diego, CA

1.31

Dallas, TX

1.2

San Jose, CA

0.95

Source: U.S. Census Bureau census.gov/2010census/popmap

1. Arrange the population data in an ascending order:

0.95      1.20      1.31      1.33      1.45      1.53      2.10      2.70      3.79      8.18

2. Find the median of the data set. This is Q2.

Median position =

So Q2 =

3. Find the median of the lower half of the data (in parenthesis). This is Q1.

(0.95      1.20      1.31      1.33      1.45)      1.53      2.10      2.70      3.79      8.18

Q1 = 1.31

4. Find the median of the upper half of the data (in parenthesis), which is Q3.

0.95      1.20      1.31      1.33      1.45      (1.53      2.10      2.70      3.79      8.18)

Q3 = 2.70

Now we have all quartiles.

Q1 = 1.31

Q2 = 1.49

Q3 = 2.70

RANDOM THOUGHTS

California and Texas have three of their cities on the top 10 most populated cities in the United States according to the 2010 Census!

Interquartile Range (IQR)

When you have established the quartiles, you can easily calculate the interquartile range (IQR); the IQR measures the spread of the center half of our data set. It is simply the difference between the third and first quartiles, as follows:

IQR = Q3 – Q1

DEFINITION

The interquartile range measures the spread of the center half of the data set and identifies potential outliers. Outliers are observations with extreme values in the upper or lower quartiles and should be examined before you use them in analysis.

In our previous example of the top 10 most populated cities in the United States, IQR = 2.70 – 1.31 = 1.39. This is the range within which 50 percent of our data falls.

The interquartile range is used to identify outliers, which are the “black sheep” of our data sets. Outliers are observations with extreme values on either the upper or lower end of the data. Examine these outliers to determine how and why they appeared and whether similar values may continue to appear. You only need to discard outliers if they are mistakes and don’t belong to the data set. For example, if you are looking at a GPA data set and you see an observation of 6.0, then it’s definitely a mistake and does not belong (since the maximum GPA is 4.0). However, if the outliers belong to the data set, keep them even if they have extreme values, and they may help you learn something valuable about the data under investigation. John Tukey identified outliers as any value outside this range:

Q1 – 1.5 (IQR) and Q3 + 1.5 (IQR)

So any value less than Q1 - 1.5 (IQR) or any value greater than Q3 + 1.5 (IQR) is considered an outlier. Let’s apply this to our example of the top 10 most populated cities in the United States.

Q1 – 1.5 (IQR) = 1.31 – 1.5 (1.39) = -0.775

Q3 + 1.5 (IQR) = 2.70 + 1.5 (1.39) = 4.785

Looking at our data, we see that only one city of the population of the 10 most populated cities is outside of that range. New York City with its 8.18 million people is considered an outlier. However, it does belong to the data set–it’s just an observation with an extremely large value compared to the rest of the data. If you have ever been to New York City, this won’t surprise you!

TEST YOUR KNOWLEDGE

The IQR method for checking outliers was introduced by John Tukey (1915–2000), a famous American statistician. When asked why he uses the factor 1.5 in the outlier formula, Tukey answered, “because 1 is too small and 2 is too large.”

Excel to the Rescue

I know that calculating the variance and standard deviation using the formula can be tedious and time consuming. Here comes Excel to help. Excel can calculate the range, variance, and standard deviation for you, and you don’t need to complete any additional steps! Not even one! If we look at the Chapter 4 example from the section “Using Excel to Calculate Central Tendency” and look at the descriptive statistics table Excel provided us (shown again below, Figure 5.1), you will see that the table includes the range, variance, and standard deviation. So repeating those same steps, you can create a descriptive statistics table for any data set to analyze measures of dispersion.

As we can see from the previous figure, the range is $280 million, the sample variance is $10,194.72 and the standard deviation is $100.97 million. Piece of cake!

This wraps up our discussion on the different ways to describe measures of dispersion.

WRONG NUMBER

The values for variance and standard deviation reported by Excel are for a sample. If your data set represents a population, you need to recalculate manually the results using N in the denominator rather than n – 1.

Figure 5.1

Measures of dispersion for the celebrity net worth example.

Practice Problems

1. Calculate the variance, standard deviation, and the range for the following sample data set:
20, 15, 24, 10, 8, 19, 24

2. Calculate the variance, standard deviation, and the range for the following population data set:
84, 82, 90, 77, 75, 77, 82, 86, 82

3. Calculate the variance, standard deviation, and the range for the following sample data set:
36, 27, 50, 42, 27, 36, 25, 40

4. Calculate the quartiles and the cutoffs for the outliers for the following data set:
8, 11, 6, 2, 11, 6, 5, 6, 10, 15

5. A company counted the number of their employees in each of the age classes as follows. According to this distribution, what is the standard deviation for the age of the employees in the company?

Age Range

Number of Employees

20–24

8

25–29

37

30–34

25

35–39

48

40–44

27

45–49

10

The Least You Need to Know

  • The range of a data set is the difference between the largest value and smallest value.
  • The variance of a data set summarizes the squared deviation of each data value from the mean.
  • The standard deviation of a data set is the square root of the variance and is expressed in the same units as the original data values.
  • The interquartile range measures the spread of the center half of the data set and identifies outliers, which are extreme values that need to be examined before using them in your analysis.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset