Chapter 5

Estimation

Preview: Estimation is an important concept to understand no matter what field you work in. Estimating outcomes may seem like a simple matter, but it is actually a complicated process. In order to create an accurate estimation, it is important to not only gather the right information but also to make the proper inferences from that data. The sample size plays a role here as well and can have a profound impact on the accuracy of the estimate: a large sample size yields better estimates than a small one. We will provide our estimates in the form of confidence intervals, which in turn come with a margin of error. If you have ever reviewed polling data for a presidential election, you know that all of those polls report a margin of error, which varies from poll to poll. Margin of error is defined as the amount allowed for miscalculation. If a poll reports that candidate A has the support of 52 percent of the population while candidate B enjoys the support of 48 percent of the people and the margin of error is 3 percent, candidate A could have the support of between 49 percent and 55 percent of the people surveyed, while support for candidate B could be between 45 percent and 51 percent. These deviations may be small, but they can mean the difference between one candidate and the other prevailing on election day. We will provide estimates for the population mean, the difference of population means, and for proportions.

Learning Objectives: At the conclusion of this chapter, you should be able to:

  1. Construct and interpret confidence interval estimates for the mean and the proportion
  2. Determine the sample size necessary to develop a confidence interval estimate for the mean or proportion
  3. Use confidence interval estimates in solving business problems

Introduction

Up to now we discussed descriptive statistics (Chapters 1 to 3) and we developed some basic understanding of probability theory (Chapter 4). Beginning with this chapter we will talk about inferential statistics, which will be based on the theory we developed in Chapter 4. The principal idea is that we want to estimate parameters of a population based on information gathered from a random sample.

Confidence Intervals for Means

Let us start by estimating the mean of a population, given that we know the mean of a particular sample. In other words, if a sample of size, say, 100 is selected at random from some population, it is easy to compute the mean of that sample. It is equally easy to then use that sample mean as an estimate for the unknown population mean. But just because it is easy to do does not necessarily mean it is right.

For example, suppose we randomly select 100 people from the population of all people in the United States, measure their height, and compute the average height of our sample to be, say, 164.432 cm. If we now wanted to know the average height of everyone in our population (i.e., everyone in the United States), it seems reasonable to estimate that the average height of everyone is 164.432 cm as well. However, if we think about it, it is of course highly unlikely that the average for the entire population comes out exactly the same as the average for our random sample of just 100 people. It is much more likely that our sample mean of 164.432 cm is only approximately equal to the (unknown) population mean. It is the purpose of this section to clarify, using probability theory, what exactly we mean by “approximately equal.”

Example: Consider the following data set for approximately 400 cars, assumed to be collected at random. We would like to make predictions about all automobiles, based on that random sample. In particular, the data set lists miles per gallon, engine size, and weight of 400 cars, but we would like to know the average miles per gallon, engine size, and weight of all cars, based on this sample.

Description: http://www.mathcs.org/statistics/course/00-icons/excel.gif www.betterbusinessdecisions.org/data/cars.xls

It is of course simple to compute the mean of the various variables of the sample, starting with gas mileage. We find that the sample mean gas mileage is 95747.jpg miles per gallon (mpg), the sample standard deviation s = 7.82 mpg, and the sample size 95754.jpg. We need to know how well this sample mean 95764.jpg predicts the actual and unknown population mean m. Our best guess is clearly that the average mpg for all cars is 23.5 mpg—it is after all pretty much the only number we have—but how good is that estimate?

In fact, we know more than just the sample mean: we also know that all sample means are distributed normally, according to the Central Limit Theorem, and that the distribution of all sample means (of which ours is just one) is normal with the same mean as the population mean and a standard deviation of 95772.jpg.

Let us say we want to estimate a (unknown) population mean so that we are, say, 95 percent certain that the estimate is correct (or 90 percent, or 99 percent, or any other predetermined notion of certainty we might want to have). In other words, we need to compute a lower limit a and an upper limit b in such a way as to be 95 percent sure that our (unknown) population mean is between a and b (see Figure 5.1). That interval (a, b) is known as a 95 percent confidence interval for the unknown mean. Using standard probability notation we can rephrase this: we want to find a and b so that 95785.jpg (see Figure 5.1).

Using symmetry and focusing on the part of the distribution that we can compute with Excel, this is equivalent to finding a value of a such that P(x < a) = 0.025, since 95885.jpg and 95896.jpg (see Figure 5.2).

But that is an inverse normal problem as described in Chapter 4 and the Excel function NORMINV will come to the rescue: a = NORMINV(0.025, 23.5, 0.3290) = 22.8552.

Similarly, to find b such that 95920.jpg implies that b = NORMINV(0.975, 23.5, 0.3290) = 24.1448. Thus, we conclude that the unknown population mean is between 22.8551 and 24.1448 with 95 percent certainty. Alternatively, we say that the 95 percent confidence interval for the average gas mileage of all cars is (22.8551, 24.1448).

fig5.1.jpg

Figure 5.1 P(a < µ < b) = 0.95 and related probabilities

fig5.2.jpg

Figure 5.2 P(x < a) = 0.025 and related probabilities

Example: Again using our data on cars, as earlier, find the 95 percent confidence interval for the engine size of all cars.

We compute the sample mean to be 193 in.3 with a standard deviation of 104.55 in.3 and the sample size n = 398. To find a 95 percent confidence interval, we need to find a such that P(x < a) = 0.025 and b such that P(x > b) = 0.025, assuming that x is N(95966.jpg). Thus, since 95975.jpg:

  • a = NORMINV(0.025,193,5.2406) = 182.7286
  • b = NORMINV(0.975,193,5.2406) = 203.2714

Therefore, our 95 percent confidence interval is (182.73, 203.27) or, in other words, we are 95 percent certain that the unknown population mean—in this case the size of the engines of all cars—is between 182.73 and 203.2, with the mean 193 right in the middle.

Note that the preceding discussion is based on the Central Limit Theorem. It lets us use a normal distribution for the sample means even if the underlying distribution of the original data is unknown. However, that approximation works best for large sample sizes n, so for small n we need to employ a slightly different procedure. Thus, we will summarize our procedure for finding a confidence interval for a population mean m, depending on the sample size n.

Large Sample Size

Here we will summarize the procedure to find a confidence interval for a large sample size n. In fact, we will consider n to be large if n > 30.

Confidence Interval for Mean, n > 30 (with Excel): Suppose we have selected a random sample with size n > 30 and with sample mean m and sample standard deviation s. Then the p% confidence interval for the unknown population mean m is the interval (a, b), where

  • a = NORMINV((1 − p)/2, m, 96057.jpg)
  • b = NORMINV((1 + p)/2, m, 96067.jpg)

Note: the sample mean m will always be half way between a and b, that is, (a + b)/2 = m. You could use this relationship to quickly check your calculations.

Example: We want to know the average weight of cheese that comes in 8 oz packages. We select a random sample of 100 packages, weigh them, and find that the sample mean is 7.8 oz with a standard deviation of 0.8 oz. Estimate the average weight of all 8 oz packages, using a 95 percent confidence interval.

We have N = 100 > 30, so our preceding procedure for a large sample size is valid. Thus, we find:

  • 96102.jpg

    96116.jpg

  • 96110.jpg

96124.jpg

Note that 96133.jpg so that our calculations check out.

We have found that the average weight of all 8 oz packages of cheese is between 7.6432 oz and 7.9568 oz and we are 95 percent certain of this estimate.

Incidentally, this would mean that it is likely that these 8 oz packages of cheese are mislabeled! They should of course contain 8 oz of cheese on average, but we found that the (unknown) average is less than 7.9568 with 95 percent certainty. In other words, there is a small chance that the average weight of all cheeses is indeed 8 oz, but that chance is less than 5 percent. Thus, we are relatively certain that the cheese manufacturer is indeed mislabeling their cheese. This kind of argument, in fact, will be formalized in the next chapter on hypothesis testing. For now we will be content with obtaining estimates together with their certainty.

By the way, do you think a 99 percent confidence interval would be wider or narrower than the 95 percent interval (7.6432, 7.9568)? You will find the answer later in this section but you should try to think about it already.

The preceding summary works well but it relies on Excel. We can, however, use the z-score introduced in “The Normal Distribution” section to compute some confidence intervals without resorting to Excel at all.

According to our preceding summary, a standard normal distribution has a 95 percent confidence interval (a, b) where a = NORMINV(0.025, 0, 1) = 1.96 and b = NORMINV(0.975, 0, 1) = 1.96. Thus, if we start with a normal distribution with mean m and standard deviation s, then (x m)/s is standard normal and has a 95 percent confidence interval (1.96, 1.96). In other words, the confidence interval goes from (x m)/s = 1.96 to (x m)/s = 1.96, or equivalently, solving for x, from x = m 1.96 s to x = m + 1.96 s. Thus, the 95 percent confidence interval for x is (m 1.96 s, m + 1.96 s).

Now we can finish up this discussion by resorting to the Central Limit Theorem: if x has a distribution with mean m and standard deviation s then the sample means are normal with mean m and standard deviation 96148.jpg. But then the sample means have a 95 percent confidence interval from 96156.jpg to 96164.jpg. Putting everything together, we can summarize an alternative method of computing confidence intervals, at least for the ones most commonly used.

Confidence Interval for Mean, N > 30 (Alternate Version): Suppose we have selected a random sample with size N > 30 and with sample mean m and sample standard deviation s. Then:

  • A 90 percent confidence interval goes from 96215.jpg to 96222.jpg.
  • A 95 percent confidence interval goes from 96229.jpg to 96236.jpg.
  • A 99 percent confidence interval goes from 96244.jpg to 96254.jpg.

Each of the intervals is centered at the mean m. The term 96271.jpg is known as the standard error.

Note: As you could have (hopefully) guessed from our preceding discussion, the 90 percent and 99 percent intervals use the constants NORMINV(0.05, 0, 1) = 1.645 and NORMINV(0.005, 0, 1) = 2.58, respectively. The improvement of this alternate version is based on the fact that you do not need Excel to compute any of these constants, a simple, plain calculator would do just fine here (unless you want to find an interval different from 90 percent, 95 percent, or 99 percent).

Example: In an earlier example we analyzed the gasoline efficiency of cars. Recall that we looked at a sample of size N = 398 with m = 23.5 and s = 7.82. Find a 90 percent, 95 percent, and 99 percent confidence interval for the average mpg of all cars.

Using our alternate version the problem is pretty easy:

  • 90 percent confidence: from 96304.jpg to 96313.jpg, or in interval notation (22.85, 24.14)
  • 95 percent confidence: from 96322.jpg to 96329.jpg, or (22.73, 24.27)
  • 99 percent confidence: from 96337.jpg to 96347.jpg, or (22.49, 24.51)

Thus, we are 90 percent certain that the average mpg for all cars is between 22.85 and 24.14; we are 95 percent certain that it is between 22.73 and 24.27, and we are 99 percent certain it is between 22.49 and 24.51.

Usually you only need to compute one of these intervals, depending on your preferred level of certainty. There is a price to pay, though: the more certain you want to be, the bigger your interval will turn out. Eventually, a 100 percent confidence interval would be (, +): you are clearly 100 percent certain that the unknown population mean is in that interval (after all, any number is) but that answer, while correct, does not help at all.

Small Sample Size

As we mentioned, the derivation of the formulas for confidence intervals uses the Central Limit Theorem, which works better for larger values of N (the sample size). For N > 30 we declare the sample size big enough but for smaller N we need to be more careful. In this case the method is based on the Student’s t-distribution, but is otherwise similar to before. Since all t-distributions have a mean of zero, we cannot generalize our first procedure to find confidence intervals. However, the alternate method works fine.

Confidence Interval for Mean, N 30 (Alternate Version): Suppose we have selected a random sample with size N 30 and with sample mean m and sample standard deviation s. Compute the number tp = TINV(1 p, N 1), where TINV is inverse of the t-distribution with degrees of freedom df = N 1 and p is the confidence interval to compute. Then:

  • The p% confidence interval goes from 96473.jpg to 96481.jpg.

This is similar to the alternate method for large N, but the multiplier tp depends not only on which percentage interval we want to find but also on the sample size N. It turns out, however, that the multiplier tp is always greater than the corresponding multiplier zp for large N (e.g., zp = 1.96 for a 95 percent confidence interval) so that a confidence interval using the small sample size procedure is always bigger than the one using the large sample size procedure.

Example: Suppose you want to measure the efficacy of a new blood pressure drug. Since trials involving human beings are expensive, you test the new drug on only 10 randomly selected patients. You find that the average decrease of blood pressure in the group tested was 15 mmHg with a standard deviation of 2 mmHg. Since we are dealing with medication given to humans, we want to be very sure about my results, so we want to know a 99 percent confidence interval.

The sample size is N = 10, which is considered small, so that the “small size” procedure applies. We need to find t0.99 using N 1 = 9 as degrees of freedom: t0.99 = TINV(0.01,9) = 3.2498. Thus, our 99 percent confidence interval goes from 96522.jpg to 96532.jpg, or in interval notation (12.84, 17.16).

Using Excel to Compute Confidence Intervals

We have used Excel to find various confidence intervals using the mean m and standard deviation s of a data set. This works fine even if we do not have access to the entire data, as long as we know N, m, and s. If we do know the complete data set, Excel offers yet another method to compute confidence intervals and a host of other parameters all at once and for several variables simultaneously.

Example: Consider the following data set for approximately 400 cars that we analyzed before. Find 95 percent confidence intervals for the average miles per gallon, engine size, and weight of cars using this data.

Description: http://www.mathcs.org/statistics/course/00-icons/excel.gif www.betterbusinessdecisions.org/data/cars.xls

fig5.3.jpg

Figure 5.3 Descriptive statistics procedure options

We will use the “Descriptive Statistics” tool of Excel’s Analysis ToolPak. Load the data set specified, then choose “Data Analysis …” from the “Data” ribbon, and select “Descriptive Statistics” from the choice of available procedures (see Figure 5.3).

Select as input range the first few columns, including “Miles per Gallon,” “Engine Size,” “Horse Powers,” and “Weight in Pounds” and make sure to check the “Labels in First Row” box, the “Summary Statistics” box, and the “Confidence Level for Mean,” as shown in Figure 5.3. We also need to specify the level of confidence for the “Confidence Level for Mean”—enter 90 percent and then click “Okay.”

You should see a number of parameters for our data set, including the familiar mean, median, variance, and standard deviation as well as our new descriptors standard error and confidence level (90 percent)—see Figure 5.4. To find the actual confidence intervals in the form we are used to, we need to add/subtract the Confidence Level to/from the Mean. In our case we have:

  • 90 percent confidence interval of average miles per gallon:
  • {from 23.4957 0.6468 = 22.8489 to 23.4957 + 0.6468 = 24.1425
  • 90 percent confidence interval of average engine size:
  • {from 192.8577 8.6568 = 184.2008 to 192.8577 + 8.6568 = 201.5145073
  • 90 percent confidence interval of average weight (in pounds)
  • {from 2960.9799 70.3841 = 2890.5957 to 2960.9798 + 70.3842 = 3031.3640

fig5.4.jpg

Figure 5.4 Output of the descriptive statistics procedure

Comparing Confidence Intervals

To finish up our discussion of estimating the mean we want to investigate the relation between the various intervals we could compute.

Example: Suppose we compute, for the same sample data, both a 90 percent and a 99 percent confidence interval. Which one is larger?

To answer this question, let us first look at an example: we compute both a 90 percent and a 99 percent confidence interval for the “Horse Power” in the preceding data set about cars, using Excel. The procedure of computing the numbers is similar to the one earlier; here are the answers:

  • The sample mean for the “Horse Power” is 104.27.
  • The 90 percent confidence level results in 3.19, so that the 90 percent confidence interval goes from 104.27 3.19 to 104.27 + 3.19, or from 101.08 to 107.46.
  • The 99 percent confidence level results in 5.01, so that the 99 percent confidence interval goes from 104.27 5.01 to 104.27 + 5.01, or from 99.26 to 109.28.

Since the 99 percent interval (99.26, 109.28) includes the 90 percent interval (101.08, 107.46), we conjecture that in general a 99 percent confidence interval is always larger than a 90 percent confidence interval (see Figure 5.5).

That makes sense: If we want to be more certain that we have captured the true (unknown) population mean correctly, we need to make our interval larger; the larger the interval, the better our chance of capturing the unknown population mean. Hence, a 99 percent confidence interval must be wider than a 90 percent confidence interval.

Another way to argue is that every 99 percent interval is automatically also a 90 percent interval, because if we are 99 percent certain to include the mean, we are in particular 90 percent certain to include it. The other way around is not true. Thus, the 99 percent interval must contain the 90 percent interval and must therefore be wider.

Last, we want to compare the two methods for computing confidence intervals: the one based on a normal distribution (for N > 30) and the other one based on the t-distribution (for N 30). First, we will consider an example.

Example: A crime scene investigator finds an unknown liquid at a crime scene. To help identify it, she decides to determine its boiling point. Heating up the entire liquid would destroy the evidence, so instead she takes nine small samples and determines their boiling points. It turns out that the sample mean is 86.5°C with a sample standard deviation of 0.6°C. Find a 95 percent confidence interval for the boiling point of the substance, using the small and the large sample size method (even though only one of the methods is appropriate, technically speaking).

fig5.5.jpg

Figure 5.5 90 percent (top) versus 99 percent (bottom) confidence interval

fig5.6.jpg

Figure 5.6 Small sample size (bottom) versus large sample size (top) confidence interval

Small sample method: We need to find the multiplier t0.95 = TINV(0.05,8) = 2.3060 and the standard error 96821.jpg. Then the 95 percent confidence interval goes from 86.5 – 2.306 0.2 = 86.0388 to 86.5 + 2.306 0.2 = 86.9612 (see Figure 5.6).

Large sample method: The multiplier for a 95 percent confidence interval is always 1.96 while the standard error is the same as before. Thus, the interval goes from 86.5 – 1.96 0.2 = 86.108 to 86.5 – 1.96 0.2 = 86.892 (see Figure 5.6).

Our conjecture is, therefore, that the small sample size method yields a wider interval than the large sample size method. This is indeed true because the t-distribution has “thicker” tails than the standard normal distribution; hence the cut-off value x0 where 96865.jpg will be larger for the t-distribution than for the standard normal one.

Therefore using the t-distribution will yield a more conservative, that is, wider, confidence interval for any sample size; we could do away with the large sample size procedure entirely and not be wrong. However, doing so might make our interval unnecessarily wide and our estimate for the population mean would not be as sharp as it could be.

Confidence Intervals for Difference of Means

Now we will discuss confidence intervals for the difference of two means. This procedure applies if we have two samples whose means we want to compare. There are many situations where this is useful; perhaps the most important one relates to medical trials, where people are frequently divided at random into two groups: one group called treatment that receives a new medical treatment and a second group called control that will receive a placebo instead. We can then determine the efficacy of the treatment by comparing the means of treatment and control.

We distinguish between two cases, that of equal variance and of unequal variances of the two populations.

Equal Variances

Suppose we have two populations with (unknown) means m1 and m2 and standard deviations s1 and s2, respectively. Moreover, assume that the two variances are equal, approximately. If we select independent samples of sizes n1 and n2, respectively, from these populations, then the distribution of the difference of the sample means 96887.jpg has joint mean 96895.jpg and pooled standard deviation 96903.jpg as well as joint standard error 96914.jpg. The confidence interval for 96923.jpg goes from 96933.jpg to 96943.jpg, where SE is the joint standard error, and the multiplier 96952.jpg, TINV is the inverse of the t-distribution with degrees of freedom 96960.jpg and p is the confidence interval to compute.

These formulas apply if the variances are approximately equal. How can you tell? As a guideline, compute the ratio of the sample variances 96996.jpg. If that ratio is between 0.5 and 2.0, we will assume that the population variances are approximately the same and the formulas listed earlier apply. In fact, there are quite a number of formulas here, but as the next example shows, if you compute them one at a time, things are not as bad as they look.

Example: We want to determine if there is a difference in the spending habits between men and women at soccer games. A randomly selected sample of 45 men spent an average of $21 with a standard deviation of $4. A randomly selected sample of 57 women, on the other hand, spent an average of $19 with a standard deviation of $3. Find a 95 percent confidence interval for the difference of means and interpret the answer.

First we check the ratio of the sample variances to test the assumption of equal variances:

97004.jpg.

The ratio is less than 2, so that according to our guidelines our formulas apply. Next, we compute the pooled standard deviation Sp:

97017.jpg

97025.jpg.

Note that Sp is between the two original variances. Next, we go for the joint standard error SE:

97038.jpg.

Finally we need to find the multiplier tp = TINV(1 – p,df ), where 97048.jpg:

97062.jpg.

But now we have all ingredients in place to compute the given confidence interval. It goes from (19 21) 1.984 0.6931 = 3.3751 to (19 21) + 1.984 0.6931 = 0.62489.

Thus, our 95 percent confidence interval for the difference of means is (3.3751, 0.62489). In particular, we can say with 95 percent certainty that the difference of population means m2 m1 is negative, which implies that m2 < m1. In other words, we are 95 percent sure that women spend less money, on average, than men at soccer games. This seems obvious, considering the sample means for men and women, but just because one sample mean is less than another does not imply that the first population mean is necessarily less than the other population mean. The point of the preceding example is that in this case we can infer from the sample about the population, and we can even specify the degree of certainty.

Note that tp = 1.9840 is close to 1.96, the multiplier we used for large sample size confidence interval for a single mean. This makes sense, since for large sample sizes the t-distribution is close to the normal distribution. However, for simplicity we will stick to using the t-distribution for difference of means regardless of the sample sizes.

Unequal Variances

The procedure in the case of unequal variances is similar except that the formula to compute the degrees of freedom for the t-distribution is different (and much more complicated).

If the variances for the two populations are different, the confidence interval for m2 m1 as before goes from 97073.jpg to 97083.jpg where 97091.jpg and 97099.jpg, where TINV is the inverse of the t-distribution with degrees of freedom 97108.jpg (use the nearest integer) and p is the confidence interval to compute.

This procedure is similar to the one before but the degree of freedom for the t-distribution is a lot more complicated.

Example: A female student did poorly in a class and suspects the teacher is biased against women. She complains to the department chair who investigates the situation. The chair selects a random sample of 21 women and 9 men who have previously taken a class with the teacher. It turns out that the average grade for the men is 3.4 with standard deviation 0.9 and for the women it is 2.9 with a standard deviation of 1.5. What can you conclude?

First, we need to decide on a confidence interval. We choose a 90 percent confidence interval (which means that we are somewhat favorably disposed toward the student). Next, we check the ratio of the two standard deviations to determine which of our procedures applies:

97161.jpg.

Since that ratio is not between 0.5 and 2, we need to assume unequal variances. We compute the standard error SE:

97168.jpg.

The multiplier 97177.jpg, but the degree of freedom is hard to compute:

97185.jpg

The next closest integer to that value is 24, so we know that df = 24. Thus

97198.jpg.

Now we have all the ingredients so that the 90 percent confidence interval goes from

(3.4 2.9) 0.4440 1.7109 = 0.2596 to (3.4 2.9) + 0.4440 1.6572 = 1.2596.

This means, in particular, that our 90 percent confidence interval includes 0, and if the difference of means was indeed 0, there would be no difference in the scores of men and women. Thus, it is perfectly possible, based on our calculations, that on average the instructor in question shows no bias toward men or women. Therefore, the department chair will dismiss the accusation. Note that this does not mean that there truly is no bias; it is just that based on the available data we cannot conclude that there is.

The question of whether there is or is not a difference in the average score is actually better suited for a “test of hypothesis,” which we will tackle in the next chapter. But first we want to apply the concept of estimation to proportion.

Note that another difference of means situation applies in the case of paired differences. This situation arises if we take two measurements from each member of a population. For example, we might be interested in figuring out whether consumption of wine or beer has a different impact on a person’s concentration. We could divide our participants at random into two groups, give wine to one and beer to the other, and measure the level of concentration for each group. This would be a difference of means situation, just as we covered already. Alternatively, I could give everyone in my population wine, measure their level of concentration, then (after waiting an appropriate time) give everyone beer, and again measure their concentration. This is advantageous, for example, if the total available sample size is small. We will not develop this situation here but refer the reader, for example, to www.real-statistics.com/students-t-distribution/paired-sample-t-test/ for a nice discussion of this situation.

Estimating Proportions

As we mentioned in the previous chapter, many variables are categorical not numerical, so that the idea of estimating the mean does not even apply (since there is no mean). In special cases, however, we can resort to the idea of proportions and we could try to estimate them. This section will explain how that works.

Example: The General Social Science survey from 2008 includes data provided by a random sample of adults in the United States. You will find, among many other variables, answers to the question: Did humans develop from animals? Based on that sample data, provide an estimate of how many people in the entire United States think that way.

Description: http://www.mathcs.org/statistics/course/00-icons/excel.gif www.betterbusinessdecisions.org/data/gss2008-short-2.xls

Looking at the file we find the data for that question in column AH. The data is categorical, so the first thing we need to do is count all the “true” and all “false” values. As we learned in Chapter 2, Excel’s Pivot tool will do that for us. Here are the steps, in case you forgot:

  1. Click on the “Insert” ribbon and select the “Pivot” menu choice.
  2. Make sure the entire table is selected; then click “OK.”
  3. Drag the field “SCI: HUMANS DEVELOPED FROM ANIMALS” onto the Row Fields; then drag the same field also onto the Value Fields.

This should finish the pivot table (see Figure 5.7) and show that 651 out of 1,316 answered “false” while 665 out of 1,316 answered “true.”

Incidentally, the total number of people participating in this survey was 2,023 but only 1,316 answered this particular question. Thus, as far as analyzing this question is concerned, the sample size is n = 1,316. To phrase this as a proportion problem, we need to define success (and failure): We call it a success if someone believes that humans developed from animals, since then the probability of success p is exactly what we want to know. What we do know is the ratio of success for our sample, which is 97290.jpg. To finish the problem, we need to know the standard error SE and the multiplier m, as is usual for all confidence intervals.

Definition: Suppose x is a binomial random variable with probability of success p. If we take a random sample of size n, we can compute the sample proportion of success as 97428.jpg, where x counts the number of successes. Then the standard error SE of the sample is 97437.jpg and we can compute a confidence interval for p from 97450.jpg to 98124.jpg where zp = 1.645 for a 90 percent confidence interval, zp = 1.96 for a 95 percent confidence interval, and zp = 2.54 for a 99 percent confidence interval. This procedure is valid as long as there are at least 10 successes and 10 failures.

fig5.7.jpg

Figure 5.7 Counts for the question: did humans develop from animals?

Now we can complete our example. Note that there were over 600 successes and failures so that the assumptions of our procedure are satisfied. We know that 97461.jpg so that the standard error is 97468.jpg. Finally, suppose we want to compute a 95 percent confidence interval. It goes from 97474.jpg 1.96 SE = 0.5052 1.96 0.0138 = 0.47815 to 97477.jpg + 1.96 SE = 0.5052 + 1.96 0.0138 = 0.53225.

Summary

Instead of providing a point estimate for an unknown population parameter we provide an interval instead, called confidence interval. The interval is based on a random sample of size n. Three particular confidence intervals are most common: a 90 percent, a 95 percent, or a 99 percent confidence interval (other intervals are possible). Each interval has the form:

from 97488.jpg to 97496.jpg, or 97505.jpg,

where P is a point estimator for the parameter being estimated, SE is the standard error, and m is a multiplier based on the standard normal or the t-distribution. The quantity m SE is known as the margin of error.

  • Population mean µ, large sample: point estimator 97518.jpg, standard error 97527.jpg, multiplier m = 1.645, 1.96, or 2.54
  • Population mean µ, small sample: point estimator 97543.jpg, standard error 97550.jpg, multiplier m = TINV(0.05, df ), TINV(0.025, df ), or TINV(0.005, df ) with df = n 1
  • Difference of means µ2 µ1, equal variances (if 97559.jpg): point estimator 97568.jpg, pooled standard deviation 97577.jpg, standard error 97588.jpg, multiplier m = TINV(0.05, df), TINV(0.025, df ), or TINV(0.005, df ) with df = n1 + n2 2
  • Difference of means µ2 µ1, unequal variances (if 97599.jpg or 97606.jpg): point estimator
    97615.jpg
    , standard error 97624.jpg, multiplier m = TINV(0.05, df ), TINV(0.025, df ), or TINV(0.005, df ) with 97635.jpg (use the closest integer)
  • Probability of success: point estimator 97642.jpg, standard error 97650.jpg, multiplier m = 1.645, 1.96, or 2.54

The preceding multipliers refer, in that order, to a 90 percent, 95 percent, and 99 percent interval. The certainty of the estimate is denoted by its confidence level.

Excel Demonstration

Company P wants to estimate the mean sales volume for all employees within the company. The sales director selects a sample of employees and calculates the average sales. The director now wants to be able to construct a confidence interval for the average of sales for the entire population of sales representatives in the company. The director decides that a 95 percent confidence level is sufficient and pulls the average sales from a sample of 14 representatives:

Rep 1: 24,000 Rep 2: 22,000 Rep 3: 23,000
Rep 4: 21,000 Rep 5: 22,000 Rep 6: 22,000
Rep 7: 18,000 Rep 8: 19,000 Rep 9: 21,000
Rep 10: 21,000 Rep 11: 18,000 Rep 12: 19,000
Rep 13: 21,000 Rep 14: 17,000

Step 1: Insert the data into Excel with the labels in column A and the numbers in column B.

Step 2: Run the descriptive statistics: Go to “Data | Data Analysis | Descriptive Statistics.” For the “Input Range,” select Cells B1 through B14. Check the “Summary Statistics” and the “Confidence Level for Mean” boxes and type in 95 percent into the “Confidence Level” field. Compare your input with Figure 5.8; then click OK.

fig5.8.jpg

Figure 5.8 Parameters for the descriptive statistics procedure

After you click OK, the descriptive statistics are provided, which are as follows:

Column1

Mean

20,571.42857

Standard Error

551.8628365

Median

21,000

Mode

21,000

Standard Deviation

2,064.881659

Sample Variance

4,263,736.264

Kurtosis

0.82560218

Skewness

0.240896459

Range

7,000

Minimum

17,000

Maximum

24,000

Sum

288,000

Count

14

Confidence Level (95.0%)

1,192.227175

Step 3: Construct the upper and lower limits of the confidence interval. Recall that the “confidence level” is the product of the appropriate multiplier and the standard error.

  • Lower Limit: Mean (20,571.42) Confidence Level (1,192.22) = 19,379.20
  • Upper Limit: Mean (20,571.42) + Confidence Level (1,192.22) = 21,763.64

You could write the 95 percent confidence interval as: $19,379.20 m $21,763.64. In other words, you can be 95 percent confident that the average sales per representative is somewhere between $19,379.20 and $21,763.64.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset