Suppose building specifications in a certain city require that the average breaking strength of residential sewer pipe be more than 2,400 pounds per foot of length (i.e., per linear foot). Each manufacturer who wants to sell pipe in that city must demonstrate that its product meets the specification. Note that we are interested in making an inference about the mean of a population. However, in this example we are less interested in estimating the value of than we are in testing a hypothesis about its value—that is, we want to decide whether the mean breaking strength of the pipe exceeds 2,400 pounds per linear foot.
A statistical hypothesisis a statement about the numerical value of a population parameter.
The method used to reach a decision is based on the rare-event concept explained in earlier chapters. We define two hypotheses: (1) The null hypothesis represents the status quo to the party performing the sampling experiment—the hypothesis that will be assumed to be true unless the data provide convincing evidence that it is false. (2) The alternative, or research, hypothesis is that which will be accepted only if the data provide convincing evidence of its truth. From the point of view of the city conducting the tests, the null hypothesis is that the manufacturer’s pipe does not meet specifications unless the tests provide convincing evidence otherwise. The null and alternative hypotheses are therefore
Null hypothesis (i.e., the manufacturer’s pipe does not meet specifications)
Alternative (research) hypothesis (i.e., the manufacturer’s pipe meets specifications)
The null hypothesis, denoted H0, represents the hypothesis that will be assumed to be true unless the data provide convincing evidence that it is false. This usually represents the “status quo” or some statement about the population parameter that the researcher wants to test.
The alternative (research) hypothesis, denoted Ha, represents the hypothesis that will be accepted only if the data provide convincing evidence of its truth. This usually represents the values of a population parameter for which the researcher wants to gather evidence to support.
How can the city decide when enough evidence exists to conclude that the manufacturer’s pipe meets specifications? Because the hypotheses concern the value of the population mean it is reasonable to use the sample mean to make the inference, just as we did when we formed confidence intervals for in Sections 5.2 and 5.3. The city will conclude that the pipe meets specifications only when the sample mean convincingly indicates that the population mean exceeds 2,400 pounds per linear foot.
“Convincing” evidence in favor of the alternative hypothesis will exist when the value of exceeds 2,400 by an amount that cannot be readily attributed to sampling variability. To decide, we compute a test statistic, i.e., a numerical value computed from the sample. Here, the test statistic is the z-value that measures the distance between the value of and the value of specified in the alternative hypothesis. When the null hypothesis contains more than one value of , as in this case (H0: ), we use the value of closest to the values specified in the alternative hypothesis. The idea is that if the hypothesis that equals 2,400 can be rejected in favor of , then less than or equal to 2,400 can certainly be rejected. Thus, the test statistic is
Note that a value of means that is 1 standard deviation above a value of means that is 1.5 standard deviations above and so on. How large must z be before the city can be convinced that the null hypothesis can be rejected in favor of the alternative and conclude that the pipe meets specifications?
The test statistic is a sample statistic, computed from information provided in the sample, that the researcher uses to decide between the null and alternative hypotheses.
If you examine Figure 6.1, you will note that the chance of observing more than 1.645 standard deviations above 2,400 is only .05—if in fact the true mean is 2,400. Thus, if the sample mean is more than 1.645 standard deviations above 2,400, either is true and a relatively rare event has occurred (.05 probability), or is true and the population mean exceeds 2,400. Because we would most likely reject the notion that a rare event has occurred, we would reject the null hypothesis and conclude that the alternative hypothesis is true. What is the probability that this procedure will lead us to an incorrect decision?
Such an incorrect decision—deciding that the null hypothesis is false when in fact it is true—is called a Type I error. As indicated in Figure 6.1, the risk of making a Type I error is denoted by the symbol —that is,
A Type I error occurs if the researcher rejects the null hypothesis in favor of the alternative hypothesis when, in fact, H0 is true. The probability of committing a Type I error is denoted by .
In our example,
We now summarize the elements of the test:
Note that the rejection region refers to the values of the test statistic for which we will reject the null hypothesis.
The rejection region of a statistical test is the set of possible values of the test statistic for which the researcher will reject H0 in favor of Ha.
To illustrate the use of the test, suppose we test 50 sections of sewer pipe and find the mean and standard deviation for these 50 measurements to be
As in the case of estimation, we can use s to approximate when s is calculated from a large set of sample measurements.
The test statistic is
Substituting and we have
Therefore, the sample mean lies above the hypothesized value of as shown in Figure 6.2. Because this value of z exceeds 1.645, it falls into the rejection region. That is, we reject the null hypothesis that and conclude that Thus, it appears that the company’s pipe has a mean strength that exceeds 2,400 pounds per linear foot.
How much faith can be placed in this conclusion? What is the probability that our statistical test could lead us to reject the null hypothesis (and conclude that the company’s pipe meets the city’s specifications) when in fact the null hypothesis is true? The answer is —that is, we selected the level of risk, of making a Type I error when we constructed the test. Thus, the chance is only 1 in 20 that our test would lead us to conclude the manufacturer’s pipe satisfies the city’s specifications when in fact the pipe does not meet specifications.
Now, suppose the sample mean breaking strength for the 50 sections of sewer pipe turned out to be pounds per linear foot. Assuming that the sample standard deviation is still the test statistic is
Therefore, the sample mean is only 1.06 standard deviations above the null hypothesized value of As shown in Figure 6.3, this value does not fall into the rejection region Therefore, we know that we cannot reject using Even though the sample mean exceeds the city’s specification of 2,400 by 30 pounds per linear foot, it does not exceed the specification by enough to provide convincing evidence that the population mean exceeds 2,400.
The Neyman-Pearson Lemma
Egon Pearson was the only son of noteworthy British statistician Karl Pearson (see Biography, p. 463). As you might expect, Egon developed an interest in the statistical methods developed by his father and, upon completing graduate school, accepted a position to work for Karl in the Department of Applied Statistics at University College, London. Egon is best known for his collaboration with Jerzy Neyman (see Biography) on the development of the theory of hypothesis testing. One of the basic concepts in the Neyman-Pearson approach was that of the “null” and “alternative” hypotheses. Their famous Neyman-Pearson lemma was published in Biometrika in 1928. Egon Pearson had numerous other contributions to statistics and was known as an excellent teacher and lecturer. In his last major work, Egon fulfilled a promise made to his father by publishing an annotated version of Karl Pearson’s lectures on the early history of statistics.
A Type II error occurs if the researcher accepts the null hypothesis when, in fact, is false. The probability of committing a Type II error is denoted by .
Should we accept the null hypothesis and conclude that the manufacturer’s pipe does not meet specifications? To do so would be to risk a Type II error—that of concluding that the null hypothesis is true (the pipe does not meet specifications) when in fact it is false (the pipe does meet specifications). We denote the probability of committing a Type II error by It is well known that is often difficult to determine precisely. Rather than make a decision (accept ) for which the probability of error is unknown, we avoid the potential Type II error by avoiding the conclusion that the null hypothesis is true. Instead, we will simply state that the sample evidence is insufficient to reject at Because the null hypothesis is the “status-quo” hypothesis, the effect of not rejecting is to maintain the status quo. In our pipe-testing example, the effect of having insufficient evidence to reject the null hypothesis that the pipe does not meet specifications is probably to prohibit the use of the manufacturer’s pipe unless and until there is sufficient evidence that the pipe does meet specifications—that is, until the data indicate convincingly that the null hypothesis is false, we usually maintain the status quo implied by its truth.
Table 6.1 summarizes the four possible outcomes (i.e., conclusions) of a test of hypothesis. The “true state of nature” columns in Table 6.1 refer to the fact that either the null hypothesis is true or the alternative hypothesis is true. Note that the true state of nature is unknown to the researcher conducting the test. The “decision” rows in Table 6.1 refer to the action of the researcher, assuming that he or she will conclude either that is true or that is true, based on the results of the sampling experiment. Note that a Type I error can be made only when the null hypothesis is rejected in favor of the alternative hypothesis, and a Type II error can be made only when the null hypothesis is accepted. Our policy will be to make a decision only when we know the probability of making the error that corresponds to that decision. Because is usually specified by the analyst, we will generally be able to reject (accept ) when the sample evidence supports that decision. However, because is usually not specified, we will generally avoid the decision to accept preferring instead to state that the sample evidence is insufficient to reject when the test statistic is not in the rejection region.
True State of Nature | ||
---|---|---|
Conclusion | True | True |
Accept (Assume True) | Correct decision | Type II error (probability ) |
Reject (Assume True) | Type I error (probability ) | Correct decision |
Be careful not to “accept ” when conducting a test of hypothesis because the measure of reliability, (Type II error), is almost always unknown. If the test statistic does not fall into the rejection region, it is better to state the conclusion as “insufficient evidence to reject ”*
The elements of a test of hypothesis are summarized in the following box. Note that the first four elements are all specified before the sampling experiment is performed. In no case will the results of the sample be used to determine the hypotheses; the data are collected to test the predetermined hypotheses, not to formulate them.
Null hypothesis A theory about the specific values of one or more population parameters. The theory generally represents the status quo, which we adopt until it is proven false. The theory is always stated as
Alternative (research) hypothesis A theory that contradicts the null hypothesis. The theory generally represents that which we will adopt only when sufficient evidence exists to establish its truth.
Test statistic: A sample statistic used to decide whether to reject the null hypothesis.
Rejection region: The numerical values of the test statistic for which the null hypothesis will be rejected. The rejection region is chosen so that the probability is that it will contain the test statistic when the null hypothesis is true, thereby leading to a Type I error. The value of is usually chosen to be small (e.g., .01, .05, or .10) and is referred to as the level of significance of the test.
Assumptions: Clear statement(s) of any assumptions made about the population(s) being sampled.
Experiment and calculation of test statistic: Performance of the sampling experiment and determination of the numerical value of the test statistic.
Conclusion:
If the numerical value of the test statistic falls in the rejection region, we reject the null hypothesis and conclude that the alternative hypothesis is true. We know that the hypothesis-testing process will lead to this conclusion incorrectly (Type I error) only of the time when is true.
If the test statistic does not fall in the rejection region, we do not reject Thus, we reserve judgment about which hypothesis is true. We do not conclude that the null hypothesis is true because we do not (in general) know the probability that our test procedure will lead to an incorrect acceptance of (Type II error).
As with confidence intervals, the methodology for testing hypotheses varies depending on the target population parameter. In this chapter, we develop methods for testing a population mean, a population proportion, and (optionally) a population variance. As a reminder, the key words and the type of data associated with these target parameters are again listed in the accompanying box.
Parameter | Key Words or Phrases | Type of Data |
---|---|---|
Mean; average | Quantitative | |
p | Proportion; percentage; fraction; rate | Qualitative |
Variance; variability; spread | Quantitative |