4.1 Hypothesis testing
4.1.1 Introduction
If preferred, the reader may begin with the example at the end of this section, then return to the general theory at the beginning.
4.1.2 Classical hypothesis testing
Most simple problems in which tests of hypotheses arise are of the following general form. There is one unknown parameter θ which is known to be from a set Θ, and you want to know whether or where
Usually, you are able to make use of a set of observations whose density depends on θ. It is convenient to denote the set of all possible observations by .
In the language of classical statistics, it is usual to refer to
and to
and to say that if you decide to reject H0 when it is true then you have made a Type I error while if you decide not to reject H0 when it is false then you have made a Type II error.
A test is decided by a rejection region R where
Classical statisticians then say that decisions between tests should be based on the probabilities of Type I errors, that is,
and of Type II errors, that is,
In general, the smaller the probability of Type I error, the larger the probability of Type II error and vice versa. Consequently, classical statisticians recommend a choice of R which in some sense represents an optimal balance between the two types of errors. Very often R is chosen, so that the probability of a Type II error is as small as possible subject to the requirement that the probability of a Type I error is always less than or equal to some fixed value α known as the size of the test. This theory, which is largely due to Neyman and Pearson, is to be found in most books on statistical inference and is to be found in its fullest form in Lehmann (1986).
4.1.3 Difficulties with the classical approach
Other points will be made later about the comparison between the classical and the Bayesian approaches, but one thing to note at the outset is that, in the classical approach, we consider the probability (for various values of θ) of a set R to which the vector x of observations does, or does not, belong. Consequently, we are concerned not merely with the single vector of observations we actually made but also with others we might have made but did not. Thus, classically, if we suppose that and we wish to test whether or is true (negative values being supposed impossible), then we reject H0 on the basis of a single observation x = 3 because the probability that an N(0, 1) random variable is 3 or greater is 0.001 350, even though we certainly did not make an observation greater than 3. This aspect of the classical approach led Jeffreys (1961, Section 7.2) to remark:
What the use of P implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred.
Note, however, that the form of the model, in this case the assumption of normally distributed observations of unit variance, does depend on an assumption about the whole distribution of all possible observations.
4.1.4 The Bayesian approach
The Bayesian approach is in many ways more straightforward. All we need to do is to calculate the posterior probabilities
and decide between H0 and H1 accordingly. (We note that p0+p1=1 as and .)
Although posterior probabilities of hypotheses are our ultimate goal we also need prior probabilities
to find them. (We note that just as p0+p1=1.) It is also useful to consider the prior odds on H0 against H1, namely
and the posterior odds on H0 against H1, namely
(The notion of odds was originally introduced in the very first section of this book). Observe that if your prior odds are close to 1 then you regard H0 as more or less as likely as H1 a priori, while if the ratio is large you regard H0 as relatively likely and when it is small you regard it as relatively unlikely. Similar remarks apply to the interpretation of the posterior odds.
It is also useful to define the Bayes factor B in favour of H0 against H1 as
The interest in the Bayes factor is that it can sometimes be interpreted as the ‘odds in favour of H0 against H1 that are given by the data’. It is worth noting that because and p1=1–p0 we can find the posterior probability p0 of H0 from its prior probability and the Bayes factor by
The aforementioned interpretation is clearly valid when the hypotheses are simple, that is,
for some and . For if so, then and so that
and hence, the Bayes factor is
It follows that B is the likelihood ratio of H0 against H1 which most statisticians (whether Bayesian or not) view as the odds in favour of H0 against H1 that are given by the data.
However, the interpretation is not quite as simple when H0 and H1 are composite, that is, contain more than one member. In such a case, it is convenient to write
and
where is the prior density of θ, so that is the restriction of to renormalized to give a probability density over , and similarly for . We then have
the constant of proportionality depending solely on . Similarly,
and hence, the Bayes factor is
which is the ratio of ‘weighted’ (by and ) likelihoods of and .
Because this expression for the Bayes factor involves and as well as the likelihood function itself, the Bayes factor cannot be regarded as a measure of the relative support for the hypotheses provided solely by the data. Sometimes, however, B will be relatively little affected within reasonable limits by the choice of and , and then we can regard B as a measure of relative support for the hypotheses provided by the data. When this is so, the Bayes factor is reasonably objective and might, for example, be included in a scientific report, so that different users of the data could determine their personal posterior odds by multiplying their personal prior odds by the factor.
It may be noted that the Bayes factor is referred to by a few authors simply as the factor. Jeffreys (1961) denoted it by K, but did not give it a name. A number of authors, most notably Peirce (1878) and (independently) Good (1950, 1983 and elsewhere), refer to the logarithm of the Bayes factor as the weight of evidence. The point of taking the logarithm is, of course, that if you have several experiments about two simple hypotheses, then the Bayes factors multiply, and so the weight of evidence adds.
4.1.5 Example
According to Watkins (1986, Section 13.3), the electroweak theory predicted the existence of a new particle, the W particle, of a mass m of GeV. Experimental results showed that such a particle existed and had a mass of GeV. If we take the mass to have a normal prior and likelihood and assume that the values after the signs represent known standard deviations, and if we are prepared to take both the theory and the experiment into account, then we can conclude that the posterior for the mass is where
(following the procedure of Section 2.2 on ‘Normal Prior and Likelihood’). Suppose that for some reason it was important to know whether or not this mass was less than 83.0 GeV. Then, since the prior distribution is N(82.4, 1.12), the prior probability of this hypothesis is given by
where is the distribution function of the standard normal distribution. From tables of the normal distribution, it follows that so that the prior odds are
Similarly, the posterior probability of the hypothesis that is , and hence the posterior odds are
Thus, the Bayes factor is
In this case, the experiment has not much altered beliefs about the hypothesis under discussion, and this is represented by the nearness of B to 1.
4.1.6 Comment
A point about hypothesis tests well worth making is that they ‘are traditionally used as a method for testing between two terminal acts [but that] in actual practice [they] are far more commonly used [when we are] given the outcome of a sample [to decide whether] any final or terminal decision [should] be reached or should judgement be suspended until more sample evidence is available’ (Schlaifer, 1961, Section 13.2).