9.2 The EM algorithm

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

9.2.1 The idea of the EM algorithm

A useful numerical technique which finds the posterior mode, that is, the value at which or equivalently is a maximum, but does not provide full information on the posterior distribution is the EM algorithm. We can exemplify this by an example on genetic linkage due to Rao (1973, Section 5g) quoted by Dempster et al. (1977), by Gelfand and Smith (1990) and by Tanner (1996, Section 4.1). We have observations with cell probabilities

and we want to estimate η. The likelihood is then

What we do is to augment the data by adding further data to produce augmented data . It should be noted that to a considerable extent the distinction between parameters of a model and augmentations to the data is artificial (i.e. fictitious augmentations to the data can be regarded as parameters of a model). In the example under consideration, the augmentation consists simply in splitting the first cell into two cells with probabilities and . The advantage of this is that the likelihood then takes the much simpler form

and if we take the standard reference prior Be(0, 0), then the posterior is a beta distribution

Not much in life is free, and in this case we have to pay for the greater simplicity of the model by estimating the split of x1 into y0 and y1. The EM algorithm for finding the posterior mode is an iterative method starting from some plausible guess for the value of η. At stage t, we suppose that the current guess is . Each stage has two steps. In the first, the E-step (expectation step), we compute

that is, the expectation of the log-likelihood function, the expectation being computed at , so that

In the second step, the M-step (the maximization step), we we find that value of η which maximizes . In this particular example as yi=xi for i> 1

For the M-step, we note that

and equating this to zero we deduce that

Since y1 has a binomial distribution with

so that

the iteration becomes

Unnumbered Display Equation

The values actually observed were x1=125, x2=18, x3=20, x4=34. We can then estimate η by iteration starting from, for example, and using

In fact, the iteration will converge to the positive root of which is 0.630.

9.2.2 Why the EM algorithm works

We will first show that increases with t and, presuming that converges to some limit , then show that at so that will normally be the posterior mode.

From the obvious equation considered conditional on , it follows, on noting that includes and hence , that

Multiplying both sides by the density and integrating (noting that the left hand side does not depend on ), we get for any fixed

Unnumbered Display Equation

(we note that K clearly does not depend on η). Taking , we get

Now because of the way in which we chose . Moreover,

Unnumbered Display Equation

where

Because (as is easily seen by differentiation) takes a minimum value of 0 when u=1, we see that for all t. Consequently,

Unnumbered Display Equation

so that

We now show that the limit is a stationary point. As

we see that provided Q is a reasonably smooth function

Further

and in particular taking

Since for any fixed , and in particular for , it now follows that

so that is a stationary point, which will usually be the posterior mode.

More information about the EM algorithm can be found in Dempster et al. (1977) or Tanner (1996).

9.2.3 Semi-conjugate prior with a normal likelihood

Suppose that we have independent normally distributed observations x1, x2, … , . We use the notation

In Section 2.13, we noted that the conjugate joint prior for the normal mean and variance is not a product of a function of θ and , so that θ and are not independent a priori. Nevertheless, an assumption of prior independence would seem appropriate for many problems, and so we are sometimes led to consider a situation in which

independently of one another. Such a prior is described as semi-conjugate by Gelman et al. (2004, Section 3.4) and as conditionally conjugate by O’Hagan (2004, Section 6.33). The latter term arises because conditional on knowledge of θ, it is a conjugate prior for and vice versa.

With such a prior, we know that, conditional on knowledge of , the posterior of θ is

where

(cf. Section 2.3) whereas, conditional on knowledge of θ, the posterior of is

where

(cf. Section 2.7).

In such a case, we can use the EM algorithm to estimate the posterior mode of θ, effectively augmenting the data with the variance . We begin by observing that (ignoring terms which do not depend on θ) the log posterior density is

To carry out the E-step we first note that since

and the expectation of a chi-squared variate is equal to its number of degrees of freedom, we know that

where and . For the M-step we must find that value which maximizes . But is of the form of the log of the posterior density we would have got for θ had we had the same prior () and observations with mean θ and known variance S1/n1. Now it follows from Section 2.3 on ‘Several normal observations with a normal prior’ that the posterior mean of θ in this case is at

(the right-hand side depends on through S1), and so we see that this is the value of θ giving the required maximum.

To illustrate this case, we return to the data on wheat yield considered in the example towards the end of Section 2.13 in which n=12, and S=13 045. We will take the prior for the variance considered in that Section, namely, with S0=2700 and . For the mean, we will take a prior which is where and , which approximates well to the values for the marginal distribution in that Section (according to which ). The resulting marginal densities are nearly the same, but because we now assume prior independence the joint distribution is different.

It seems that a reasonable starting point for the iteration is . Then we get , , and thereafter.

9.2.4 The EM algorithm for the hierarchical normal model

The EM algorithm is well suited to the analysis of hierarchical models and we can see this by examining the important case of the hierarchical normal model. Although the formulae we shall derive do not look pretty, they are very easy to use. Suppose that

with , where

as in Section 8.5 and that we wish to estimate the hyperparameters μ, and . In this case takes the place of η in the example on genetic linkage, while we augment the data by to give augmented data . We can use a reference prior for , but (as mentioned in Section 8.5) this will not work for . Accordingly, we will adopt the prior

It is then easily seen that (up to an additive constant)

Unnumbered Display Equation

To carry out the E-step, observe that conditional on the value of the parameters where

Unnumbered Display Equation

using the notation introduced in Section 6.5 for averaging over a suffix (cf. Section 2.3). We can now see that

and similarly

so that

Unnumbered Display Equation

It is now clear that the M-step gives

Unnumbered Display Equation

9.2.5 A particular case of the hierarchical normal model

Gelman et al. (2004, Table 11.2) quote the following data from Box et al. (1978, Table 6.1) on the coagulation time in seconds for blood drawn from N=24 animals randomly allocated to four different diets:

Unnumbered Display Equation

The data has been slightly adjusted, so that the averages come out to be whole numbers. It should perhaps be noted that the prior adopted in this and the preceding subsections differs slightly from that adopted by Gelman et al. The within-groups sum of squares is and the overall mean is . It, therefore, seems reasonable to start iterations from and . As for , we can take