10.3 Variational Bayesian methods: general case

The variational Bayes approach can also be used in cases where we have more than two parameters or sets of parameters. In cases where  we seek an approximation to the density  of the form

Unnumbered Display Equation

We need to write

Unnumbered Display Equation

With this notation, a straightforward generalization of the argument in the previous section leads to

Unnumbered Display Equation

and so to

Unnumbered Display Equation

10.3.1 A mixture of multivariate normals

We sometimes encounter data in two or more dimensions which is clearly not adequately modelled by a multivariate normal distribution but which might well be modelled by a mixture of two or more such distributions. An example is the data on waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA taken from Härdle (1991), which is illustrated in Figure 10.3. A more complicated example of variational Bayes techniques can be illustrated by outlining a treatment of this problem.

Figure 10.3 Härdle’s old faithful data.

nc10f003.eps

We shall write  for a vector of probabilities, so that  ,  for a set of M mean vectors of dimension k, and  for a set of M variance–covariance ( ) matrices.

Suppose that we have a data set X comprising N observations  ( ) which come from a mixture of sub-populations each of which has a k-dimensional multivariate normal distribution, so that with probability  (for  ) their distribution is multivariate normal with mean  and variance–covariance matrix  , that is  . Thus,

Unnumbered Display Equation

(where by abuse of notation the expression on the right-hand side means a variable which with probability  has an  distribution). We suppose that all the parameters are unknown, and we seek maximum likelihood estimates of them, so that we can find what the component distributions are and which observations come from which component.

In order to deal with this problem, it is helpful to augment the data with variables sij which are indicator variables denoting the sub-population to which each observation belongs, so that

Unnumbered Display Equation

and the values for different values of i are independent of one another. We write  for the complete set of the sij. Estimating the values of the sij amounts to determining which sub-population each observation comes from.

In dealing with this problem, we begin by working as if the value of  were known. We then take independent multivariate normal priors of mean  and variance–covariance matrix  for the  . For the variance–covariance matrices, we take independent inverse Wishart distributions; since we are only outlining a treatment of this problem it is not important exactly what this means and you can take it as given that this is the appropriate multidimensional analogue of the (multiples of) inverse chi-squared distributions we use as a prior for the variance in the one-dimensional case.

We can specify the prior distribution of  for a given value of  by saying that an individual i (for  ) belongs to sub-population j (for  ) with probability  .

It then follows that

Unnumbered Display Equation

In order to evaluate  it is necessary to marginalize this expression with respect to  , so that we seek

Unnumbered Display Equation

Now as we found earlier

Unnumbered Display Equation

We then seek a variational posterior distribution of the form

Unnumbered Display Equation

Since we used appropriate conjugate priors, we end up with a posterior for  which is multivariate normal and for  which is inverse Wishart. The posterior distribution of  for a given value of  is determined by a matrix (pij) where pij is the probability that individual i belongs to sub-population j.

In this way we obtain a variational lower bound  which approximates the true marginal log-likelihood  . This bound is, of course, dependent on  through the component  of  . By maximizing it with respect to  we obtain our required estimates for the mixing coefficients. But, of course, the value of  and hence the value of the lower bound will depend on  . We, therefore, adopt an EM procedure in which we alternately maximize  with respect to  and then optimize q by updating of the variational solutions for q1, q2 and q3 (expectation step). We have so far assumed that M is fixed and known, but it is possible to try the procedure for various values of M and choose that value which maximizes the variational likelihood.

We omit the details, which can be found in Corduneanu and Bishop (2001). They find that the Old Faithful geyser data is well fitted by a three-component mixture of bivariate normals with  . This corresponds to the visual impression given by Figure 10.3 in which it appears that there is one sub-population at the top right, one at the bottom left and possibly a small sub-population between the two.

Figure 10.3 Härdle’s old faithful data.

nc10f003.eps

Some further applications of variational Bayesian methods can be found in Ormerod and Wand (2010). In their words, ‘Variational methods are a much faster alternative to MCMC, especially for large models’.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset