10.3 Variational Bayesian methods: general case
The variational Bayes approach can also be used in cases where we have more than two parameters or sets of parameters. In cases where we seek an approximation to the density of the form
We need to write
With this notation, a straightforward generalization of the argument in the previous section leads to
and so to
10.3.1 A mixture of multivariate normals
We sometimes encounter data in two or more dimensions which is clearly not adequately modelled by a multivariate normal distribution but which might well be modelled by a mixture of two or more such distributions. An example is the data on waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA taken from Härdle (1991), which is illustrated in Figure 10.3. A more complicated example of variational Bayes techniques can be illustrated by outlining a treatment of this problem.
We shall write for a vector of probabilities, so that , for a set of M mean vectors of dimension k, and for a set of M variance–covariance () matrices.
Suppose that we have a data set X comprising N observations () which come from a mixture of sub-populations each of which has a k-dimensional multivariate normal distribution, so that with probability (for ) their distribution is multivariate normal with mean and variance–covariance matrix , that is . Thus,
(where by abuse of notation the expression on the right-hand side means a variable which with probability has an distribution). We suppose that all the parameters are unknown, and we seek maximum likelihood estimates of them, so that we can find what the component distributions are and which observations come from which component.
In order to deal with this problem, it is helpful to augment the data with variables sij which are indicator variables denoting the sub-population to which each observation belongs, so that
and the values for different values of i are independent of one another. We write for the complete set of the sij. Estimating the values of the sij amounts to determining which sub-population each observation comes from.
In dealing with this problem, we begin by working as if the value of were known. We then take independent multivariate normal priors of mean and variance–covariance matrix for the . For the variance–covariance matrices, we take independent inverse Wishart distributions; since we are only outlining a treatment of this problem it is not important exactly what this means and you can take it as given that this is the appropriate multidimensional analogue of the (multiples of) inverse chi-squared distributions we use as a prior for the variance in the one-dimensional case.
We can specify the prior distribution of for a given value of by saying that an individual i (for ) belongs to sub-population j (for ) with probability .
It then follows that
In order to evaluate it is necessary to marginalize this expression with respect to , so that we seek
Now as we found earlier
We then seek a variational posterior distribution of the form
Since we used appropriate conjugate priors, we end up with a posterior for which is multivariate normal and for which is inverse Wishart. The posterior distribution of for a given value of is determined by a matrix (pij) where pij is the probability that individual i belongs to sub-population j.
In this way we obtain a variational lower bound which approximates the true marginal log-likelihood . This bound is, of course, dependent on through the component of . By maximizing it with respect to we obtain our required estimates for the mixing coefficients. But, of course, the value of and hence the value of the lower bound will depend on . We, therefore, adopt an EM procedure in which we alternately maximize with respect to and then optimize q by updating of the variational solutions for q1, q2 and q3 (expectation step). We have so far assumed that M is fixed and known, but it is possible to try the procedure for various values of M and choose that value which maximizes the variational likelihood.
We omit the details, which can be found in Corduneanu and Bishop (2001). They find that the Old Faithful geyser data is well fitted by a three-component mixture of bivariate normals with . This corresponds to the visual impression given by Figure 10.3 in which it appears that there is one sub-population at the top right, one at the bottom left and possibly a small sub-population between the two.
Some further applications of variational Bayesian methods can be found in Ormerod and Wand (2010). In their words, ‘Variational methods are a much faster alternative to MCMC, especially for large models’.