Chapter 6: Coalescence on Continuous and Unbounded State Spaces (1/5)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 6

Coalescence on Continuous and

Unbounded State Spaces

The self is not something ready-made, but something in continuous forma-

tion through choice of action.

John Dewey

The algorithms of Chapters 3 and 5 are not restricted in theory to discrete state

spaces, but the examples presented therein were all discrete. This is because the

problem of designing a chain that coalesces down to a single state can be trickier

when Ω is a continuous state space. Trickier, but not impossible, and in this chapter

several methods for dealing with continuous state spaces are considered.

6.1 Splitting chains

Suppose that X

,... is a Markov chain over a continuous state space, and that

given X

= x

, the distribution of the next state X

t+1

is given by the transition kernel

K(x

,·).

The transition kernel usually depends on the starting state x

. That is, the distri-

bution K(x

,·) varies as x

varies. However, it can sometimes be the case that there

is an

> 0andkernelsK

and K

, such that K(x

,·)=K

,·)

+ K

,·)(1 −

)

with the property that K

,·) does not depend on x

For example, suppose that X

is a Markov chain conditioned to lie in [0,1] th at

updates as follows. With probability 1/2, X

t+1

given X

is uniform over [0,1],and

with probability 1/2, [X

t+1

] ∼Unif({X

−

}), where

= min{1 −X

Figure 6.1 shows the density of X

t+1

given that X

= 0.3.

The idea then to take a move is given X

,drawY

∼ Unif([0,1]),and[Y

] ∼

Unif([X

−

]) (where

= min{1 −X

}). Last, draw B ∼ Bern(1/2).Set

t+1

= Y

B +Y

(1 −B) (6.1)

so that X

t+1

= Y

if B = 1andX

t+1

= Y

otherwise. Note that Y

did not depend on

, and so no matter what the value of X

, the next state X

t+1

will be Y

if B = 1. So

the chain couples if B = 1.

Since the chain has been split into a random variable that depends on X

and one

98 COALESCENCE ON CONTINUOUS AND UNBOUNDED STATE SPACES

10.60

Figure 6.1 The density of X

t+1

given that X

= 0.3,whichis(1/2)1(x ∈ [0, 1]) +(1/2)1(x ∈

[0,0.6]).

that does not, this is called a splitting chain [109]. If B ∼ Bern(p) for the distribu-

tion, on average 1/p steps are needed before the chain coalescences by choosing the

random variable that does not depend on the current state.

6.2 Multigamma coupling

The multigamma coupler of Murdoch and Green [107] takes this splitting chain idea

and uses it to build an update function. Previously, the splitting chain [109] had been

used by Lindvall [85] to create a coupling for two processes to bound the mixing

time. Since an update function couples together multiple processes (usually an inﬁ-

nite number), Murdoch and Green called this process the multigamma coupler.

The idea is to explicitly write the update function into a mixture of a piece that

couples completely, and a piece that does not. That is, suppose that for all x ∈ Ω,the

chain can be written as

(x,U,B)=

(U )B +

(x,U)(1 −B) (6.2)

where U ∼ Unif([0,1]) and B ∼ Bern(

) for some

> 0.

At each step of this update function, there is a

chance that

(Ω,U,B)=

(U ).

Hence such a chain quickly coalesces. But can this be done in practice?

Suppose the update function

(x,U,B) has unnormalized density f (y|x),and

there exists f

so that 0 ≤ f

(y) ≤ f (y|x) for all x and y.Thenlet f

(y|x)=

f (y|x) − f

(y|x).

With probability



(y) dy/



f (y|x) dy, draw the next state from f

,otherwise

draw from f

Multigamma update Input: current state x, Output: next state y

1) Draw C ← Bern(



(y) dy/



f (y|x) dy)

2) If C = 1

3) Draw y from unnormalized density f

4) Else

5) Draw y from unnormalized density f

(·|x)

To employ multigamma coupling in density form, it is necessary to obtain a uni-

form lower bound on the density of the next state.

MULTIGAMMA COUPLING 99

6.2.1 Example: hierarchical Poisson/gamma model

In statistical inference, the goal is to learn from the data more about parameters of a

statistical model. In Bayesian statistical inference, the unknown parameter is treated

as a random variable. That means that it is modeled initially using a distribution

called the prior that is set before any d ata is taken.

Then given the data and the statistical model, the distribution of the param eters

can be updated using Bayes’ Rule. The resulting d istribution is called the posterior

distribution.

In a hierarchical model, the parameter of interest has a prior that itself depends

on unknown parameters that usually are given by a prior with high variance.

As an example, consider a dataset on pump reliability, with a model originally

given by Gelfand and Smith [39], and studied further by Reutter and Johnson [113].

In this case the data is the number of failures in ten pump systems at a nuclear

power plant. Let s

denote the number of failures of pump i,andt

be the time that

pump i has been operating.

A common model is to treat the failures of a pump i as occurring according to

a Poisson process with failure rate

. In such a model, the number of failures has a

Poisson distribution with mean equal to the failure rate times the length of time of

operation. That is, s

∼ Pois(

) for each i ∈{1,...,10}.

In order for the prior to not overwhelm the data, each failure rate is given a

high-variance distribution for its prior, say [

] ∼ Gamma(

).Thevalueof

= 1.802 comes from a method of moments argument. The inverse scale

is taken

from another high-variance prior

∼ Γ(

), where the values

= 0.01 and

= 1

were used in [113].

The reason for employing Gamma distributions is their special nature as a conju-

gate prior. When the

have gamma distributions and are used to generate a Poisson

random variable, then conditioned on the Poisson random variable, they still have a

Gamma distribution with slightly altered parameters. To be precise:

[

] ∼ Gamma(

+ s

+ t

). (6.3)

Similarly, the distribution of

given the failure parameters is also gamma:

[

,...,

,...,s

] ∼ Gamma



+ 10

∑



. (6.4)

It is important to note here that for this particular model, perfect simulation is

unnecessary. The variable

is an example of a linchpin variable meaning that once it

is known, the d istribution of the remaining variables is also known. So what initially

looks like a ten-dimensional problem actually reduces to a one-dimensional problem

that can be integrated numerically.

Still, there are many hierarchical models that do not split apart so easily, and this

particular model serves as a nice illustration of the multigamma coupling method.

In order to do that, lower bounds on gamma densities are needed.

Lemma 6.1. Say X ∼ Gamma(a,b) if X has density f

(x)=x

a−1

exp(−xb )1 (x ≥

0)/Γ(a). Then if b ∈ [b

], and r(x)=x

a−1

exp(−xb

)1(x > 0)/

(a),then

r(x) ≤ f

(x) for all x, an d



x∈R

r(x) dx =(b

)

100 COALESCENCE ON CONTINUOUS AND UNBOUNDED STATE SPACES

Proof. Since a > 0andx > 0, b

≤ b

and exp(−xb) ≤ exp(−xb

),sor(x) ≤ f

(x).

Also r(x)(b

)

is a gamma density, and so integrates to 1.

To use this for the pump model, consider taking one step in the Gibbs sampler

Markov chain, where ﬁrst

is sampled conditioned on the

, and then the

are

sampled conditioned on

Since [

,...,

] ∼ Gamma(1.803, 1 +

∑

),thenif

∑

was bounded, it

would be possible to use the previous lemma to write the distribution as the sum

of a d raw from density r plus a draw from whatever remained.

But in this case, the

are unbounded! There are two ways to solve this issue:

1. Alter the prior slightly to impose the restriction

∑

< L.WhenL is large enough,

this alters the p rior very little.

2. Employ a technique developed by Kendall and Møller [82] known either as dom-

inated coupling from the past or coupling into and from the past. This idea is

explained in Section 6.7.

The ﬁrst approach is much easier. Murdoch and Green found that after a 10,000-

step run of the Markov chain, the largest

∑

value that was seen was 13.6, and so

they employed L = 20 as their maximum value [107]. Plugging into the lemma gives

]=[1,21],and



r(x) dx =(1/21)

18.03

< 10

−24

. Therefore, the chance that the

multigamma coupler coalesces across the entire space simultaneously is very small.

One solution is not to try to couple the entire space simultaneously, but rather

to partition the possible

∑

values into intervals 0 = L

< L

···< L

= L.

Then all the values of

∑

∈ [L

k+1

] can be updated simultaneously using the

multigamma coupler.

If every one of these couplers coalesce, then there is now a ﬁnite set of

values

to bring forward. The problem has been reduced to updating a ﬁnite set. By setting

the L

so that [(1 + L

)/(1 + L

k+1

)]

18.03

is always the same number, the chance of

coalescence in each interval will be the same. By making this ratio sufﬁciently close

to 1, the overall chance of every interval coupling will be high.

A more effective approach for this particular problem was developed by

Møller [96]. This approach takes advantage of the following well known property

of the gamma distribution.

Lemma 6.2. Say X ∼Gamma(a,b). Then for c > 0,X/c ∼ Gamma(a,cb).

Therefore, given 1 +

∑

∈ [1,21], it is possible to generate X ∼

(18.03,1),

andthensay

∈ [X/21, X/1]. These bounds can then be used to bound each

in a similar fashion, and then keep going back and forth until the upper and lower

bound on

∑

is close enough together that there is a reasonable chance that the

multigamma coupler coalescences.

6.3 Multishift coupling

The p artitioned multigamma coupler reduces a continuous space down to a ﬁnite

number of states. Wilson developed another meth od called the layered multishift

MULTISHIFT COUPLING 101

coupler [130] that accomplishes this for a wide variety of target distributions. More-

over, the multishift coupler is monotone, thus allowing the use of monotonic CFTP.

To illustrate how multishift coupling works, consider the Markov process that is

simple symmetric random walk on the real line, where if the current state is x,the

next state is chosen uniformly from [x −1,x + 1].

One update function is

(x,U)=x + 2U −1. For U ∼ Unif([0, 1]),2U −1 ∼

Unif([−1,1]), and so this update function has the correct kernel. It is also monotone,

since f or any u ∈ [0, 1] and x ≤ y,

(x,u) ≤

(y,u).

It is, however, extraordinarily bad at coalescence. If x = y,then

(x,u) =

(y,u)

for all u ∈ [0,1]. A different update is needed in order to make states comes together.

Multishift coupling does the following. Consider the set of numbers S =

{...,−6, −4, −2, 0,2,4,6,...}. For any real

,letS+

denote the set {s+

: s ∈S}.

Hence S +0.5 = {...,−5.5, −3.5, −1.5,0.5,...}. Aga in let U ∼Unif([0,1]), and con-

sider the set S + 2U.

Then for any x ∈ R, exactly one point of S +2U falls in the interval [x −1,x + 1).

(Note that the point x+1 h as been removed from the interval, but this does not change

the kern el since this only occurs with probability 0 anyway.)

So the new update function is

multishift

(x,U) is the unique element of (S + 2U) ∩[x −1,x + 1). (6.5)

Some pertinent facts about this update function:

Fact 6.1. Let

multishift

(x,U)=min((S + 2U ) ∩ [x − 1,x + 1)). Then for U ∼

Unif([0,1]),

1. The update function has

multishift

(x,U) ∼ Unif([x −1, x + 1)).

2. The update function is monotonic.

(∀x)(∀y ≥ x)(∀u ∈ [0 , 1])(

multishift

(x,u) ≤

multishift

(y,u)). (6.6)

Proof. Let a ∈ [x −1,x + 1).Thens = 2a/2 is the largest element of S that is at

most a. Then the event that

multishift

(x,U) ≤a is just the event that s+2U ∈[x−1,a),

which occurs with probability (a −(x −1))/(s + 2 −s)=(a −(x −1))/2. Hence

multishift

(x,U) ∼ Unif([x −1, x + 1)).

Let x < y and u ∈ [0,1]. Then if an element of S + 2u falls in [x −1,x + 1) ∩[y −

1,y + 1),then

multishift

(x,u)=

multishift

(y,u).Otherwise

multishift

(x,u) < y −1and

multishift

(y,u) > x + 1, so monotonicity is trivially obtained.

Suppose that the current state of the chain is bounded between x

min

and x

max

Then after taking one step using

multishift

, the number of points to be coupled will

have been reduced to (x

max

−x

min

)/2.

There was nothing special about the width 2 on the interval. In general, given a

width w between elements of S and upper and lower bounds x

min

and x

max

, after one

step in the Markov chain there will remain at most (x

max

−x

min

)/w points.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 6: Coalescence on Continuous and Unbounded State Spaces (1/5)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 6: Coalescence on Continuous and Unbounded State Spaces (1/5)