Chapter 6
Coalescence on Continuous and
Unbounded State Spaces
The self is not something ready-made, but something in continuous forma-
tion through choice of action.
John Dewey
The algorithms of Chapters 3 and 5 are not restricted in theory to discrete state
spaces, but the examples presented therein were all discrete. This is because the
problem of designing a chain that coalesces down to a single state can be trickier
when Ω is a continuous state space. Trickier, but not impossible, and in this chapter
several methods for dealing with continuous state spaces are considered.
6.1 Splitting chains
Suppose that X
0
,X
1
,X
2
,... is a Markov chain over a continuous state space, and that
given X
t
= x
t
, the distribution of the next state X
t+1
is given by the transition kernel
K(x
t
,·).
The transition kernel usually depends on the starting state x
t
. That is, the distri-
bution K(x
t
,·) varies as x
t
varies. However, it can sometimes be the case that there
is an
ε
> 0andkernelsK
1
and K
2
, such that K(x
t
,·)=K
1
(x
1
,·)
ε
+ K
2
(x
t
,·)(1
ε
)
with the property that K
1
(x
1
,·) does not depend on x
1
!
For example, suppose that X
t
is a Markov chain conditioned to lie in [0,1] th at
updates as follows. With probability 1/2, X
t+1
given X
t
is uniform over [0,1],and
with probability 1/2, [X
t+1
|X
t
] Unif({X
t
δ
,X
t
+
δ
}), where
δ
= min{1 X
t
,X
t
}.
Figure 6.1 shows the density of X
t+1
given that X
t
= 0.3.
The idea then to take a move is given X
t
,drawY
1
Unif([0,1]),and[Y
2
|X
t
]
Unif([X
t
δ
,X
t
+
δ
]) (where
δ
= min{1 X
t
,X
t
}). Last, draw B Bern(1/2).Set
X
t+1
= Y
1
B +Y
2
(1 B) (6.1)
so that X
t+1
= Y
1
if B = 1andX
t+1
= Y
2
otherwise. Note that Y
1
did not depend on
X
t
, and so no matter what the value of X
t
, the next state X
t+1
will be Y
1
if B = 1. So
the chain couples if B = 1.
Since the chain has been split into a random variable that depends on X
t
and one
97
98 COALESCENCE ON CONTINUOUS AND UNBOUNDED STATE SPACES
10.60
Figure 6.1 The density of X
t+1
given that X
t
= 0.3,whichis(1/2)1(x [0, 1]) +(1/2)1(x
[0,0.6]).
that does not, this is called a splitting chain [109]. If B Bern(p) for the distribu-
tion, on average 1/p steps are needed before the chain coalescences by choosing the
random variable that does not depend on the current state.
6.2 Multigamma coupling
The multigamma coupler of Murdoch and Green [107] takes this splitting chain idea
and uses it to build an update function. Previously, the splitting chain [109] had been
used by Lindvall [85] to create a coupling for two processes to bound the mixing
time. Since an update function couples together multiple processes (usually an in-
nite number), Murdoch and Green called this process the multigamma coupler.
The idea is to explicitly write the update function into a mixture of a piece that
couples completely, and a piece that does not. That is, suppose that for all x Ω,the
chain can be written as
φ
(x,U,B)=
φ
1
(U )B +
φ
2
(x,U)(1 B) (6.2)
where U Unif([0,1]) and B Bern(
ρ
) for some
ρ
> 0.
At each step of this update function, there is a
ρ
chance that
φ
(Ω,U,B)=
φ
1
(U ).
Hence such a chain quickly coalesces. But can this be done in practice?
Suppose the update function
φ
(x,U,B) has unnormalized density f (y|x),and
there exists f
1
so that 0 f
1
(y) f (y|x) for all x and y.Thenlet f
2
(y|x)=
f (y|x) f
1
(y|x).
With probability
f
1
(y) dy/
f (y|x) dy, draw the next state from f
1
,otherwise
draw from f
2
.
Multigamma update Input: current state x, Output: next state y
1) Draw C Bern(
f
1
(y) dy/
f (y|x) dy)
2) If C = 1
3) Draw y from unnormalized density f
1
4) Else
5) Draw y from unnormalized density f
2
(·|x)
To employ multigamma coupling in density form, it is necessary to obtain a uni-
form lower bound on the density of the next state.
MULTIGAMMA COUPLING 99
6.2.1 Example: hierarchical Poisson/gamma model
In statistical inference, the goal is to learn from the data more about parameters of a
statistical model. In Bayesian statistical inference, the unknown parameter is treated
as a random variable. That means that it is modeled initially using a distribution
called the prior that is set before any d ata is taken.
Then given the data and the statistical model, the distribution of the param eters
can be updated using Bayes’ Rule. The resulting d istribution is called the posterior
distribution.
In a hierarchical model, the parameter of interest has a prior that itself depends
on unknown parameters that usually are given by a prior with high variance.
As an example, consider a dataset on pump reliability, with a model originally
given by Gelfand and Smith [39], and studied further by Reutter and Johnson [113].
In this case the data is the number of failures in ten pump systems at a nuclear
power plant. Let s
i
denote the number of failures of pump i,andt
i
be the time that
pump i has been operating.
A common model is to treat the failures of a pump i as occurring according to
a Poisson process with failure rate
λ
i
. In such a model, the number of failures has a
Poisson distribution with mean equal to the failure rate times the length of time of
operation. That is, s
i
Pois(
λ
i
t
i
) for each i ∈{1,...,10}.
In order for the prior to not overwhelm the data, each failure rate is given a
high-variance distribution for its prior, say [
λ
i
|
α
,
β
] Gamma(
α
,
β
).Thevalueof
α
= 1.802 comes from a method of moments argument. The inverse scale
β
is taken
from another high-variance prior
β
Γ(
γ
,
δ
), where the values
γ
= 0.01 and
δ
= 1
were used in [113].
The reason for employing Gamma distributions is their special nature as a conju-
gate prior. When the
λ
i
have gamma distributions and are used to generate a Poisson
random variable, then conditioned on the Poisson random variable, they still have a
Gamma distribution with slightly altered parameters. To be precise:
[
λ
i
|
β
,s
i
] Gamma(
α
+ s
i
,
β
+ t
k
). (6.3)
Similarly, the distribution of
β
given the failure parameters is also gamma:
[
β
|
λ
1
,...,
λ
10
,s
1
,...,s
10
] Gamma
γ
+ 10
α
,
δ
+
λ
i
. (6.4)
It is important to note here that for this particular model, perfect simulation is
unnecessary. The variable
β
is an example of a linchpin variable meaning that once it
is known, the d istribution of the remaining variables is also known. So what initially
looks like a ten-dimensional problem actually reduces to a one-dimensional problem
that can be integrated numerically.
Still, there are many hierarchical models that do not split apart so easily, and this
particular model serves as a nice illustration of the multigamma coupling method.
In order to do that, lower bounds on gamma densities are needed.
Lemma 6.1. Say X Gamma(a,b) if X has density f
X
(x)=x
a1
b
a
exp(xb )1 (x
0)/Γ(a). Then if b [b
0
,b
1
], and r(x)=x
a1
b
a
0
exp(xb
1
)1(x > 0)/
γ
(a),then
r(x) f
X
(x) for all x, an d
xR
r(x) dx =(b
0
/b
1
)
a
.
100 COALESCENCE ON CONTINUOUS AND UNBOUNDED STATE SPACES
Proof. Since a > 0andx > 0, b
a
0
b
a
and exp(xb) exp(xb
1
),sor(x) f
X
(x).
Also r(x)(b
1
/b
0
)
a
is a gamma density, and so integrates to 1.
To use this for the pump model, consider taking one step in the Gibbs sampler
Markov chain, where rst
β
is sampled conditioned on the
λ
i
, and then the
λ
i
are
sampled conditioned on
β
.
Since [
β
|
λ
1
,...,
λ
10
] Gamma(1.803, 1 +
λ
i
),thenif
λ
i
was bounded, it
would be possible to use the previous lemma to write the distribution as the sum
of a d raw from density r plus a draw from whatever remained.
But in this case, the
λ
i
are unbounded! There are two ways to solve this issue:
1. Alter the prior slightly to impose the restriction
λ
i
< L.WhenL is large enough,
this alters the p rior very little.
2. Employ a technique developed by Kendall and Møller [82] known either as dom-
inated coupling from the past or coupling into and from the past. This idea is
explained in Section 6.7.
The rst approach is much easier. Murdoch and Green found that after a 10,000-
step run of the Markov chain, the largest
λ
i
value that was seen was 13.6, and so
they employed L = 20 as their maximum value [107]. Plugging into the lemma gives
[b
0
,b
1
]=[1,21],and
r(x) dx =(1/21)
18.03
< 10
24
. Therefore, the chance that the
multigamma coupler coalesces across the entire space simultaneously is very small.
One solution is not to try to couple the entire space simultaneously, but rather
to partition the possible
λ
i
values into intervals 0 = L
0
< L
1
< L
2
···< L
m
= L.
Then all the values of
λ
i
[L
k
,L
k+1
] can be updated simultaneously using the
multigamma coupler.
If every one of these couplers coalesce, then there is now a nite set of
β
values
to bring forward. The problem has been reduced to updating a nite set. By setting
the L
k
so that [(1 + L
k
)/(1 + L
k+1
)]
18.03
is always the same number, the chance of
coalescence in each interval will be the same. By making this ratio sufciently close
to 1, the overall chance of every interval coupling will be high.
A more effective approach for this particular problem was developed by
Møller [96]. This approach takes advantage of the following well known property
of the gamma distribution.
Lemma 6.2. Say X Gamma(a,b). Then for c > 0,X/c Gamma(a,cb).
Therefore, given 1 +
λ
i
[1,21], it is possible to generate X
γ
(18.03,1),
andthensay
β
[X/21, X/1]. These bounds can then be used to bound each
λ
i
in a similar fashion, and then keep going back and forth until the upper and lower
bound on
λ
i
is close enough together that there is a reasonable chance that the
multigamma coupler coalescences.
6.3 Multishift coupling
The p artitioned multigamma coupler reduces a continuous space down to a nite
number of states. Wilson developed another meth od called the layered multishift
MULTISHIFT COUPLING 101
coupler [130] that accomplishes this for a wide variety of target distributions. More-
over, the multishift coupler is monotone, thus allowing the use of monotonic CFTP.
To illustrate how multishift coupling works, consider the Markov process that is
simple symmetric random walk on the real line, where if the current state is x,the
next state is chosen uniformly from [x 1,x + 1].
One update function is
φ
1
(x,U)=x + 2U 1. For U Unif([0, 1]),2U 1
Unif([1,1]), and so this update function has the correct kernel. It is also monotone,
since f or any u [0, 1] and x y,
φ
1
(x,u)
φ
1
(y,u).
It is, however, extraordinarily bad at coalescence. If x = y,then
φ
1
(x,u) =
φ
1
(y,u)
for all u [0,1]. A different update is needed in order to make states comes together.
Multishift coupling does the following. Consider the set of numbers S =
{...,6, 4, 2, 0,2,4,6,...}. For any real
α
,letS+
α
denote the set {s+
α
: s S}.
Hence S +0.5 = {...,5.5, 3.5, 1.5,0.5,...}. Aga in let U Unif([0,1]), and con-
sider the set S + 2U.
Then for any x R, exactly one point of S +2U falls in the interval [x 1,x + 1).
(Note that the point x+1 h as been removed from the interval, but this does not change
the kern el since this only occurs with probability 0 anyway.)
So the new update function is
φ
multishift
(x,U) is the unique element of (S + 2U) [x 1,x + 1). (6.5)
Some pertinent facts about this update function:
Fact 6.1. Let
φ
multishift
(x,U)=min((S + 2U ) [x 1,x + 1)). Then for U
Unif([0,1]),
1. The update function has
φ
multishift
(x,U) Unif([x 1, x + 1)).
2. The update function is monotonic.
(x)(y x)(u [0 , 1])(
φ
multishift
(x,u)
φ
multishift
(y,u)). (6.6)
Proof. Let a [x 1,x + 1).Thens = 2a/2 is the largest element of S that is at
most a. Then the event that
φ
multishift
(x,U) a is just the event that s+2U [x1,a),
which occurs with probability (a (x 1))/(s + 2 s)=(a (x 1))/2. Hence
φ
multishift
(x,U) Unif([x 1, x + 1)).
Let x < y and u [0,1]. Then if an element of S + 2u falls in [x 1,x + 1) [y
1,y + 1),then
φ
multishift
(x,u)=
φ
multishift
(y,u).Otherwise
φ
multishift
(x,u) < y 1and
φ
multishift
(y,u) > x + 1, so monotonicity is trivially obtained.
Suppose that the current state of the chain is bounded between x
min
and x
max
.
Then after taking one step using
φ
multishift
, the number of points to be coupled will
have been reduced to (x
max
x
min
)/2.
There was nothing special about the width 2 on the interval. In general, given a
width w between elements of S and upper and lower bounds x
min
and x
max
, after one
step in the Markov chain there will remain at most (x
max
x
min
)/w points.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset