16 INTRODUCTION
Denition 1.33. If X
t
= Y
t
, then say that the processes have coupled or coalesced.
Coupling is important for assessing convergence b ecause of the following result.
Theorem 1.4 (The Coupling Lemma [29, 1]). Let Y
0
π
and X
0
= x
0
both evolve
according to a coupled Markov chain. Then for all measurable C,
|P(X
t
C|X
0
= x)
π
(C)|≤P(X
t
= Y
t
).
In other words, the distance between the distribution of X
t
and the stationary
distribution is bounded above by the probability that the process X
t
has not coupled
at time t with a stationary process Y
t
.
Even if it is po ssible to use the coupling lemma to determine how close the law of
the current state is to the target distribution, at best this can only give approximately
correct samples. Still, it is a start!
1.8 Designing Markov chains
In order to use Markov chain Monte Carlo, it is necessary to build a Harris chain
for which
π
is a stationary distribution . It is usually better to require more than just
stationarity, and to build a chain that is reversible. To describe reversibility, it is help-
ful to have a differential notation for measures. When
π
has a density f (x),then
let
π
(dx)= f (x) dx. This differential notation means that for all measurable sets
A,
π
(A)=
xA
π
(dx)=
xA
f (x) dx. Using this differential notation, reversibility
works as follows.
Denition 1.34. A distribution
π
is reversible with respect to a Markov chain if
π
(dx)P(X
t+1
dy|X
t
= x)=
π
(dy)P(X
t+1
dx|X
t
= y).
Lemma 1.2. If
π
is reversible, then it is also stationary.
Proof. Let Ω be the state space of the Markov chain, and
π
be reversible for the
chain. Then let X
t
π
,andC be any measurable set. Then
P(X
t+1
C)=E[1(X
t+1
C)]
= E[E[1(X
t+1
C)|X
t
]]
=
xΩ
E[1(X
t+1
C)|X
t
= x]
π
(dx)
=
xΩ
P(X
t+1
C|X
t
= x)
π
(dx)
=
xΩ
yC
P(X
t+1
dy|X
t
= x)
π
(dx)
=
yC
xΩ
P(X
t+1
dx|X
t
= y)
π
(dy)
=
yC
P(X
t+1
Ω|X
t
= y)
π
(dy)
=
yC
π
(dy)
=
π
(C).
DESIGNING MARKOV CHAINS 17
Hence
π
is stationary. The swapping of the order of the iterated integrals in the
proof is allowed by Tonelli’s theorem since the integrand is nonnegative; see for
instance [31].
The main reversible Markov chain types are Gibbs samplers, Metropolis-
Hastings, and shift-move chains. Auxiliary variable chains can be used together with
these ideas to expand their range of applicability. Each of these methods is now dis-
cussedinturn.
1.8.1 Gibbs sampling
There are many varieties of Gibbs sampler. The simplest operates over a state space
of the form C
V
.Callv V a dimension of the problem. Given a current state X
t
= x,
the Gibbs sampler chooses a dimension v uniformly from V ,thenletsL(x,v) be
all the states that match the label of x everywhere except possibly at dimension v.
So L(v)={y : (w V {v})(y(w)=x(w))}. Given X
t
= x, the next state X
t+1
is
chosen from
π
conditioned to lie in L(x,v).
As an example, consider the Ising model when
μ
= 0. In this case, a dimension
of the problem consists of a single node v. So select a node v uniformly at random
from V , and consider the two states that equal x at every node w = v.Thevalueat
node v can be 1 or -1. Call these two states x
v1
and x
v→−1
, respectively.
Then make the next state either x
v1
or x
v→−1
, where the probability of choosing
x
v1
is proportional to
π
({x
v1
}).So
P(X
t+1
= x
v1
)=
π
({x
v1
})
π
({x
v1
})+
π
({x
v→−1
})
. (1.12)
Recall that for the Ising model with
μ
= 0,
π
({x})=
1
Z
{i, j}∈E
exp(
β
1(x(i)=x( j)). (1.13)
In (1.12), the normalizing constant Z cancels out. So does the exp(
β
1(x(i)=
x( j))) factors unless either i or j equals v. So the fraction reduces to
P(X
t+1
= x
v1
)=
j:{v, j}∈E
1(x( j)=1)
j:{v, j}∈E
1(x( j)=1)+
j:{v, j}∈E
1(x( j)=1)
=
exp(
β
n
1
)
exp(
β
n
1
)+exp(
β
n
1
)
where n
c
is the number of neighbors of v that have label c.
For instance, suppose the node v is adjacent to three nodes labeled 1, and one
node labeled 0. Then the chance that v should be labeled 1 is exp(3
β
)/[exp(
β
)+
exp(3
β
)].
This gives the following pseudocode for taking one step in the Markov chain.
18 INTRODUCTION
Ising Gibbs Input: old state x Output: new state x
1) Draw v Unif(V ), U Unif([0,1])
2) Let n
1
be the number of neighbors of v labeled 1 in x
3) Let n
1
be the number of neighbors of v labeled - 1 in x
4) If U < exp (n
1
β
)/[exp(n
1
β
)+exp(n
1
β
)]
5) x(v) 1
6) Else
7) x(v) ←−1
Note here that the node v is being chosen uniformly at random from among the
nodes of the graph. This is known as using random scan. Another popular choice
with Gibbs is to run through all the nodes one at a time in either a specied or a
random order. This is known as deterministic scan.
More generally, suppose that the current state is a vector (X (1),X (2),...,X(n)).
Then choose some subset of states I according to some d istribution. Then the next
state chooses values for X(i) for all i I from
π
conditioned on the values of X(i
) for
all i
/ I. For instance, in the Ising model, an edge {i, j} could be chosen uniformly
from the set of edges, and the values of X (i) and X ( j) could be chosen from
π
conditioned on all of the remaining states.
In random scan, the set I is chosen uniformly at random. It is easily to verify that
as long as the choice of I is independent of the current state, the resulting chain is
stationary with respect to
π
. It is usually not difcult to show that it is possible to
reach any state from any starting state in the nite number of moves.
1.8.2 Metropolis-Hastings
With Metropolis-Hastings, for each state x of the chain, there is a density q
x
. Choose
a p roposal state y according to q
x
. Then the next state of the chain will be either x or
y. T he probability the next state is y is chosen in such a way that reversibility holds
with r espect to the target distribution. If the next state is y, say that the chain accepts
the move to y. Otherwise, the chain rejects the move, and stays at x.
For this to work, it is necessary that if q
x
(y) > 0, then q
y
(x) > 0 as well. Then the
move to state y occurs with probability
min
1,
f
π
(y)q
y
(x)
f
π
(x)q
x
(y)
. (1.14)
Note that f
π
can be either a normalized or unnormalized density for the target distri-
bution.
To see this in action, consider again the Ising model with
μ
= 0. Then suppose
at each step a node v is selected uniformly at random, and then a candidate color for
y(v) is chosen uniformly from {−1, 1}.Thenq
y
(x)=q
x
(y)=1/2 so those factors
in (1.14) cancel out. In fact, the original Metropolis et al. m ethod [91] just considered
chains where q
y
(x)=q
x
(y). Only later did Hastings [47] consider q
x
(y) = q
y
(x).
Again we need to know n
1
(the number of neighbors of v labeled 1) and n
1
(the
DESIGNING MARKOV CHAINS 19
number of neighbors labeled -1) to calculate f
π
(y)/ f
π
(x), but again the remainder of
the factors in f
π
simply cancel out. The resulting algorithm looks like this.
Ising MH Input: old state x Output: new state x
1) Draw v Unif(V ), U Unif([0,1]), c Unif({−1,1})
2) Let n
1
be the number of neighbors of v labeled 1 in x
3) Let n
1
be the number of neighbors of v labeled - 1 in x
4) If U < exp(n
c
β
)/exp(n
x(v)
β
)
5) x(v) c
Now consider trying to build a M arkov chain whose stationary distribution has
a density with respect to the uniform distribution over permutations. It is impo ssi-
ble to change only one component of a permutation, so at least two values of the
permutation vector must be altered.
Given a permutation
σ
,let
σ
ij
(k)=
σ
( j) when k = i
σ
(i) when k = j
σ
(k) when k /∈{i, j}
denote the move that transposes (swaps) the items at positions i and j.
Given current state x, a common way to construct a chain on permutations is
to choose i and j uniformly at random, and let the proposed state be y = x
ij
.
Then q
x
(y)=q
y
(x), and the chance of accepting the move is the minimum of 1
and f (y)/ f (x). Since only two elements are being considered, this is usually easy to
calculate.
As an example, consider the uniform distribution over linear extensions of a
poset. With respect to the uniform distribution on permutations, the density of a per-
mutation where two elements are out of order is 0, while the density of any valid
linear extension is 1.
A transposition is called adjacent if j = i +1. By choosing only adjacent transpo-
sitions, it is easy to check if the p artial order is still satised. For i ∈{1, 2,...,n 1}
and j = i+1, let
σ
(i)=a and
σ
(i+1)=b. Then do not transpose if a b;otherwise,
the transposition always occurs.
1.8.3 Auxiliary random variables
In most of the applications of MCMC, the target density of X has a multiplicative
structure. For these densities, it is possible to add a vector Y of auxiliary (extra)
random variables such that (X ,Y ) has a simpler joint distribution.
As an example, consider once again the Ising model with
μ
= 0. Then for each
edge {i, j}, create an auxiliary random variable Y ({i, j}) whose distribution condi-
tioned on X is uniform over [0,1] if X(i) = X ( j), and uniform over [0,exp(
β
)] if
X(i)=X ( j).
The density of a uniform over [a,b] is f (s)=(ba )
1
1(s [a,b ]), so this choice
20 INTRODUCTION
for Y makes the joint density of (X ,Y ) uniform over X ∈{0,1}
V
and Y ∈{[0, ) :
(∀{i, j})(y ({i, j}) min(exp(
β
),1))}.
The Markov chain then is simple: given the state x, choose a new value of y
conditioned on x, and then choose a new x conditioned on y. (This can be viewed as
a deterministic scan Gibbs sampler applied to the new joint distribution.)
Auxiliary chain Input: old state (x,y) Output: new state (x ,y)
1) Draw y conditioned on x
2) Draw x conditioned on y
From the construction, drawing y conditioned on x is straightforward. Drawing x
conditioned on y can be more complex. Consider again the Ising model with
β
> 0.
Then exp(
β
) > 1. When y({i, j}) [0, 1], then it could be true that x(i)=x( j) or
x(i) = x( j).Butwheny({i, j}) > 1, it must be true that x(i)=x( j). The edges with
y({i, j}) > 1 break the graph into clusters of connected components. Each component
must have the same x value across the component, but otherwise is uniform.
Therefore, break the graph into connected components u sing the edges with
y({i, j}) > 1, then for each connected component, choose a label uniformly from
{−1,1}and apply that label to every node in the component.
This example was one of the rst examples of an auxiliary variable chain, and
is due to Swendsen and Wang [121]. For this reason, auxiliary variable chains are
sometimes known as Swendsen-Wang chains.
Swendsen Wang chain Input: old state ( x,y) Output: new state (x,y)
1) For each edge {i, j}∈E
2) Draw y({i, j}) Unif([0,exp(
β
·1(x(i)=x ( j)))])
3) Let A ←{{i, j} : y({i, j}) > 1}
4) For each connected component C in the graph (V,A)
5) Draw B Unif({−1,1})
6) For each v B,setx(v) B
1.8.4 Slice sampler
The slice sampler (see [95] for m ore details) is a special case of auxiliary vari-
ables that uses just a single component Y with conditional distribution [Y |X = x]
Unif([0, f
X
(X)]).Then[X |Y = y] Unif(Ω ∩{x : f
X
(x) y}).
This last set Ω ∩{x : f
X
(x) y} is a slice through the density of f
X
at height y,
hence the name slice sampler.
1.8.5 Shift chains
Although Gibbs, Metropolis-Hastings, and auxiliary random variables are the most
widely used types of Markov chains, they are not the only types o f chains.
An important type of chain that is useful for repulsive processes is the shift move.
The canonical example is permutations: they are repulsive in the sense that two com-
ponents of the permutation cannot both be assigned the same value. The solution to
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset