A FEW APPLICATIONS 11
The rst two on this list, Markov random elds and permutation problems, serve
as excellent examples for understanding how approximate sampling works, and so a
detailed description of these m odels is presented here rst.
1.6.1 Markov random elds
Consider a graph G =(V,E).
Denition 1.13. Say that a subset of nodes S separates i and j if every path in the
graph from i to j passes through a node in S.
A Markov random eld is a probability distribution on the labellings of a graph
with the property that if i and j are separated by S, then conditioned on the values of
the nodes in S, the labels of i and j are independent.
Denition 1.14. GivenagraphG=(V,E) with a set of node labellings Ω, say that
the distribution of X over Ω is a Markov random eld (MRF) if for all nodes i and
j, and all sets S separating i and j, [X(i)|X ( j),X(S)] [X (i)|X(S)].Astatex Ω is
called a conguration.
To characterize distributions that are Markov random elds, it helps to have the
notion of neighbors and cliques.
Denition 1.15. The neighbors of a node i are all nodes j such that {i, j}∈E.
Denition 1.16. A clique is a subset of nodes such that every two vertices in the
subset are connected by an edge.
The Hammersley Clifford Theorem states that for Markov random elds with
positive density, the density can be factorized over cliques.
Theorem 1.2 (Hammersley Clifford Theorem). Given a nite graph G =(V,E),the
distribution
π
is an MRF if it has density f
X
and there exist functions
φ
C
for all
cliques C such that f
X
can be written as
f
X
(x)=
1
Z
Ccliques(G)
φ
C
(x). (1.6)
Here Z is the normalizing constant known as the partition function.
Following Besag [8], consider the following important class of MRFs.
Denition 1.17. Say that an MRF is an auto model if there exist functions f and g
so that the density of X
π
can be written
f
X
(x)=
1
Z
iV
f
i
(X(i))

{i, j}∈E
g
{i, j}
(X(i),X ( j))
, (1.7)
In other words, auto models are MRFs where only the size one and two cliques
affect the distribution. These types of models are used for image restoration [9] and
more generally anywhere you have local spatial dependence in the model [8]. Ar-
guably the most famous auto model is the Ising model.
12 INTRODUCTION
Denition 1.18. The Ising model is an auto model with parameters
μ
and
β
,
Ω = {−1, 1}
V
, and functions f (c)=exp(
μ
c),g(c
1
,c
2
)=exp(
β
1(c
1
= c
2
)).The
parameter
μ
is known as the magnetization, and
β
as the inverse temperature. When
β
> 0 call th e model ferromagnetic,when
μ
= 0 say the model has no magnetic eld.
Most of the terminology comes from the initial use of the Ising model as a tool for
understanding the magnetic behavior of materials. Note that when
μ
> 0 the distribu-
tion favors 1 ’s at a node, and when
β
> 0 higher probability is given to congurations
where adjacent nodes have the same color.
The denition can be easily extended in several ways. Different values of
μ
i
for
each node and
β
e
for each edge could be used. The algorithms given in later chapters
easily handle this type of generalization but for clarity of presentation are only g iven
here for the simpler case where the
μ
and
β
values are constant over the graph.
Another way to generalize is to allow more than two different labels of the nodes,
giving what is commonly known as the Potts model. This can be used to model alloys
where there are k different types of metals that can occupy a site, o r discrete g rayscale
images where each pixel is assigned a number from 0 to 255 [9].
Denition 1.19. The Potts model is an auto model with parameter
β
,whereΩ =
{1,2,...,k}
V
,f(c)=1, g(c
1
,c
2
)=exp(1(c
1
= c
2
)
β
).
Another model is known as the hard-core gas model, because it envisioned gas
molecules as behaving like billiard balls with a hard-core so that two adjacent sites
in a lattice could not b oth be occupied.
Denition 1.20. The hard-core gas model is an auto model over {0, 1}
V
with param-
eter
λ
> 0 where f (c)=
λ
c
and g(c
1
,c
2
)=1 c
1
c
2
.Whenx(v)=1, say that node
visoccupied, otherwise it is unoccupied.
Note that if x(i)=x( j)=1 for adjacent nodes, then 1 1 ·1 = 0 and the chance
of having the conguration immediately drops to 0. Because the occupation of a site
lowers the chance (in the previous case, to 0) of having occupation at a neighboring
site, this is an example of a repulsive process. A generalization known as the Strauss
model [119] allows for adjacent labels of 1, but assigns a penalty factor to the d ensity
each time this occurs.
Denition 1.21. The Strauss model is an MRF over {0,1}
V
with parameters
λ
>
0,
γ
[0,1] where f (c)=
λ
c
and g(c
1
,c
2
)=1 +(
γ
1)c
1
c
2
.Whenx(v)=1, say that
node v is occupied, otherwise it is unoccupied.
Another way to put this: each time there is an edge {i, j} with x(i)=x( j)=1,
the density acquires an extra factor of
γ
.
1.6.2 Permutations
A permutation can be viewed as an ordering of a set of objects. In order to keep the
notation consistent it will be useful to write a permutation using a vector.
Denition 1.22. A permutation on n positions is a vector in {1,2,...,n}
n
where each
component is labeled a different value, that is, a vector x where i = j x (i) = x( j).
The set of permutations on n objects is also known as the symmetric group, and can
be denoted by S
n
.
MARKOV CHAINS AND APPROXIMATE SIMULATION 13
Typically
σ
will be used to denote a permutation. Different distributions on per-
mutations arise across mathe matical disciplines. A permutation is also known as an
assignment because it can be viewed as assigning each of n workers to n different
jobs. When different workers have different aptitudes for the jobs, the result is a
weighted permutation, which in turn gives a distribution o n permutations. Another
source of distributions on permutations is nonparametric testing.
A classic distribution problem on permutations is related to nding the permanent
of a nonnegative matrix.
Denition 1.23. Given nonnegative numbers w(i, j),let
w(
σ
)=
n
i=1
w(i,
σ
(i)). (1.8)
As long as there exists at least one
σ
with w(i,
σ
(i)) > 0 for all i, this gives an
unnormalized density on the set of permutations. The normalizing constant for this
density is called the permanent of the matrix (w(i, j)). (If no such
σ
exists then the
permanent is 0.)
Another d istribution of interest that arises in the context of nonparametric testing
([106]) is the uniform distribution over p ermutations where certain items are requ ired
to b e put in a lower position than other items.
Denition 1.24. Given a set P, a partial order on P is a binary relation such that
for all a, b, c P:
1. (Reexive) a a.
2. (Antisymmetric) If a b and b a, then a = b.
3. (Transitive) If a b and b c, then a c
A partially ordered set is also known as a poset.
Then a linear extension of a partial or der can be viewed as a permutation that
respects a partial order.
Denition 1.25. A linear extension of a partial order on 1,...,n is a permutation
where if i and j are such that
σ
(i)
σ
( j),theni< j.
For example, in the partial order on 4 elements where 2 4, (1,2,4,3) is a linear
extension, while (1,4, 2,3) is not because items 2 and 4 are out of order.
1.7 Markov chains and approximate simulation
Although there is a great need for the ability to sample from high dimensional dis-
tributions, until the development of perfect simulation methods there only existed
approximate methods, many of which utilized Markov chains.
The many protocols for approximate sampling using Markov chains are collec-
tively known as Markov chain Monte Carlo. A Markov chain is a stochastic process
that is memoryless, so the next state of the chain only depends on the current state,
and not on any of the past history.
14 INTRODUCTION
Denition 1.26. A stochastic process {X
t
} is a Markov chain if for all t,
[X
t
|X
0
,X
1
,...,X
t1
] [X
t
|X
t1
]. (1.9)
For each state x in the state space Ω, the transition kernel of the Markov chain
gives the probabilities that X
t+1
falls into a measurable set C given that X
t
= x.For-
mally, Markov kernels have two properties.
Denition 1.27. Consider a measurable space (Ω, F ).AMarkov kernel K : Ω ×
F [0,1] satises:
Fo r a ll A F and a [0,1], the set of x such that K(x,A) a is measurable.
Fo r a ll x Ω,themapA→ K(x, A) is a probability measure.
If for a particular Markov chain {X
t
}, it holds that P(X
t+1
A|X
t
= x)=K(x,A)
for all x Ω and A F , say that K is the kernel of the Markov chain.
Note that it is possible to use a different Markov kernel K
t
for each step of the
chain. Such a Markov chain is called time heterogeneous. A Markov chain where the
kernel is the same throughout is time homogeneous.
Common examples of Markov chains include shufing a deck of cards, or mixing
up a Rubik’s cube by randomly twisting sides. As long as the next state only depends
on the current state, and not on any states further back in time, it is a Markov chain.
In both these examples, the goal is to achieve a state that is close to uniformly
distributed over the set of possible states. Say that a state of a Rubik’s cube is reach-
able if there exists a nite sequence of moves startin g from the initial state where
each side of the cube is a solid color to the desired state. When a Rubik’s cube starts
in a state that is uniformly distributed over reachable states, and a random move is
made, the state is still uniform over the reachable states. A distribution where taking
a move in the Markov chain leaves the distribution of the state unchanged is called
stationary.
Denition 1.28. A distribution
π
for a Markov chain is stationary if
X
t
π
X
t+1
π
.
Many protocols exist to build a Markov chain whose stationary distribution
matches a particular target distribution
π
. Examples of such methods include
Metropolis-Hastings [91, 47] and Gibbs sampling.
Under mild conditions, a Markov chain will have a limiting distribution that
equals the stationary distribution. There are a variety of conditions that guarantee
this; here the notion of a Harris chain is used since it applies to both discrete and
continuous state space Markov chains.
Denition 1.29. A Markov chain {X
t
} over Ω is a Harris chain if there exist measur-
able A,B Ω and
ε
> 0 for x A, y B, and a probability measure
ρ
with
ρ
(B)=1
where
1. For T
A
= inf{t 0:X
t
A}, (z Ω)(P(T
A
< |X
0
= z) > 0).
2. If x A and C B, then P(X
1
C|X
0
= x)
ερ
(C).
MARKOV CHAINS AND APPROXIMATE SIMULATION 15
In other words, a chain is a Harris chain if there is a subset of states A that the
chain r eturns to with positive probability after a nite number of steps from any
starting state. Moreover, if the state is somewhere in A, then with probability
ε
the
next state does not depend on exactly which element of A the state is at.
The notions of recurrence and aperiodicity are needed to make the stationary dis-
tribution limiting as well. A Harris chain always has positive probability of returning
to A. When the chance of returning is 1, then the chain is recurrent.
Denition 1.30. Let R = inf{n > 0:X
n
A }. A Harris chain is recurrent if for all
x A, P(R < |X
0
= x)=1. A Harris chain that is not recurrent is transient.
Given a recurrent Harris chain, suppose the state space can be partitioned into
P
1
,P
2
,...,P
k
so that for a state in P
i
, the chance of moving to P
i+1
is 1. For a state
in P
k
, the next state is back in P
1
.Thenifk is the smallest number for which there is
such a partition, call k the perio d of the chain. Intuitively, when k is 1, the chain is
aperiodic. The following denition of aperiodicity is equivalent to this notion.
Denition 1.31. A recurrent Harris chain is aperiodic if for all x Ω, there exists n
such that for all n
n,
P(X
n
A|X
0
= x) > 0. (1.10)
Theorem 1.3 (Ergodic Theorem for Harris chains). Let X
n
be an aperiodic recurrent
Harris chain with stationary distribution
π
.IfP(R < |X
0
= x)=1 for all x, then as
t , for all measurable sets C and for all x:
|P(X
t
C|X
0
= x)
π
(C)|→0. (1.11)
This theorem is the heart o f Markov chain Monte Carlo: for a given target distri-
bution, build a Harris chain whose stationary distribution equals the target distribu-
tion and is recurrent and aperiodic. Then the limitin g distribution of the Harris chain
will be the target distribution. So by running the chain an innite number of steps, a
draw from the target distribution is made.
Since very few practitioners actually have an innite amount of time to spare,
instead a nite number of steps are taken, and the user hopes that it is sufcient to
make the state close to the limiting distribution . However, while it is often known
that the distribution of the state of the chain converges to the stationary distribution,
it is rare to be able to calculate how quickly that convergence occurs.
One of the few methods that exist for doing so is known as coupling.
Denition 1.32. Suppose {X
t
}∼
ν
X
and {Y
t
}∼
ν
Y
.Thenacoupling of {X
t
} and
{Y
t
} is a bivariate process {(X
t
,Y
t
)} such that {X
t
}∼
ν
X
and {Y
t
}∼
ν
Y
.
TheideahereisthatX
t
and Y
t
both have a specied distribution, perhaps both are
Markov chains with a specied transition kernel, for instance. Then a coupling is a
way of advancing both X
t
and Y
t
forward into the future at the same time while each,
when viewed individually, is following the transition probabilities of th e appropriate
Markov chain. Say that the two Markov chains are coupled.
The word couple comes from the Latin for “to fasten together, and this usage is
still the important part of using coupled processes. It is important to recognize when
the two processes cross paths; that is, when they are equal to one another.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset