Chapter 1: Introduction (3/5)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

A FEW APPLICATIONS 11

The ﬁrst two on this list, Markov random ﬁelds and permutation problems, serve

as excellent examples for understanding how approximate sampling works, and so a

detailed description of these m odels is presented here ﬁrst.

1.6.1 Markov random ﬁelds

Consider a graph G =(V,E).

Deﬁnition 1.13. Say that a subset of nodes S separates i and j if every path in the

graph from i to j passes through a node in S.

A Markov random ﬁeld is a probability distribution on the labellings of a graph

with the property that if i and j are separated by S, then conditioned on the values of

the nodes in S, the labels of i and j are independent.

Deﬁnition 1.14. GivenagraphG=(V,E) with a set of node labellings Ω, say that

the distribution of X over Ω is a Markov random ﬁeld (MRF) if for all nodes i and

j, and all sets S separating i and j, [X(i)|X ( j),X(S)] ∼ [X (i)|X(S)].Astatex∈ Ω is

called a conﬁguration.

To characterize distributions that are Markov random ﬁelds, it helps to have the

notion of neighbors and cliques.

Deﬁnition 1.15. The neighbors of a node i are all nodes j such that {i, j}∈E.

Deﬁnition 1.16. A clique is a subset of nodes such that every two vertices in the

subset are connected by an edge.

The Hammersley Clifford Theorem states that for Markov random ﬁelds with

positive density, the density can be factorized over cliques.

Theorem 1.2 (Hammersley Clifford Theorem). Given a ﬁnite graph G =(V,E),the

distribution

is an MRF if it has density f

and there exist functions

for all

cliques C such that f

can be written as

(x)=

∏

C∈cliques(G)

(x). (1.6)

Here Z is the normalizing constant known as the partition function.

Following Besag [8], consider the following important class of MRFs.

Deﬁnition 1.17. Say that an MRF is an auto model if there exist functions f and g

so that the density of X ∼

can be written

(x)=



∏

i∈V

(X(i))



∏

{i, j}∈E

{i, j}

(X(i),X ( j))



, (1.7)

In other words, auto models are MRFs where only the size one and two cliques

affect the distribution. These types of models are used for image restoration [9] and

more generally anywhere you have local spatial dependence in the model [8]. Ar-

guably the most famous auto model is the Ising model.

12 INTRODUCTION

Deﬁnition 1.18. The Ising model is an auto model with parameters

and

Ω = {−1, 1}

, and functions f (c)=exp(

c),g(c

)=exp(

1(c

= c

)).The

parameter

is known as the magnetization, and

as the inverse temperature. When

> 0 call th e model ferromagnetic,when

= 0 say the model has no magnetic ﬁeld.

Most of the terminology comes from the initial use of the Ising model as a tool for

understanding the magnetic behavior of materials. Note that when

> 0 the distribu-

tion favors 1 ’s at a node, and when

> 0 higher probability is given to conﬁgurations

where adjacent nodes have the same color.

The deﬁnition can be easily extended in several ways. Different values of

for

each node and

for each edge could be used. The algorithms given in later chapters

easily handle this type of generalization but for clarity of presentation are only g iven

here for the simpler case where the

and

values are constant over the graph.

Another way to generalize is to allow more than two different labels of the nodes,

giving what is commonly known as the Potts model. This can be used to model alloys

where there are k different types of metals that can occupy a site, o r discrete g rayscale

images where each pixel is assigned a number from 0 to 255 [9].

Deﬁnition 1.19. The Potts model is an auto model with parameter

,whereΩ =

{1,2,...,k}

,f(c)=1, g(c

)=exp(1(c

= c

)

Another model is known as the hard-core gas model, because it envisioned gas

molecules as behaving like billiard balls with a hard-core so that two adjacent sites

in a lattice could not b oth be occupied.

Deﬁnition 1.20. The hard-core gas model is an auto model over {0, 1}

with param-

eter

> 0 where f (c)=

and g(c

)=1 −c

.Whenx(v)=1, say that node

visoccupied, otherwise it is unoccupied.

Note that if x(i)=x( j)=1 for adjacent nodes, then 1 −1 ·1 = 0 and the chance

of having the conﬁguration immediately drops to 0. Because the occupation of a site

lowers the chance (in the previous case, to 0) of having occupation at a neighboring

site, this is an example of a repulsive process. A generalization known as the Strauss

model [119] allows for adjacent labels of 1, but assigns a penalty factor to the d ensity

each time this occurs.

Deﬁnition 1.21. The Strauss model is an MRF over {0,1}

with parameters

∈[0,1] where f (c)=

and g(c

)=1 +(

−1)c

.Whenx(v)=1, say that

node v is occupied, otherwise it is unoccupied.

Another way to put this: each time there is an edge {i, j} with x(i)=x( j)=1,

the density acquires an extra factor of

1.6.2 Permutations

A permutation can be viewed as an ordering of a set of objects. In order to keep the

notation consistent it will be useful to write a permutation using a vector.

Deﬁnition 1.22. A permutation on n positions is a vector in {1,2,...,n}

where each

component is labeled a different value, that is, a vector x where i = j ⇒ x (i) = x( j).

The set of permutations on n objects is also known as the symmetric group, and can

be denoted by S

MARKOV CHAINS AND APPROXIMATE SIMULATION 13

Typically

will be used to denote a permutation. Different distributions on per-

mutations arise across mathe matical disciplines. A permutation is also known as an

assignment because it can be viewed as assigning each of n workers to n different

jobs. When different workers have different aptitudes for the jobs, the result is a

weighted permutation, which in turn gives a distribution o n permutations. Another

source of distributions on permutations is nonparametric testing.

A classic distribution problem on permutations is related to ﬁnding the permanent

of a nonnegative matrix.

Deﬁnition 1.23. Given nonnegative numbers w(i, j),let

∏

i=1

w(i,

(i)). (1.8)

As long as there exists at least one

with w(i,

(i)) > 0 for all i, this gives an

unnormalized density on the set of permutations. The normalizing constant for this

density is called the permanent of the matrix (w(i, j)). (If no such

exists then the

permanent is 0.)

Another d istribution of interest that arises in the context of nonparametric testing

([106]) is the uniform distribution over p ermutations where certain items are requ ired

to b e put in a lower position than other items.

Deﬁnition 1.24. Given a set P, a partial order on P is a binary relation  such that

for all a, b, c ∈ P:

1. (Reﬂexive) a  a.

2. (Antisymmetric) If a  b and b  a, then a = b.

3. (Transitive) If a  b and b  c, then a  c

A partially ordered set is also known as a poset.

Then a linear extension of a partial or der can be viewed as a permutation that

respects a partial order.

Deﬁnition 1.25. A linear extension of a partial order on 1,...,n is a permutation

where if i and j are such that

(i) ≺

( j),theni< j.

For example, in the partial order on 4 elements where 2 ≺4, (1,2,4,3) is a linear

extension, while (1,4, 2,3) is not because items 2 and 4 are out of order.

1.7 Markov chains and approximate simulation

Although there is a great need for the ability to sample from high dimensional dis-

tributions, until the development of perfect simulation methods there only existed

approximate methods, many of which utilized Markov chains.

The many protocols for approximate sampling using Markov chains are collec-

tively known as Markov chain Monte Carlo. A Markov chain is a stochastic process

that is memoryless, so the next state of the chain only depends on the current state,

and not on any of the past history.

14 INTRODUCTION

Deﬁnition 1.26. A stochastic process {X

} is a Markov chain if for all t,

,...,X

t−1

] ∼ [X

t−1

]. (1.9)

For each state x in the state space Ω, the transition kernel of the Markov chain

gives the probabilities that X

t+1

falls into a measurable set C given that X

= x.For-

mally, Markov kernels have two properties.

Deﬁnition 1.27. Consider a measurable space (Ω, F ).AMarkov kernel K : Ω ×

F → [0,1] satisﬁes:

• Fo r a ll A ∈F and a ∈ [0,1], the set of x such that K(x,A) ≤ a is measurable.

• Fo r a ll x ∈ Ω,themapA→ K(x, A) is a probability measure.

If for a particular Markov chain {X

}, it holds that P(X

t+1

∈A|X

= x)=K(x,A)

for all x ∈ Ω and A ∈ F , say that K is the kernel of the Markov chain.

Note that it is possible to use a different Markov kernel K

for each step of the

chain. Such a Markov chain is called time heterogeneous. A Markov chain where the

kernel is the same throughout is time homogeneous.

Common examples of Markov chains include shufﬂing a deck of cards, or mixing

up a Rubik’s cube by randomly twisting sides. As long as the next state only depends

on the current state, and not on any states further back in time, it is a Markov chain.

In both these examples, the goal is to achieve a state that is close to uniformly

distributed over the set of possible states. Say that a state of a Rubik’s cube is reach-

able if there exists a ﬁnite sequence of moves startin g from the initial state where

each side of the cube is a solid color to the desired state. When a Rubik’s cube starts

in a state that is uniformly distributed over reachable states, and a random move is

made, the state is still uniform over the reachable states. A distribution where taking

a move in the Markov chain leaves the distribution of the state unchanged is called

stationary.

Deﬁnition 1.28. A distribution

for a Markov chain is stationary if

∼

→ X

t+1

∼

Many protocols exist to build a Markov chain whose stationary distribution

matches a particular target distribution

. Examples of such methods include

Metropolis-Hastings [91, 47] and Gibbs sampling.

Under mild conditions, a Markov chain will have a limiting distribution that

equals the stationary distribution. There are a variety of conditions that guarantee

this; here the notion of a Harris chain is used since it applies to both discrete and

continuous state space Markov chains.

Deﬁnition 1.29. A Markov chain {X

} over Ω is a Harris chain if there exist measur-

able A,B ⊆Ω and

> 0 for x ∈A, y ∈ B, and a probability measure

with

(B)=1

where

1. For T

= inf{t ≥ 0:X

∈ A}, (∀z ∈ Ω)(P(T

< ∞|X

= z) > 0).

2. If x ∈ A and C ⊆ B, then P(X

∈C|X

= x) ≥

ερ

(C).

MARKOV CHAINS AND APPROXIMATE SIMULATION 15

In other words, a chain is a Harris chain if there is a subset of states A that the

chain r eturns to with positive probability after a ﬁnite number of steps from any

starting state. Moreover, if the state is somewhere in A, then with probability

the

next state does not depend on exactly which element of A the state is at.

The notions of recurrence and aperiodicity are needed to make the stationary dis-

tribution limiting as well. A Harris chain always has positive probability of returning

to A. When the chance of returning is 1, then the chain is recurrent.

Deﬁnition 1.30. Let R = inf{n > 0:X

∈ A }. A Harris chain is recurrent if for all

x ∈ A, P(R < ∞|X

= x)=1. A Harris chain that is not recurrent is transient.

Given a recurrent Harris chain, suppose the state space can be partitioned into

,...,P

so that for a state in P

, the chance of moving to P

i+1

is 1. For a state

in P

, the next state is back in P

.Thenifk is the smallest number for which there is

such a partition, call k the perio d of the chain. Intuitively, when k is 1, the chain is

aperiodic. The following deﬁnition of aperiodicity is equivalent to this notion.

Deﬁnition 1.31. A recurrent Harris chain is aperiodic if for all x ∈ Ω, there exists n

such that for all n



≥ n,

P(X



∈ A|X

= x) > 0. (1.10)

Theorem 1.3 (Ergodic Theorem for Harris chains). Let X

be an aperiodic recurrent

Harris chain with stationary distribution

.IfP(R < ∞|X

= x)=1 for all x, then as

t →∞, for all measurable sets C and for all x:

|P(X

∈C|X

= x) −

(C)|→0. (1.11)

This theorem is the heart o f Markov chain Monte Carlo: for a given target distri-

bution, build a Harris chain whose stationary distribution equals the target distribu-

tion and is recurrent and aperiodic. Then the limitin g distribution of the Harris chain

will be the target distribution. So by running the chain an inﬁnite number of steps, a

draw from the target distribution is made.

Since very few practitioners actually have an inﬁnite amount of time to spare,

instead a ﬁnite number of steps are taken, and the user hopes that it is sufﬁcient to

make the state close to the limiting distribution . However, while it is often known

that the distribution of the state of the chain converges to the stationary distribution,

it is rare to be able to calculate how quickly that convergence occurs.

One of the few methods that exist for doing so is known as coupling.

Deﬁnition 1.32. Suppose {X

}∼

and {Y

}∼

.Thenacoupling of {X

} and

} is a bivariate process {(X



)} such that {X



}∼

and {Y



}∼

TheideahereisthatX

and Y

both have a speciﬁed distribution, perhaps both are

Markov chains with a speciﬁed transition kernel, for instance. Then a coupling is a

way of advancing both X

and Y

forward into the future at the same time while each,

when viewed individually, is following the transition probabilities of th e appropriate

Markov chain. Say that the two Markov chains are coupled.

The word couple comes from the Latin for “to fasten together,” and this usage is

still the important part of using coupled processes. It is important to recognize when

the two processes cross paths; that is, when they are equal to one another.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 1: Introduction (3/5)

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 1: Introduction (3/5)