2 Graph Cuts—Combinatorial Optimization in Vision

Many problems in computer vision, image processing, and computer graphics can be put into labeling problems [1]. In such a problem, an undirected graph is given as an abstraction of locations and their neighborhood structure, along with a set of labels. Then, the solutions to the problem is identified with labelings, or assignments of a label to each vertex in the graph. The problem is then to find the best labeling according to the criteria in the problem’s requirements. An energy is a translation of the criteria into a function that evaluates how good the given labeling is, so that smaller energy for a labeling means a better corresponding solution to the problem. Thus, the problem becomes an “energy minimization problem”. This separates the problem and the technique to solve it in a useful way by formulating the problem as an energy, it tends to make the problem more clearly defined, and also, once the problem is translated into an energy minimization problem, it can be solved using general algorithms.

In this chapter, we describe a class of combinatorial optimization methods for energy minimization called the graph cuts. It has become very popular in computer vision, image processing, and computer graphics, used for various tasks such as segmentation [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13], motion segmentation [14, 15, 16, 17], stereo [18, 19, 20, 21, 22, 23, 24], texture and image synthesis [25, 26], image stitching [27, 28, 29, 30], object recognition [31, 32, 33], computing geodesics and modeling gradient flows of contours and surfaces [34, 35], computational photography [36], and motion magnification [37].

Minimization of such energies in general is known to be NP-hard [38]. Stochastic approximation methods, such as the iterated conditional modes (ICM) [39] and simulated annealing [40] have been used, but the former is prone to falling into local minima, while the latter, although the convergence to global minima is theoretically guaranteed, it is very slow in practice.

Graph cuts are methods utilizing the s-t mincut algorithms. Known in operations research for a long time [41], it was first introduced into image processing in the late 1980s [42, 43], when it was shown capable of exactly minimizing an energy devised to denoise binary images. At the time, the exact solution was compared to the results by annealing, which turned out to oversmooth the images [44, 43].

In late 1990s, graph cuts were introduced to the vision community [19, 45, 6, 46, 23]. This time the method was used for multilabel cases, and it was shown [19, 6] that energies with real numbers as labels can be exactly minimized if the prior is linear in terms of the absolute value of the difference between the labels at neighboring sites. This was later generalized to the construction for the case when the prior is any convex function of the difference [47]. Also, approximation algorithms for the energies with for arbitrary label sets and Potts priors were introduced [19, 38], which lead to the current popularity.

At first, there was some duplication of efforts in the vision community. The condition for the binary energy to be exactly solved, long known in the OR community as the submodularity condition, was rediscovered [48]. The “linear” case mentioned above was also devised without the knowledge that a similar method had been used for the problem of task assignment to multiple processors [49]. Now, methods long known in the OR community have been introduced to vision, such as the BHS (a.k.a. QPBO / roof duality) algorithm [50, 51, 52, 53] that proved very useful.

Other new minimization methods such as belief propagation (BP) [54, 55, 56] and the tree-reweighted message passing (TRW) [57, 58]) have also been introduced to vision. An experimental comparison [59] of these methods with graph cuts were conducted using stereo and segmentation as examples. It showed that these new methods are much superior to ICM, which was used as the representative of the older methods. When the neighborhood structure was relatively sparse, such as the 4-neighbors, TRW performed better than graph cuts. However, according to another study [60], when the connectivity is larger, as in the case of stereo models with occlusion, graph cuts considerably outperformed both BP and TRW in both the minimum energy and the error compared to the ground truth.

2.1.1 Chapter Summary

We first introduce the energy minimization formalisms used in the image-related areas (§2.2). Next (§2.3), we describe the graph cut algorithm in the case of binary labels, which is the most fundamental case where the energy-minimization problem directly translates to the minimum-cut problem, and global minima can be computed in a polynomial time under certain condition called submodularity. For the binary case, we also describe the BHS algorithm, which can partially minimize even the energies that do not satisfy the submodularity condition, as well as the reduction of higher-order energies to first order. Finally (§2.4), we describe graph cuts to minimize energies when there are more than two labels. One approach to this is to translate multilabel energies into binary ones. In some cases, the resulting problem is globally optimizable. Another way is the approximate algorithms to solve the multilabel problems by iteratively solving binary problems with graph cuts.

2.2 Markov Random Field

2.2.1 Labeling Problem: A Simple Example

As the simplest example, we consider the denoising problem of binary (black-and-white) images. Such an image consists of pixels that can be black or white, arranged in a rectangle. This is an example of labeling. Each pixel in the image has a color; we consider this as assigning a label—0 for white and 1 for black—to each pixel. A labeling is thus a map from the set of pixels to the set of labels. Therefore, every image is a labeling. Thus, the denoising problem is that of finding the “least noisy” labeling.

FIGURE 2.1
A labeling is an assignment of a label to each pixel. A binary image is a labeling with two labels. The image on the left is a labeling, as shown on the right.

More formally, let us denote the set of pixels by $P$ and the set of labels by $ℒ$ . For example, in the case of the binary image denoising, $P = {(i, j) | i = 1, \dots, n; j = 1, \dots, m}$ and $ℒ = {0, 1}$ , assuming that the image has n × m pixels. Then a labeling is a map

$X : P \to ℒ .$

(2.1)

Since the actual coordinates are not needed in the discussion, in the following we denote the pixels by single letters such as $p, q \in P$ . The label that is assigned to pixel p by a labeling X is denoted by X_p:

$X : P ∋ p \mapsto X_{p} \in ℒ .$

(2.2)

Figure 2.1 depicts an example of binary image and the labeling representing it.

For a set $P$ of pixels and a set $ℒ$ of labels, we denote the set of all labelings by $ℒ^{P}$ . There are ${| ℒ |}^{| P |}$ possible such labelings; thus, the number of possible solutions to the denoising problem is exponential in terms of the number of pixels. For a labeling $X \in ℒ^{P}$ and any subset $Q \subset P$ , the labeling on $Q$ defined by restricting X to $Q$ is denoted by $X_{Q} \in ℒ^{Q}$ .

FIGURE 2.2
(a) Counting the number of pixels that has different colors for X and Y. (b) Counting neighboring pairs of pixels in X that have different colors.

2.2.1.1 Denoising Problem

In the denoising problem, we are given one labeling, the noisy image, and somehow find another labeling, the denoised image. The input and the output are related as follows:

1. They are the “same” image.

2. The output has less noise than the input.

To denoise an image by energy minimization, we need to know what these mean in terms of the labeling. We translate these notions into an energy

$E : ℒ^{P} \to ℝ,$

(2.3)

which is a function on the set $ℒ^{P}$ of labelings that gives a real number to each labeling. The real number E(X) that is given a labeling X is called its energy. The energy evaluates how well the conditions above are met by a labeling. According to the problem at hand, we define the energy so that it gives a smaller value to a labeling when it is better according to some criteria that we consider reasonable or that we have observed to produce good results. Then we use an algorithm, such as graph cuts, to minimize the energy, i.e., the algorithm finds a labeling with as small an energy as possible. With graph cuts, sometimes the global minimum of the energy can be found; that is, we can actually find the labeling that gives the smallest energy among all the exponential number of possible labelings in $ℒ^{P}$ , or at least one of such minima.

2.2.1.2 Image Similarity

Suppose we are given a noisy image, or a labeling, Y, and that we would like to find the denoised version X. First, we consider the condition (1) above. That is, the noisy image Y and the denoised image X are supposed to be the same image. Since we cannot take this as meaning they are exactly the same (then we cannot reduce the noise), we quantify the difference between the two images and assume the condition means that the difference is small. How do we quantify the difference between two images? A popular method is to take the square of the L² norm:

$\sum_{p \in P} {(X_{p} - Y_{p})}^{2} .$

(2.4)

Another oft-used measure of the difference between two images is the sum of absolute difference:

$\sum_{p \in P} | X_{p} - Y_{p} |,$

(2.5)

In our case, the two measures coincide, since the label is either 0 or 1. They both mean the number of the pixels that have different labels between X and Y, as illustrated in Figure 2.2(a).

2.2.1.3 Denoising Criterion

If we want to find the labeling X that minimize (2.4) or (2.5), it is easy: we take X = Y. This is natural, since we do not have any reason to arrive at anything else so far. To make X the denoised version of Y, we need to have some criterion to say some images are noisier than others. If we have the original image before noise was introduced, we can say that the closer the image is to the original, the less noisier it is. However, if we had the original, we don’t need to denoise it but that is not usually the case. So we need a way to tell the level of noise in an image without comparing it with the original. If it is in the sense above, the closeness to the original, that is clearly impossible, since the original might be very “noisy” to start with. Nevertheless, we have a sense that some images are noisy and others are not, which presumably comes from the statistics of images that we encounter in life. Thus, knowing the statistics of images helps. Also, there are different kinds of noises that are likely to be introduced in practice whose statistics we know. If we know what caused the noise in the image, we can use the information to reduce the noise.

Here, however, we are using the denoising problem only as an example; so we simply assume that the noisier the image is, the rougher it is, that is, there are more changes in color between neighboring pixels.

Accordingly, to evaluate the roughness of image X, we count the number of changes in color between neighboring pixels (see Figure 2.2(b)) as follows:

$\sum_{(p, q) \in N} {(X_{p} - X_{q})}^{2} .$

(2.6)

Here, $N \subset P \times P$ is the set of neighboring pairs of pixels. We assume that $N$ is symmetric, i.e., $(q, p) \in N$ if $(p, q) \in N$ . We also assume that $(p, p) \notin N$ for any $p \in P$ .

The image that satisfies the “less noise” condition (2) the most, which minimizes the quantity (2.6), is a constant image, either all black or all white. On the other hand, the condition (1) is perfectly satisfied if we set X = Y, as mentioned above. Thus, the two conditions conflict unless Y is a constant image, leading to a need of a tradeoff. The heart of the energy minimization methods is that we attempt to treat such a tradeoff in a principled way by defining it as minimization of an energy. That is, we translate various factors in the tradeoff as terms which are added up to form an energy. The relative importance of each factor can be controlled by multiplying it by a weight.

FIGURE 2.3
Denoising a binary image. Given the noisy image Y on the left, find the image X that is close to Y and has as few change as possible of labels between neighboring pixels. Since the two conditions usually conflict, we need to find a trade off. We encode it as the minimization of the energy (2.7).

In our case, we define the energy as a weighted sum of (2.4) and (2.6):

$E (X) = \sum_{p \in P} λ {(X_{p} - Y_{p})}^{2} + \sum_{(p, q) \in N} κ {(X_{p} - X_{q})}^{2}$

(2.7)

We consider the energy a function of X, supposing Y is fixed. Here, λ and κ are positive weights. Although the two weights are fixed globally in this example, it can be varied from pixel to pixel and pair to pair, depending on the data Y and other factors; for instance, the smoothing factor κ is often changed according to the contrast in Y between the neighboring pixels. Figure 2.3 illustrates the tradeoff represented by the energy.

Assume that X minimizes E(X). For each pixel p, X_p = Y_p makes the first sum smaller. On the other hand, a smoother X without too many changes between neighboring pixels leads to a lesser second sum. Let us assume that λ = κ. If, for instance, there is an “isolated” pixel p such that Y assigns all its neighbors the opposite value from Y_p, the total energy would be smaller when the value X_p at p is the same with the neighbors, rather than choosing X_p = Y_p. Thus, minimizing E(X) represents an effort to make X as close as possible to Y while fulfilling the condition that the values between neighboring pixels do not change too much. The tradeoff between the two conditions are controlled by the relative magnitude of the parameters λ and κ.

2.2.2 Markov Random Field

The energy minimization problem as above can be considered a maximum a posteriori (MAP) estimation problem of a Markov random field (MRF).

A Markov random field is a stochastic model with a number of random variables that are arranged in an undirected graph, so that each node has one variable. Let $P$ and $N$ be the sets of pixels and neighbors, as before. We will call the elements of $P$ “sites”; they represent the notion of location in the problem. Since we are assuming $N$ is symmetric, we can consider $(P, N)$ an undirected graph. The sites most commonly correspond to the pixels in the images, especially in low-level problems such as the denoising example above. However, they sometimes represent other objects such as blocks in the images and segments (so-called superpixels) obtained by a quick oversegmentation of images.

A set of random variables $X = {(X_{p})}_{p \in P}$ is a Markov random field if its probability density function can be written as:

$P (X) = \prod_{C \in C} φ_{C} (X_{C}),$

(2.8)

where $C$ is the set of cliques of the undirected graph $(P, N)$ , and each function φ_C depends only on the vector of variables X_C = (X_p)_p∈C.

Assume that the random variables in X all take values in the same set $ℒ$ . Then, we can think the set of random variables to be a random variable taking a value in the set $ℒ^{P}$ of labelings. Thus, an MRF can be thought of as a random variable with values in the set of labelings whose probability density function (2.8) satisfies the condition above.

2.2.2.1 Order of an MRF

When the probability density function P(X) in (2.8) can be written as

$P (X) = \prod_{C \in C, | C \leq k + 1 |} φ_{C} (X_{C}),$

(2.9)

with only the functions depending on the variables for cliques with at most k + 1 sites, it is called a k’th order MRF or an MRF of order k. For instance, an MRF of first order can be written as:

$P (X) = \prod_{p \in P} φ_{p} (X_{p}) \prod_{(p, q) \in N} φ_{{p, q}} (X_{{p, q}}),$

(2.10)

that depends only on the functions for cliques of size 1 and 2, i.e., sites and neighboring pairs of sites. An MRF of second order can be written as:

$P (X) = \prod_{p \in P} φ_{p} (X_{p}) \prod_{(p, q) \in N} φ_{{p, q}} (X_{{p, q}}) \prod_{{p, q, r} clique} φ_{{p, q, r}} (X_{{p, q, r}}) .$

(2.11)

Remember that the notation X_{{p, q, r}} denotes the vector (X_p, X_q, X_r) of variables.

Note that there is an ambiguity in the factorization of the density function, since lower-order functions can be a part of a higher-order function. We can reduce this ambiguity somewhat by restricting $C$ to be the set of maximal cliques. However, in practice we define the functions concretely; so there is no problem of ambiguity, and we would like to define them in a natural way. Therefore, we leave $C$ to mean the set of all cliques.

2.2.2.2 Energy of an MRF

If we define in (2.8)

$f_{C} (X_{C}) = - \log φ_{C} (X_{C}),$

(2.12)

then maximizing P(X) in (2.8) is equivalent to minimizing

$E (X) = \sum_{C \in C} f_{C} (X_{C}),$

(2.13)

This function that assigns a real number E(X) to a labeling X is called the energy of the MRF. An MRF is often defined with an energy instead of a density function. Given the energy (2.13), the density function (2.8) is obtained as

$p (X) = e^{- E (X)} = \prod_{C \in C} e^{- f_{C} (X_{C})} .$

(2.14)

This form of probability density function is called the Gibbs distribution.

The first-order MRF in (2.10) has the energy

$E (X) = \sum_{p \in P} g_{p} (X_{p}) + \sum_{(p, q) \in N} h_{p q} (X_{p}, X_{q}),$

(2.15)

where

$g_{p} (X_{p}) = - \log φ_{p} (X_{p}),$

(2.16)

$h_{p q} (X_{p}, X_{q}) = h_{q p} (X_{q}, X_{p}) = - \log φ_{{p, q}} (X_{{p, q}}) .$

(2.17)

Most of the use of graph cuts have been for minimizing the energy of the form (2.15). Each term g_p(X_p) in the first sum in (2.15) depends only on the label assigned to each site. The first sum is called the data term, because the most direct influence of data such as the given image, in deciding the label to assign to each site, often manifests itself through g_p(X_p). For instance, in the case of the denoising problem in the previous section, this term influences the labeling X in the direction of making it closer to Y through the definition:

$g_{p} (X_{p}) = λ {(X_{p} - Y_{p})}^{2} .$

(2.18)

Thus, the first term of (2.15) represents the data influence on the optimization process.

The second sum, on the other hand, reflects the prior assumption on the wanted outcome, such as less noise. In particular, it dictates the desirable property of the labels given to neighboring sites. Because this often means that there is less label change between neighboring sites, this term is called the smoothing term. In the case of the previous section, it is defined by:

$h_{p q} (X_{p}, X_{q}) = κ {(X_{p} - X_{q})}^{2} .$

(2.19)

Sometimes, it is also called the prior term. The reason is that it often encodes the prior probability distribution, as will be explained next.

2.2.2.3 Bayesian Inference

Assume that we wish to estimate an unobserved parameter X, on the basis of observed data Y. The maximum a posteriori (MAP) estimate of X is its value that maximizes the posterior probability P(X|Y), which is the conditional probability of X given Y. It is often used in the area of statistical inference, where some hidden variable is inferred from an observation. For instance, the denoising problem in the previous section can be thought of as such an inference.

The posterior probability P(X|Y) is often derived from the prior probability P(X) and the likelihood P(Y|X). The prior P(X) is the probability of X when there is no other condition or information. The likelihood P(Y|X) is the conditional probability that gives the data probability under the condition that the unobserved variable in fact has value X. From these, the posterior is obtained by Bayes’ rule:

$P (X | Y) = \frac{P (Y | X) P (X)}{P (Y)} .$

(2.20)

Suppose that we have an undirected graph $(P, N)$ and a set $L$ of labels. Also suppose that the unobserved parameter X takes as value a labeling, and we have a simple probabilistic model given by the following:

$P (Y | X) = \frac{1}{Z_{1} (Y)} \prod_{p \in P} φ_{p}^{Y} (X_{p}),$

(2.21)

$P (X) = \frac{1}{Z_{2}} \prod_{(p, q) \in N} φ_{p q} (X_{p}, X_{q}),$

(2.22)

where

$Z_{1} (Y) = \sum_{X \in ℒ^{P}} \prod_{p \in P} φ_{p}^{Y} (X_{p}),$

(2.23)

$Z_{2} = \sum_{X \in ℒ^{P}} \prod_{(p, q) \in N} φ_{p q} (X_{p}, X_{q}) .$

(2.24)

That is, the likelihood P(Y|X) of the data Y given the parameter X is given in the form of the product of independent local distributions $φ_{p}^{Y} (X_{p})$ for each site p that depends only on the label X_p, while the prior probability P(X) is defined as a product of local joint distributions φ_pq(X_p, X_q) depending on the labels X_p and X_q that the neighboring sites p and q are given.

By (2.20), it follows

$P (X | Y) = \frac{1}{Z_{1} (Y) Z_{2}} \prod_{p \in P} φ_{p}^{Y} (X_{p}) \prod_{(p, q) \in N} φ_{p q} (X_{p}, X_{q}) .$

(2.25)

If we fix the observed data Y, this makes it a first-order MRF. Thus, maximizing the posterior probability P(X|Y) with a fixed Y is equivalent to the energy minimization problem of an MRF. For instance, our example energy minimization formulation of the denoising problem can be considered a MAP estimation problem with the likelihood and the prior:

$P (Y | X) = \frac{1}{Z_{1} (Y)} \prod_{p \in P} e^{- λ {(X_{p} - Y_{p})}^{2}},$

(2.26)

$P (X) = \frac{1}{Z_{2}} \prod_{(p, q) \in N} e^{- κ {(X_{p} - X_{q})}^{2}},$

(2.27)

which is consistent with a Gaussian noise model.

2.3 Basic Graph Cuts: Binary Labels

In this section, we discuss the basics of graph-cut algorithms, namely the case where the set of labels is binary: $ℒ = B = {0, 1}$ . The multiple label graph cuts depend on this basic case. As it is based on the algorithms to solve the s-t minimum cut problem, we start with a discussion on them. For details on the minimum cut and maximum flow algorithms, see textbooks such as [61, 62, 63].

2.3.1 Minimum-Cut Algorithms

Consider a directed graph $G = (V, ℰ)$ . Each edge e_i→j from υ_i to υ_j has weight w_ij. For the sake of notational simplicity, in the following we assume every pair of nodes $υ_{i}, υ_{j} \in V$ has weight w_ij but w_ij = 0 if there is no edge e_i→j.

Let us choose two nodes $s, t \in V$ ; a cut of $G$ with respect to the pair (s, t) is a partition of $V$ into two sets $S \subset V$ and $T = V S$ such that $s \in S, t \in T$ (Figure 2.4). In the following, we denote this cut by $(S, T)$ . Then, the $c o s t c (S, T)$ of cut $(S, T)$ is the sum of the weights of the edges that go from $S$ to $T$ :

FIGURE 2.4
(a) A cut of $G$ with respect to (s, t) divides the nodes into two sides. The cost of a cut is the total sum of the weights of the directed edges going from the s side $(S)$ to the t side $(T)$ . The cost of the depicted cut is 10, the sum of the weights of the edges in the cut, indicated by thick arrows. (b) Another example: the cost of the cut is 3.

$c (S, T) = \sum_{υ_{i} \in S, υ_{j} \in T} w_{i j} .$

(2.28)

Note that the directed edges going the other direction, from $T$ to $S$ , do not count. An edge that contributes to the cost is said to be in the cut. A minimum cut is a cut with the minimum cost, and the minimum cut problem seeks to find such a cut for given the triple $(G, s, t)$ .

The minimum cut problem is well known to be equivalent to the maximum flow problem [64, 65]. When all edge weights are nonnegative, these problems are known to be tractable, which is the source of the usefulness of graph cuts. There are three groups of algorithms known: the ‘augmenting path’ algorithms [64, 66], the ‘push-relabel’ algorithms [67, 68, 69], and the ‘network simplex’ algorithms [70]. Although asymptotically the push-relabel algorithms are the fastest, for typical image-related applications, actual performance seems to indicate that the augmenting path algorithms are the fastest. Boykov and Kolmogorov [71] experimentally compared the Dinic algorithm [72], which is a kind of augmenting path algorithm, a push-relabel algorithm, and their own augmenting path algorithm (the BK algorithm) in image restoration, stereo, and image segmentation applications to find the BK algorithm to be the fastest. This is the maximum-flow algorithm most used for graph cuts in image-related applications. There is also a push-relabel algorithm that was specifically developed for large regular grid graphs [73] used in higher-dimensional image processing problems, as well as algorithms to speed up the repeated minimization of an energy with small modifications [74, 75, 76], as commonly happens in movie segmentation.

FIGURE 2.5
(a) The graph for binary MRF minimization. The edges in the cut are depicted as thick arrows. Each node other than υ_s and υ_t corresponds to a site. If a cut $(S, T)$ places a node in $S$ , the corresponding site is labeled 0; if it is in $T$ , the site is labeled 1. The 0’s and 1’s at the bottom indicate the label each site is assigned. Here, the sites are arranged in 1D; but according to the neighborhood structure this can be any dimension as shown in (b) and (c).

2.3.2 The Graph Cuts

The graph-cut algorithms utilize the minimum cut (maximum flow) algorithms to minimize MRF energies. We construct a graph with special nodes s and t, as well as a node for each site. Then for a cut $(S, T)$ of the graph with respect to the pair (s, t), each node must be either in $S$ or $T$ ; we interpret this as the site being labeled 0 or 1. That is, we define a one to one correspondence between the set of all labeling and that of all cuts.

Consider the first-order MRF energy (2.15), shown here again:

$E (X) = \sum_{p \in P} g_{p} (X_{p}) + \sum_{(p, q) \in N} h_{p q} (X_{p}, X_{q})$

(2.29)

We construct a directed graph $G = (V, ℰ)$ corresponding to this energy (Figure 2.5). The node set $V$ contains the special nodes υ_s and υ_t (for consistency with the notation for graph edges in this book, we denote what we called s and t by υ_s and υ_t, respectively), as well as a node corresponding to each site p in $P$ :

$V = {υ_{s}, υ_{t}} \cup {υ_{p} | p \in P} .$

(2.30)

From and to each node υ_p corresponding to a site $p \in P$ , we add two edges, and also edges between nodes corresponding to neighboring sites:

$ℰ = {e_{s \to p} | p \in P} \cup {e_{p \to t} | p \in P} \cup {e_{p \to q} | (p, q) \in N} .$

(2.31)

FIGURE 2.6
Weights for binary MRF minimization for the energy in the example in Section 2.2.1. The numbers at the bottom indicates the labels assigned to the sites above. The thick arrows indicate the edges in the cut. The cut separates the nodes υ_p and υ_t; thus, the site p is assigned the label 0. The weight is exactly g_p(0), which is added to the cost of the cut. Similarly, the site q is assigned 1 and the cost is added g_q(1). When two neighboring sites are assigned different labels, an edge between the two is cut. Here, the labels differ between p and q; thus, the weight h_pq(0, 1) of the edge e_p→q is added to the cost of the cut.

Now, we define a correspondence between a cut $(S, T)$ of $G$ with respect to (υ_s, υ_t) and a labeling $X \in ℒ^{P}$ as follows:

$X_{p} = {\begin{array}{l} 0 & (if υ_{p} \in S) \\ 1 & (if υ_{p} \in T) . \end{array}$

(2.32)

That is X assigns 0 to site p if the node υ_p corresponding to p is on the υ_s side, and 1 if it is on the υ_t side. This way, there is a one-to-one correspondence between cuts of $G$ with respect to (υ_s, υ_t) and labelings in $ℒ^{P}$ .

The minimum-cut algorithm can find a cut with the minimum cost efficiently; thus, in order to find a labeling that minimizes the energy (2.15), we want to define a weight of $G$ so that the minimum cut corresponds to the minimum energy.

We consider the example in Section 2.2.1, with the energy terms (2.18) and (2.19), and define:

$w_{s p} = g_{p} (1),$

(2.33)

$w_{p t} = g_{p} (0),$

(2.34)

$w_{p q} = h_{p q} (0, 1) .$

(2.35)

Consider the labeling X corresponding to a cut $(S, T)$ of $G$ . First, note the data term that depends only on one site. If $υ_{p} \in S$ , then by (2.32) X_p = 0 and, as the edge e_p→t is in the cut, by (2.34) w_pt = g_p(0) = g_p(X_p) is added to the cost of the cut. If $υ_{p} \in T$ , then X_p = 1 and, as the edge e_s→p is in the cut, by (2.33) w_sp = g_p(1) = g_p(X_p) is added to the cost. Either way, g_p(X_p) is added to the cost of the cut.

FIGURE 2.7
Smoothing term and cost.

Next, we consider the smoothing term depending on two neighboring sites. For a neighboring pair of sites p and q, if $υ_{p} \in S, υ_{q} \in T$ , then the edge e_p→q is in the cut. Thus, by (2.35) h_pq(0, 1) = κ is added to the cut cost. By (2.32), it corresponds to (X_p, X_q) = (0, 1) and the added weight to the cost coincides with the smoothing term that should be added to the energy in such combination of assignment. When both υ_p and υ_q belong to either of $S$ and $T$ , the increase in cost is zero, corresponding to the energy h_pq(0, 0) = h_pq(1, 1) = 0.

This way, the energy for the labeling X precisely coincides with the cost of the corresponding cut. This construction is due to [43].

2.3.3 Submodularity

Thus, by finding the minimum-cut, we can find the global minimum of the energy (2.7) for our image denoising example, as well as the labeling that attains it. Can we generalize this to more general form of energy? The limiting factor is that the edge weights must be nonnegative in order to benefit from the polynomial-time minimum-cut algorithms: with the above construction, g_p(X_p) and h_pq(X_p, X_q) must all be nonnegative in (2.15). Also, it assumes h_pq(0, 0) = h_pq(1, 1) = 0.

However, there is still a little leeway in choosing edge weights. For instance, each υ_p must belong to either $S$ or $T$ ; so one of the edges e_s→p and e_p→t is always in the cut. Therefore, adding the same value to w_sp and w_pt does not affect the choice of which edge to cut. This means that g_p(X_p) can take any number because, if we have negative weight after defining w_sp and w_pt according to (2.33) and (2.34), we can add − min{w_sp, w_pt} to w_sp and w_pt to make both nonnegative. In fact, one of w_sp and w_pt can always be set to zero this way.

So the data term is arbitrary. How about the smoothing term? For a pair p, q of sites, let us define

$\begin{matrix} w_{q t} = a, & w_{p q} = b, & w_{s p} = c, \end{matrix}$

(2.36)

as shown in Figure 2.7. Then the cut costs for the four possible combinations of (X_p, X_q) are:

$\begin{matrix} h_{p q} (0, 0) = a, & h_{p q} (0, 1) = b, & h_{p q} (1, 0) = a + c, & h_{p q} (1, 1) = c . \end{matrix}$

(2.37)

Then we have

$h_{p q} (0, 1) + h_{p q} (1, 0) - h_{p q} (0, 0) - h_{p q} (1, 1) = b .$

(2.38)

Since the right-hand side is an edge weight, it must be nonnegative for the polynomial-time algorithms to be applicable. This is known as the submodularity condition:

$h_{p q} (0, 1) + h_{p q} (1, 0) - h_{p q} (0, 0) - h_{p q} (1, 1) \geq 0.$

(2.39)

Conversely, if this condition is met, we can set

$a = h_{p q} (1, 0) - h_{p q} (1, 1),$

(2.40)

$b = h_{p q} (0, 1) + h_{p q} (1, 0) - h_{p q} (0, 0) - h_{p q} (1, 1),$

(2.41)

$c = h_{p q} (1, 0) - h_{p q} (0, 0) .$

(2.42)

Then we have the cut cost for each combination as follows:

$(X_{p}, X_{q}) = (0, 0) : h_{p q} (1, 0) - h_{p q} (1, 1),$

(2.43)

$(X_{p}, X_{q}) = (0, 1) : h_{p q} (0, 1) + h_{p q} (1, 0) - h_{p q} (0, 0) - h_{p q} (1, 1),$

(2.44)

$(X_{p}, X_{q}) = (1, 0) : h_{p q} (1, 0) - h_{p q} (1, 1) + h_{p q} (1, 0) - h_{p q} (0, 0),$

(2.45)

$(X_{p}, X_{q}) = (1, 1) : h_{p q} (1, 0) - h_{p q} (0, 0) .$

(2.46)

If we add h_pq (0, 0) + h_pq (1, 1) − h_pq (1, 0) to the four cost values, we see each equals h_pq(X_p, X_q). Since we are adding the same value to the four possible outcome of the pair, it does not affect the relative advantage among the alternatives. Thus, the minimum cut still corresponds to the labeling with the minimum energy.

We can do this for each neighboring pair. If all such pair p, q satisfies (2.39), we define:

$w_{s q} = g_{q} (1) + \sum_{(p, q) \in N} {h_{p q} (1, 0) - h_{p q} (0, 0)},$

(2.47)

$w_{q t} = g_{q} (0) + \sum_{(p, q) \in N} {h_{p q} (1, 0) - h_{p q} (1, 1)},$

(2.48)

$w_{p q} = h_{p q} (0, 1) + h_{p q} (1, 0) - h_{p q} (0, 0) - h_{p q} (1, 1) .$

(2.49)

After that, if any edge weight w_sq or w_qt is negative, we can make it nonnegative as explained above.

In this way, we can make all the edge weights nonnegative. And from the construction, the minimum cut of this graph corresponds, by (2.32), to a labeling that minimizes the energy (2.15). The basic construction above for submodular first-order case has been known for at least 45 years [41].

2.3.4 Nonsubmodular Case: The BHS Algorithm

When the energy is not submodular, there are also known [53, 50] algorithms that give a partial solution, which were introduced to the image-related community, variously called as the QPBO algorithm [77] or roof duality [78]. Here, we describe the more efficient of the two algorithms, the BHS algorithm [50].

The solution of the algorithms has the following property:

1. The solution is a partial labeling, i.e., a labeling $X \in B^{Q}$ on a subset $Q$ of $P$ .

2. Given an arbitrary labeling $Y \in B^{P}$ , the output labeling $X \in B^{Q}$ has the property that, if we “overwrite” Y with X, the energy for the resulting labeling is not higher than that for the original labeling. Let us denote by Y ⊳ X the labeling obtained by overwriting Y by X:

${(Y ᐊ X)}_{p} = {\begin{array}{l} X_{p} & if p \in Q \\ Y_{p} & otherwise . \end{array}$

(2.50)

Then, this autarky property means E(Y ⊳ X) ≥ E(Y). If we take as Y a global minimum, then Y ⊳ X is also a global minimum. Thus, we see that the partial labeling X is always a part of a global minimum labeling, i.e., we can assign 0 or 1 to each site outside of $Q$ to make it into the global minimum labeling Y ⊳ X. It also means that if $Q = P$ , X is a global minimum labeling.

3. If the energy is submodular, all sites are labeled, giving a global minimum solution.

How many of the sites are labeled when the energy is not submodular is up to the individual problem.

Here, we give the graph construction for the BHS algorithm, following [77]. First, we reparametrize the given energy, i.e., modify it without changing the minimization problem, into the normal form. The energy (2.15) is said to be in normal form if

1. For each site p, min{g_p(0), g_p(1)} = 0,

2. For each $(p, q) \in N$ and x = 0, 1, min{h_pq(0, x), h_pq(1, x)} = 0.

Note that the normal form is not unique. Beginning with any energy, we can reparametrize it into the normal form with the following algorithm:

1. While there is any combination of $(p, q) \in N$ and x ∈ {0, 1} that does not satisfy the second condition, repeatedly redefine

$h_{p q} (0, x) \leftarrow h_{p q} (0, x) - δ,$

(2.51)

$h_{p q} (1, x) \leftarrow h_{p q} (1, x) - δ,$

(2.52)

$g_{q} (x) \leftarrow g_{q} (x) + δ,$

(2.53)

where δ = min{h_pq(0, x), h_pq(1, x)}.

FIGURE 2.8
The graph construction for the BHS algorithm. The thick arrows indicate the edges in the cut. The graph has, in addition to υ_s and υ_t, two nodes υ_p and $υ_{\bar{p}}$ for each site p. We define a correspondence between the cuts of this graph and the output labeling as in (2.61). The assignment is shown at the bottom.

2. For each $p \in P$ , redefine

$g_{p} (0) \leftarrow g_{p} (0) - δ,$

(2.54)

$g_{p} (1) \leftarrow g_{p} (1) - δ,$

(2.55)

where δ = min{g_p(0), g_p(1)}.

Next, we build a graph $G = (V, ℰ)$ as follows (Figure 2.8). Define two nodes for each site plus the two special nodes:

$V = {υ_{s}, υ_{t}} \cup {υ_{p}, υ_{\bar{p}} | p \in P} .$

(2.56)

There are four edges for each site, and four edges for each neighboring pair of sites:

$ℰ = {e_{s \to p}, e_{s \to \bar{p}}, e_{p \to t}, e_{\bar{p} \to t} | p \in P} \cup {e_{p \to q}, e_{p \to \bar{q}}, e_{\bar{p} \to q}, e_{\bar{p} \to \bar{q}} | (p, q) \in N} .$

(2.57)

Edge weights are defined as follows:

$\begin{matrix} w_{s p} = w_{\bar{p} t} = \frac{1}{2} g_{p} (1), & w_{p t} = w_{s \bar{p}} = \frac{1}{2} g_{p} (0), \end{matrix}$

(2.58)

$\begin{matrix} w_{p q} = w_{\bar{q} \bar{p}} = \frac{1}{2} h_{p q} (0, 1), & w_{q p} = w_{\bar{p} \bar{q}} = \frac{1}{2} h_{p q} (1, 0), \end{matrix}$

(2.59)

$\begin{matrix} w_{p \bar{q}} = w_{q \bar{p}} = \frac{1}{2} h_{p q} (0, 0), & w_{\bar{q} p} = w_{\bar{p} q} = \frac{1}{2} h_{p q} (1, 1) . \end{matrix}$

(2.60)

For this graph, we find a minimum cut $(S, T)$ and interpret it as follows:

${\begin{array}{l} p \in Q and X_{p} = 0 & if υ_{p} \in S, υ_{\bar{p}} \in T \\ p \in Q and X_{p} = 1 & if υ_{p} \in T, υ_{\bar{p}} \in S, \\ p \notin Q & otherwise . \end{array}$

(2.61)

As mentioned above, it depends on the actual problem to what extent this algorithm is effective. However, it is worth trying when dealing with nonsubmodular energies, as the construction is straightforward as above. There are also some techniques dealing with the sites that are unlabeled after the use of the algorithm [78].

2.3.5 Reduction of Higher-Order Energies

So far, the energies we dealt with have been of first order (2.15). Also in the vision literature, until recently most problems have been represented in terms of first-order energies, with a few exceptions that consider second-order ones [79, 48, 80]. Limiting the order to one restricts the representational power of the models, the rich statistics of natural scenes cannot be captured by such limited potentials. Higher-order energies can model more complex interactions and reflect the natural statistics better. There are also other reasons, such as enforcing connectivity [81] or histogram [82] in segmentation, for the need of optimizing higher-order energies. This has long been realized [83, 84, 85], and recent developments [86, 87, 88, 89, 90, 91], including various useful special cases of energies are discussed in the next chapter.

Here, we discuss the general binary case and introduce a method [92, 93] to transform general binary MRF minimization problems to the first-order case, for which the preceding methods can be used, at the cost of increasing the number of the binary variables. It is a generalization of earlier methods [48, 94] that allowed the reduction of second-order energies to first order.

An MRF energy that has labels in $B = {0, 1}$ is a function of binary variables

$f : B^{n} \to ℝ,$

(2.62)

where $n = | P |$ is the number of sites/variables. Such functions are called pseudo-Boolean functions (PBFs.) Any PBF can be uniquely represented as a polynomial of the form

$f (x_{1}, \dots, x_{n}) = \sum_{S \subset J} c_{S} \prod_{i \in S} x_{i},$

(2.63)

where $J = {1, \dots, n}$ and $c_{S} \in ℝ$ . We refer the readers to [95, 96] for proof. Combined with the definition of the order, this implies that any binary-labeled MRF of order d − 1 can be represented as a polynomial of degree d. Thus, the problem of reducing higher-order binary-labeled MRFs to first order is equivalent to that of reducing general pseudo-Boolean functions to quadratic ones.

Consider a cubic PBF of variables x, y, z, with a coefficient $a \in ℝ$ :

$f (x, y, z) = a x y z .$

(2.64)

The reduction by [48, 94] is based on the following identity:

$x y z = \max_{w \in B} w (x + y + z - 2) .$

(2.65)

If a < 0,

$a x y z = \min_{w \in B} a w (x + y + z - 2) .$

(2.66)

Thus, whenever axyz appears in a minimization problem with a < 0, it can be replaced by aw(x + y + z − 2).

If a > 0, we flip the variables (i.e., replace x by 1 − x, y by 1 − y, and z by 1 − z) of (2.65) and consider

$(1 - x) (1 - y) (1 - z) = \max_{w \in B} w (1 - x + 1 - y + 1 - z - 2) .$

(2.67)

This is simplified to

$x y z = \min_{w \in B} w (x + y + z - 1) + (x y + y z + z x) - (x + y + z) + 1.$

(2.68)

Therefore, if axyz appears in a minimization problem with a > 0, it can be replaced by

$a {w (x + y + z - 1) + (x y + y z + z x) - (x + y + z) + 1} .$

(2.69)

Thus, either case, the cubic term can be replaced by quadratic terms. As any binary MRF of second order can be written as a cubic polynomial, each cubic monomial in the expanded polynomial can be converted to a quadratic polynomial using one of the formulae above, making the whole energy quadratic, i.e., of the first order. Of course, it does not come free; as we can see, the number of the variables increases.

This reduction works in the same way for higher order if a < 0:

$a x_{1} \dots x_{d} = \min_{w \in B} a w {x_{1} + \dots + x_{d} - (d - 1)} .$

(2.70)

If a > 0, a different formula can be used [92, 93]:

$a x_{1} \dots x_{d} = a \min_{w_{1}, \dots, w_{n_{d}} \in B} \sum_{i = 1}^{n_{d}} w_{i} (c_{i, d} (- S_{1} + 2 i) - 1) + a S_{2},$

(2.71)

where

$\begin{matrix} S_{1} = \sum_{i = 1}^{d} x_{i}, & S_{2} = \sum_{i = 1}^{d - 1} \sum_{j = i + 1}^{d} x_{i} x_{j} = \frac{S_{1} (S_{1} - 1)}{2} . \end{matrix}$

(2.72)

are the elementary symmetric polynomials in these variables and

$\begin{matrix} n_{d} = ⌊ \frac{d - 1}{2} ⌋, & c_{i, d} = {\begin{array}{l} 1 & if d is odd and i = n_{d}, \\ 2 & otherwise . \end{array} \end{matrix}$

(2.73)

Also, both formulas (2.70) and (2.71) can be modified by “flipping” some of the variables before and after transforming the higher-order term to a minimum or maximum [93]. For instance, another reduction for the quartic term can be obtained using the technique: if we define $\bar{x} = 1 - x$ , we have $x y z t = (1 - \bar{x}) y z t = y z t - \bar{x} y z t$ . The right-hand side consists of a cubic term and a quartic term with a negative coefficient, which can be reduced using (2.70). This generalization gives rise to an exponential number (in terms of the occurrences of the variables in the function) of possible reductions.

2.4 Multi-Label Minimization

In this section, we deal with the case where there are more than two labels. First we describe the case when global optimization is possible, after which we introduce the move-making algorithms that can approximately minimize more general class of energies.

2.4.1 Globally Minimizable Case: Energies with Convex Prior

Assume that the labels have a linear order:

$\begin{matrix} ℒ = {l_{0}, \dots, l_{k}}, & k \geq 2. \end{matrix}$

(2.74)

Also assume that, in the energy (2.15), the pairwise term h_pq(X_p, X_q) is a convex function in terms of the difference of the indices of the labels, i.e., it can be written

$h_{p q} (l_{i}, l_{j}) = {\tilde{h}}_{p q} (i - j)$

(2.75)

with some function ${\tilde{h}}_{p q} (x)$ that satisfies

$\begin{matrix} {\tilde{h}}_{p q} (i + 1) - 2 {\tilde{h}}_{p q} (i) + {\tilde{h}}_{p q} (i - 1) \geq 0, & i = 1, \dots, k - 1. \end{matrix}$

(2.76)

In this case, the energy can be minimized globally using the minimum-cut algorithm by constructing a graph similar to the binary case we described in the preceding section [47].

As before, we construct a graph $G = (V, ℰ)$ with special nodes υ_s and υ_t (Figure 2.9). For each site $p \in P$ , we define a series of nodes ${υ_{p}^{1}, \dots, υ_{p}^{k}}$ :

$V = {υ_{s}, υ_{t}} \cup {υ_{p}^{i} | p \in P, i = 1, \dots, k} .$

(2.77)

For simplicity of notation, let us mean υ_s by $υ_{p}^{0}$ and υ_t by $υ_{p}^{k + 1}$ for any p.

We denote by e_{(p,i)→(q,j)} the directed edge going from $υ_{p}^{i}$ to $υ_{q}^{j}$ and by w_{(p, i)(q, j)} its weight. Now we connect the nodes for the same site as follows. The first kind,

$ℰ_{1} = {e_{(p, i) \to (p, i + 1)} | p \in P, i = 0, \dots, k}$

(2.78)

starts with $υ_{s} = υ_{p}^{0}$ and goes $υ_{p}^{1}, υ_{p}^{2}, \dots$ in turn and ends at $υ_{t} = υ_{p}^{k + 1}$ .

To the opposite direction, the second kind of edges

$ℰ_{2} = {e_{(p, i + 1) \to (p, i)} | p \in P, i = 1, \dots, k - 1}$

(2.79)

have infinite weights. In Figure 2.9, they are depicted as dotted arrows. None of them can be in any cut $(S, T)$ with finite cost, as in the case shown in the figure; no dotted arrow goes from the $S$ side to the $T$ side. In practice, the weight can be some large-enough value. Because of these edges, among the column of edges

FIGURE 2.9
The graph for multilabel MRF minimization. There is a column of nodes and edges for each site. The thick arrows indicate the edges in the cut. We give an infinite weight to the downward dotted arrows, so that exactly one upward edge is in the cut for each column. The label assigned to each site is determined by which one of the upward edge is in the cut (the number on the left). The numbers at the bottom show the labels assigned to each site (column).

$υ_{s} = υ_{p}^{0} \to υ_{p}^{1} \to \dots \to υ_{p}^{k} \to υ_{p}^{k + 1} = υ_{t}$

(2.80)

in $ℰ_{1}$ over each site p, exactly one falls in the cut. This can be seen as follows: (1) the first node $υ_{s} = υ_{p}^{0}$ must be in $S$ ; (2) the last node $υ_{t} = υ_{p}^{k + 1}$ must be in $T$ ; (3) so there must be at least one edge in the column going from $S$ to $T$ ; (4) but there cannot be two, since to go from $S$ to $T$ twice, there must be an edge in the column going from $T$ to $S$ , but then the opposite edge with infinite weight would be in the cut.

Because of this, we can make a one-to-one correspondence between the cuts and the labelings as follows:

$X_{p} = l_{i} \Leftrightarrow υ_{p}^{i} \in S, υ_{p}^{i + 1} \in T$

(2.81)

Therefore, we can transfer the data terms into cut cost in the similar way as the binary case by defining the weight as follows:

$\begin{matrix} w_{(p, i) (p, i + 1)} = g_{p} (l_{i}), & i = 0, \dots, k . \end{matrix}$

(2.82)

As in the binary case, g_p can be any function, since we can add the same value to all the edges in the column if there is any with negative weight.

For the pairwise term h_pq(X_p, X_q), we first examine the case shown in Figure 2.9. Here, there are horizontal edges between $υ_{p}^{i}$ and $υ_{q}^{i}$ , with the same “height” i. With these edges

$ℰ_{3} = {e_{(p, i) \to (q, i)} | (p, q) \in N, i = 1, \dots, k},$

(2.83)

we complete the construction of the graph by defining the set of edges $ℰ = ℰ_{1} \cup ℰ_{2} \cup ℰ_{3}$ .

With this graph, if we give a constant positive number κ as weight to edges in $ℰ_{3}$ , we have the smoothing term proportional to the difference between i and j:

$h_{p q} (l_{i}, l_{j}) = κ | i - j | .$

(2.84)

Now, assume that (2.75) and (2.76) are satisfied. For simplicity, we assume that l_i = i, (i = 0, ⋯, k). Then the energy can be rewritten:

$E (X) = \sum_{p \in P} g_{p} (X_{p}) + \sum_{(p, q) \in N} {\tilde{h}}_{p q} (X_{p} - X_{q}) .$

(2.85)

To minimize more general energies than (2.84), we need edges that are not horizontal (Figure 2.10). That is, for neighboring sites p and q, $ℰ_{3}$ would in general include edges e_{(p, i)→(q, j)} for all combination of i and j. We give them the weight

$w_{(p, i) (q, j)} = \frac{1}{2} ({\tilde{h}}_{p q} (i - j + 1) - 2 {\tilde{h}}_{p q} (i - j) + {\tilde{h}}_{p q} (i - j - 1)) .$

(2.86)

Because of the convexity condition (2.76), this is nonnegative. Without loss of generality, we can assume ${\tilde{h}}_{p q} (x) = {\tilde{h}}_{q p} (- x)$ , because in the energy (2.85), ${\tilde{h}}_{p q} (X_{p} - X_{q})$ and ${\tilde{h}}_{q p} (X_{q} - X_{p})$ appear in pairs and thus we can redefine

${\tilde{h}}^{'}_{p q} (x) = \frac{{\tilde{h}}_{p q} (x) + {\tilde{h}}_{q p} (- x)}{2}$

(2.87)

if necessary.

Now, suppose edges e_{(p, i)→(p, i+1)} and e_{(q,j)→(q, j+1)} are in the cut $(S, T)$ , i.e., that $υ_{p}^{i}, υ_{q}^{j} \in S$ and $υ_{p}^{i + 1}, υ_{q}^{j + 1} \in T$ . According to (2.81), this means X_p = l_i = i, X_q = l_j = j. Then we see that exactly the edges

${e_{(p, l) \to (q, m)} | 0 \leq l \leq i, j + 1 \leq m \leq k + 1}$

(2.88)

${e_{(q, l) \to (p, m)} | 0 \leq l \leq j, i + 1 \leq m \leq k + 1}$

(2.89)

are in the cut. We calculate the sum of the weights of these edges. First, summing up the weights of the edges in (2.88) (thick arrows in Figure 2.10) according to (2.86), we see most terms cancel out and

FIGURE 2.10
General smoothing edges for multilabel MRF minimization

$\begin{array}{l} \sum_{l = 0}^{i} \sum_{m = j + 1}^{k + 1} w_{(p, l) (q, m)} \\ = \frac{1}{2} \sum_{l = 0}^{i} \sum_{m = j + 1}^{k + 1} ({\tilde{h}}_{p q} (l - m + 1) - 2 {\tilde{h}}_{p q} (l - m) + {\tilde{h}}_{p q} (l - m - 1)) \\ = \frac{1}{2} \sum_{l = 0}^{i} \sum_{n = l - k - 1}^{l - j - 1} ({\tilde{h}}_{p q} (n + 1) - 2 {\tilde{h}}_{p q} (n) + {\tilde{h}}_{p q} (n - 1)) \\ = \frac{1}{2} \sum_{l = 0}^{i} [\sum_{n = l - k}^{l - j} ({\tilde{h}}_{p q} (n) - {\tilde{h}}_{p q} (n - 1)) - \sum_{n = l - k - 1}^{l - j - 1} ({\tilde{h}}_{p q} (n) - {\tilde{h}}_{p q} (n - 1))] \\ = \frac{1}{2} \sum_{l = 0}^{i} ({\tilde{h}}_{p q} (l - j) - {\tilde{h}}_{p q} (l - k - 1) - {\tilde{h}}_{p q} (l - j - 1) + {\tilde{h}}_{p q} (l - k - 2)) \\ = \frac{1}{2} ({\tilde{h}}_{p q} (i - j) - {\tilde{h}}_{p q} (- j - 1) - {\tilde{h}}_{p q} (i - k - 1) + {\tilde{h}}_{p q} (- k - 2)) . \end{array}$

(2.90)

Similarly, the edges in (2.89) add up to:

$\frac{1}{2} ({\tilde{h}}_{q p} (j - i) - {\tilde{h}}_{q p} (- i - 1) - {\tilde{h}}_{q p} (j - k - 1) + {\tilde{h}}_{q p} (- k - 2)) .$

(2.91)

Adding the two and using ${\tilde{h}}_{p q} (x) = {\tilde{h}}_{q p} (- x)$ , we have the total sum of

${\tilde{h}}_{p q} (i - j) + r_{p q} (i) + r_{q p} (j),$

(2.92)

where

$r_{p q} (i) = \frac{{\tilde{h}}_{p q} (k + 2) - {\tilde{h}}_{p q} (i - k - 1) - {\tilde{h}}_{p q} (i + 1)}{2} .$

(2.93)

Since r_pq(i) and r_qp(j) depend only on X_p = i and X_q = j, respectively, we can charge them to the data term. That is, instead of (2.82), we use

$w_{(p, i) (p, i + 1)} = g_{p} (i) - \sum_{(p, q) \in N} r_{p q} (i) .$

(2.94)

and we have the sum above exactly ${\tilde{h}}_{p q} (i - j)$ which is h_pq(l_i, l_j). Thus the cost of the cut is exactly the same as the energy of the corresponding labeling up to a constant; and we can use the minimum-cut algorithm to obtain the global minimum labeling.

This construction, since it uses the minimum-cut algorithms, in effect encodes the multi-label energy into a binary one and the convexity condition is the result of the submodularity condition. Generalizations have been made in this respect [97, 98, 99, 100], as well as a spatially continuous formulation [101, 102], which improves metrication error and efficiency. There has also been other improvement in time and memory efficiency [103, 104, 105].

2.4.2 Move-Making Algorithms

The above construction requires that the labels have a linear order, as well as that the energy pairwise term be convex. Since many problems do not satisfy the condition, iterative approximation algorithms are used more often in practice.

Move-making algorithms are iterative algorithms that repeatedly moves in the space of labelings so that the energy decreases. In general, the move can go to anywhere in the space, i.e., each site can change its label to any label; so, the best move would be to a global minimum. Since finding the global minimum in the general multi-label case is known to be NP-hard [106], we must restrict the range of possible moves so that finding the best move among them can be done in a polynomial time. The graph-cut optimization in each iteration of move-making algorithms can be thought of as prohibiting some of the labels at each site [107]. Depending on the restriction, this makes the optimization submodular or at least makes the graph much smaller than the full graph in the exact case.

Below, we introduce some of the approximation methods. There are also other algorithms that generalize move-making algorithms [108, 109].

2.4.2.1 α-β Swap and α-Expansion

Here, we first describe α-β swap and α-expansion moves, the two most popular algorithms in vision literature (Figure 2.11).

FIGURE 2.11
Three moves from the original labeling (a). (b) In the conventional one-pixel move (ICM, annealing), one pixel at a time is changed. (c) In α-β swap, two labels α and β are fixed and the sites currently labeled with one of the two are allowed to swap the current label to the other. (d) In α-expansion, one label α is fixed and each site is allowed to change its label to α.

For $α, β \in ℒ$ , an α-β swap from a labeling X is a move that allows only those sites labeled α or β by X to change their labels to α or β. That is, a move X → X′ is an α-β swap if

$\begin{matrix} X_{p} \neq X_{p}^{'} & \Rightarrow & X_{p}, X_{p}^{'} \in {α, β} \end{matrix}$

(2.95)

is satisfied. Similarly, an α-expansion is a move that allows each site to either switch to label α or remain unchanged, i.e., a move X → X′ is an α-expansion if

$\begin{matrix} X_{p} \neq X_{p}^{'} & \Rightarrow & X_{p}^{'} = α \end{matrix}$

(2.96)

is satisfied.

What these two moves have in common is that they both can be parametrized by a binary labeling. To see this, define label functions $ℒ \times {0, 1} \to ℒ$ by:

$x^{α β} (l, b) = {\begin{array}{l} l & if l \notin {α, β} \\ α & if l \in {α, β} and b = 0 \\ β & if l \in {α, β} and b = 1, \end{array}$

(2.97)

$x^{α} (l, b) = {\begin{array}{l} l & if b = 0 \\ α & if b = 1. \end{array}$

(2.98)

Then the α-β swap and α-expansion moves are defined by site-wise combination of the current labeling X and a binary labeling Y. Fixing X, we evaluate the energy (2.15) after the move:

$E^{α β} (Y) = \sum_{p \in P} g_{p} (x^{α β} (X_{p}, Y_{p})) + \sum_{(p, q) \in N} h_{p q} (x^{α β} (X_{p}, Y_{p}), x^{α β} (X_{q}, Y_{q}))$

(2.99)

$E^{α} (Y) = \sum_{p \in P} g_{p} (x^{α} (X_{p}, Y_{p})) + \sum_{(p, q) \in N} h_{p q} (x^{α} (X_{p}, Y_{p}), x^{α} (X_{q}, Y_{q}))$

(2.100)

What makes these moves important is that these energies can be shown to be sub-modular under simple criteria. Let us check the submodularity condition (2.39) for energy (2.99). We only have to check the case both X_p and X_q are in {α, β}, as otherwise there is no real pairwise term depending on both Y_p and Y_q. Assuming X_p, X_q ∈ {α, β}, the condition simply translates to:

$h_{p q} (α, β) + h_{p q} (β, α) - h_{p q} (α, α) - h_{p q} (β, β) \geq 0.$

(2.101)

As for energy (2.100), let us write β = X_p, γ = X_q; then the condition is:

$h_{p q} (β, α) + h_{p q} (α, γ) - h_{p q} (β, γ) - h_{p q} (α, α) \geq 0.$

(2.102)

Thus, if the energy satisfies (2.101) for all combination of $α, β \in ℒ$ , the optimal α-β swap move can always be found using the basic graph-cut algorithm. The algorithm iterate rotating through combinations of α and β in some order, until the energy stops lowering. Similarly, if the energy satisfies (2.102) for all combinations of $α, β, γ \in ℒ$ , the optimal α-expansion move can always be found. The algorithm rotates through α till the energy ceases to improve.

An important special case of both (2.101) and (2.102) is when h_pq is a metric, i.e., when conditions

$h_{p q} (α, β) \geq 0,$

(2.103)

$\begin{matrix} h_{p q} (α, β) = 0 & \Leftrightarrow & α = β \end{matrix},$

(2.104)

$h_{p q} (α, β) + h_{p q} (β, γ) \geq h_{p q} (α, γ),$

(2.105)

$h_{p q} (α, β) = h_{p q} (β, α)$

(2.106)

are satisfied. In fact, only the first two are needed to satisfy (2.101); and only the first three would imply (2.102), when it is called a semimetric.

2.4.2.2 Fusion Moves

The fusion move [110, 111] is the most general form of move-making algorithm using binary MRF optimization. In each iteration, define the binary problem as the pixelwise choice between two arbitrary labelings, instead of between the current label and the fixed label α as in the case of α-expansion.

Let X, P be two multilabel labelings; then, the fusion of the two according to a binary labeling Y is the labeling defined by

$F {(X, P; Y)}_{p} = {\begin{array}{l} X_{p} & if Y_{p} = 0 \\ P_{p} & if Y_{p} = 1, \end{array} p \in P .$

(2.107)

In each iteration, a binary-labeled MRF energy is defined as a function of Y, with X and P fixed:

$\hat{E} (Y) = E (F (X, P; Y)) .$

(2.108)

The Y that minimizes this energy is used to find the optimal fusion. As a move-making algorithm, the resulting fusion is often used as X in the next iteration; thus X is treated as the optimization variable, whereas P, called the proposal, is generated in some problem-specific way in each iteration.

Binary labeling problem defined this way is rarely submodular. So it became really useful only after the BHS (QPBO/roof duality) algorithms (Section 2.3.4), which give a partial labeling $Y \in B^{Q} (Q \subset P)$ to minimization problems of a nonsubmodular energy, was introduced. For fusion, partial solution must be filled to make a full binary labeling Y. The autarky property guarantees that the energy of the fusion labeling F(X, P; 0 ⊳ Y) is not more than that of X, i.e.,

$E (F (X, P; 0 ⊳ Y)) = \hat{E} (0 ⊳ Y) \leq \hat{E} (0) = E (X) .$

(2.109)

where $0 \in B^{P}$ is a labeling that assigns 0 to all sites. In other words, if we leave the labels unchanged at the sites where Y does not assign any label, the energy does not increase.

In fusion moves, it is crucial for the success and efficiency of the optimization to provide proposals P that fits the energies being optimized. Presumably, α expansion worked so well with the Potts energy because its proposal is the constant labeling. Using the outputs of other algorithms as proposals allows a principled integration of various existing algorithms [80]. In [112], a simple technique based on the gradient of the energy is proposed for generating proposal labelings that makes the algorithm much more efficient.

It is also worth noting that the set of labels can be very large, or even infinite, in fusion move, since the actual algorithm just chooses between two possible labels for each site.

2.4.2.3 α-β Range Moves

The exact minimization algorithm we described in Section 2.4.1 can be used for a class of move-making algorithms [113, 114], called α-β range move algorithms. When the set of labels has a linear order

$ℒ = {l_{0}, \dots, l_{k}},$

(2.110)

they can efficiently minimize first-order energies of the form (2.15) with truncated-convex priors:

$h_{p q} (l_{i}, l_{j}) = \min ({\tilde{h}}_{p q} (i - j), θ),$

(2.111)

where ${\tilde{h}}_{p q}$ is a convex function, which satisfies (2.76), and θ a constant. This kind of prior limits the penalty for very large gaps in labels, thereby encouraging such gaps compared to the convex priors, which is sometimes desirable in segmenting images into objects. Here, we only describe the basic algorithm introduced in [113]; for other variations, see [114].

For two labels α = l_i, β = l_j, i < j, the algorithm use the exact algorithm in Section 2.4.1 to find the optimal move among those that allow changing the labels in the range between the two labels at once. A move X → X′ is called an α-β range move if it satisfies

$\begin{matrix} X_{p} \neq X_{p}^{'} & \Rightarrow & X_{p}, X_{p}^{'} \in {l_{i}, \dots, l_{j}} \end{matrix} .$

(2.112)

Let d be the maximum integer that satisfies ${\tilde{h}}_{p q} (d) \leq θ$ . The algorithm iterates α-β range moves with α = l_i, β = l_j, j − i = d. Within this range, the prior (2.111) is convex, allowing the use of the exact algorithm. Let X be the current labeling and define

$P_{α - β} = {p \in P | X_{p} \in {l_{i}, \dots, l_{j}}} .$

(2.113)

Define a graph $G_{α - β} = (V_{α - β}, ℰ_{α - β})$ as follows. The node set $V_{α - β}$ contains two special nodes υ_s, υ_t as well as a series of nodes ${υ_{p}^{1}, \dots, υ_{p}^{d}}$ for each site p in $P_{α - β}$ . As in Section 2.4.1, let $υ_{p}^{0}$ be an alias for υ_s, and $υ_{p}^{d + 1}$ for υ_t and denote a directed edge from $υ_{p}^{n}$ to $υ_{q}^{m}$ by e_{(p,n)→(q,m)}. The edge set $ℰ_{α - β}$ contains a directed edge e_{(p,n)→(p,n+1)} for each pair of $p \in P_{α - β}$ and n = 0, 1, …, d, as well as the opposite edge e_{(p, n+1)→(p, n)} with an infinite weight for n = 1, …, d − 1. This, as before, guarantees that exactly one edge among the series

$υ_{s} = υ_{p}^{0} \to υ_{p}^{1} \to \dots \to υ_{p}^{d} \to υ_{p}^{d + 1} = υ_{t}$

(2.114)

over each site p ∈ P_α-β is in any cut with finite cost, establishing a one-to-one correspondence between the cut $(S, T)$ and the new labeling:

$X_{p}^{'} = {\begin{array}{l} l_{i + n} & (if p \in P_{α - β}, υ_{p}^{n} \in S, υ_{p}^{n + 1} \in T) \\ X_{p} & (if p \notin P_{α - β}) . \end{array}$

(2.115)

Thus, by defining the weights by

$w_{(p, n) (p, n + 1)} = g_{p} (l_{i + n})$

(2.116)

the data term is exactly reflected in the cut cost.

As for the smoothing term, any convex ${\tilde{h}}_{p q}$ can be realized by defining edges as in Section 2.4.1. However, that is only if both p and q belong to $P_{α - β}$ . Although it can be ignored if neither belongs to $P_{α - β}$ , it cannot be if only one does. This is, however, a unary term, as one of the pair has a fixed label. Thus, adding the effect to (2.116), we redefine

$w_{(p, n) (p, n + 1)} = g_{p} (l_{i + n}) + \sum_{\begin{matrix} (p, q) \in N \\ q \in P P_{α - β} \end{matrix}} \min ({\tilde{h}}_{p q} (l_{i + n} - X_{q}), θ) .$

(2.117)

This way, the cost of a cut of $(G_{α - β}, υ_{s}, υ_{t})$ is the energy of the corresponding labeling by (2.115) up to a constant. Thus, the minimum cut gives the optimal α-β range move.

2.4.2.4 Higher-Order Graph Cuts

Combining the fusion move and the reduction of general higher-order binary energy explained in Section 2.3.5, multiple-label higher-order energies can be approximately minimized [92, 93].

The energy is of the general form (2.13), shown here again:

$E (X) = \sum_{C \in C} f_{C} (X_{C}),$

(2.118)

where $C$ is a set of cliques (subsets) of the set $P$ of sites, and X_C is the restriction of X to C.

The algorithm maintains the current labeling $X \in ℒ^{P}$ . In each iteration, the algorithm fuses X and a proposed labeling $P \in ℒ^{P}$ by minimizing a binary energy. How P is prepared is problem specific, as is how X is initialized at the beginning of the algorithm.

The result of the merge according to a binary labeling $Y \in B^{P}$ is defined by (2.107) and its energy by (2.108). For the energy E(X) of the form (2.13), the polynomial expression of $\hat{E} (Y)$ is

$\hat{E} (Y) = \sum_{C \in C} \sum_{γ \in B^{C}} f_{C} (F (X_{C}, P_{C}; γ)) θ_{C}^{γ} (Y_{C}),$

(2.119)

where $F (X_{C}, P_{C}; γ) \in ℒ^{C}$ is the fusion labeling on the clique C determined by a binary vector $γ \in B^{C}$ :

$F {(X_{C}, P_{C}; γ)}_{p} = {\begin{matrix} X_{p} & if γ_{p} = 0 \\ P_{p} & if γ_{p} = 1, \end{matrix} p \in C,$

(2.120)

and $θ_{C}^{γ} (Y_{C})$ is a polynomial of degree |C| defined by

$\begin{matrix} θ_{C}^{γ} (Y_{C}) = \prod_{p \in C} Y_{p}^{(γ)}, & Y_{p}^{(γ)} \end{matrix} = {\begin{array}{l} Y_{p} & if γ_{p} = 1 \\ 1 - Y_{p} & if γ_{p} = 0 \end{array}$

(2.121)

which is 1 if Y_C = γ and 0 otherwise.

The polynomial $\hat{E} (Y)$ is then reduced into a quadratic polynomial using the technique described in Section 2.3.5. The result is a quadratic polynomial $\tilde{E} (\tilde{Y})$ in terms of the labeling $\tilde{Y} \in B^{P^{'}}$ , where $P^{'} \supset P$ . We use the BHS algorithm to minimize $\tilde{E} (\tilde{Y})$ and obtain a partial labeling $Z \in B^{Q^{'}}$ on a subset $Q^{'}$ of $P^{'}$ . Overwriting the zero labeling 0 by the restriction $Z_{P \cap Q^{'}}$ of Z to the original sites, we obtain $0 ᐊ Z_{P \cap Q^{'}}$ , which we denote by Y_Z. Then we have

$E (F (X, P; Y_{Z})) = \hat{E} (Y_{Z}) \leq \hat{E} (0) = E (X) .$

(2.122)

To see why, let us define

$\begin{matrix} y_{0} = {\tilde{Y} \in B^{P^{'}} | {\tilde{Y}}_{P} = 0}, & {\hat{Y}}_{0}^{\min} = \arg \min_{\tilde{Y} \in y_{0}} \tilde{E} (\tilde{Y}), \end{matrix}$

(2.123)

$\begin{matrix} y_{Z} = {\tilde{Y} \in B^{P^{'}} | {\tilde{Y}}_{P} = Y_{Z}}, & {\hat{Y}}_{Z}^{\min} = \arg \min_{\tilde{Y} \in y_{Z}} \tilde{E} (\tilde{Y}) . \end{matrix}$

(2.124)

Then, by the property of the reduction, we have $\hat{E} (0) = \tilde{E} ({\tilde{Y}}_{0}^{\min})$ and $\hat{E} (Y_{Z}) = \tilde{E} ({\tilde{Y}}_{Z}^{\min})$ . Also, if we define $\tilde{Z} = {\tilde{Y}}_{0}^{\min} ᐊ Z$ , we have $\tilde{E} (\tilde{Z}) \leq \tilde{E} ({\tilde{Y}}_{0}^{\min})$ by the autarky property. Since $\tilde{Z} \in y_{Z}$ , we also have $\tilde{E} ({\tilde{Y}}_{Z}^{\min}) \leq \tilde{E} (\tilde{Z})$ . Thus, $\hat{E} (Y_{Z}) = \tilde{E} ({\tilde{Y}}_{Z}^{\min}) \leq \tilde{E} (\tilde{Z}) \leq \tilde{E} ({\tilde{Y}}_{0}^{\min}) = \hat{E} (0)$ .

Therefore, we update X to F(X, P; Y_Z) and (often) decrease the energy. We iterate the process until some convergence criterion is met.

2.5 Examples

Here, we show denoising examples using the multilabel energy minimization techniques described in the previous section. Figure 2.12 shows the results of denoising an image with graph cuts. The original 8-bit grayscale image (a) was added an i. i. d. Gaussian additive noise with standard deviation σ = 50, resulting in the noisy image Y shown in (b).

The result of denoising it by finding X that globally minimizes the “total variation” energy:

$E (X) = \sum_{p \in P} {(Y_{p} - X_{p})}^{2} + \sum_{(p, q) \in N} κ | X_{p} - X_{q} |$

(2.125)

with κ = 8.75, using the technique described in Section 2.4.1 and [47], is shown in (c).

The result of denoising Y with the higher-order graph cuts described in Section 2.4.2.4 is shown in (d). For the higher-order energy, we used the recent image statistical model called the fields of experts (FoE) [85], which represents the prior probability of an image X as the product of several Student’s t-distributions:

$P (X) \propto \prod_{C} \prod_{i = 1}^{K} {(1 + \frac{1}{2} {(J_{i} \cdot X_{C})}^{2})}^{- α_{i}},$

(2.126)

where C runs over the set of all n × n patches in the image, and J_i is an n × n filter. The parameters J_i and α_i are learned from a database of natural images. Here, we used the energy

$E (X) = \sum_{C \in C} f_{C} (X_{C}),$

(2.127)

FIGURE 2.12
Denoising example. (a) Original image, (b) noise-added image (σ = 50), (c) denoised using the total variation energy, (d) denoised using the higher-order FoE energy.

where the set $C$ of cliques consists of those cliques formed by singleton pixels and 2 × 2 patches. For the two kinds of cliques, the function f_C is defined by

$\begin{matrix} f_{C} (X_{C}) = \frac{{(Y_{p} - X_{p})}^{2}}{2 σ^{2}}, & (C = {p}) \end{matrix}$

(2.128)

$\begin{matrix} f_{C} (X_{C}) = \sum_{i = 1}^{3} α_{i} \log (1 + \frac{1}{2} {(J_{i} \cdot X_{C})}^{2}) . & (C is a 2 \times 2 patch) \end{matrix}$

(2.129)

The FoE parameters J_i and α_i for the experiment were kindly provided by Stefan Roth. The energy was approximately minimized using the higher-order graph cuts. We initialized X by Y and then used the following two proposals in alternating iterations: (i) a uniform random image created each iteration, and (ii) a blurred image, which is made every 30 iterations by blurring the current image X with a Gaussian kernel (σ = 0.5625).

2.6 Conclusion

In this chapter, we have provided an overview of the main ideas behind graph cuts and introduced the basic constructions that are most widely used as well as the ones the author is most familiar with.

Table of Contents for 2 Graph Cuts—Combinatorial Optimization in Vision

Create new playlist

Sign In

Sign Up

Table of Contents for
2 Graph Cuts—Combinatorial Optimization in Vision