5.7 Top-down Modelling in the Bayesian Framework

In Section 3.7, a saliency model (SUN) with more comprehensive statistics is only considered for its bottom-up part. However, the framework of the model includes both bottom-up and top-down parts. For the sake of completeness, here we briefly present its top-down part in [19].

5.7.1 Review of Basic Framework

The SUN model assumes that the salient location is closely related to the probability of a target's presence at each location. The location with higher probability of the target's appearance has a larger salient value. Let z be a point in the visual field (or a pixel in the input image) and let the binary random variable C denote whether or not the point belongs to a target class, C img {0, 1}. The random variable l and random vector f denote the location (or pixel coordinates) and the visual features of a point, respectively. The saliency of a point z in the visual field is directly proportional to the probability img, where fz represents the features observed at the location z (here fz is written as a vector denoted by a bold letter) and lz is the coordinate of z. The saliency at point z can be calculated using Bayes' rule as

(5.34) equation

Since the logarithm is a monotonically increasing function that does not affect the ranking of salience across locations in an image – as in the deduction mentioned in Section 3.7 – Equation 5.34 can be replaced with the following form:

(5.35) equation

In the above equation, the first term is the self-information at point z as shown in Chapter 3 and it represents the salience of the bottom-up part when the feature vector f takes fz, the features at point z. When the probability of the appearance of one or more feature decreases, the salience increases at point z, because the joint probability of all the features is the product of each feature probability in probability density estimation [19, 71]. The second term is a log-likelihood term that favours the feature vector (or feature values) consistent with the knowledge of the target's presence at point z. If one or several features of the target are already known (e.g., we know that the target is of red colour), then the log-likelihood term is much larger for the given feature point (a red point) than for the other colour points. The third term is related to prior knowledge of where the target is likely to be present and it is independent of features. For instance, a tiger often appears at the place where it frequently finds quarry.

Assuming that the location prior is not considered for simplicity (set the third term to zero or set the conditional probability to uniform), the first two terms lead to the pointwise mutual information between features and presence of a target. Equation 5.35 can be written as

(5.36) equation

The SUN model considering both bottom-up and top-down parts is to look for the salient regions of an image most likely to contain the target by maximizing the pointwise mutual information shown in Equation 5.36. In the case of searching a single target class, p(C = 1) can be simplified as a constant, so the right of Equation 5.36 can be explained as

(5.37) equation

Form Equation 5.37, the salience calculation of the SUN model with top-down part is to estimate the conditional probability img.

5.7.2 The Estimation of Conditional Probability Density

Since the conditional probability density involves the class (target/background) and the features at each point, a probabilistic classifier, support vector machine (SVM), is adopted in [19] to estimate the probability density. The SVM is an effective classifier in pattern recognition, which can complete the pattern classification, fitting data (regression function) and probability density estimation by using the library for SVM (LIBSVM) [72]. For the features at each point z, the feature filters mentioned in Section 3.7 are considered. Section 3.7 adopts two kinds of feature filters: one is the difference of Gaussians at multiple scales and the other is the ICA filters. Since the ICA filters (bank of basis) can generate independent responses that make the estimation of the joint probability density in self-information more accurate, it is more admired than the SUN computation with top-down information [19].

The ICA bank of basis is estimated first from a large database for the image patches normalized to have zero mean, each one being an 11 × 11 × 3-dimensional vector, where the numeral 3 is the colour number in the colour image patch and the size of image patch is 11 × 11 pixels. One of the learning algorithms of ICA [73–75] is used with the database, and the bank of ICA basis (filters) is obtained after learning. The details of ICA filter computation was discussed in Sections 3.5 and 3.7. If the ICA filters are available, the implementation of SUN has the following three steps.

1. Create a training set of SVM
Images from a large dataset containing the interested target are used as the training set. Each image is normalized to have zero mean and unity standard deviation. A square mask with d × d × 3 is used to crop the interesting target from the images to form the positive candidate samples. The size of d is chosen to ensure that the cropped image patch can contain the entire target. Since the size of target in different images is different, d for each image is not the same. However, random square patches of the same size d are collected from the background of the same image, which will form the negative candidate samples.
xThe ICA filters are resized to match each candidate sample due to the different size d, and then each of these candidate samples (positive and negative patches) are projected on these ICA filters to get the responses. In order to keep the invariance to different d, each response is multiplied by img, and the absolute values of these responses for one candidate sample are taken as the features of the image patch. The feature vector at each image patch with positive or negative label is regarded as the final training sample. When all image patches cut from original images are represented by their feature vectors (ICA responses), the training set with positive and negative labels is created. Of course, if the dimension of the feature vector too large, a dimensional reduction method such as PCA can be used to simplify the training set.
2. SVM learning from the training set
SVM is a neural network with Gaussian kernel as the input-output functions of the hidden units, and the output neuron of SVM is related to the weighted summation of the outputs of these hidden units. The training result is to maximize the discrimination between target and background by choosing the number of hidden units, the variance of each Gaussian kernel and weights.
When SVM is used as a classifier, the output is a binary value: one for a positive sample and zero for a negative sample, which will be introduced in Chapter 7 for the application of pattern classification. For estimation of the regression function, its output is directly the weighted summation and when the weighted summation is normalized, the probability density can be estimated.
The size of the patch images is related to the target sizes in different images, so adapting the scale for a new image is considered in [19] by clustering the resizing factors obtained in the training set building stage. Three sizes of image patches are utilized in [19] for the next test stage (step 3).
3. Calculating conditional probability img
A test image is normalized to have zero mean and unit variance as does in step 1. The ICA filters are enlarged to match the size of the image patch for the three different scales, and then each of these ICA filter convolves with the image and takes the absolute value of the response. The SVM for class C provides an estimation of img, where s is the scale. The resulting map for each scale is smoothed, and the maps for all scales are combined by averaging the estimations at each point.
In summary, the saliency map of the SUN model with top-down part is based on the estimation of the conditional probability density at each point, and ICA filters and SVM are used to complete the estimation. Training of SVM in a larger database is necessary for different interested targets because the SUN model is based on more comprehensive statistics. When the top-down requirement is free, self-information estimation at each point can decide the bottom-up saliency map, as mentioned in Section 3.7.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset