4.2 Spectral Residual Approach

Regardless of the early spectral attentional model by using a DCT block mentioned in Itti et al.'s literature [2], the pure bottom-up model in the frequency domain was first proposed in [3] and called spectral residual, SR for short. This method needs only to compute the residual amplitude spectrum, and then the saliency map is just the recovered image by inverse FFT. There are no biological facts to confirm spectral processing in the brain, but almost the same saliency results can be obtained from the simple SR model compared to the BS model. In addition, its computational consumption is quite little. Remember that the BS model in C++ code named NVT or NVT+ [22] has high computation speed, but the SR model in the MATLAB® version is extremely parsimonious, so that it is absolutely suitable for real-time applications. Since the SR model was introduced, several improved or simpler computational models in the frequency domain have been proposed [5–10]. Putting aside the reason why the frequency model has good performance, we first introduce the idea of SR, its algorithm and some simulation results in this section. Some alternative frequency domain models will be illustrated in the following six sections. A biologically plausible model in the frequency domain is proposed in Section 4.6, which can partly interpret the rationality of these spectral models. It is worth noting that, up to now, all visual attention models in the frequency domain are pure bottom-up models since they do not make it easy to integrate top-down information into frequency domain computation. Finally some discussions about the advantages and limitations of computational models in the frequency domain are given in the final section in order to provide users' selection for their particular applications.

4.2.1 Idea of the Spectral Residual Model

As mentioned in Chapters 02 and 03, these attention focuses of human vision in a scene often deviate from normal structure or colour. Our brain has the ability to suppress the redundant information and first process the unusual signals [23, 24]. The SR model suggests that the whole information in an image or scene is divided into two parts: one is the novel part that includes unexpected signals, and the other is the redundant information or inherent information. The overall information is a simple summation of the two parts:

equation

where IN represents information. Attention should suppress the IN (inherent information) and extract the IN (innovation). How do we distinguish the two kinds of information? An intuitive expectation is that those pixels that belong to innovation will have less statistical dependence than pixels from the inherent information. Consider the amplitude spectra of a natural image, which is also considered as the sum of two amplitude spectra (innovation and inherent information). From Section 4.1.2 (Equations(4.4) and 4.5) the statistical property of average amplitude spectrum img of natural images satisfies a 1/f law (the frequency power index α = 1), that is we change the power spectrum in Equation 4.4 to the amplitude spectrum.

(4.6) equation

Notice that Equation 4.6 is not favoured for analysing the individual images, because different images have minor differences in the shape of the amplitude spectra though the trend is similar. The SR model [3] assumes that the smooth 1/f curve can be regarded as inherent information, and then the difference (residua) between original and smoothed amplitude spectra is the innovation, Therefore, the recovered image of the innovation information by inverse Fourier transform will appear as salient information. In the SR model, the log amplitude spectrum is adopted [3] by img. The average log amplitude spectrum that has been widely used in some literature [12, 17, 25] often appears as a local linearity. The SR model relies on the assumption that similarity of amplitude spectra implies redundancy of information in the image, thus the residual log amplitude spectrum at frequency f is

(4.7) equation

where img(f) is the value of original log amplitude spectrum at frequency f and img(f) is the value after a smoothing operation that simulates the similar parts between the image and statistical natural images. hn (f) is a 2D local smooth filter in the frequency domain, in which subscript n is the template size of the filter, for example the average pixel's values in n × n neighbours. The symbol * denotes the operation of convolution. The filter is to approximate the statistical average of the amplitude spectra because it can remove the individual difference and keep the overall trend. img(f) is the innovation part in frequency space, called the spectral residua.

4.2.2 Realization of Spectral Residual Model

The SR model is easy to realize in five steps: (1) resize the input image as a standard image; (2) perform discrete Fourier transform to the standard image and take the natural logarithm for the amplitude spectrum; (3) calculate the spectral residua in amplitude spectrum with the aid of Equation 4.7; (4) perform inverse discrete Fourier transform by keeping the phase spectrum of the standard image and substituting the spectral residua for the amplitude spectrum; (5) implement post-processing for the recovered image by using a low-pass Gaussian filter, and then the resulting image is the saliency map of the original image.

If an image I with M × N pixels is given, which is the I(x, y) array for x (0, . . . (N − 1)) and y (0, . . . (M − 1)), the realization of SR algorithm is introduced as follows:

1. Resize the input image
In the SR model the input image is resized to 64 pixels in width (or height). Choosing this relatively coarse resolution is for two reasons: one is that bottom-up attention is a fast parallel process in the pre-attention stage and it may not be possible to observe the details of the image; the other is that it can suppress the noise in the high frequency region. It is consistent with the BS model that selects mid resolution as the size of the final saliency map. The standard image resizing step is

(4.8) equation

where the standard image can be represented as the img array in which the number of pixels for the smaller of width (x) or height (y) is 64.
2. Calculate the log amplitude and phase spectra of the standard image from Equations 4.94.11

(4.9) equation

(4.10) equation

(4.11) equation

where img(.) is the Fourier transform calculated from Equation 4.1, ph(.) is a function for computing phase spectrum from Equation 4.3, img and img are phase and amplitude spectra, respectively and img (f) denotes the log amplitude spectrum.
3. Calculate the spectral residua
Without loss of generality, taking the parameter of the smooth filter n = 3, hn will be the 3 × 3 averaging filter template with a value of 1/9 for each element, then, according to Equation 4.7, this operation is img; the smoothed amplitude spectrum is img and the spectral residua is img.
4. Do an inverse Fourier transform for spectral residua by keeping the phase spectrum using Equation 4.2

(4.12) equation

where img is the recovered resultant map after doing the inverse Fourier transform.
5. Post-process the recovered image by using a low-pass Gaussian filter with variance σ = 8

(4.13) equation

where each value in img is squared in order to enhance the contrast, and symbol g denotes a low-pass Gaussian filter. Finally the array SM(x,y) forms the saliency map.
The realization of the SR model is very simple and convenient, and it does not need to compute the pyramids of multiple scales and to use a centre–surround process as in the original BS model or its variations in the spatial domain, and also there's no need to estimate probability density as in the AIM, DISC, SUN and Bayesian surprise models. It only requires a few sentences of MATLAB® to complete the algorithm due to the ready-made FFT and inverse FFT in MATLAB®. Readers can program it themselves.

4.2.3 Performance of SR Approach

For colour natural images, the SR model adopts two colour channels and one intensity channel independently as have been defined in the BS model, and then integrated the three individual conspicuity maps into the final saliency map. In order to compare the SR approach with BS model, the calculation of the SR and BS models is applied to the same natural image data set including 62 natural images in [3], where the BS model employs the code from [22]. The comparison reported that the SR model has better performance than the original BS model in terms of object hitting ratio or intuition results [3]; specially, the SR model, which may lack more biological evidence, has very high efficiency in terms of computational use. Figure 4.4(a) shows the black and white versions of two colour natural images with a small white house near the mountain on a green sward (top) and two trees far from the observer in a field (bottom), respectively. In Figure 4.4(c), the BS model cannot detect the objects of interest clearly; contrarily, the SR model behaves quite well, as in Figure 4.4(b). We can see from the example images that the saliency maps derived from the SR model can locate the objects more accurately than the BS model.

Figure 4.4 Comparison of SR and BS models: (a) original images; (b) resultant saliency maps from the SR model for original images; (c) saliency maps for the BS model [3]. © 2007 IEEE. Reprinted, with permission, from X. Hou, L. Zhang, ‘Saliency Detection: A Spectral Residual Approach’, IEEE Conference on Computer Vision and Pattern Recognition, June 2007

img

The quantitative comparison of SR and BS models in [3] also illustrates the superiority of the SR model over the BS model, which we will no longer discuss here. The interesting issue is why the SR model can pop out the object in the image. In the next few sections, we will gradually analyse it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset