Regardless of the early spectral attentional model by using a DCT block mentioned in Itti et al.'s literature [2], the pure bottom-up model in the frequency domain was first proposed in [3] and called spectral residual, SR for short. This method needs only to compute the residual amplitude spectrum, and then the saliency map is just the recovered image by inverse FFT. There are no biological facts to confirm spectral processing in the brain, but almost the same saliency results can be obtained from the simple SR model compared to the BS model. In addition, its computational consumption is quite little. Remember that the BS model in C++ code named NVT or NVT+ [22] has high computation speed, but the SR model in the MATLAB® version is extremely parsimonious, so that it is absolutely suitable for real-time applications. Since the SR model was introduced, several improved or simpler computational models in the frequency domain have been proposed [5–10]. Putting aside the reason why the frequency model has good performance, we first introduce the idea of SR, its algorithm and some simulation results in this section. Some alternative frequency domain models will be illustrated in the following six sections. A biologically plausible model in the frequency domain is proposed in Section 4.6, which can partly interpret the rationality of these spectral models. It is worth noting that, up to now, all visual attention models in the frequency domain are pure bottom-up models since they do not make it easy to integrate top-down information into frequency domain computation. Finally some discussions about the advantages and limitations of computational models in the frequency domain are given in the final section in order to provide users' selection for their particular applications.
As mentioned in Chapters 02 and 03, these attention focuses of human vision in a scene often deviate from normal structure or colour. Our brain has the ability to suppress the redundant information and first process the unusual signals [23, 24]. The SR model suggests that the whole information in an image or scene is divided into two parts: one is the novel part that includes unexpected signals, and the other is the redundant information or inherent information. The overall information is a simple summation of the two parts:
where IN represents information. Attention should suppress the IN (inherent information) and extract the IN (innovation). How do we distinguish the two kinds of information? An intuitive expectation is that those pixels that belong to innovation will have less statistical dependence than pixels from the inherent information. Consider the amplitude spectra of a natural image, which is also considered as the sum of two amplitude spectra (innovation and inherent information). From Section 4.1.2 (Equations(4.4) and 4.5) the statistical property of average amplitude spectrum of natural images satisfies a 1/f law (the frequency power index α = 1), that is we change the power spectrum in Equation 4.4 to the amplitude spectrum.
Notice that Equation 4.6 is not favoured for analysing the individual images, because different images have minor differences in the shape of the amplitude spectra though the trend is similar. The SR model [3] assumes that the smooth 1/f curve can be regarded as inherent information, and then the difference (residua) between original and smoothed amplitude spectra is the innovation, Therefore, the recovered image of the innovation information by inverse Fourier transform will appear as salient information. In the SR model, the log amplitude spectrum is adopted [3] by . The average log amplitude spectrum that has been widely used in some literature [12, 17, 25] often appears as a local linearity. The SR model relies on the assumption that similarity of amplitude spectra implies redundancy of information in the image, thus the residual log amplitude spectrum at frequency f is
where (f) is the value of original log amplitude spectrum at frequency f and (f) is the value after a smoothing operation that simulates the similar parts between the image and statistical natural images. hn (f) is a 2D local smooth filter in the frequency domain, in which subscript n is the template size of the filter, for example the average pixel's values in n × n neighbours. The symbol * denotes the operation of convolution. The filter is to approximate the statistical average of the amplitude spectra because it can remove the individual difference and keep the overall trend. (f) is the innovation part in frequency space, called the spectral residua.
The SR model is easy to realize in five steps: (1) resize the input image as a standard image; (2) perform discrete Fourier transform to the standard image and take the natural logarithm for the amplitude spectrum; (3) calculate the spectral residua in amplitude spectrum with the aid of Equation 4.7; (4) perform inverse discrete Fourier transform by keeping the phase spectrum of the standard image and substituting the spectral residua for the amplitude spectrum; (5) implement post-processing for the recovered image by using a low-pass Gaussian filter, and then the resulting image is the saliency map of the original image.
If an image I with M × N pixels is given, which is the I(x, y) array for x (0, . . . (N − 1)) and y (0, . . . (M − 1)), the realization of SR algorithm is introduced as follows:
(4.8)
For colour natural images, the SR model adopts two colour channels and one intensity channel independently as have been defined in the BS model, and then integrated the three individual conspicuity maps into the final saliency map. In order to compare the SR approach with BS model, the calculation of the SR and BS models is applied to the same natural image data set including 62 natural images in [3], where the BS model employs the code from [22]. The comparison reported that the SR model has better performance than the original BS model in terms of object hitting ratio or intuition results [3]; specially, the SR model, which may lack more biological evidence, has very high efficiency in terms of computational use. Figure 4.4(a) shows the black and white versions of two colour natural images with a small white house near the mountain on a green sward (top) and two trees far from the observer in a field (bottom), respectively. In Figure 4.4(c), the BS model cannot detect the objects of interest clearly; contrarily, the SR model behaves quite well, as in Figure 4.4(b). We can see from the example images that the saliency maps derived from the SR model can locate the objects more accurately than the BS model.
The quantitative comparison of SR and BS models in [3] also illustrates the superiority of the SR model over the BS model, which we will no longer discuss here. The interesting issue is why the SR model can pop out the object in the image. In the next few sections, we will gradually analyse it.