4.9 Further Discussions of Frequency Domain Approach

This chapter introduces several visual attention models in frequency domain: SR, PFT, PQFT, pulsed PCA, PCT, FDN, PFDN and AQFT. Modelling directly from a bit-stream of compressed images is also discussed. These models complete the computation of the saliency map with the help of FFT and DCT tools from the field of image processing, and therefore they have fast computational speed that meets the requirement of real-time processing applications, which cannot be satisfied by any spatial domain models. The fastest models are PFT and PCT that spend the same time with almost the same performance, and then the order of time cost is: FDN, SR, PQFT, PFDN and pulsed PCA. Note that here the AQFT and compressed domain models are not listed since no comparison is given in [9, 10]. The model of pulsed PCA only aims at giving a reasonable explanation for frequency model. In practice, pulsed PCA is rarely used due to the existence of PCT (as PCT is faster for the same performance).

In most of the image databases, frequency domain approaches exhibit good consistency with psychophysical results as spatial domain models do, but they are short of biological basis. PFT and PQFT suggest that the phase spectrum represents the local edge information in the input image, while flatting the amplitude spectrum, the edge information with high frequency in general is just the focus of visual attention. The SR and PCT models seem to obey the same rule.

The calculation of PQFT considers all features as an entity at each pixel and AQFT combines colour and intensity features as an entity in each image patch, which is different from respective computation for each channel in some cases. For some images or videos when the entire information is projected on respective channels these covered parts will be lost. The mathematical tool of the quaternion and its Fourier transform can solve this problem. The processing of an entire feature set is also consistent with the distribution of simple cells in the visual system. When the covered information does not exist, multichannel PFT (or PCT) is almost the same as PQFT.

FDN and PFDN approaches have the most biological plausibility compared to other frequency domain models, since they simulate each step of the typical spatial model: feature extraction in the spatial model becomes grouping of amplitude spectrum; the divisive normalization in each sub-band simulates the lateral inhibition between simple cells tuning a special feature (orientation) and features integration in the spatial model is the combination of normalized sub-bands and the inverse Fourier transform. In the FDN and PFDN models the phase spectrum is still reserved, and when the group size of the amplitude spectrum reduces to one pixel the FDN degenerates into the PFT. The FDN considers the orientation features with more angles than the spatial domain models. For example, in the BS model only four orientation features are considered; however, in FDN or PFT, orientation features are dependent on the number of coefficient groups. In fact, FDN has 16 orientations in high resolution and PFT has more orientations, because each pixel location represents an orientation. Thus FDN or PFT may have better performance. Some experiments showed that for some images the performance of frequency domain models is better than spatial domain models [6, 8].

Amplitude spectrum methods based on image patches consider the difference in amplitude spectrum between each patch and other patches, and the human visual sensitivity is used to weigh these differences to get the saliency for image patches. Based on this idea, another model directly from bit-stream of compressed images is proposed, which does not need to do an inverse discrete cosine transform (IDCT) and obtain the saliency map from the bit-stream of compressed images.

More recently, a modified PQFT model referred to as hypercomplex Fourier transform (HFT) is proposed [75], in which filtering the log amplitude spectrum with Gaussian functions of different variances is adopted while keeping the phase spectrum. PQFT is like a case of HFT when the variance of the Gaussian function approaches infinity (flatting the amplitude spectrum). We are too late to include this model in this chapter as the content is too new. We believe that more and more computational models in the frequency domain will be created in the future for engineering applications.

The deficiency of frequency domain model is very obvious: (1) Fourier transform methods for a whole image (SR, PFT, PQFT, PCT and FDN) consider the global property of the image; for instance, the divisive normalization (FDN) just simulates lateral inhibition in whole image extent. However, salient objects often pop out in limited surroundings, which is the reason why, for some images, a frequency model is not as good as a spatial domain model. Of course, the strategy of the PFDN approach can be applied to any frequency domain model to overcome the deficiency, but this improvement is limited due to the fact that it is not flexible enough. (2) Frequency models need not select many parameters in their models, which makes them superior to spatial models, but resizing the image's scale in pre-process stage is required. This is because, in some frequency domain models (SR, PFT, PQFT, PCT and FDN), flatting the amplitude spectrum results in high-frequency enhancement that may highlight noise in the image, and also the choice of a suitable image scale and smoothing filter can reduce the noise. For the model based on the amplitude spectrum of image patches, the size of the image patch and overlapping between image patches need to be selected, for it is related to the size of the final saliency map. A frequency model is propitious in the case where the object including many high-frequency components among the background with low-frequency components or the scene is sparse enough. For the psychological patterns in Chapter 2 and for natural images with an object on a simple background (a boat on a blue sea, a white sheep on a green lawn), the frequency model works well. When the object with low-frequency components stays in a complex background (see Figure 3.10(a)): a small homogeneous region (object) among many various colour long bars with random orientations (background)), we need to choose image size in PFT, SR, PQFT, the size of image patches in AQFT and the Gaussian function's variance in HFT, very carefully; (3) Top-down attention as a new channel can be added easily into a bottom-up spatial model like the Wolf model introduced in Chapter 2 or the BS model mentioned in Chapter 3. Most frequency models are only for bottom-up attention. They provide the candidate regions of objects in the scene by bottom-up attention; object recognition based on top-down attention needs to be reconsidered, because in the frequency domain we cannot know any local information.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset