1.4 Visual Attention Model Development

The history of research and development toward selective visual attention can be divided into three phases. The first phase began at the time of William James about a century ago, and we refer to it as the biological study phase. Many neurophysiologists and psychologists have discovered valuable truths and developed theories of visual attention in this first phase.

Although, during the ancient era (300 bc), Aristotle found the attention phenomenon for humans or animals, the real research work of visual attention was introduced in James's book, The Principles of Psychology [21]. The concepts of pre-attention and attention, two-stage models, competition and normalization in the neuronal system, and the feature integration theory of attention and so on were proposed and discussed in the first phase [4, 5, 25, 26] with many psychological and physiological experiments. The theories and methodologies devised in this phase become the basis of building computational attention models later on.

The next phase started in 1980s. A two-dimensional map (i.e., saliency map) which encodes visual conspicuity stimulus was put forward by Koch and Ullman in 1985 [35]. Various computational models to automatically generate the saliency map, including spatial domain and frequency domain approaches, had been suggested to simulate bottom-up or top-down visual attention over the past 30 years [9–13, 43–45]. The phenomenon of visual attention in physiological and psychological experiments were simulated in these computational models. Many scientists and engineers in the areas of computer vision, artificial intelligence and computer engineering participated in the studies in the second phase, so the methods of performance measurement and comparison among different models have appeared.

The third phase began in the end of 1990s after many computational models had been built. Many applications of visual attention on object detection, image and video coding, image segmentation, quality assessment of image and video and so on have been proposed. Now it has been known that visual attention plays a central role not only in the studies of biological perception but also in the computer vision and other engineering areas. It should be noted that the three phases started from different times in the past, but they are concurrent at this moment because the work is still on-going for all aspects.

1.4.1 First Phase: Biological Studies

The studies in the first phase of visual attention were based on the relevant evidences from psychology and physiology whose contributions were alternate. As mentioned above, this phase started with the book The Principles of Psychology [21] in 1890 by W. James, who was the first to publish a number of facts related to brain functions and activities. Visual attention was discussed as a chapter in this book. The two-component attention and covert attention without moving eyes were mentioned in his book, although they were not named and defined at that time.

Over half a century later, in the 1960s, physiologists Hubel and Wiesel recorded the activities of a single cell in the primary visual cortex of cats, and reported that some cells responded preferentially to input stimuli with particular spatial orientations in their receptive fields [6, 31]. Then, many electrophysiological experiments showed that some basic neurons in the early visual cortex can respond to other features in their receptive fields apart from orientations, such as colour contrast, motion direction, spatial frequency and so on [7, 32, 33]. These physiological evidences suggest that the visual scene is analysed and selected in the early visual cortex, and then these features are mapped onto different regions of the brain [34]. In the same decade, in 1967, Russian biophysical scientist Yarbus developed a novel set of devices and related method to accurately record eye movement tracks while the observers watched scenes with or without cue guidance [24]. His studies on eye movement have had significant influences on visual attention, especially on overt attention. In the same year, other contributions were from research into psychology; for example, Neisser [25] suggested the two-stage attention: pre-attention and attention. Pre-attention is parallel processing over the whole visual field at one time. Attention is limited-capacity processing, which is restricted to a smaller area related to the interested object or event in the visual field at one time. Afterwards Hoffman proposed a two-stage processing model (pre-attention and attention stages) in 1975 [26]. As an extension of the two-stage attention concept, after 21 years, Wolfe [28] in his book proposed a post-attention stage to supplement the system after the attention stage.

In 1980s, psychologists Treisman and Gelade proposed a feature integration theory for visual attention [4] based on physiological evidences of a single cell's feature extraction from the visual field in parallel. How can we combine these separate features of an object in the visual field? Their theory suggests that a number of the features that come from parallel perception need the focal attention to form a single object. Some testing paradigms have confirmed the feature integration hypothesis of Treisman and Gelade's literature [4]. They found that searching a target based on a single feature is very easy since it does not need to consider the relation among different features; by contrast, searching a conjunction with more than one feature is slower due to a serial search in the processing. The feature integration theory has become the foundation for subsequent attention studies.

An implementation of the feature integration theory, the guided search model, was proposed by Wolfe et al. in 1989, 1994 and 1996 [5, 28, 30]. In the original guided search model, the parallel process is used for several separable features to guide attention during object search with the conjunction of multiple features [5]. In the revised version – the guided search 2.0 model [30] – each feature consists of several channels (e.g., red, yellow, green and blue channels for colour features), and three features are synthesized from their respective channels such as colour, orientation, and others (size or shape) as a bottom-up process were proposed. They extracted features from the visual field in parallel, to form three feature maps. The information from the top-down process guides the feature maps' active locations, and then the three feature maps are integrated into a topographic activation map with two dimensions. The attention focus is located with the higher values of the activation map. We can see subsequently that the famous computational model proposed by Koch and Ullman [35] and by Itti et al. [43] is very close to the guided search 2.0 model. Since the simulation results and conclusions of the guided search model are based on psychology, in this book, we label this model as one from psychological studies.

The lateral inhibitory interaction in the cells of the early visual system was discovered by physiologists in the 1950s [46]. In the 1960s and 70s, physiologists found that the receptive field of a retinal ganglion cell in cats had the effect of central enhancement and surrounding inhibition (on/off) or reversed (off/on) [47, 48]. Afterwards, the experiments in the extrastriate cortex of the macaque found a competing mechanism in other areas of the visual cortex [49–51]. If a local region of the cortex receives input from two stimuli, the neuronal response in that attention region will be generated via competitive interaction that represents a mutually suppressive effect [52, 53]. The cell with the strongest response can suppress the response of its surrounding cells, and this leads to the winner-takes-all (WTA) strategy. The phenomenon of competition was discovered earlier in a psychological experiment: when two objects were presented in the visual field, the subject only focused on one object at a time because of the WTA strategy, so competition and attention are consequential to each other. In 1990s, Desimone discussed the relations of attention and competition: neurons representing different stimulus components compete with each other, and attention operates by biasing the competition of neurons that encoded the attended stimulus [53]. Due to the competitive nature of visual selection, most of the attention models are based on WTA networks such as Lee et al. proposed in 1999 [54]. Through the neurons' computations, a WTA network selects the neurons, the place located at the winner, as the fixated focus [54].

In the same period, a normalization model of attention was proposed based on the non-linear response of simple cells in the primary visual cortex [55]. This idea is based on physiological investigation of a simple cell of cats by [56, 57], which is different from the longstanding view of linear relations. In 1994, Carandini and Heeger; and in 2009 Reynolds and Heeger proposed that the non-linear response of a simple cell can be represented as a linear response of every cell divided by all the cells' activity, called the normalization model [58, 59]. The normalization can also be explained by suppressive phenomena resulting from the inhibition of neighbouring cells. In the late 1980s, stimulus-related neuronal oscillations were discovered in the primary visual cortex of cats and monkeys [60, 61]. These findings supported the hypothesis that neuronal pulse synchronization might be a mechanism to link local visual features into coherent global perception. Based on the facts of pulse synchronizing oscillations, many spiking neural networks were proposed in [62–64]. Since spiking neuron networks consider the connected context and pulse transfer between neurons, it can simulate visual attention phenomena well [65, 66].

Many physiological and psychological experiments showed that top-down attention plays a critical role in determining object search [67–69]. In Wolf's guided search, bottom-up activations are modulated by top-down gain that specifies the contribution of a particular feature map related to the current task [30]. In the review article by Itti and Koch [23], top-down processing is represented as a hierarchical decision tree that can learn object knowledge, and the signals from the decision tree control the salient location in the input visual field. Some studies using single-cell recording suggested that top-down control signals from working memory of object representation can modulate neural response, so the top-down models, related to working memory biasing selection in favour of the object, were proposed [37, 70, 71].

It should be noted that, in this first phase of research and development, the studies were mostly to reveal the attention phenomena and find the related principles, so most of the models proposed in this phase are models formulated in principle to verify physiological and psychological experiments. However, these main research results, which are related to physiology and psychology, are the foundation for building computational models of visual attention in the next phase. We should also note that this phase is not over yet since the related research still continues. Every new finding on visual attention from physiology and psychology will promote the development of computational models and applications in computer engineering. The reader is encouraged to constantly track the latest progress in the related scientific and technical literature.

1.4.2 Second Phase: Computational Models

We have classified computational models of visual attention as two genres: one is biologically plausible (i.e., being consistent with physiological facts) and also incorporates the relevant psychophysical facts; the other one is not based explicitly on biology or psychophysics. Biologically plausible models include two types: function block form and neuron-connective form. Non-explicit biology based models also include two types: one adheres to the hypothesis that our sensory system develops in response to the statistical properties of signals to which it is exposed, and the other works in the frequency domain with fast computational speeds. Due to the fact that bottom-up attention is purely stimuli-driven and therefore is easily validated, most computational models simulate bottom-up attention. According to chronological order and genres, the major computational models are listed and discussed below.

The biologically plausible computational model based on feature integration theory was first proposed by Koch and Ullman in 1985 [35], and is a pure bottom-up model. They tried to account for attention phenomena including selective attention shifting. The concept of the saliency map was first proposed in their literature [35]. In the pre-attention stage a number of elementary features in the visual field are extracted in parallel, and these are represented as topographical maps called feature maps. Then, another topographical map named a saliency map combines the information of individual feature maps into one global measure. A competition rule, winner-take-all [35, 72] is employed on the neural network of a saliency map. The most conspicuous location in the saliency map is mapped into a central representation such that only a single location in the visual field is selected. After that, an inhibition signal adds to the selected location that makes attention focus shift towards the second conspicuous location and so on. It is obvious that the saliency map is similar to the activation map proposed by Wolfe [30], as discussed above. Saliency maps have been validated by electrophysiological and psychological experiments [73, 74].

One of the most influential computational models of bottom-up saliency detection was proposed by Itti et al. in 1998 [43]. In this model, various features such as four orientations, two antagonistic colours and one intensity are extracted from the input visual field at multiple resolutions, and the centre–surround effect is used between different resolutions to form conspicuity maps on orientated, chromatic and luminous channels. Finally, a two-dimensional saliency map is computed based on these conspicuity maps. The location with the highest value of the saliency map represents the fixated focus. The description of Itti's model is very detailed in their paper and the C + + program related to the model is freely available on the NVT website (http://ilab.usc.edu/toolkit/). Most subsequent models [14, 75–77] of function blocks and applications [78–80] are based on Itti et al.'s computational model.

Biologically plausible models from the neuronal level consider the cells' temporal or spatial activities. Cells' temporal activities suggest that attention binds together all these neurons whose activities are related to the relevant temporal features of a single object at a time [81]. The typical binding models to operate visual attention were proposed by [66, 82]. The neuronal level models of spatial activities are single layer or multiple layer neural networks related to cells context such as new WTA strategy by [72, 83] and the neuronal linear threshold unit to supplement an improved model by Walther et al. [77].

In his article in 1961, Barlow addressed the issue about optimal coding in our brain, which can reduce redundancy of incoming data from sensors such as eyes or ears [84]. The optimal coding with limited resource in the brain is similar to complexity theory in information theory. In 1990, Tsotsos proposed the concept of visual attention with his following standpoint: visual attention is to compress and abstract the data arising from the retina [85]. In this period, this idea that ‘without attention the general-purpose vision is not possible’ has been universally accepted [35]. People have woken up to the fact that visual attention plays an important role in computer vision or other engineering applications.

In the 2000s, some models proposed are not explicitly biologically based, but adhere to the hypothesis that our sensory system develops in response to the statistical properties of the signals on the retina [84, 86]. In 2004, Gao and Vasconcelos introduced decision-theory to computing the saliency map [12]. As with Itti et al.'s model, they first projected the visual field into several feature maps such as colour, intensity, orientation and so on. Central and surrounding windows are then analysed at each location by the mutual information between centre and surround for saliency. The final saliency map is the sum of the saliency of all features. This model is called the discriminant centre–surround model (DISC) [12–14]. It is not clear if the formula of mutual information exists in the brain, but feature extraction and centre–surround accord with biological facts. Bruce and Tsotsos recommended the use of information maximization to attention modelling, while using Shannon's self-information criterion to define saliency, referred to as the attention of information maximization (AIM) model [87]. A sparse representation of image statistics, such as the simple-cell receptive field property and independent component bases, is first learned from many natural image patches in AIM. Then the saliency is determined by quantifying the self-information of each local image patch in the input scene, so the localized saliency computation serves the maximum information sampled from one's environment [10, 87, 88].

Itti and Baldi [9] suggested the concept of surprise, that results in the subject's visual attention. Surprise is related to uncertainty and prior experience of observers, which is consistent with the Bayesian approach, so they proposed the distance between the posterior and prior probability densities as a way to measure attention [9, 89]. This surprise measurement can be used in attention for a still image and for video. All the saliency detection with DISC, AIM and surprise needs to estimate the probability of input data. In the same consideration, Zhang et al. [11] developed a computational model for saliency named the saliency using natural statistics (SUN) in which the measure of saliency depends on a collection of natural image statistics, differing from other methods based on input images.

Harel et al. in 2007 proposed a graph theory based approach to visual saliency (GBVS) using dissimilarity weights as neuronal connected weights [44]. As mentioned above, this model is a supplement of Itti et al.'s original model by using neuronal connected weights and graph theory.

The methods under the guidance of information, decision and graph theory show better consistency with physiological and psychophysical data than the pure biologically plausible models. They also have biological basis such as centre–surround [12], redundancy reduction by sparse representation [9, 10, 87, 88], neuronal connection based on graph theory [44]; however, some computational rules are embedded in these models which are computationally expensive and incapable of real-time applications.

Lately, frequency domain approaches of bottom-up attention have gained popularity due to their fast computational speed in real-time applications and good consistency with psychophysics in most situations. These include the algorithms based upon spectral residual (SR), proposed by Hou and Zhang [45], phase spectrum of quaternion Fourier transform (PFT, PQFT), proposed by Guo et al. [90, 91], and pulse of cosine transform (PCT), proposed by Yu et al. [92]. In 2010, biological prediction and comparison with spatial biological models were verified by Bian and Zhang in his frequency divisive normalization (FDN) model [93]. Amplitude spectrum of quaternion Fourier transform (AQFT) and modelling from bitstream were proposed by Fang et al. [94, 95] These models open a new way into real-time applications of visual attention in image and video compression, image and video quality assessments, robot vision, object detection and compressive sampling of images.

Some more engineering methods for image segmentation and object detection using the saliency concept have been proposed in the past few years in [96–98]. These computational models are based on special applications on computer vision without regard to biological facts and we will discuss them in the next subsection.

Object search in the visual field needs top-down guidance since pure bottom-up attention models can only provide some candidates of the object region. However, since top-down attention models need to model the domain knowledge, task information and even the user's requirement/preference, there are different kinds of top-down computational models reported. Most top-down computational models available are related to a special application such as object detection and recognition, robot navigation, image quality assessment and so on. Top-down models often combine with bottom-up ones as a modulation or enhancement for the saliency map, and often need human cues or signal control from other module(s) including object information that is learned from its environment. An early top-down computational model was proposed by Grossberg et al. [99, 100]. They proposed a neural network model for visual attention in which both bottom-up and top-down parts contain weights modified by experience. This approach has been used in more neuronal attention models [54, 101, 102]. Tsotsos et al. used local WTA networks and top-down mechanisms to selectively tune neurons at the attended location of their model [72, 83] called the selective tuning model (ST). Deco and Schurmann proposed a model to modulate the spatial resolution of the image based on a top-down attentional control signal [101]. Sun et al. [103] proposed a computational model of hierarchical object-based attention by using ‘groupings’. This model suggests that the attention selection from coarsest level to finest level interacts with top-down attentive biasing. There have been other top-down attention models based on a special application since the end of the 1990s, such as top-down models based learning and memory [36, 37, 104], and a model based on image statistics [105]. Some computational models have been implemented by general computers or dedicated hardware, so they can be easily applied to engineering problems, especially computer vision applications.

1.4.3 Third Phase: Visual Attention Applications

This phase started at the end of the 1990s. The first application is in robot vision. The challenge is that when the eye of robot, which is usually a video camera, receives continuous signals from its environment, it often encounters the data overflow problem due to limited memory and processing ability. The visual attention can extract important information in a mass of visual data coming from the robot's eyes (video camera), so it can just solve the difficulty of limited memory and processing ability of the robot. Because many computational attention models were developed in the second phase and the challenge of robot vision has been identified, most studies were directed at robot applications with both bottom-up and top-down attention. A vast number of papers have been published in conferences and journals related to this issue [106, 107], and here we cannot list them one by one. Robot applications often connect with object detection and recognition, which has similar considerations to using top-down modulation [108, 109]. In 2006, Frintrop in his PhD thesis proposed a model called visual object detection with computational attention system (VOCUS) which was a successful implementation of robot vision. The system has been published in a book [78]. In the same period, Walther et al. relied just on bottom-up saliency-based attention to accomplish object recognition [110]. They suggested that the interested regions on the bottom-up saliency map as object candidates could be further searched and recognized by utilizing an effective recognition algorithm proposed by Lowe [111]. Other applications on object detection and recognition were proposed in [96–98, 112–115] and the applications on image retrieval were suggested in [116]. It is worth noting that some attention models incorporated in applications may have no clearly biological basis, or only use the concept of visual attention, but they still explain the importance of visual attention in applications.

The applications in image processing using attention models started in about 2000. Visual attention combined with just noticeable difference (JND), another property of the human vision, is considered in [117]. In the image/video coding standard, the regions of interest in the coded image are considered to be worthy of higher bit rates than uninteresting areas, which is consistent with human subjective perception. The assignment of interesting areas is prearranged in the classical encoding standard. It is very easy to use a visual attention model to automatically select these areas. There are similar applications for video coding [79, 118–121]. Since image and video coding need lower computational complexity, attention models in the frequency domain may be more effective than spatial domain models [91].

The subjective quality assessment of image and video that needs to score the distorted images by human observers is expensive, tedious and time-consuming. Attention reflects subjective perception, so recently the applications in subjective quality assessment have been considered in [122–126]. Other applications of visual attention modelling, apart from the aforementioned aspects, have also been developed, such as image retargeting [127] and compressive sampling [128]. We believe that visual attention models will be used in more and more areas of computer science, image and signal processing and engineering in general in the near future.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset