5.5 Top-down Computation in the Visual Attention System: VOCUS

The visual attention system VOCUS takes its name from ‘Visual Object detection with CompUtational attention System’ proposed by [16, 17, 57]. The book [57] is rooted in the author's PhD thesis, which introduced a goal-directed search model integrating both date-driven (bottom-up) and task-driven (top-down) features. Since the model applies to object detection in real-time robots, its top-down computation and acting on object detection are simpler and more convenient than other top-down models. There are two stages: the training stage and the object search stage. In the training stage, the weights of each extracted feature map are calculated from data-driven features. In the object search stage, excitation and inhibition biasing is used to create a top-down saliency map and a global saliency map by integrating both bottom-up and top-down maps with weights. The weights of bottom-up and top-down features can simulate the extent of human concentration on the demanded task. Although the model was introduced in detail in [57], as a distinctive top-down computation, we still present it in this section.

5.5.1 Bottom-up Features and Bottom-up Saliency Map

The framework of bottom-up feature extraction and bottom-up salient computation of this model is similar to the BS model described in Chapter 3. However, some variations are considered in the bottom-up part.

Firstly, the input colour image is converted into the colour space LAB. Several pyramids by filtering and down-sampling are generated in different scales as in the BS model. Although the computation of pyramids is the same as the BS model, the content of the pyramids is a little different from the BS model. This model includes an intensity pyramid, four colour pyramids (R, G, B and Y) deduced from the LAB space by discarding the luminance component that is already considered in the intensity channel, and the edge pyramid produced by the Laplacian filter.

Secondly, the centre–surround computation is processed at the same scale under two surrounding sizes. Two types of centre–surround computation, on/off and off/on, are considered in the intensity and colour pyramids. In the intensity pyramid the centre–surround maps are used to compute the subtraction between the centre pixel and the average of its surrounding pixels in two different sizes at the same scale to get two contrast values, unlike the BS model where the centre–surround is processed at two different scales. Two centre–surround types: on/off (centre on and surround off) and off/on (centre off and surround on) are considered in the intensity channel, resulting in two feature maps (on/off and off/on) by combining two surrounding areas and different scales in the channel. In the colour channel, four colour opponents R/G (on/off and off/on), B/Y (on/off and off/on) are computed for two sizes of centre–surround processing. Four feature maps are created in the colour channel by combining across different scales and surrounding areas. The edge pyramid is filtered by Gabor orientation filters to obtain four orientation (0°, 45°, 90°, 135°) scale maps. In the same manner, these maps at different scales are combined to the four feature maps in the orientation channel. It will be noticed that there is no centre–surround processing in the orientation pyramid. After that ten feature maps are created: two maps (on/off and off/on) for intensity, four maps (on/off and off/on for opponent R/G and B/Y) for colour and four maps (0°, 45°, 90°, 135°) for orientation.

The ten maps are combined into three conspicuity maps for intensity, colour and orientation. Finally the three conspicuity maps are summed to global bottom-up saliency map SMbu.

5.5.2 Top-down Weights and Top-down Saliency Map

The computation of the top-down saliency map includes two stages: a training stage and a computation stage. In the training stage bottom-up processing for the training images, each with a specified target, is used to obtain the biases of the target-relevant features. Here there are 13 maps need to get their biases: ten feature maps and three conspicuity maps as mentioned above. In the computational stage the feature maps and conspicuity maps are multiplied by these learned biases respectively. Each feature map or conspicuity map considers both exciting and inhibiting information for the required target and its background, respectively. Finally, all the biasing feature maps colligate to the conspicuity maps and to all the conspicuity maps by multiplying their respective biases integrated to a target-dependent top-down saliency map.

5.5.2.1 Processing in the Training Mode

Suppose that a training image with a specified target is available and a rectangular region of interest (ROI) that contains the specified target on the training image is drawn manually by the user. The size of the rectangular region need not match the specified target exactly. The training image is inputted into the VOCUS model to compute the bottom-up saliency. The most salient area (MSA) is determined in the rectangle. It is worth noting that any MSA beyond the drawn rectangle is left out of account at the training stage. Figure 5.8 shows a picture of a dolphin performance with two dolphins and a woman in the water. The task is to find the woman in the water. The white rectangle in the water is the ROI and the white ellipse in the ROI is the MSA. Although there are many salient objects outside the ROI, they are not considered as the MSA. The only MSA within the white rectangle is the woman in the water.

Figure 5.8 ROI and MSA in the performance of dolphins

img

The weights of each feature map or conspicuity map are based on the MSA and the background of the whole feature map or conspicuity map under consideration, except the MSA. For simplicity, the feature maps are denoted as pi, i = 1, 2, . . . 10 for two intensity features, four colour features and four orientation features, and the conspicuity maps are denoted as pi, i = 11, 12, 13 for intensity, colour and orientation, respectively. The next step is to compute the weights, wi, i = 1, 2, . . . 13 for each feature and each conspicuity map pi. The weight wi for the map pi is the ratio of the mean MSA saliency and the mean background saliency as

(5.27) equation

where img denotes the average intensity value of the pixels within the MSA in the map pi, and img is the mean intensity value of the pixels in the rest region of the map pi. The weight wi shows the importance of the map i for detecting the specified target. In Figure 5.8, the colour of skin and the hat of the woman in water is different from the colour of water (blue) and from the colour of the environment, so the colour feature may have a higher weight than others. It is noticed that the size of the ROI drawn manually does not affect the result of Equation 5.27; however, if these features in the MSA also occur in the background, then the computation of weights will be changed. Thereby, it is necessary to choose the training image carefully. For example, the selected training image should satisfy: (1) the overlapping of features between the specified target and the background should occur as little as possible; (2) the specified target is the MSA and is the unique MSA in the ROI; (3) the specified target is not covered by other objects so that it can be extracted completely and so on. Some selection rules for the training images need experience. Of course, it is advisable to choose several training images with different backgrounds for one target and to take the average weights of several training images as final weights.

5.5.2.2 Top-down Saliency Map

The top-down saliency map is produced with the aid of the learned weights to excite or inhibit the feature maps and conspicuity maps. Let Ep and Ip be the excitation map and inhibition map of a testing input image respectively. The top-down saliency map is the combination of the two excitation and inhibition maps. The excitation map is the weighted summation of the maps including those feature maps and conspicuity maps with weights greater than one.

(5.28) equation

It contributes to the top-down saliency by enhancing the area related to a task-relevant target. The inhibition map collects the maps with weights less than one.

(5.29) equation

where the map with weight less than one means that the feature in the background is higher than the feature in the target area. Note that the maps with weights equal to one are ignored because they have no contribution to the top-down saliency map. Finally the top-down saliency map is obtained by the difference between the excitation map and inhibition map.

(5.30) equation

Top-down saliency highlights the area with the specified target and suppresses the other backgrounds in testing image, which will speed up the target detection.

5.5.3 Global Saliency Map

Suppose that the bottom-up saliency SMbu and top-down saliency SMtop have been calculated as in Sections 5.5.1 and 5.5.2.2, the global saliency map is the weighted sum of SMbu and SMtop that is somewhat like Guided Search 2.0 [6, 58], a psychological model described in Chapter 2:

(5.31) equation

where factor αc is real number αc img [0, 1]. When αc = 1 top-down saliency plays the major role in the target search, and when αc = 0 Equation 5.31 degenerates to a pure bottom-up saliency map. In most cases, factor αc is chosen as 0 < αc < 1. The global saliency map will guide the attention focus; in the other words, the most salient location at SM is the focus of attention and is often the location of the target.

It is very interesting that the factor αc can simulate the situation of the extent of human concentration while viewing a scene. For instance, when you want to search for your lost son in a complex scene you will ignore the bottom-up salience and be engaged in the most important task; in which case, the system makes the factor αc near to one. When you are enjoying the scenery in a tour, the bottom-up salience is primary, so the factor αc is near to zero.

The top-down computation in VOCUS is a very simple one in which both excitation and inhibition influences are considered in the top-down salience. It can usefully be applied to object search, object recognition, robot navigation and so on. There are many experimental results concerning the target search using the VOCUS system and top-down guidance in [16, 17, 57], to which interested readers should refer.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset