2.3 Guided Search Theory

Another influential idea, Wolfe et al.'s proposal called guided search (GS), has been proposed as an extension of the FIT. It has eliminated the deficiency of the standard FIT and captured a wide range of search behaviours [55–61]. Similar to the use before, the term ‘search' refers to scanning the environment in order to find a target in a scene no matter whether the target exists or not. Hence, the GS theory is implemented from the perspective of target search in input visual field. The main differences between the standard FIT and GS can be stated as follows. (1) In FIT, the two stages, pre-attention (or feature registration) in parallel and attention (or feature integration) in serial are processed almost separately. Very little information from the parallel pre-attention stage influences the next serial attention stage in conjunction cases. Two of main results are: 1) searching a conjunctional target is very slow and its RT increases with detractor size; 2) target-absent (negative) search requires double RT compared to the target-present (positive) search. In the standard FIT experiments, the subjects did not know what to look for, so there is no guidance from the parallel pre-attention stage for the serial search stage. In fact, the linear ascended (RT vs. size) function of conjunctional search and the 2 : 1 ratio in slopes between negative and positive searches are not valid in some trials. GS theory gives some experimental results to confirm that the pre-attention stage can transfer information to the attention stage, so that searching for a conjunctional target becomes very fast and the RT increasing with detractor size cannot hold, since subjects can indeed get some guidance from the parallel search stage or from the summarization of their experience in the parallel search stage to find the target. (2) In the second GS version (GS2), top-down knowledge is added into the model, so both bottom-up and top-down activities are summed to form an activation map in which the locations with high value are regarded as candidate target regions. The activation map can guide the subject to search these candidate target regions. It is worth noting that in the conceptual model of FIT, the master map only records the focus location to complete integration of features extracted from the input visual field. It does not concern the guidance of the target search in serial attention stage. (3) A threshold is set in the activation map in the GS2 model, so that the serial search in the conjunction case can stop if the values at some locations on the activation map are less than the threshold, which can avoid exhausted search. The threshold aims to simulate a physiological phenomenon of just noticeable difference (JND) in the pre-attention stage when the target is in a noisy or poorly illuminated scene. More detail difference between FIT and GS will be introduced in the following subsections.

The GS theory is based on the facts that cannot be predicted and explained by the standard FIT. Early guided search theory called GS1 was proposed by Wolfe et al. [55] and Cave and Wolfe [56]. In the GS1 model, parallel feature extraction from the input visual field is similar to the methodology in FIT, but the parallel process divides the set of all information into distractors and candidate targets, although it cannot determine the location of a conjunctional target. Therefore, in the series process that follows, the search is only for these candidate targets. Afterwards, the early GS1 model was modified into many improved versions [57–61]. The most widely used model was called GS2 [58]. In GS2, the input is filtered by broadly tuned channels in parallel for each feature map, and then the summation of the filters' outputs generates bottom-up activation in each feature map. Top-down activation is added by the weighted feature maps and combined with bottom-up activation to create an activation map. An activation map is a topographic map that combines both bottom-up and top-down activations for all feature maps. The location with the highest value in the activation map is the first attention focus. That is, the information in both bottom-up and top-down processes forms an activation map to determine the attention location. If a subject did not find the target at the first maximum location in the activation map, his eyes would move to the second maximum location. This seems to be more coincident with reality. Versions GS3 and GS4 were proposed later [60, 61], which have further improved GS2 to cover more psychological facts and to capture a wider range of human search behaviours. Since GS2 is the most influential model for many computational models, this section mainly introduces the GS2 model.

2.3.1 Experiments: Parallel Process Guides Serial Search

There were many trials showing that target detection defined by a feature conjunction might be beyond the predicted result of the standard FIT in some situations. For some subjects, a number of conjunction cases could produce very small slopes of the linear function (RT vs. distractor size) [55, 62–65] or even near flat slopes such as the published data for single feature search in the FIT (i.e., RT is independent of set size). Wolfe's group designed some experiments to explore the cause, and a new modified FIT model was proposed in 1989 –1990. A part of early represented experiments [55, 56] are abstracted from their publications as follows in order to explain the idea.

a. Simple conjunction target search
Simple conjunction target search experiments included the conjunction of colour and form, colour and orientation and colour and size. Twenty subjects participated in the experiments, and each subject was tested with 260 trials for conjunction search of colour and form (a red ‘O' among green ‘O's and red ‘X's) that was similar to the work of Treisman and Gelade in 1980 [30]. Very small slopes were obtained by these naive subjects with minimal practice (20 trials out of 260 trials for practice). For some (not all) subjects with shallow slopes, the 2 : 1 slope ratio for negative and positive searches did not hold. In the colour and orientation condition (a green horizontal bar among red horizontal and green vertical bars), 22 subjects were tested with different distractor sizes by applying 520–620 test trials to each subject. Again, subjects could detect the target more efficiently than that predicted by the standard FIT. And when the number of distractors increases, the RT increases slowly and presents a non-linear function which also differs from linear changes predicted by the standard FIT. The examinations of a colour and size conjunction (a small red square in small green and large red squares) also reported similar results. To some subjects (not all), RTs for size-colour conjunction searches are independent of the set size.
These tests reflect the fact that the parallel process can deliver the information to serial search; for example, the parallel process can discriminate different colours when a target with a red ‘O' among the distractors (green ‘O's and red ‘X's), and all green items in the parallel process are discarded so that the serial process only acts on the red items. Subjects can serially search a conjunctional target with a small set size but not a large one.
b. Search for ‘T's among ‘L's
Another experiment, a conjunction with basic features in both target and distractors, was to search for a target letter ‘T' among distractor letter ‘L's. Two groups composed of nine subjects and six subjects were tested for 200 and 500 trials respectively. The target and distractors were presented in different set sizes and in any of four rotations (0°, 90°, 180° and 270°). Because both selected test letters consist of a horizontal segment and a vertical segment, the parallel process cannot guide the serial search more in that case. Therefore the search results were consistent with a serial self-terminating search as in the published work of the FIT that showed the steep linear function for RT vs. set size, and the 2 : 1 slope ratio for target-absent and target-present cases. In this situation, the FIT still holds.
c. Effect of practice before trials
In Wolfe et al.'s experiments all subjects were given a practice of 20 trials, and then they were asked to start the test. However, it was doubted whether the practice helped these subjects to choose a clever search. Repeating the tests of conjunction search on colour and form (a red ‘O' among green ‘O's and red ‘X's) and on ‘T' among ‘L's for five naive unpractised subjects in 200 trials, most of the trials for the five subjects did not have different results except for one subject who did change from a serial, self-terminating search to a more efficient search only on colour and form conjunction.
The results confirmed that for a ‘T' among ‘L's (i.e., a non-conjunctional case) subjects tend to use serial, self-terminating searches, but for conjunctions of colour and form subjects tend to produce different results. That means that the effect of practice before trials seemed to be minor and that the testing methodology of Wolfe et al. did not affect their conclusions.
d. Stimulus salience
In Treisman et al.'s earlier work, less striking stimulus colours were adopted on a white background, whereas Wolfe's group used saturated red and green stimuli on a black background. When Wolfe's group used the same stimulus that Treisman used, less salient stimulus in white background, the slopes of the RT vs. set size function were steeper. This means that stimulus salience plays an important role in search of conjunctional targets. Less salient stimuli might be considered as noise; in such cases subjects use a serial, self- terminating search.
e. Triple conjunctions
If the target was defined by a triple conjunction of colour, form and size, according to the standard FIT, the RT of subjects should be slower or at least equal to double conjunction cases. In the experiments of Wolfe's group, the RT of subjects for the triple conjunction targets appeared to be faster than simple conjunction cases, and also in some trials, the RT was independent of set size. This can be explained by the fact that in a triple conjunction, the extra information from the three different parallel feature maps helps subjects to speed up their search.
All the experiments account for the serial search in conjunction cases sometimes being guided by parallel processes. Although parallel feature maps cannot locate the position of the target, it can divide the set of stimuli into distractors and candidate targets. In this manner, subjects can search for the conjunction target more efficiently. However, FIT does not consider all possible cases mentioned in experiments (a) and (e).

2.3.2 Guided Search Model (GS1)

The experiments described in Section 2.3.1 show that the original FIT cannot explain some psychological facts. The modified version was first proposed in [55, 56], considering guidance from the parallel process. There are two different ideas in [55, 56], which are related to the relevant physiological and psychological facts. In terms of physiology, input stimuli projected onto the retina from the visual field are processed in detail only at the fovea, and the information in the periphery is coarsely sampled. Therefore, uninteresting input information is discarded. The related idea of psychology is that the parallel process inhibits non-target locations and discards some distractors' information; the results will guide the serial process in the second stage for the conjunction search [56], that is the parallel process can give some selections to guide the later serial search.

The guided search model GS1 [55] in a simple conjunction case is demonstrated in Figure 2.11. The conjunction target (red vertical bar) is put in several red horizontal bars and several green vertical bars. In the parallel search stage, the locations of input stimuli which have the features of horizontal and green are discarded as distractor information. Each parallel feature map excites the spatial locations related to all the candidate targets; for instance, items in feature map (red colour or vertical orientation) are considered as candidate targets first. The conjunction target will excite two different dimension feature maps (both colour and form), but the distractors only excite a feature map (colour or form). An activation map is the pixel-by-pixel summation of all the feature maps. The spotlight of attention is pointed to the target at the maximal location of the activation map since the location of that target receives double excitation. In Figure 2.11, input stimuli include colour and orientation bars. All the red bar locations and all the vertical locations are excited in colour and orientation maps. Finally, the location of the activation map with the maximum value indicates the location of the target. In that case, subjects can sometimes find the target quickly, and the RT is near to the single feature case.

Figure 2.11 The guided search model for simple conjunction search. Adapted from Wolfe, J.M., Cave, K.R., Franzel, S.L. (1989) ‘Guided search: an alternative to the feature integration model for visual search’ Journal of Experimental Psychology: Human Perception and Performance, 15, page 528, Figure 5

img

This serial search may be faster than that of the case without guidance of the location map (master map) in the standard FIT (Figure 2.3). In a target-absent case, the serial search may continue at random until every item is checked. Consequently, the 2 : 1 slope ratio for the searching of negative vs. positive does not hold in the experiments above because in a target-present case the RT vs. size function has a small slope. Note that the GS1 model implies top-down instructions. Let us again use the example above. If in the parallel stage a subject first considers the green feature map, there will be no different orientation associated with it, so it will be discarded rapidly and the red feature map is to be considered next. In red feature map the item with unique orientation is detected easily. This model can explain all the facts in the simple conjunction case shown in experiment (a) of Section 2.3.1.

To search for a ‘T' among ‘L's, both target and distractors contain the same simple features, vertical and horizontal segments, in the orientation feature map. At all the locations, the same features can be seen in a parallel stage, so the feature map is unable to guide the serial search stage. Subjects have to use a serial self-terminating technique to search for the target. The results are the same with the standard FIT, that is the result in experiment (b) of Section 2.3.1 accords with the conclusion of standard FIT.

In a noisy or less salient stimulus case, many positions in feature maps will be excited because of noise influence, and the resultant searches will be serial, as mentioned in Section 2.3.1 (experiments (d)). In that case, the first maximum value on the activation map may not be the target location. Eyes need to search for targets serially according to the order of excited values on the activation map.

For a triple conjunction case, the target and distractors will share one or two features and the target makes three different feature maps excited simultaneously. This should lead to a more efficient search, especially when the target sharing one feature for each of three kinds different distractors, respectively, because a very large value on the activation map can be obtained at the target location. Experiments (e) of Section 2.3.1 can also be illuminated by the model in Figure 2.11.

In summary, the GS1 model can explain facts both differing from or agreeing with the standard FIT. It highlights an important idea that the serial search for conjunctions can be guided by the parallel processes if top-down information is implied.

2.3.3 Revised Guided Search Model (GS2)

Although GS1 can explain more psychological facts than the FIT, not all of the new data or phenomena are completely consistent with GS1. Hence, the model needs adaptation or modifications to make it coincide with even more psychological facts that are relevant to human visual attention. For this reason, a revised version of GS1 was proposed in 1994, called GS2 [58]. In GS2, each feature map of Figure 2.11 is separated into categorical channels by several filters. Top-down and bottom-up activations are combined to an activation map by weighted summation. The activation of bottom-up information is sensitive to the difference and distance between each local item and its neighbours. A threshold is explicitly added in GS2, which can stop the serial search when the excited value at the location of a hill point is less than the threshold on the activation map. As mentioned at the beginning of this Section 2.3, the threshold aims to simulate just noticeable difference in a noisy or poorly illuminated scene. In addition, the inhibition between the same items in a feature map is considered in GS2; for example, in feature search for the target (a red bar in many green bars), the non-targets (green bars) are inhibited by each other, so the corresponding values in the green feature map are lower. However, the unique red target in the red feature map appears with a highly excited value because there are no inhibited signals from its neighbours, which will cause the red target to stand out more.

GS2 is the most widely used model and can be easily approved by the researchers from relevant engineering areas even in comparison to the further modified versions, GS3 and GS4, which were developed later. Therefore, it is essential to describe GS2 in detail. Figure 2.12 shows the architecture of model GS2.

Figure 2.12 The architecture of GS2 [58]. With kind permission from Springer Science+Business Media: Psychonomic Bulletin & Review, ‘Guided Search 2.0 A revised model of visual search,’ 1, no. 2, ©1994, 202–238, Jeremy M. Wolfe

img

The model includes several feature dimensions (or feature maps), several separate filters in each feature map, both bottom-up activation and top-down activation for each feature map, and a final activation map. The bottom-up process in each feature map depends on several separate filtered channels. The top-down knowledge drives the top-down activation of a feature map and guides attention to a desired target.

According to the final activation map, the information located in high activations is directed to the limited resources in the brain for further processing (in the post-attention stage).

a. Bottom-up process
Input stimuli are first processed in parallel by several feature maps (colour, orientation, size and so on) like the feature dimensions in the first stage of the FIT model or GS1 model, but there are some differences as follows.
The first difference from FIT is that the input of each feature map consists of filtered information from some broadly tuned ‘categorical' channels. Figure 2.12 just shows the channels of colour and orientation dimensions. In reality, there may be lot of basic features such as size, depth and so on and their related channels. These tuned channels act like simple cells in visual primary cortex to respond to a preferred feature such as a specific angle in an orientation feature map or red colour, green colour and so on in a colour feature map.
The orientation feature is divided into several channels. For simplicity, five-channel filters are defined as steep, right, left, shallow right and shallow left, in the range of −90° to +90°; for instance, 0° is vertical or steep, positive angles are tilted to the right of vertical, and negative angles are tilted to the left of vertical. Each channel is arranged in a wide range around the centre angle. The outputs of the orientation channels are defined as

(2.1) equation

where symbol img is the angular response in the orientation feature map. For an input bar with arbitrary orientation, at least one channel is tuned and outputs a corresponding value, and in most cases two channels are activated if the orientation of the bar is not just at 0°, ±45° or ±90°. Note that five filters are not unchangeable. The number of channels for each feature can be selected discretionarily.
The colour channel is also arranged into several channels. The ordering of these channels in Figure 2.12, from left to right, depends on their spectrum; for example, red, having the longest wavelength, is arranged at the very left and blue, with the shortest wavelength, is at very right.
The second difference of GS2 from FIT, also from GS1, in the parallel stage is that the process of each bottom-up activation is local (e.g., 8 × 8 or 5 × 5 windows) in each locus for the tuned channels on colour, orientation or other feature maps, after filtering by these tuned filters. The strength of bottom-up activation for one location depends on the differences between the values of the location and its neighbours on the output of each broadly tuned channel, and then the difference is multiplied by the response of the tuned filter. A threshold called the pre-attentive just noticeable difference (PJND) is set for the differences, so that small differences in the output of colour or orientation channels do not contribute to bottom-up attention.
The distant neighbours have a weaker effect than those nearby for the strength of activation. In general, the neighbourhood weighting function can be Gaussian or a linear descending function. The outside of the window does not influence the bottom-up activation of the locus. The resulting response of bottom-up activation for each feature and each location is averaged for all neighbouring responses, with a ceiling of 200 (arbitrary) units of activation.
b. Top-down process
Top-down or user-driven activation guides attention to a desired item that cannot pop out by only bottom-up activation. In GS2, top-down activation is performed by selecting the output of the one broadly tuned channel per feature map. The channel that best differentiates the target from the distractors is set with the largest weight. For instance, take red segment with tilted 10° to the right of vertical direction (the desired target) in many other colour and orientation segments as an example: larger weights will be given to the red channel of the colour feature map and the steep channel of the orientation feature map. Consequently, the target can pop out in top-down activation. Note that the weighting in top-down activation depends on the difference between target and distractors in bottom-up channels, but does not depend on the strength of channels. For instance, let us search a 20° tilted segment among many vertical segments if five broadly tuned channels are considered in the orientation feature map. According to Equation 2.1, the 20° tilted segment can produce a larger response in the steep channel than in the right channel (cos40° > sin40°), but since the vertical distractors will not contribute to the right channel, the best choice that can distinguish between target and distractors is in the right channel. A larger weight should be assigned to the right channel because the unique target presents itself in that channel. For the channels that are irrelevant to the desired task, their weights will be reduced, to weaken their contribution to top-down activation.
c. Final activation map
Both bottom-up and top-down activations are finally summed to create an activation map as represented in Figure 2.12. The final activation map is a topographic map with a number of hills where both activations in bottom-up and top-down maps are higher. The topographic map can direct attention starting from the highest locus to the next highest locus and so on, until the target is found. That is to say, a subject takes a serial search. If the first highest locus happens to be the position of target, no more RT is needed for the subject to search the target. The search is nearly parallel, because in the first period of serial search the subject can find the target, regardless of the changes in the number of distractors. It is similar to the cases with a single feature. In a target-absent case, the hills of the activation map are serially searched in order of strength, so the RT of subjects is very long in single feature or conjunction cases. The threshold in GS2 can avoid the serial search in random manner, since the search will stop if the strengths of some small hills of the activation map are below the defined threshold.
Apart from explaining single feature and conjunction experiments in target-present and target-absence cases,GS2 can explain more psychological phenomena and allow for computational modelling in a more biologically plausible manner, especially for the bottom-up process (to be illustrated next). Remember the search in the difficult cases shown in Figure 2.8, when the target and distractors tend to be similar or the distractors have a tendency towards heterogeneity. FIT and GS1 cannot clearly interpret the hard searches, but GS2 can do it well because local calculation is introduced in the bottom-up process. As the similarity between the target and distractors increases, the difference between the target and its neighbours decreases, and this reduces target activation, resulting in hard searches. For the heterogeneous situation of distractors, in GS2 the local differences between distractors in heterogeneous cases allow many bottom-up channels activated at many positions, resulting in many false high hills in their final activation maps; this explains the cause of the hard searches.
Search asymmetries are another kind of psychological phenomenon. That is, a search for an item ‘x' among ‘y's does not yield the same results as a search for a ‘y' among ‘x's [59]. There are a lot of examples for search asymmetries. Search for a 20° tilted bar among vertical bars is easier than search of a vertical bar among many 20° tilted bars. Another known example was given in Figure 2.7: searching a plus sign among vertical segments is easier than searching a vertical segment among plus signs. The broadly tuned channels of a feature map in GS2 support the asymmetrical search facts. The following explanations of the two instances above are all related to the orientation characteristics; for convenience, in the next paragraph we only consider the orientation map in Figure 2.12, but the conclusion can be easily extended to other feature maps.
In first instance, the target, a 20° tilted bar, produces the responses in both the right channel and the steep channel, and the distractors, vertical bars (at 0°), produce the responses only in the steep channel, and none in the right channel. The parallel search in GS2 is dependent on the large difference between the location of target and its neighbour region in the right channel. The unique target can be searched fast in the right channel. Conversely, many 20° bars (distractors) produce signals in both the steep and right channels. The difficulty of the search is obvious since the differences between the signals of 20° and 0° bars in the steep channel are smaller.
In the second instance, Figure 2.7(a), the vertical line segment (0°) of the plus sign (target) and all the distractors (vertical line segments) produce responses in the steep channel, which they do not have much difference between the responses of target and distractors in the steep channel. However, since the horizontal line segment of the plus sign (target) is unique in the shallow channel, the largest difference between target and distractors is located at the shallow channel, and this enables subjects to find the target easily. By contrast, Figure 2.7(b), the target is a vertical line segment that tunes the steep channel, and the distractors (the plus signs) tune both the steep and shallow channels. There is no unique target with the largest difference from its neighbour region to be found in any channels, and this accounts for the search difficulty.

2.3.4 Other Modified Versions: (GS3, GS4)

GS1 suggests that the parallel process can guide the attention in the serial stage and this explains the efficient search in conjunction cases; GS2 separates parallel feature maps to several broadly tuned ‘categorical' channels and considers the activation map as a combination of bottom-up and top-down information to guide the target search. In other words, GS2 extends the search guided model GS1 and can account for more laboratory findings.

In the real world, target search involves eye movement. The fovea of the retina samples input stimuli with much greater detail. Searching for an object in the visual field, subjects often make the fovea move towards regions of interest, and the fovea is often moved to the central coordinate of visual field. GS3 [60] incorporates eye movement and eccentricity effects in GS2. In GS3, the activation map is a winner-take-all neural network. All the units on the activation map compete with each other. The unit located at the maximum hill in the activation map will win the competition, and this results in the attention focus being fixated by the eyes. Attention focus on the activation map acts like a gate, and at a given time only one object can be conveyed for higher level processes and recognition that corresponds to the post-attention stage. If the selected object is not the desired target, a feedback signal from higher level of the brain will return to the activation map and inhibit the winner unit. In succession, the competition continues and creates a new winner as a new fixation point that the eyes' fovea is pointing towards. The inhibition of return is necessary in the eye movement and attention deployment in order to find the next attention focus. In GS3, the eye movement is represented by a trace of serial searches by an eye saccade map. Obviously, GS3 can cover more examples and phenomena in the laboratory and the natural environment. However, in GS3 the functions of eye movement, attention gate, object recognition, saccade generation and so on, are represented as blocks that cannot be described in detail. Thus, these functions are hard to implement in engineering.

GS4 is one recent version proposed in 2007 [61]. It describes a parallel-serial hybrid model. The top-down guidance is based on the match between a stimulus and the desired properties of a target. If the processing from input to activation map is a path from input to output, in GS4 an added path from input to output is considered to deploy the attention in addition to the path already described in GS1–3. Hence, the version can not only model simple search tasks in the laboratory, but can also capture a wide range of human search behaviours.

In GS4, one object or a group of objects can be selected and passed through a bottleneck at a time, and both parallel and serial stages are combined together to form the attention that is to control the bottleneck. Most parts of GS1–GS3 are incorporated in GS4. It is known from the FIT that attention can bind features produced in the parallel process to represent the early vision of an object. Without attention, people cannot glue more than one feature to a conjunction item correctly. Another view is that the limited resources in the brain cannot process all the information from the early vision. The attention mechanism selects the input of interest, and then controls and deploys the limited resources to process it effectively. Based upon the two considerations above, GS4 suggests two parallel paths: one of the paths is like the one in GS1–GS3 and early FIT models in which input stimuli are processed by parallel channels, and then are selected via attention mechanism; the other path is to analyse the statistical characteristics in the input image based on [66, 67]. The deployment of selection in the latter path can be guided by statistical properties extracted from the input scene, and this is the another capability of GS4. The working of two paths is independent in a parallel manner, but finally the outputs of the two paths are simultaneously input to another bottleneck (the second bottleneck) to make the final decision. The inhibition of return is from the output of the second bottleneck. GS4 is a more complicated system which includes many parameter controls and timing considerations, so it is not discussed here in detail. The interested reader is referred to [61] for GS4, and should be able to understand the model after the introduction of the various GS models in this section above. As with GS3, there are no detailed descriptions for the added path in [61].

The GS model can be developed into even more advanced versions with the progress of psychology. New processing modules can be added in the GS model. For many scientists and engineers, GS2 is sufficient for developing meaningful, effective and efficient computational models in many engineering applications.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset