3.8 Saliency Based on Bayesian Surprise

Surprise events or targets frequently occur in our life, and they often attract human attention or create a deep impression in the human brain. For instance, in films or television plays, the audience will pay more attention to the unexpected sequel after long suspense or fix their eyes on surprise locations or scenes. Surprise is out-of-expectation stimuli or events, which connotes uncertainty. If all things are deterministic, there is no surprise in the world, and this will result in loss of highlights and delight. Surprise may be also subjective; for the same visual data, the surprise degree of different individuals is different. Sometimes surprise represents novelty in temporal or spatial scope, which is related to saliency, because experiments on many video clips showed that for about 72% of all audiences their gaze shifts directly towards more surprising locations than the average in thousands of frames [10, 54].

But how do you quantify surprise? Itti and Baldi proposed a Bayesian definition of surprise in [10, 54, 55], called Bayesian surprise theory. In the theory, prior probability density is assigned over the possible models, and when the new stimuli arrive, the probability over possible models is updated by the Bayesian rule. The surprise is measured by KL divergence between the prior probability density and the posterior probability density. This definition of KL distance or divergence was mentioned in Sections 2.6.2 and 3.6.1. In Bayesian surprise theory, a greater change of probability density after the arrival of new stimuli leads to a larger KL divergence, and this means that the surprise degree is large. By contrast, the new data do not influence the prior probability; that is there is no surprise to it, and KL divergence is zero. The update of data makes the prior and post-probability change over time in surprise theory, which suits the applications to video saliency.

3.8.1 Bayesian Surprise

Suppose the prior probability density of an observer is described as {p(M)MimgM} over the model M in a model space M; here the italic M denotes one of the hypotheses or models in model space, and the normal M is the whole model space. Given the prior probability density of beliefs, the effect of new data D (or X if the data are in a feature space) is to change the prior probability density {p(M)MimgM} to posterior probability density by the Bayesian rule as follows, as a rewritten form of Equation 2.8 (i.e., X of Equation 2.8 is replaced by D):

(3.61) equation

The surprise degree is quantified by the distance or KL divergence between the posterior and prior probability densities:

(3.62) equation

Equation 3.62 carries two meanings: (1) The posterior probability density p(M/D) is updated with time through the Bayesian rule, because the surprise degree only considers the effect of the current data that lead to change of the original model. An example of this is shown in [54]: while one is watching TV, suddenly snowflakes occur on the screen of television due to electronic malfunction or communication failure. The posterior probability density of a snowflake is greatly dissimilar to the prior probability density in general cases (TV news, advertisements and plays etc.) and this results in large surprise. When the random snowflake interference continues, now the probability density favours a snowflake model which reduces the observer's surprise. In fact, the long and bothersome random interference never attracts you again though each snowflake frame is different from the other frames (the difference between two successive frames is not equal to zero). (2) The surprise measurement (Equation 3.62) requires integration over the model space M, which does not integrate over the data space as in other models (Equation 3.41).

The model space includes all the hypotheses about an event. One lively and simple example in [54] illustrates the computation of surprise over model space. When a new television channel starts its programmes, there is uncertainty about its content. The parents have two hypotheses (models) over the model space M: one hypothesis is that the channel is suitable for children, denoted as M1; and the other hypothesis is the opposite (i.e., the channel may contain some scenes of nudity), represented as M2. If the prior probability of both models is given (assume equal probability for the two models p(M1) = p(M2) = 0.5), and the observation of possible data D1 with some nudity and D2 without nudity is also defined. First, the observer watches several frames in the new channel that include data D1, and the posterior probability of M1 and M2 for data D1 are img and img respectively by Bayesian theorem. The surprises for the two models are described as

equation

The total surprise experienced by the observer is the average value over the model family M = {M1, M2}:

(3.63) equation

Equation 3.63 can be extended for the model family to include more than two hypotheses. Sp is directly proportional to video saliency, so it is a new concept for estimating the saliency.

It is known that the unit of entropy is the bit, when the logarithm with base 2 is used. The KL divergence is relative entropy; in the surprise theory the unit in Equation 3.62 is defined as ‘wow’ by [10, 54, 55] in the case of a logarithm with base 2.

3.8.2 Saliency Computation Based on Surprise Theory

Under the surprise theoretical framework, the bottom-up saliency has four steps: low-level features extraction like the BS model with motion and temporal flicker; temporal surprise computation in the model family at each location and each feature; spatial surprise detection; and then a combination of temporal and spatial surprises to the final saliency map [55]. Figure 3.17 gives the block diagram of saliency computation based on surprise.

Figure 3.17 The block diagram of saliency computation based on surprise [55]. © 2005 IEEE. Reprinted, with permission, from L. Itti, P. Baldi, ‘A principled approach to detecting surprising events in video’, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2005

img
1. Low-level features extraction
This step is similar to the BS model with motion feature detection as mentioned in Section 3.2. The input video is decomposed into five channels: intensity, colour opponents (red/green and blue/yellow), orientation (four angles), motion (four directions) and flicker. This produces 72 centre–surround features between different scales of the five channels: six intensity contrast maps, 12 colour feature maps, 24 orientation contrast maps, 6 temporal onset/offset maps (flicker) and 24 motion maps. Low-level feature extraction is shown in the top-left part of Figure 3.17.
2. Local time surprise computation
The local time surprise depends on different frames in a given location. It is a time sequence to estimate the surprise according to the KL divergence between the prior probability density in previous frames and the posterior probability density by the Bayesian rule.
Firstly, the 72 raw feature maps without competition and normalization are resized to a 40 × 30 lattice (pixels) for a given image stimuli with the size of 640 × 480 pixels in [55]. Each location of each feature map is assigned a local surprise detector that is like a biological neuron to learn the probability density from the input data. The probability of family (or model space) is considered as a unimodal distribution. In [55], the authors used five models with different time scales, so there are a total of 72 × (40 × 30) × 5 = 432,000 surprise detectors (neurons) for the attention model.
Consider a neuron at a given location in one of the 72 feature maps (e.g., the kth feature map), receiving spiking signals with Poisson distribution as input from the low-level feature. The surprise of time can be independently calculated in that map and at that location by using a family of models img from all detectors, with each one satisfying a one-dimensional Poisson distribution for all possible firing rates λ > 0. It is noticed that in following equation (Equation 3.65) we change the input data D of Equations 3.61, 3.62, 3.63 to x for distinguishing the data in feature frames in a video frame. The parameter λ of the Poisson distribution is estimated with each feature frame over its duration (each video frame lasts 33 ms in [55]). However, fitting the probability density function with feature data samples into each location of each feature map is not crucial since statistics of the data does not influence the surprise measurement. The main issue, as already mentioned, is to yield a change in probability density about which models are likely to occur, as described by the prior probability density p(M) over all models. In [55], the authors considered that the posterior and prior probability densities belong to the same functional family. In that case the posterior density at one feature frame can directly serve as prior probability density for the next frame by Bayesian estimation. It is known that if p(M) and p(M/xk) have the same probability density form when Poisson-distributed data xk is sampled, then the p(M(λ)) should satisfy the gamma probability density:

(3.64) equation

where parameters αγ, βγ > 0 and Γ(·) is the gamma function. In order to detect surprise at different timescales, several cascaded surprise detectors at each pixel in each feature map are implemented. Given an observation xk = λc at one of these detectors of feature k, the posterior probability density img obtained by the Bayesian rule is also a gamma density, with img, img. The local temporal surprise [54, 55] can be calculated by

(3.65) equation

where img is digamma function. Thereby, the local temporal surprise can be updated by Equation 3.65 for an arbitrary detector at an arbitrary location of a feature map. An illustration of local temporal surprise for the flicker feature map is shown in the square block with round corners in Figure 3.17.
3. Spatial surprise computation
For a detector on each location (i, j) of each feature at time t, the gamma neighbourhood distribution is computed. When new data arrive, the spatial surprise SpS is the KL divergence between prior neighbourhood probability density and posterior probability density after update by local samples from the neighbourhood's centre. Figure 3.17 (the square block with round corners) shows the spatial surprise for the flicker feature map.
4. Combination of temporal and spatial surprise
A non-linear combination for the two kinds of surprise (temporal and spatial surprises) in a location of each feature is proposed in [55], which is to fit the biological data from the V1 complex cell of a macaque. The experiential equation is shown as follows:

(3.66) equation

Finally, the surprise of a location is the summation of five low-level features, that is surprises arise from colour, intensity, motion, orientation and flicker. The saliency map is the result by a saturating sigmoid function of the summed surprise. All the procedures are described in Figure 3.17.
The performance of the attention model based on surprise is validated by comparison with many state-of-the-art models in six computational metrics by [10, 54] in different video clips over thousands of frames. Experimental results show that the surprise-based model is closer to the human eye fixation locations than other models.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset