Surprise events or targets frequently occur in our life, and they often attract human attention or create a deep impression in the human brain. For instance, in films or television plays, the audience will pay more attention to the unexpected sequel after long suspense or fix their eyes on surprise locations or scenes. Surprise is out-of-expectation stimuli or events, which connotes uncertainty. If all things are deterministic, there is no surprise in the world, and this will result in loss of highlights and delight. Surprise may be also subjective; for the same visual data, the surprise degree of different individuals is different. Sometimes surprise represents novelty in temporal or spatial scope, which is related to saliency, because experiments on many video clips showed that for about 72% of all audiences their gaze shifts directly towards more surprising locations than the average in thousands of frames [10, 54].
But how do you quantify surprise? Itti and Baldi proposed a Bayesian definition of surprise in [10, 54, 55], called Bayesian surprise theory. In the theory, prior probability density is assigned over the possible models, and when the new stimuli arrive, the probability over possible models is updated by the Bayesian rule. The surprise is measured by KL divergence between the prior probability density and the posterior probability density. This definition of KL distance or divergence was mentioned in Sections 2.6.2 and 3.6.1. In Bayesian surprise theory, a greater change of probability density after the arrival of new stimuli leads to a larger KL divergence, and this means that the surprise degree is large. By contrast, the new data do not influence the prior probability; that is there is no surprise to it, and KL divergence is zero. The update of data makes the prior and post-probability change over time in surprise theory, which suits the applications to video saliency.
Suppose the prior probability density of an observer is described as {p(M)MM} over the model M in a model space M; here the italic M denotes one of the hypotheses or models in model space, and the normal M is the whole model space. Given the prior probability density of beliefs, the effect of new data D (or X if the data are in a feature space) is to change the prior probability density {p(M)MM} to posterior probability density by the Bayesian rule as follows, as a rewritten form of Equation 2.8 (i.e., X of Equation 2.8 is replaced by D):
The surprise degree is quantified by the distance or KL divergence between the posterior and prior probability densities:
Equation 3.62 carries two meanings: (1) The posterior probability density p(M/D) is updated with time through the Bayesian rule, because the surprise degree only considers the effect of the current data that lead to change of the original model. An example of this is shown in [54]: while one is watching TV, suddenly snowflakes occur on the screen of television due to electronic malfunction or communication failure. The posterior probability density of a snowflake is greatly dissimilar to the prior probability density in general cases (TV news, advertisements and plays etc.) and this results in large surprise. When the random snowflake interference continues, now the probability density favours a snowflake model which reduces the observer's surprise. In fact, the long and bothersome random interference never attracts you again though each snowflake frame is different from the other frames (the difference between two successive frames is not equal to zero). (2) The surprise measurement (Equation 3.62) requires integration over the model space M, which does not integrate over the data space as in other models (Equation 3.41).
The model space includes all the hypotheses about an event. One lively and simple example in [54] illustrates the computation of surprise over model space. When a new television channel starts its programmes, there is uncertainty about its content. The parents have two hypotheses (models) over the model space M: one hypothesis is that the channel is suitable for children, denoted as M1; and the other hypothesis is the opposite (i.e., the channel may contain some scenes of nudity), represented as M2. If the prior probability of both models is given (assume equal probability for the two models p(M1) = p(M2) = 0.5), and the observation of possible data D1 with some nudity and D2 without nudity is also defined. First, the observer watches several frames in the new channel that include data D1, and the posterior probability of M1 and M2 for data D1 are and respectively by Bayesian theorem. The surprises for the two models are described as
The total surprise experienced by the observer is the average value over the model family M = {M1, M2}:
Equation 3.63 can be extended for the model family to include more than two hypotheses. Sp is directly proportional to video saliency, so it is a new concept for estimating the saliency.
It is known that the unit of entropy is the bit, when the logarithm with base 2 is used. The KL divergence is relative entropy; in the surprise theory the unit in Equation 3.62 is defined as ‘wow’ by [10, 54, 55] in the case of a logarithm with base 2.
Under the surprise theoretical framework, the bottom-up saliency has four steps: low-level features extraction like the BS model with motion and temporal flicker; temporal surprise computation in the model family at each location and each feature; spatial surprise detection; and then a combination of temporal and spatial surprises to the final saliency map [55]. Figure 3.17 gives the block diagram of saliency computation based on surprise.
(3.64)
(3.66)