Hrishikesh Bhaumik*, Siddhartha Bhattacharyya, Surangam Sur, and Susanta Chakraborty

2Keyframe selection for video indexing using an approximate minimal spanning tree

*Corresponding author: Hrishikesh Bhaumik, RCC Institute of Information Technology, Kolkata, India

Siddhartha Bhattacharyya, RCC Institute of Information Technology, Kolkata, India

Surangam Sur, Jadavpur University, Kolkata, India

Susanta Chakraborty, Indian Institute of Engineering Science and Technology, Shibpur, Howrah, India

Abstract: Video summarization is an important field of research in content-based video retrieval. One of the major objectives in this domain has been to generate the summary of a given video in the shortest possible time. In this work, the primary aim is to rapidly select keyframes from the component shots of a video to generate a storyboard in minimal time. The time taken to produce the storyboard is directly proportional to the number of correlations to be computed between the frames of a shot. To reduce the time, user input is obtained regarding the amount of actual correlations to be computed. The other correlation values are predicted by interpolation from the actual correlation values computed for the purpose. Keyframes are selected from each shot by generating an approximate minimal spanning tree (AMST) and computing the density around each frame of the shot by means of an automatic threshold based on the statistical distribution of the correlation values. From the set of keyframes obtained, redundant frames are eliminated using the SURF feature descriptor to arrive at the final summary, which depicts the storyboard of the video. A comparison between the summaries produced by an AMST and MST show encouraging results and demonstrate the usefulness of the proposed method.

Keywords: Storyboard generation, Approximate minimal spanning tree, Interpolation, Redundancy elimination

2.1Introduction

Content-based retrieval systems (CBRS) have been the focal point of research for the past two decades. CBRS is a broad research field that encompasses various other domains like content-based video retrieval systems (CBVRS) [1], content-based image retrieval systems (CBIRS) [2] and content-based audio retrieval systems (CBARS) [3]. ‘Content-based’ refers to the semantic content of the media type under consideration and ‘retrieval’ refers to the act of extracting the relevant data (audio, video, or image) required by the user. Out of all the types of CBRS, CBVRS is the most challenging one because it encompasses the analysis of all other media types like text, image and audio. Even video clips embedded within a video are quite common these days. The field of CBVRSs has drawn much attention and contributions from researchers and application developers in several domains. Research in CBVRSs has been propelled by the unrestrained uploading of videos on several online repositories, augmented by the ever-decreasing cost of storage devices.

In the early stages, CBVRS development occurred in the form of academic research projects such as the Informedia Digital Video Library from Carnegie Mellon University and Fischlar Digital Video Suite from Dublin City University. These systems operated over thousands of hours of video repositories. The important aspects for ensuring the efficacy of such systems are as follows:

  1. Acquiring a concise and complete representation of video sequences
  2. Providing the user with different search techniques depending upon the type of search required

The essence of research in this field is to develop different search techniques for efficient mining of videos from a huge digital repository. The extraction of relevant items from such a digital library is based on search by semantic content, rather than meta-data tagged to the videos. The mining process is facilitated by the extraction of features obtained from the component units of such video data. The component elements considered for CBVR activity may consist of scenes, shots or frames of the video under consideration. Temporal decomposition at different granularity levels gives rise to one of the three components of a video, i.e., scenes, shots or frames, as depicted in Figure 2.1. While frames represent the lowest granularity level, shots are a collection of time-sequenced frames that are part of an uninterrupted sequence taken by a single camera. At the highest level of video decomposition, scenes represent a conglomeration of a number of temporally consecutive shots that are semantically similar. Semantic similarity arises when different shots share the same background or objects composing a scene.

Fig. 2.1: Structural hierarchy of a video

It is imperative that the videos stored in large digital repositories be indexed in order to enable the efficient retrieval of relevant items from the database. An indexed digital library allows easy searching, retrieval and browsing of relevant videos in accordance with user requirements. The process adds flexibility and ensures that the unstructured multimedia content is both accessible and searchable [4]. The indexing process may start as soon as a video is stored in the digital library. Video indexing may be done using any of the three component units. The type of features required for indexing videos grows in complexity (low level to high level) as the level of granularity decreases. Low-level features in the form of pixel statistics like mean, variance, standard deviation of values [5], histogram of pixel values [6], entropy [7], luminance, fuzzy hostility index of pixels [8] and so forth may be extracted from high-granularity units (frames). High-level features such as optical flow in successive images [9], shape of video objects [10], edges in objects [11], motion vector [12], event detection [13], action modelling [14] and so forth are more appropriate for representing the semantic content of a shot or scene. This includes the identification or tracking of objects [15], action recognition [16], emotion detection [17] and so forth.

Video indexing using frames gives rise to a static representation of contents, which is referred to as static video summarization or storyboard. In a storyboard representation, a subset of frames from the entire set of component frames in a video is chosen to portray the contents. The process of selecting the keyframes is based on algorithms formulated for the purpose. The two aspects considered in a summarization activity are conciseness and coverage. These two aspects are in conflict with each other since conciseness affects coverage and vice versa. Indexing videos in a repository using shots or scenes allows for a dynamic representation, also known as video skimming. In such a representation, a few shots are chosen from the entire set of shots that make up the video. However, any visual abstract of a video must satisfy the following three postulates:

  1. Video summary must hold the higher entities and events of interest
  2. It should present rational or logical points of flow or progression
  3. It should be free of recapitulation (redundancy-free summarization)

The present approach deals with keyframe selection from videos of a large repository in order to facilitate easy indexing, retrieval and browsing. Over the years, keyframe selection has been a frequently used technique for video indexing. Efficient selection of keyframes from a video helps to portray the contents in a summarized form (also referred to as static summary) to the user. This form of video summarization is particularly useful in bandwidth-constrained scenarios where the user needs to retrieve relevant videos in the minimum possible amount of time.

In the present work, an investigation is carried out to improve the performance of the method based on minimal spanning trees (MSTs), developed by Bhaumik et al. in [18]. The main motivation of this work is to reduce the time complexity for generating MSTs pertaining to shots having a large number of frames. It is common knowledge that the time complexity of MSTs is O(n2). The focus of this work is to develop a method for generating an approximate MST (AMST) based on interpolation. Given a subset of actual distances, the remaining distances for generating an AMST are predicted using interpolation. The number of actual distances computed for the purpose is based on an input percentage specified by the user. It is obvious that a high value of input percentage will contribute to reducing the error induced in the AMST while increasing the time for generating the AMST. An input percentage of 100 will generate an AMST that would be identical to the original MST and the error induced would be zero. The objective here is to generate the summaries obtained using both MST and AMST and compare it with the user-generated summaries, in terms of well-known metrics such as precision and recall.

The remainder of this chapter is organized as follows. A survey of related works on static and dynamic video summarizations is presented in Section 2.2. Basic concepts and definitions are given in Section 2.3. The proposed methodology for static video summarization, along with an explanation of the different steps of the algorithm, is enumerated in Section 2.4. The experimental results of the summarization process using the proposed methodology are elaborated in Section 2.5. Finally, some conclusions and remarks are given in Section 2.6.

2.2Survey of related works

Video summarization is one of the most important approaches to speedy browsing and accessing of videos from large digital repositories. It is the process of constructing a short summary from a large digital video to give users a useful visual abstract about the contents of the video. The component units of a video can be defined as scenes, shots and frames depending upon the granularity level used. Thus, the temporal decomposition of a video gives rise to different coarseness levels corresponding to the stages of video hierarchy. As shown in Figure 2.1, several units at a lower stage are combined to form a unit at a higher stage. The decision to combine a set of units at a lower level of the hierarchy to form a unit of the higher level depends upon the similarity of such units, given the features extracted from each unit. This results in the different domains of video analysis such as shot boundary detection or shot transition detection, scene detection and so forth. The sequential ordering of the temporally related units in a video is of prime importance because any disordering will cause a deformation of the content [19]. Video summarization may be classified into two broad categories: static video summarization and dynamic video summarization.

2.2.1Static video summarization

The term ‘static’ literally means stationary, stable or fixed. The process of static video summarization involves the selection of a certain number of frames from a video. The process comprises the following basic steps:

  1. Decompose video into images using a codec.
  2. Extract features from each frame composing the video.
  3. Select a set of representative frames based on features extracted from each frame.
  4. Reduce redundancy on set of representative frames to eliminate superfluous frames.

The static summary reflects the basic underlying content of a video. The summary generated may be a set of representative frames or a collage formed from such frames. In the literature, static video summarization has also been referred to as storyboard generation, which has attracted the attention of many researchers, who identified a need to represent the contents of a given video in concise form. The two basic approaches used in video summarization and detailed in later subsections are:

  1. Classical approaches
  2. Soft computing–based approaches

2.2.1.1Classic approaches

Clustering has been used by many researchers as an effective tool for generating video summaries. Mundur et al. [20] applied the concept of Denaulay clustering to produce effective static summaries of videos. In this method, Denaulay triangulation is used to cluster frames, which produces good-quality summaries with few frames and reduced redundancy. In this method no user intervention is required, and the method is well suited for batch processing. In [21], De Avila et al. focused on the extraction of colour features from video frames for clustering the same using a k-means clustering algorithm. This algorithm was called VSUMM and the generated summary was compared with a manually generated summary by human users. The comparison revealed that the system had produced a high-quality summary. In [22], a static video summarization algorithm was presented that was based on a k-medoid clustering method. The distance measure taken for the purpose is computed on a block matching mechanism based on frequency domain. The approach is computationally viable and is able to extract dissimilar frames in large motion scenarios. The keyframes extracted by the process have reduced visual redundancy. In [23], Hanjalic proposed a scheme for producing a static and dynamic summary by applying multiple partitional clustering to all frames of a video. After cluster validity analysis, centroids from the optimal clusters thus obtained form a set of representative frames or storyboard. The shots corresponding to the selected keyframes are coalesced together to form a dynamic summary.

Cahuina and Chavez proposed a method [24] for video summarization taking into consideration local descriptors and temporal video segmentation to produce static video summaries. The evaluation was carried out on more than 100 videos, and the approach yielded good results compared to other state-of-the-art methods. The experiments conducted also conveyed the importance of colour information. Bhaumik et al. in a paper [25] introduced a three-phase keyframe extraction approach to generate a static summary of a video. It is one of the few works that focus on redundancy reduction of detected keyframes. The algorithm works in three stages, where the first stage focuses on the selection of representative frames and the other two stages focus on the intrashot and intershot redundancy reduction of the keyframes selected in the first stage. After every stage, a comparison of the system generated and user-generated summaries is done. In [26], the authors take a set-theoretic approach to the problem of keyframe extraction from a video. The extraction algorithms proposed in the work enable the presentation of a summary in the form of a multilevel abstract. For a given degree of fidelity, a tree-structured hierarchy is generated that enables well-organized content retrieval using a depth-first search. The work by DeMenthon et al. [27] on generating representative frames of a video focuses on a curve simplification method. The idea behind this work is to represent a video as a trajectory in high-dimensional feature space, where each of the dimensions represents a particular feature of the video frame. The trajectory represents a temporal variation of the features in the feature space. To select the representative frames, the curve is broken down into several component arcs. The frames at the junction of the arcs are selected as representative frames. The approach is versatile in that the level of decomposition can be controlled, thereby decreasing or increasing the number of keyframes for summarizing the video.

The work by Yeung and Yao [28] is one of the early papers on video summarization that speaks about the objectives of the summarization task and presents a technique for computing the dominance of contents in various segments of a video and thereby generating a pictorial summary from the chosen video frames. Franti et al. introduced two methods based on a Hough transform for image matching in [29]. In the first one, a feature vector is formed by adding the most important coefficients in each column of the accumulator matrix. Thus, this method represents an easy approach to matching that is invariant to geometric transformation. The second approach takes into account the positional data of lines to provide a more accurate depiction of images. Thus, this method provides an effective way to match two images, which is a fundamental requirement for video indexing systems. In a different approach, a Gabor filter was used by Zhang et al. [2] for extracting the textural features of an image. Rotational normalization is accomplished by circular shifting, so that the images have the same dominant direction. With this approach, images can be indexed for retrieval of similar images from a large digital repository. Gygli et al. [30] used a supervised learning approach for the characterization of a summary that optimized multiple objectives for producing a good summary. In another work, Xu et al. [31] adopted gaze tracking information for summarizing personal videos. Itazuri et al. [32] proposed a video summarization scheme for volleyball videos using a rally scene detection method based on court changeover information. A multimodal video database and movie summarization system was introduced by Zlatintsi et al. [33] for the evaluation of video summarization algorithms. Graph modularity clustering was used in [34] for the extraction of keyframes. Sharghi et al. [35] focused on the aspect of user subjectivity to create a video summary. In this approach, a memory network parameterized sequential determinantal point process was used to address user queries on the frames and shots of a video. A two-stage model was proposed by Lin et al. [36] for the online summarization of videos for the Internet of Video Things (IoVT), captured at the sensor nodes, in which a two-layer k-nearest neighbour clustering algorithm is used for object clustering. A keyframe recommender system was designed by Chen et al. [37] that could model both textual and visual features in a unified framework. In this approach, the personalization provided to the user was achieved by encoding several user interests into a unified multimodal space. Sahoo et al. [38] devised a video summarization system that could produce static, dynamic or entity summaries. The system is able to summarize videos on the fly. On-the-fly summarization of large videos was taken up in the work by Mirzasoleiman et al. [39] using a single-pass streaming algorithm designed for the purpose. In the approach by Choi et al. [40], context-driven video summaries are generated by learning the semantic embeddings of video frames by means of captions present in frames. The system is able to produce a meaningful summary by selecting relevant video segments, given a textual description of user needs. The system developed by Otani et al. [41] takes into account user intentions for posting a blog video by focusing on the supporting texts in the blog. The summaries generated by the system reflect the blogger’s perspective. In [42], Mitra et al. used sequences of temporal segments called tracklets for discovering entities in a video. The tracklets are clustered based on parameters like clustering accuracy, cluster purity and entity coverage. The method is applicable for streaming videos and is also able to reject false alarms. The approach used by Mademlis et al. [43] focuses on summarizing human activity in videos by identifying visual building blocks and forming basis vectors. A joint optimization model is extracted from a column subset selection problem (CSSP) to provide content coverage and inclusion of outliers. In [44], a half-quadratic minimization algorithm is used to solve a collaborative sparse optimization problem for clustering large collections of videos based on the given topic. In this work, techniques from information retrieval and natural language processing are used for collaborative summarization. A technique for real-time video summarization was proposed by Abdelali et al. in [45] that used an MPEG-7 colour structure descriptor. The system was implemented on a chip using a Xilinx Virtex5 field-programmable gate array.

A detailed survey of video summarization made with the help of wearable cameras is presented in [46] by Del Molino et al. A supervised video summarization framework was developed by Ji et al. [47] in which an attentive encoder-decoder network was developed for automated video summarization. Bidirectional long short-term memory (BiLSTM) was used to encode the contextual information in input video frames. The decoder uses two attention-based LSTM networks for additive and multiplicative objective functions. Video summarization using an ensemble of deep learned features was tested in [48]. In [49], Panda et al. focus on the summarization of a collection of videos rather than only one video. The method explores the complementarities in a given set of videos by means of a separate minimization algorithm over an objective function. In a separate work [50], the authors took up the problem of summarizing surveillance videos taken from multiple cameras. The work focuses on multi-view correlations that serve to represent a diverse set of representative shots. Also, sparseness is modelled to select representative shots in order to produce a summary. In [51], a fast iterative shrinkage thresholding algorithm (FISTA) was used by Meng et al. for category-specific video summarization using a multi-view sparse dictionary selection with centroid co-regularization (MSDS-CC) method. An approach for object-level video summarization was formulated by Zhang et al. in [52] by means of an online motion Auto–Encoder (online motion-AE) framework that executes on the segmented object motion snippets of the video. The encoder imitates the online dictionary learning by memorizing previous states of object motion; it updates itself through a recurrent auto-encoder network. Instead of using a single-source media type, Rudinac et al. in [53] focus on using dynamic multisource media that may be more effective at summarizing content from modern social media. Tejero-de-Pablos et al. in [54] deal with the problem of automated sports video summarization using a deep neural network to extract action-related features and to categorize video fragments into interesting and uninteresting segments. A method for video summarization was developed in [55] by Plummer et al. that uses a joint vision-language approach to evaluate the representativeness and interestingness objectives.

In the next section, video summarization approaches using soft computing are discussed.

2.2.1.2Soft computing–based approaches

1.Approaches based on fuzzy logic

Researchers in the field of video summarization have frequently used fuzzy logic to enhance the performance of systems developed for the purpose. In an interesting work by Fang et al. [56], shot boundaries were detected using fuzzy logic in order to integrate hybrid features constructed from basic features like pixel differences, colour histogram, motion of objects and edge features. The approach uses a mode selector to detect hard cuts and gradual transitions in two different processing modes. A data set from Carleton University is used to evaluate the system. In another approach, Das et al. [57] used low-level fuzzy features to generate a fuzzy model for detecting shot transitions. The method is useful for detecting abrupt transitions as well as different types of gradual transitions like dissolves, wipes and fades. The method achieves high precision and recall values on test videos, especially for gradual transitions. Anderson et al. [58] proposed a method based on fuzzy logic for event detection. The work deals with the problem of fall detection of humans using a 3D object generated from multiple cameras. The states of a person in every image are taken, and a linguistic summarization of the person’s state is given as output using a fuzzy rule base. The framework is flexible since the existing rules can be modified or removed and new rules added. In [59], Cahuina et al. presented a method for static video summarization based on colour histograms generated from a video. Principal component analysis is used for dimensionality reduction of feature vectors. Thereafter, the shots in the video are detected using the Fuzzy-ART algorithm [60]. The keyframes are extracted using the Fuzzy C-Means [61] algorithm. A multidimensional fuzzy histogram–based approach to video summarization and retrieval is discussed in [62] by Daulamis et al., in which the visual content is portrayed for a video. In this approach, suitable features are extracted to produce a multidimensional fuzzy histogram. Video summarization is achieved by eliminating shots or frames with similar content. During the retrieval process, images that visually match a query are extracted from video frames. An interesting approach to home video summarization that employed a fuzzy fusion method for combining the attention and emotional aspects of a user was adopted by Peng et al. in [63]. The system used an Interest Meter (IM) that estimated the attention aspect of a user by tracking head motion, eye movement and blinking. To estimate the emotional state of a user, facial expressions are captured and classified as positive or neutral. A quantitative interest score is generated for different parts of the video. The summarized version contains those parts of the video that have a high interest score. A video surveillance system for managing elderly and disabled people was proposed by Elbouz et al. [64]; it used fuzzy logic to detect a person’s behaviour. In this system, the Vander-Lugt correlator (VLC) and joint-transfer correlator (JTC) techniques are integrated in order to make a decision about the 3D poses of a person. Since no extra devices are involved, the system is cost effective. Model-based prediction and optical correlation are used to track the person by locating the position of the person’s eyes in the image. The object in the image is detected using HSV colour space. In another work with a similar motivation, aimed at monitoring elderly and disabled people, Zhou et al. [65] developed a system for activity recognition and video monitoring. The physical location and speed of a person are estimated using a learning model developed on the basis of features extracted from a video. A human detection and tracking algorithm is developed at the object level. For action recognition, a hierarchical decision tree and dimension reduction methods were used. Thus the functional assessment of a person is estimated by extracting the statistics related to activities of daily living. The approach was tested on more than 200 h of video and shown to be effective for this type of application.

Keyframe selection using a stochastic framework was presented by Avrithis et al. [66] in which feature-based representation is obtained through colour and motion segmentation on each video shot. The recursive shortest spanning tree algorithm is used in the approach. A fuzzy multidimensional histogram is used to assemble all segment features so that misclassification of segments can be decreased. The work illustrates two approaches to keyframe detection. The trajectory pertaining to the temporal variation is analysed in the first method. The second approach aimed at decreasing cross-correlations. For the second method, a stochastic adaptation of logarithmic searches and a genetic algorithm were designed. A similar kind of work was carried out by Doulamis et al. [67] in which an effective algorithm for video summarization was developed. The approach took into consideration the temporal variations of video content. Important frames or shots were extracted by approximating suitable points on the feature vector curve and the curve formed by interpolation using the estimated points.

2.Approaches based on neural networks

Neural networks have been effectively used by researchers for the summarization of video content. Convolution neural networks (CNNs) have been found to be particularly useful in dealing with image recognition problems. Karpathy et al. applied CNNs to the problem of large-scale video classification in [68]. To take advantage of local spatiotemporal information, multiple approaches are used to extend the connectivity of a CNN in the time domain. A large data set consisting of one million YouTube videos having 487 class labels were used for experimentation, and a significant improvement in performance was achieved. Recurrent neural networks (RNNs) have also been used effectively for image recognition problems. Yao et al. [69] used local and global temporal structures to produce a description of videos. Short temporal dynamics are modelled with a 3D CNN so that it can be used to train the action recognition tasks in line with human actions. Also, a text-generating RNN is used to produce a natural language description of video segments using a temporal attention mechanism. Venugopalan et al. have also used CNNs and RNNs in the description of videos in natural language in [70]. The approach uses deep neural networks to create natural language descriptions of open-domain videos with large vocabularies. Training for the neural networks is carried on a large data set of images having category labels as well as other images bearing captions.

An adjustable neural network model was proposed by Doulamis et al. in [71] for video object (VO) segmentation. In this approach an adaptive neural network classifier is used for object tracking. This is seen to be more effective than conventional object motion tracking algorithms. A cost-efficient algorithm is used for weight updating of networks without degrading the knowledge acquired by the network. The detection of the body and face of human objects is attained by Gaussian distributions for video conferencing applications. To detect generic VOs, colour and depth information is extracted for segmentation fusion. Effective results are obtained for scenarios where there is object occlusion or blending.

A six-phase method based on machine learning was proposed by Zawbaa et al. [72] for event detection and summarization of football (American soccer) matches. In this approach, both support vector machines (SVMs) and neural networks are used for the detection of important video segments. The system performs scoreboard detection, replay detection, exciting event detection and logo detection for the summarization of football videos.

In [73], Panda et al. modelled a 3D CNN that is able to automatically learn a parametric model for video classification. The model is used to detect important segments of a video in order to produce a summary. RNNs were used by Sah et al. [74] to generate textual summaries using deep learning models. The significant portions of videos are converted to textual summaries using recurrent networks. In [75], Sanghavi and Pangaonkar adopt a self-organizing map (SOM) to detect indoor and outdoor scenes for efficient video summarization.

3.Approaches based on genetic algorithms

The need to summarize large videos has posed a long-standing challenge in the field of video indexing and retrieval. Most traditional methods fail on account of issues related to time complexity. In the analysis of large videos, fast browsing and access to meaningful content are integral parts of the research in the field. In this respect, Chiu et al. [76] devised a genetic segmentation algorithm that executes iteratively on the different segments of representative strings. The method facilitates segmentation in an incremental manner due to the evolutionary nature of genetic algorithms. This enables summarization of long videos in an efficient manner for faster indexing, browsing and retrieval. The effectiveness of content-based retrieval systems can be enhanced by combining multiple features extracted from images. However, the combination of features may not always prove effective, and the combined function may be far more complex than simple weight-based functions. To address this problem, Torres et al. [77] proposed an effective way to achieve a non-linear combination of image features in a genetic programming framework. In this method, images are retrieved based on object shapes. Lu et al. in [78] presented an approach in which image feature descriptors for colour, texture and shape are selected as low-level features of the image. Three types of chromosomes, i.e., real coded, binary and bi-coded chromosomes, along with k-nearest neighbour classification accuracy, are used as fitness functions to optimize the weights and selection of descriptor subsets. A data set of 2000 annotated Corel images was created for experimentation, and it was found that the annotation accuracy increased by 7%, 9% and 13.6% for the real coded, binary and bi-coded chromosomes respectively.

A scene retrieval algorithm was proposed by Yoo and Cho [79] in which five video features, i.e., average colour histogram, average brightness, average edge histogram, average shot duration and gradual change rate, were extracted from videos and mapped to the user’s emotional space using an interactive genetic algorithm. The emotion categories considered here are action, excitement, suspense, quietness, relaxation and happiness. An initial population is selected by the algorithm based on the emotions given by the user. The feature vectors from this initial set are treated as chromosomes, and genetic crossover is applied to produce a new chromosome set. Using a similarity function, videos corresponding to the new chromosomes are retrieved. This process is repeated until the videos that are relevant to the user are retrieved. A method for selecting keyframes based on CSSP was formulated by Mademlis et al. [80]. In this case, the problem is viewed as an optimization problem and solved by means of a genetic algorithm.

4.Support vector machines

SVMs have been an important tool for researchers in most machine learning approaches. SVMs can serve to memorize knowledge gathered from past facts and use it for future computation. SVMs have been found to be effective for the classification of both linear and non-linear data. In this context, an interesting approach was adopted by Ng et al. [81] for modelling the relationship between brainwaves and audio/video features using an SVM model. In this approach, summarization of personal videos, taken with wearable video cameras, was done using low-level video and audio features. A different approach to video summarization was adopted by Li et al. [82] in which a mapping of video features to their corresponding textual information was attempted. The learning mechanism used transferable structure learning using a structural SVM as a backbone to enhance the summarization of videos and boost performance by exploiting the interchange between video features and equivalent textual information. To summarize a new video, only video features need to be extracted.

Potapov et al. stressed the need for category-specific video summarization in [83]. An SVM classifier was used to allocate scores to the different video fragments obtained after video segmentation. The segments having the highest scores are coalesced to produce a video summary. Experiments with TRECVID 2011 data show the efficacy of the proposed mechanism. Video segmentation using SVMs was tested in [84], which describes a fast method for shot boundary detection. Initially, the grey variance is extracted from video frames. The frames at the sequences of smooth variations are deleted and a new ordering of the frames, called a reordered frame sequence (RFS), is carried out. Pattern vectors are formed from three features, that is, pixel-wise intensity differences, colour histogram differences in HSV space and edge histogram differences in the X and Y directions are extracted from RFS and given as input to the SVM to determine the shot boundaries. Gradual transitions are detected using temporal multiresolution on the frames of RFSs.

Producing a synopsis of sports videos is a specific and important research field in the domain of video summarization. In [85], a mechanism for detecting notable events in sports videos was devised by Sadlier and O’Connor that was based on audio-visual features. The selected features are characteristic of any sports video and are merged with the help of an SVM that serves to build a model during the training stage for the detection of important events in a given sports clip.

To overcome the problem of relevance feedback systems, where obtaining feedback from the user might be a problem due to time constraints, a method for pseudolabelling of images was devised by Wu et al. [86] that was based on a fuzzy version of SVMs. Learning from a few training samples prevents the system from performing well. To overcome this problem, the images were psuedo-labelled. However, this induces uncertainty in the class information. Fuzzy SVM (FSVM) is thus used to address this uncertainty through active learning using a combined model called pseudo-label fuzzy support vector machine (PLFSVM) to perform content-based image retrieval. This method has proved to be effective on a large data set consisting of 10,000 images.

2.2.2Dynamic video summarization

Due to the massive volume of video data available online, it is becoming increasingly important to devise tools and applications for summarizing video content. The need for video summarization is obvious from the fact that browsing and retrieval time must be minimized. However, in many situations a static representation of a video using keyframes may not be sufficient. In certain cases related to the summarization of sports, news and other videos, a better representation can be obtained if important shots are concatenated to produce a summary, leading to dynamic video summarization or video skimming. Techniques for video skimming encompass the following approaches.

1.Graph modelling

Video skimming was achieved through graph modelling in [8789]. In [87], scene modelling was carried out using a normalized cut algorithm and temporal graph modelling, whereas highlight detection was achieved by motion attention modelling. In this approach, the video is represented as a complete undirected graph, and video clusters are formed by splitting the graph using a normalized cut algorithm, resulting in the formation of directed temporal graphs. Detection of video scenes is accomplished using a shortest path algorithm. In a temporal graph, attention values are computed and assigned to all video structures i.e., scenes, clusters, shots and subshots. This helps to generate video summaries that rely on both content and perceptual value. In an earlier work, Ngo et al. [88] used temporal graph modelling for the detection of subshots to create a dynamic video summary. To evaluate the perceptual quality of shots and clusters, a motion attention model was used.

In another approach, Lu et al. [89] used graph optimization for dynamic video summarization. The method emphasizes visual coherence and content coverage. In this method, a dynamic programming algorithm for video skimming using spatio-temporal information to determine the key shots and scenes in a video was proposed. The method comprised three stages. In the first stage, the given video is fragmented into shots and a candidate shot is chosen. The spatio-temporal association between the shots is modelled as a directional graph. Finally, the longest path in the graph, determined by an algorithm based on dynamic programming, is used to represent the final video summary.

2.Motion modelling

Motion modelling is an alternative method for dynamic video skimming. In [90], video segmentation is achieved by identifying dominant motion changes in the frames of a video. A 2D affine model is generated to examine some of the coefficient changes with respect to time. The residual motion content of the obtained temporal units is made a statistical representation. A set of labelled classes is available through off"|line learning from a training set of video clips. Maximumlikelihood is used for the classification of the video segments. A visual motion modelling approach was also used in [91] to identify dynamic video contents for dominant image motions as well as residual image motions. An affine motion model and low-level motion model were employed to quantify the motions in the video. It has produce good results for both low-level and high-level motion videos, especially sports videos.

3.User attention modelling

Ma et al. [92] proposed an alternate approach to video semantic analysis, in the form of an attention-based approach. Instead of understanding video content through semantic analysis, the method focuses on modelling the viewer’s attention. The generic framework presented in the work eliminates the need for complex rules based on heuristics and takes advantage of the attention models framed for the purpose.

4.Singular value decomposition

Gong et al. in [93] employed a technique for both static and dynamic video summarization that was based on singular value decomposition (SVD). A feature frame matrix was constructed to perform SVD in order to select the best discriminating features. The frames are clustered based on the reduced feature space, and a metric is devised to determine the amount of visual content in each cluster. The best possible set of keyframes is chosen from the clusters and a summarized video, based on length given by the user, is generated. In this method, the emphasis is on the visual content of the clusters, ensuring that redundancy is eliminated from the generated summary.

5. Expectation maximization

Orriols and Binefa used an approach based on an expectation maximization (EM) algorithm [94] for dynamic video summarization. Appearance-based frameworks, such as an appearance representation model, an appearance temporal evolution model and a maximum-likelihood estimation model, were applied to compute the time symmetry. A set of semantically meaningful frames was extracted to devise a generative model capable of representing sequence progressions in the video. The probabilistic approach used for this problem took into account the perceptual similarity taken from time progression models and learned appearance.

6.Relevance score

A method for selecting representative key shots or video clips from a full video for dynamic video summarization is discussed in [95]. Key shots with high concernment scores are recognized to summarize event-driven web videos by localizing the connected tags with shots from each video. The relevancy of the shots is determined according to the event query by connecting localized tags with the query. The method is tested on a data set consisting of more than 10,000 web videos from 60 classes.

2.3Basic concepts

2.3.1Shot boundary detection

A shot is a series of temporally separated sequences of frames taken by a single camera. In video clips, a shot is therefore a manifestation of continuous action without a break in sequence. The consecutive frames within a shot bear a high correlation. A change in sequence is denoted by a transition from one shot to another. The type of transition is determined at the editing phase during the production of a video from clips during the shooting of the sequences. During video analysis, these shots must be separated again by determining the shot boundaries. Shot boundaries are mainly classified into two broad categories, gradual transitions and abrupt transitions [4]. Shot transition detection is the most basic step for any video analysis application and plays a major part in video processing [96]. The main objective of shot detection is to automatically catch the transition point between two shots within a video. Detection of gradual transitions is a more challenging task than finding abrupt transitions. A lot of research has been done in this field, and several methods for the detection of both abrupt [9799] and gradual transitions [100103] have been developed by researchers.

An abrupt shot change or hard cut [104106] occurs when there is an instantaneous transition from one shot to another. Abrupt shot changes account for 95% of all shot transitions [107]. In the case of abrupt transitions, the last frame of the previous shot and the first frame of the next shot are placed sequentially for amalgamation in order to achieve this type of transition.

Gradual shot change [108] occurs slowly over a few frames as one shot ends and another begins. The frames involved in a gradual transition are a combination of the same number of frames taken from both shots. The manner of combination of the frames determines the type of gradual transition, i.e., dissolve, fade-in, fade-out or a wipe effect, as portrayed in Figure 2.2. In this type of transition, frames from two shots may be combined using chromatic, spatial or spatio-chromatic effects, which causes a slow transition from one shot to another. A dissolve transition is a type of gradual shot change that occurs between two shots. There is no clearly demarcated boundary between shots. Therefore, in this type of transition, the first shot fades out and the second shot fades in, such that frames that form a part of the dissolve sequence are a linear combination of frames taken from the two shots. As a dissolve sequence progresses, the amount of contribution from pixels of one shot decreases, while the contribution of pixel values from the frames of another shot increases. Hence, a dissolve sequence can be modelled as a combination of fade-in and fade-out [109]. During a dissolve sequence, an overlapping of two or more different scenes occurs. Hence, detecting a dissolve within a video has always been a major challenge for researchers. Another category of gradual transition is the fade. Fade-in [110, 111] occurs when a mono-colour screen gives way to a scene such that the image shown on the screen progressively clears out. On the other hand, fade-outs occur when the scene being shown gradually brightens or darkens, giving way to a totally black or white screen. A wipe is another type of gradual shot change in which a line moves across the screen and a new scene appears [98]. There are a few specific types of wipes such as the iris slow, star wipe, heart wipe, matrix wipe, clock wipe and others.

Fig. 2.2: Types of shot transitions in a video

2.3.2Correlation between image frames

The amount of similarity between a pair of frames can be measured by comparing the corresponding pixel values of the images. The concept of image correlation was derived from wave correlations between two electronic signals. A popular method of finding correlations is by Pearson’s method in which a deviation from the mean value of pixels is taken into consideration. A preprocessing stage involves the conversion of the 3D matrix of values corresponding to the RGB plane [112] to a 2D matrix containing the intensity values computed from the RGB values of each pixel. The equation for calculating the intensity value (I) is as follows:

I=αR+βG+γR(2.1)

In Equation (2.1), α = 0.2126, β = 0.7152 and γ = 0.0722. Here, I denotes a 2D matrix of values representing a greyscale image obtained from a 3D colour image. Pearson’s correlation coefficient (ρxy) or a product-moment correlation coefficient is a real number whose value lies in the interval [−1, 1]. For two given images represented by the matrices X and Y, the amount of similarity is computed using the formula

ρxy=Σ(xijx¯)(yijy¯){Σ(xijx¯)2}{(yijy¯)2}(2.2)

Here xij and yij are the elements in the i-th row and j-th column of the matrices X and Y respectively, is the mean value of the elements in X, ȳ is the mean value of elements of Y and n is the total number of elements in the matrix under consideration. The standard deviation of elements in both matrices should be finite and non-zero for the correlation to be defined. The correlation is 1 if there is a perfect match of the values and −1 in the case of a decreasing linear relationship. In other words, the value of ρ denotes the degree of match between matrices [8].

2.3.3Three-sigma rule

The standard deviation (σ) of a data set or probability distribution denotes the variation or deviation from the arithmetic mean (M) or expected value. The three-sigma rule in statistics is used to signify the range in which the values of a normal distribution are expected to be present. Figure 2.3 illustrates that 68.2% of values in a normal distribution lie in the range [M + σ, Mσ], 95.4% of values in [M + 2σ, M − 2σ] and 99.6% in the range [M + 3σ, M − 3σ]. Thus, this empirical rule may be used to establish a threshold for determining the similarity between two images. In this work, two images are considered to be visually similar if the distance (determined from the correlation) is less than M + 3σ, where M is the mean distance and σ is the standard deviation, computed from all the distances in the AMST.

Fig. 2.3: Three-sigma rule

2.3.4Features considered for identifying keyframes

Keyframes are a subset of the frames composing a video. These frames carry the bulk information that is crucial for summarizing a video. The keyframe set of a video is composed of reference points on the basis of which a video can be indexed. In multimedia production, a keyframe or a group of keyframes holds special information that defines a shot and can be used to discriminate between shots in a set. Often, the intermediate frames of a shot can be interpolated over time to create the illusion of motion. The features that can be extracted from a set of frames can be classified into colour-based, texture-based and shape-based attributes.

Colour-based features may consist, for example, of colour histograms, colour moments, colour correlograms or a mixture of Gaussian models. The image may be split into kXk blocks to capture local colour information [4].

The texture-based features of an image convey information about the spatial arrangement of pixel intensities and colour in images. The texture is basically a set of measures that allows for the identification of the perceived visual layers in an image. Gabor wavelet filters may be utilized to extract texture-based data for a video search engine [2, 113].

Shape-based features are those that describe object shapes in an image and can extract object contours or regions. To represent the spatial dissemination of edges for searching videos in TRECVID2005, an edge histogram descriptor was applied in [113].

2.3.5Basics of interpolation

Interpolation is a mathematical tool for estimating or predicting unknown values that exist between some known values. Thus, using interpolation, new data points may be created that lie within the range of a discrete set of known data points. Given a set of known points, represented by the set X and a corresponding set of values Y, where Y = f(X), the task is to predict yiY for some xiX such that xa < xi < xb, where xa, xbX and yi = f(xi).

Fig. 2.4: Linear interpolation

In the graph shown in Figure 2.4, a straight line passes through two known points ‘a’ and ‘b’ that are involved in estimating the unknown point ‘i’ that exists between these known points. The interpolated value of the point ‘i’ can be estimated easily since all three points lie in a straight line. However, a typical problem that arises in interpolation is the approximation of a complicated function using a simple function. A case may arise where the formula for some given function is known but is too complex to evaluate efficiently. In such a case, the known values from the original function can be used to create an interpolation based on a simple function. However, errors may be induced if the complex function is approximated or represented by such a simple function. In reality, many types of interpolations exist, for example, linear, polynomial, spline and multivariate. In this work, linear and polynomial interpolations are examined for approximating the distances to be computed to generate an AMST.

2.3.5.1Linear interpolation

Linear interpolation is a method of curve fitting using polynomials of degree one in order to generate new data points within the range of a discrete set of known data points. The formula for a straight line can be represented as P(x) = a0+a1x. A different way to characterize the same is

P(x)=(xx1)(x0x1)y0+(xx0)(x1x0)y1(2.3)
=(x1x)y0+(xx0)y1(x1x0)(2.4)
=y0+xx0x1x0[y1y0](2.5)
=y0+(y1y0x1x0)(xx0)(2.6)

To elaborate the concept of linear interpolation, the following example is considered:

Tab. 2.1: Table for tan x

Here, the problem is to evaluate the value of tan(1.47). Considering linear interpolation, the points of interest are x0 = 1.3, x1 = 1.5, with corresponding values for y0 and y1. Then,

tanxy0+xx0x1x0[y1y0](2.7)tan(1.47)3.6021+1.471.31.51.3[14.10143.6021]=12.5265

Although the true value of tan(1.47) = 9.8874, linear interpolation produces an error of 26.69% in the computed value.

2.3.5.2Polynomial interpolation

Polynomial interpolation is a method that estimates the values between known data points by fitting a polynomial curve of degree n. The maximum degree of the fitted curve is dictated by the number of known points. The values of unknown points can be estimated using polynomial interpolation when these points lie between known points. A polynomial of degree n may be represented as

P(x)=anxn+an1xn1++a2x2+a1x+a0(2.8)

It can be seen from Equation (2.8) that the number of known points has to be (n + 1) in order to fit a polynomial of degree n. Considering the same problem of computing tan(1.47), a polynomial of degree 3 may be fitted, resulting in the following polynomial:

P(x)=a3x3+a2x2+a1x+a0(2.9)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset