3. Computer Vision–Aided Video Coding

The latest advanced video coding standard H.264/AVC [1] improves rate-distortion performance significantly compared to its predecessors (e.g., H.263) and competitors (e.g., MPEG-2/4) by introducing a number of innovative ideas in Intra- and Inter-frame coding [2,3]. Major performance improvement has taken place by means of motion estimation (ME) and motion compensation (MC) using variable block size, sub-pixel search, and multiple reference frames (MRFs) techniques [4,5,6,7,8]. It has been demonstrated that MRFs facilitate better predictions than using just one reference frame, for video with repetitive motion, uncovered background, noninteger pixel displacement, lighting change, etc. The requirement of index codes (to identify the particular reference frame used), computational time in ME and MC (which increases almost linearly with the number of reference frames), and memory buffer size (to store decoded frames in both encoder and decoder for referencing) limits the number of reference frames used in practical applications. The optimal number of MRFs depends on the content of the video sequences. Typically, the number of reference frames varies from one to five. If the cycle of repetitive motion, exposing uncovered background, noninteger pixel displacement, or lighting change exceeds the number of reference frames used in MRFs coding system, there will not be any improvement and therefore, the related computation (mainly that of ME) and bits for index codes are wasted.

To tackle with the major problem of MRFs, a number of techniques [5,6,7,8] have been developed for reducing the computation associated with it. Huang et al. [5] searched either the previous or every reference frame based upon the result of the intra-prediction and ME from the previous frame. This approach can reduce 76%–96% of computational complexity by avoiding unnecessary search for reference frames. Moreover, this approach is orthogonal to conventional fast block matching algorithms, and they can be easily combined to achieve further efficient implementation. Shen et al. [6] proposed an adaptive and fast MRF selection algorithm based on the hypothesis that homogeneous areas of video sequences probably belong to the same video object, move together as well, and thus have the same optimal reference frame. Simulation results show that this algorithm deducts 56%–74% of computation time in ME. Kuo et al. [7] proposed a fast MRF selection algorithm based on the initial search results using 8 × 8-pixel block. Hachicha et al. [8] used Markov random fields algorithm relying on robust moving pixel segmentation, and saved 35% of coding time by reducing the number of reference frames to three instead of five without image quality loss.

Most of the fast MRF selection algorithms including the aforementioned techniques used one reference frame (in the best case) when their assumptions on the correlation of the MRFs selection procedure are satisfied, or five reference frames (in the worst case) when their assumptions completely failed. But, it is obvious that in terms of rate-distortion performance, these techniques cannot outperform the H.264 with five reference frames, which is considered as optimal. Moreover, due to the limited number of reference frames (the maximum is five in practical implementations), uncovered background may not be encoded efficiently using the existing techniques. Some algorithms [9,10,11] determined and exploited uncovered background using preprocessing and/or postprocessing and computationally expensive video segmentation for coding. Uncovered background can also be efficiently encoded using sprite coding through object segmentation. Most of the video coding applications could not tolerate inaccurate video/object segmentations and expensive computational complexity incurred by segmentation algorithms. Ding et al. [12] used a background frame for video coding. The background frame is made up of blocks which keep unchanged (based on zero motion vectors) in a certain number of continuous frames. Due to the dependency on block-based motion vectors and lack of adaptability in multimodal backgrounds for dynamic environment, this background frame could not perform well.

Recently in the computer vision field, a number of dynamic background generation algorithms based on the dynamic background modeling (DBM) [13,14,15,16] using the Gaussian mixture model (GMM) have been introduced for robust and real-time object detection from dynamic environment where ground-truth background is unavailable. A static background model does not remain valid due to illumination variation over time, intentional or unintentional camera displacement, shadow/reflection of foreground objects, and intrinsic background motions (e.g., waving tree leaves, clouds, etc.) [15]. Object can be detected more accurately by subtracting background frame (generated from the background model) from the current frame. First, a dynamic frame known as a most common frame in a scene (McFIS) is generated from video frames using DBM, and then used for improving video coding efficiency. A McFIS can be effective in the following ways:

• McFIS can be used for video compression as a reference frame for referencing static and uncovered areas because of its capability of capturing a whole cycle of repetitive motion, exposing uncovered background, noninteger pixel displacement, or lighting change.

• If a McFIS is generated from encoded frames, intrinsically it has better error recovery capacity for error-prone channel transmission as McFIS model has already contained pixel intensity history of previously encoded frames.

• A simple mechanism for adaptive group of picture (AGOP) determination and scene changed detection (SCD) is possible using McFIS as it represents common background of a scene. Thus, any mechanism for SCD and AGOP determination by comparing the difference between McFIS and the current frame is more effective. In fact, the SCD (therefore AGOP) is integrated with reference frame generation.

• A computationally efficient video encoder can be developed using McFIS as a long-term reference frame against multiple frames to achieve same or better rate-distortion performance. Further computational time can be reduced if we can use small search length for ME and MC while McFIS is referenced for static areas that have no or little motion.

• In video coding, an intra (I-) frame is used as an anchor frame for referencing the subsequent frames, as well as error propagation prevention, indexing, etc. To get better rate-distortion performance, a frame should have the best similarity with the frames in a group of picture (GOP), so that when it is used as a reference frame for a frame in the GOP we need the least bits to achieve the desired image quality, minimize the temporal fluctuation of quality, and also maintain a more consistent bit count per frame. McFIS has the quality to be an I-frame.

The rest of the chapter is organized as follows. Section 3.2 describes the dynamic background modeling with McFIS generation strategies. Section 3.3 describes two advanced video coding techniques. Section 3.4 explains a number of coding strategies where McFIS can be used to improve rate-distortion efficiency, while Section 3.5 concludes the chapter.

3.2 Dynamic Background Modeling

Moving foreground consisting of people, vehicles, or other active objects is at the center of all interactions and events in the real world. In terms of coding points of view, active object consumes most of the bits for encoding and it is a central point for quality judgments. Encoding the moving entities is thus the primary focus of any video coding technology to construct high-quality video within the constrained bit rates. Foreground detection is usually performed by maintaining a model of the scene background, which is subtracted from each incoming video frame, known as background subtraction. The simplest background model could be a static background image captured without any moving entity [13]. However, this simplest model is not suitable for real-world scenarios, as the background can change over time due to illumination variation, camera jitter, intrinsic background motion (e.g., waving tree leaves, clouds), and shadow/reflection. To deal with these challenges, an adaptive background model, capable of updating itself over time to reflect changing circumstances, has been proposed in [13,14]. Apart from the aforementioned challenges, a foreground detection technique is also expected to meet some operational requirements for video coding application such as (1) low complexity, (2) high responsiveness, (3) high stability, and (4) low specificity (environment invariant).

From the aforementioned Gaussian models, background and foreground are determined using different techniques. Stauffer et al. [13] used a user-defined threshold based on the background and foreground ratio. A predefined threshold does not perform well in object/background detection because the ratio of background and foreground varies from video to video. Lee et al. [14] used two parameters (instead of a threshold used in Ref. [13]) of a sigmoid function by modeling the posterior probability of a Gaussian to be background. This method also depends on the proportion by which a pixel is going to be observed as background. Moreover, the generated background has delayed response because of using the weighted mean of all the background models.

3.2.1 McFIS Generation

Traditional dynamic background modeling techniques [13,14] primarily focus on object detection; thus, it has less concern about real-time processing as well as rate-distortion performance optimization in video compression. Moreover, the generated background has delay response because of using the weighted mean of all the background models. To avoid mean effect (mean is considered as an artificially generated value and sometimes far from the original value) and delay response, Haque et al. [15] used a parameter called recentVal, m to store recent pixel intensity value when a pixel satisfies a model in the Gaussian mixture. They used classical background subtraction method, which identifies an object if the value of the current intensity differs from m of the best background model by a well-studied threshold. This method reduces not only delay response but also learning rates, which are sometimes desirable criteria for real-time object detection.

Generally, each pixel position of a scene is modeled independently by a mixture of K Gaussian distributions [13,14,15] in DBM techniques. A pixel position may be occupied by different objects and backgrounds in different frames. Each Gaussian model represents the intensity distribution of one of the different components, for example, objects, background, shadow, illumination, surround changes (like clouds in an outdoor scene), etc., observed by the pixel position in different frames. A Gaussian model is represented by the recent pixel intensity, mean of pixel intensity, pixel intensity variance, and weight (i.e., how many times this model is satisfied by the incoming pixel intensity). The system starts with an empty set of models and initial parameters. If the maximum number of models allowable for a pixel is three, we can get a maximum of 3 × H × W models for each video scene, where H × W is the resolution of a frame.

For a pixel, the McFIS generation algorithm is shown in Figure 3.1, which takes the recent pixel intensity (i.e., pixel intensity at the current time, X^t) and existing model for that pixel position (if any) as inputs and returns background pixel intensity (i.e., McFIS) and the background model. For every new observation X^t at the current time t, it is first matched against the existing models in order to find one such that the difference between the newly arrived pixel intensity and the mean of the model is within 2.5 times of the standard deviation (STD) of that model. If such a model exists, its associated parameters are updated with a learning rate parameter. The recent pixel intensity of the model is replaced by the newly arrived pixel intensity. If such a model does not exist, a new Gaussian is introduced with the intensity as a mean (μ), a high STD (σ), recent pixel value (γ), and a low weight (ω), and the least probable model is evicted. The least probable model is determined based on the lowest value of w/σ among the models. We have fixed the initial parameters in this implementation as follows: maximum number of models for a pixel K = 3, learning rate α = 0.1, weight ω = 0.001, and variance σ = 30.

FIGURE 3.1
Pseudocode for McFIS generation algorithm.

Assume that kth Gaussian at time t represents a pixel intensity with mean $μ_{k}^{t}, STD σ_{k}^{t}$ $μ_{k}^{t}, STD σ_{k}^{t}$ , the recent value $γ_{k}^{t}$ $γ_{k}^{t}$ , and the weight $ω_{k}^{t}$ $ω_{k}^{t}$ such that $\sum_{\forall k} ω_{k}^{t} = 1$ $\sum_{\forall k} ω_{k}^{t} = 1$ . The learning parameter a is used to balance the contribution between the current and past values of parameters such as weight, STD, mean, etc. After initialization, for every new observation X_t at the current time t, it is first matched against the existing models in order to find one (e.g., kth model) such that $| X^{t} - μ_{k}^{t - 1} | \leq 2.5 σ_{k}^{t - 1}$ $| X^{t} - μ_{k}^{t - 1} | \leq 2.5 σ_{k}^{t - 1}$ . If such a model exists, the corresponding recent value parameter $γ_{k}^{t}$ $γ_{k}^{t}$ is updated with X^t. Other associated parameters are updated with learning rates as follows:

$μ_{k}^{t} = (1 - α) μ_{k}^{t - 1} + α X^{t}$ $μ_{k}^{t} = (1 - α) μ_{k}^{t - 1} + α X^{t}$

(3.1)

$σ_{k}^{t^{2}} = (1 - α) σ_{k}^{t - 1^{2}} + α (X^{t} - μ_{k}^{t}) (X^{t} - μ_{k}^{t})$ $σ_{k}^{t^{2}} = (1 - α) σ_{k}^{t - 1^{2}} + α (X^{t} - μ_{k}^{t}) (X^{t} - μ_{k}^{t})$

(3.2)

$ω_{k}^{t} = (1 - α) ω_{k}^{t - 1} + α$ $ω_{k}^{t} = (1 - α) ω_{k}^{t - 1} + α$

(3.3a)

and the weights of the remaining Gaussians (i.e., l where l ≠ k) are updated as

$ω_{l}^{t} = (1 - α) ω_{l}^{t - 1}$ $ω_{l}^{t} = (1 - α) ω_{l}^{t - 1}$

(3.3b)

The weights are then renormalized. If such a model does not exist, a new model is introduced with $γ=μ= X^{t}$ $γ=μ= X^{t}$ , σ = 30, and ω = 0. 001 by evicting the Kth (i.e., the third model based on w/σ in descending order) model if it exists.

To get the background pixel intensity from the DBM technique for a particular pixel, one can take the mean value of the background model that has the highest value of w/σ In this way, a background frame (comprising background pixels) as the McFIS can be made. One example of McFIS is shown in Figure 3.2 using the first 50 original frames of Silent video sequence. Figure 3.2a show the 50th frame of video, and Figure 3.2b shows McFIS. The circle in Figure 3.2b indicates the uncovered background captured by the corresponding McFIS. To capture the uncovered background by any single frame is impossible unless this uncovered background is visible for one.

Paul et al. [16] observed that mean and recent intensity values are two extreme cases to generate true background intensity (i.e., McFIS) for better video coding while encoded video frames are used in background generation. The mean is too generalized for pixel intensities over the time and the recent intensity value is too biased to only the recent pixel intensity. Thus, a weighting between mean and recent intensity is a compromise. It also reduces the delay response (due to mean) and speeds up the learning rates (due to recent pixel intensity). Note that normally three models are used for a pixel in DBM as one represents foreground objects, another represents static backgrounds, and the third one represents different transitional object/background. Paul et al. [17] also observed that recent pixel intensity can be effective in generating McFIS while original frames (i.e., without any distortion due to quantization) are used.

FIGURE 3.2
Examples of McFIS and uncovered background (inside the circle) using Silent video sequence: (a) original 50th frame and (b) corresponding McFIS using first 50 frames of Silent video sequence.

FIGURE 3.3
Visual comparison results in fast learning rate (α = 0.1) using three techniques with the test sequences: (a) PETS2000; (b) PETS2006-S7-T6-B1; (c) Moved Object; and (d) Waving Trees. (From Haque, M. et al., Improved Gaussian mixtures for robust object detection by adaptive multibackground generation, in IEEE International Conference on Pattern Recognition, 2008, pp. 1–4.)

3.2.2 Object Detection Using McFIS

Visual comparison results are presented in Figure 3.3 using various methods [13,14,15] for object detection. Standard test video sequences such as PETS2000, PETS2006-S7-T6-B1, Moving Object, and Waving Trees are used for comparison. Experimental results show that background generation method using recent value [15] outperforms other methods in terms of clear object detection at a faster rate.

3.2.3 Scene Change Detection Using McFIS

A number of algorithms [4,18,19,20,21] are proposed in the literature for AGOP and SCD. Dimou et al. [18] used dynamic threshold based on the mean and standard deviation of the previous frames for SCD. Their reported accuracy is 94% on average. Alfonso et al. [19] used ME and MC to find the SCD. To avoid repetitive scene change, they imposed lower limit of scene change as four frames. The success rate of this method is 96% with 7.5%–15% more compression and 0.2 dB quality loss. Matsuoka et al. [20] proposed a combined SCD and AGOP method based on fixed thresholds generated from the accumulated difference of luminance pixel components. They used the number of the intensive pixels (NIP) to investigate the frame characteristics. A pixel of a frame is considered as an intensive one if the luminance pixel difference between the adjacent frames is more than 100. If NIP exceeds a predefined threshold between two frames, then an I-frame is inserted at that position assuming the occurrence of SCD; otherwise they restricted GOP size to either 8 or 32 based on the NIP and another threshold. Song et al. [21] proposed another scene change detection method focusing on the Hierarchical B-picture structure [22]. Recently, Ding et al. [4] combined AGOP and SCD for better coding efficiency based on the motion vectors and the sum of absolute transformed differences (SATD) using 4 × 4 pixels block. This method ensured 98% accuracy of SCD with 0.63 dB image quality improvement.

Most of the existing methods used some metrics computed using already processed frames and the current frames. McFIS is the most similar frame comprising stable portion of the scene (mainly background) compared to the individual frame in a scene. Thus, the SCD is determined by a simple metric computed using the McFIS and the current frame. For SCD using McFIS, Paul et al. [16] randomly selected 50% of the pixel position of a frame and determined the sum of absolute difference (SAD) between McFIS and the current frame. If the SAD for the current frame is greater than that of the previous frame by 1.7, then they considered that SCD occurs and inserted an I-frame, otherwise continued intercoding (no other AGOP). This would be effective compared to the existing algorithms as the McFIS is equivalent to a group of already processed frames. Moreover, a scene change means the change of background of a video sequence. As the McFIS has the history of the scene we do not need a rigorous process (like Ding’s algorithm) for SCD.

To see the effectiveness of the McFIS in SCD, one mixed video sequence has been created: Mixed video of 700 frames comprising 11 different standard video sequences. Mixed A video sequence comprises the first 50/100 frames of the specified QCIF videos, as shown in Table 3.1. From the table, it is clear that for mixed sequence, totally 10 scene changes occurred at 101, 151, 201, 251, 351, 401, 501, 551, 601, and 651-th frame. SCD technique using McFIS is compared against two most recent and effective AGOP and SCD algorithms [4,20] for efficient video coding.

TABLE 3.1

Mixed Video Sequences for SCD and AGOP

Mixed A (QCIF)	Frames	Frames in Mixed Sequence
Akiyo	100	1–100
Miss America	50	101–150
Claire	50	151–200
Car phone	50	201–250
Hall Monitor	100	251–350
News	50	351–400
Salesman	100	401–500
Grandma	50	501–550
Mother	50	551–600
Suzie	50	601–650
Foreman	50	651–700

FIGURE 3.4
Scene change detection using McFIS, Ding’s, and Matsuoka’s methods for mixed video comprising 11 QCIF video sequences. (a) SCD using McFIS for mixed video (b) Ding’s method for mixed video (c) Matsuoka’s method for mixed video.

Figure 3.4 shows the SCD results by the techniques McFIS, Ding’s, and Matsuoka’s approaches using three QPs = {40, 28, and 20} for Mixed video sequence. SAD ratio (Figure 3.4a), SATD ratio (Figure 3.4b), and NIP (Figure 3.4c) are plotted using McFIS, Ding’s, and Matsuoka’s algorithms, respectively. As we mentioned earlier, for the McFIS, a SCD occurs if the SAD ratio is above 1.7 (i.e., the SAD for the current frame is 70% greater than that of the previous frame). For Ding’s algorithm a SCD occurs if the SATD ratio is more than 1.7 [4]. For Matsuoka’s algorithm, a SCD occurs if the NIP is more than 1000 [20] for QCIF sequences. Thus, it is clear from the figure that for each of the SCD positions (i.e., 101, 151, 201, 251, 351, 401, 501, 551, 601, and 651-th frame), the McFIS-based technique and Ding’s methods successfully detect all scene changes. On the other hand, Matsuoka’s method successfully detects all scene changes except at the 501-th frame for QP = 40 and 28 due to the similarity in background between the salesman and grandma video sequences.

3.3 Efficient Video Coding Techniques

H.264 introduces variable block-sized ME and MC where a 16 × 16-pixel macroblock (MB) is partitioned into several small rectangular or square-shaped blocks. ME and MC are carried out for all possible combinations, and the ultimate block size is selected based on the Lagrangian optimization [23,24] using the bits and distortions of the corresponding blocks. The real-world objects, by nature, may be in any arbitrary shapes; thus, the ME and MC using only rectangular/square-shaped blocks may roughly approximate the real shape, and thus the coding gain would not be satisfactory. A number of research works are conducted using nonrectangular block partitioning [25,26,27], called geometric shape partitioning, motion-based implicit block partitioning, and L-shaped partitioning. Requirement of excessively high computational complexity in the segmentation process and the marginal improvement over H.264 makes them less effective for real-time applications. Moreover, the requirement of precious extra bits for encoding the area covering almost static background makes the aforementioned algorithms inefficient in terms of rate-distortion performance [3]. Exploiting nonrectangular block partitioning and partial block skipping, an advanced video coding technique named pattern-based video coding (PVC) technique [3,28,29] outperforms the H.264. Details of the PVC techniques are discussed in following subsection.

H.264 has one thing in common compared with all other video coding schemes, that is, all the coding schemes encode a video one image frame by one image frame, where the image frame is formed in the spatial domain and with some physical meaning (e.g., a natural scene). It is reasonable since video is usually first (before coding) captured by a sensor (e.g., a camera) and finally (after decoding) displayed in the form of one image frame by one image frame. However, in the sense of data structure, video is nothing more than a three-dimensional (3D) (direction) data matrix; to distinguish between X, Y, and T is meaningless, where X and Y are the spatial directions and T is the temporal direction of a video. One such coding paradigm is 3D transform-based video coding [30,31]; this coding paradigm is not compatible with H.264 and therefore could not be adopted in the existing video coding systems. Recently, Liu et al. [32] proposed a new framework of video coding, which is H.264-compatible and takes the advantage of both H.264 and 3D transform-based coding schemes. Similarly, with the 3D transform-based video coding, they ignored the physical meaning of X, Y, and T axes rather than explicitly distinguishing T-axis as temporal axis as in H.264. The detail of the method is discussed in Section 3.3.2.

3.3.1 Pattern-Based Video Coding

To exploit the nonrectangular block partitioning and partial block skipping for static background area in an MB, the PVC [3,28,29] schemes partition the MBs via a simplified segmentation process that avoids handling the exact shape of the moving objects, so that the popular MB-based ME could be applied. The PVC algorithm focuses on the moving regions (MRs) of the MBs, through the use of a set of regular 64-pixel pattern templates (Figure 3.5). The MR is defined as the difference between the current MB and the colocated MB of the reference frame. The pattern templates were designed using “1”s in 64 pixel positions and “0”s in the remaining 192 pixel positions in a 16 × 16-pixel MB. The MR of an MB is defined as a region comprising a collection of pixel positions where pixel intensity differs from its reference MB. Using some similarity measures, if the MR of a MB is found well covered by a particular pattern, then the MB can be classified as a region-active MB and coded by considering only the 64 pixels of the pattern, with the remaining 192 pixels being skipped as static background. Embedding PVC in the H. 264 standard as an extra mode provides higher compression as larger segment with static background is coded with the partial skipped mode.

FIGURE 3.5
The pattern codebook of 32 regular-shaped, 64-pixel patterns, defined in 16 × 16 blocks, where the white region represents “1” (motion) and the black region represents “0” (no motion). (From Paul, M. et al., IEEE Trans. Circ. Syst. Video Technol., 15(6), 753, 2005.)

3.3.2 Optimal Compression Plane for Video Coding

In the optimal compression plane (OCP) scheme, the frames are allowed to be formed in a non-XY plane. OCP determination process is to be used as a preprocessing step prior to any standard video coding scheme. The essence of the scheme is to form the frames in the plane formed by two axes (among X, Y, and T) corresponding to signal correlation evaluation, which enables better prediction (therefore, better compression). The preprocessing approach for video coding is different from the existing paradigms by exploring the information redundancy in a fuller extent. Rather than explicitly distinguishing T-axis as a temporal axis, the OCP scheme ignores the physical meaning of X, Y, and T axes (somewhat similar with 3D transform) and focuses on the amount of video redundancy along each axis (more specifically, on the correlation coefficient [CC] along each axis). OCP-based video coding framework consists of selection of appropriate preprocessing unit (PPU; its size, i. e., number of frames in the XY plane is denoted as N), adaptive OCP decision, and video coding with a standard compression method. In each PPU, it is possible to form frames in the XY, TX, or TY plane. It is obvious that it needs 2 bits of overhead to represent the three possible coding planes for each PPU, and this bit overhead is included in our rate calculation throughout this work.

FIGURE 3.6
Block diagram of the proposed scheme (illustrated with XY and non-XY frames of “Mobile” video sequence for better visual impression). (From Liu, A. et al., IEEE Trans. Image Process., 20(10), 2788, 2011.)

The insight observation for the OCP framework is that the sampling processing is as usual, but it ignores the physical meaning of a video during the video coding processing for better coder performance. People deem video as successive natural image frames, which are formed in the spatial domain and with clear physical meaning (e.g., a natural scene). However, in the sense of data structure, video is nothing more than a 3D data matrix, and the distinction among X (a spatial dimension), Y (the other spatial dimension), and T (the temporal dimension) is not absolutely necessary, in the viewpoint of compression. The adaptive plane selection according to video content makes sense and the coding system benefits from it. A block diagram of the OCP scheme is shown in Figure 3.6.

3.4 McFIS in Video Coding

Computer vision tool such as dynamic background modeling, that is, McFIS can be used for different purposes of video coding. In the following sections we will explore the applications of the McFIS in video coding.

3.4.1 McFIS as a Better I-Frame

H.264 as well as other modern standards uses intra and inter (predicted [P]- and bi-directional [B]-) frames for improved video coding. An I-frame is encoded using only its own information and thus can be used for error propagation prevention, fast backward/forward play, random access, indexing, etc. On the other hand, a P- or B-frame is coded with the help of previously encoded I- or P-frame(s) for efficient coding. In the H.264 standard, frames are coded as a GOP comprising one I-frame with subsequent inter-frames. The number of I-frames is fewer compared to the inter-frames because an I-frame typically requires several times more bits compared to its intercoded counterpart for the same image quality. An I-frame is used as an anchor frame for referencing the subsequent inter-frames of a GOP directly or indirectly. Thus, encoding error (due to the quantization) of an I-frame is propagated and accumulated toward the end of the frames of a GOP. As a result the image quality degrades and the bits requirement increases toward the end of the GOP. When another I-frame is inserted for the next GOP, better image quality (with the cost of more bits) is recovered, and then again quality degrades toward the end of GOP. As a result, the farther an inter-frame is from the I-frame, the lower the quality becomes. The fluctuation of image quality (or bits per frame) is not desirable for perceptual quality (or bit rate control). By selecting the first frame as an I-frame without verifying its suitability to be an I-frame, we sacrifice (1) overall rate-distortion performance because of poor selection of an I-frame, and (2) perceptual image quality by introducing image quality fluctuation. A frame being the first frame of a GOP is not automatically the best I-frame. An ideal I-frame should have some qualities: the best similarity with the frames in a GOP, so that when it is used as a reference frame for inter-frames in the GOP it needs fewer bits to achieve the desired image quality for better rate-distortion performance and perceptual image quality.

McFIS can be generated using DBM with the first several original frames of a scene in a video and encoding it as an I-frame with finer quantization. All frames of the scene are coded as inter-frames using two reference frames: one is the immediate previous frame and another is the McFIS assuming that moving regions and the background regions of the current frame will be referenced using the immediate previous frames and the McFIS, respectively. As all frames are coded as inter-frames using direct referencing from the McFIS, this provides less fluctuation in PSNR and bit count for the entire scene. The McFIS has higher similarity to all the frames of the scene and thus can be a better I-frame. One can continue to use the current McFIS as a second reference frame unless SCD occurs. If SCD occurs, a McFIS has to be generated again using the first several frames from the new scene and encoded as an I-frame. All of the frames of the new scene are encoded as inter-frames unless SCD occurs again. A joint SCD and AGOP technique can be developed to make the McFIS relevant to the potential referencing for the inter-frames of each new scene [17].

FIGURE 3.7
Effectiveness of McFIS as an I-frame compared to the first frame: (a) mean square error for the first frame and the McFIS with the rest 100 frames for indication of dissimilarity; (b) percentages of background generated by the first frame and the McFIS with the rest 100 frames (from 2 to 101 frames). (From Paul, M. et al., IEEE Trans. Circ. Syst. Video Technol., 21(9), 1242, 2011.)

Figure 3.7 shows two cases of evidence to demonstrate the effectiveness of McFIS compared to the first frame as an I-frame. As mentioned earlier, an I-frame should have higher similarity with the rest of the frames. To check this, the mean square error (MSE) of a frame is calculated in a video sequence evaluated with the first frame and the McFIS, respectively. Obviously, the higher MSE value indicates more dissimilarity. Figure 3.7a shows the average results of MSE using first 100 frames of eight video sequences, namely, Hall Monitor, News, Salesman, Silent, Paris, Bridge close, Susie, and Popple. The figure shows that McFIS results in less MSE than the first frame, and this indicates that McFIS is more similar to the rest of the frames than the first frame. As a result, we need fewer bits and achieve better quality if one uses McFIS (instead of the first frame) as an I-frame and direct reference frame.

From another angle, Figure 3.7b also demonstrates the effectiveness of McFIS for improving coding performance compared to the first frame as an I-frame. The subfigure shows average percentages of “background” for those video sequences using the McFIS and the first frame, respectively. A pixel is defined as a background pixel if that pixel has not more than one level (in 0–255 scale) difference with the colocated pixel in the McFIS (or first frame). The subfigure shows that there are more background pixels in McFIS than the first frame. This confirms that McFIS represents more background regions by capturing the most common features in the video compared to that of the first frame. This leads to more referencing from the McFIS for uncovered/normal background area to improve video coding performance. Note that there is a dip and a peak with McFIS at the 25th frame in both subfigures of Figure 3.7, respectively. These are due to the most similarity of the McFIS with the 25th frame as we generate the McFIS using the first 25 frames where the latest (i.e., the 25th) frame has the highest impact (due to the weight and recent pixel intensity) on the McFIS generation.

3.4.2 McFIS as a Reference Frame

McFIS can be used as an extra reference frame to refer static and uncovered background areas in different video coding schemes. In the following sections we will discuss the applications of McFIS in different coding paradigms such as pattern-based video coding, optimal compression plane, hierarchical bi-predictive pictures, computational time reductions, and bit rate and PSNR fluctuation reductions.

3.4.2.1 PVC Using McFIS

For pattern matching in PVC coding approach, MR needs to be generated for current MB. The MR generated from the difference of the current MB and the colocated MB from the traditional reference frames (i.e., the immediate previous frame or any frame which is previously encoded) may contain moving object and uncovered background (UCB) (Figure 3.8). The ME and MC using pattern-covered MR for UCB would not be accurate if there is no similar region in the reference frames. As a result no coding gain can be achieved for the UCB using the PVC. Similar issues occur for any other H.264 variable size block modes due to the lack of suitable matching region in the reference frames. Thus, we need a reference frame where we will find the UCB for the current MB if that region was evidenced. Only a true background of a scene can be the best choice to be the reference frame for UCB. Moreover, an MR generated from the true background against the current frame represents only moving object instead of both the moving object and the UCB. Thus, the selection of the best matched pattern against the newly generated MR is the best approximation of the object/partial object in an MB. The ME and MC using the best matched pattern carried out on the immediate previous frame will provide more accurate motion vector and thus minimum residual errors for the object/partial object of the MB. The rest of the area (which is not covered by the pattern) is copied from the true background frame. The immediate previous frame is used for ME and MC assuming that the object is visible in the immediate previous frame. The other modes of H.264 can also use true background as well as the immediate previous frame (in the multiple reference frames technique) as two separate reference frames. The Lagrangian optimization will pick the optimal reference frame. The experimental results reveal that this approach improves the rate-distortion performance.

FIGURE 3.8
Motion estimation and compensation problem using blocks or patterns when there is occlusion: (a) reference frame, (b) current frame, (c) MR without McF, (d) MR using McFIS, and (e) true background.

3.4.2.2 OCP Using McFIS

Treating a video as a 3D data tube and rearranging the video frames in other two directions revolutionizes the way to treat the traditional video features such as background, motion, object, panning, zooming, etc. Due to the rearrangement of the traditional XY plane images into the TX or TY plane images, object and/or camera motions of a traditional video can be transformed into simplified motions or simple background; for example, horizontal motions and vertical motions can be transformed into a static background in the TX or TY plane images, respectively, or a heterogeneous object can be transformed into a smooth object in TX or TY plane. Besides this, camera motions such as zooming and panning could not be effectively estimated by the traditional translational motion estimation adopted into the H.264. Camera motions can also be transformed into a simplified motion/background in the TX or TY plane; thus any existing coding technique can encode them more efficiently using the TX or TY plane.

The McFIS is used as a second reference frame for encoding the current frame assuming that the motion part of the current frame would be referenced using the immediate previous frame and the static background part would be referenced using the McFIS. The ultimate reference is selected at the block and sub-block levels using the Lagrangian multiplier. The McFIS is used as a long-term reference frame in the dual reference frames concept which is a subset of the concept of the MRFs. Paul et al. [33] exploit the newly introduced background in the TX or TY plane through the McFIS. In the method, they first determine the OCP and then encode the video toward the optimal plane with the McFIS. The experimental results reveal that the proposed technique improves a significant video quality compared to the existing OCP technique and the H.264 video coding standard with comparable computational complexity.

Using the aforementioned procedure, different McFISes are generated using the silent videos of VXYT, VTYX, VTXY, respectively. Figure 3.9 shows (a) original 50th frame, (b) the 50th McFIS using XY plane, (c) the 50th McFIS using TX plane, and (d) the 50th McFIS using TY plane. If we compare the McFISes in Figure 3.9c and d against the McFIS in Figure 3.9b, we can easily differentiate them in terms of smoothness. As we mentioned earlier, smoother objects or motion can be obtained after transforming the video into different planes. The McFISes in Figure 3.9c and d provide smother images compared to the McFIS in Figure 3.9b. This result also indicates that a better rate-distortion performance can be achieved using the smother McFIS (Figure 3.9c or d) compared to the rough McFIS (Figure 3.9b), as the primary goal of the McFIS is to capture more background.

FIGURE 3.9
McFISes at different planes: (a) original 50th frame, (b) McFIS at 50th frame using V_XYT video, (c) McFIS at 50th frame using V_TXY video, and (d) McFIS at 50th frame using V_TYX video where first 288 frames of Silent video sequence are used. (From Paul, M. and Lin, W., Efficient video coding considering a video as 3D data cube, in IEEE International conference on Digital Image Computing: Techniques and Applications (IEEE DICTA-11), 2011.)

3.4.2.3 Hierarchical Bipredictive Pictures

Figure 3.10a shows a popular dyadic hierarchical bipredictive picture (HBP) prediction structure with encoding image types, coding, and display order of a GOP (comprises 16 frames) where two bidirectional reference frames (solid arrows only) are used. To get the better coding performance of the HBP structure compared to the other structure (e.g., IPP or IBBP), different quantization parameters are used for different hierarchy levels. Normally, finer quantization is applied for the frames that are more frequently used as reference frames for the other frames directly or indirectly. For example, Frame 9 (according to display order) in Figure 3.10a is used more frequently (four times directly for frames 5, 7, 11, and 13, and 10 times indirectly for frames 2, 3, 4, 6, 8, 10, 12, 14, 15, and 16) compared to any other B-frames as a reference frame.

The MRFs technique can be applied on the HBP. Intuitively, a triple frame referencing technique can be made under the dyadic HBP structure, which is shown in Figure 3.10a based on the closeness and the availability of the reference frames for a frame. To encode a frame, two solid arrows come from two frames and one dotted arrow comes from the third reference frame. For example, to encode Frame 5, Frame 1 and Frame 9 are used as two bidirectional frames, and Frame 17 is used as the third reference frame. On the other hand, to encode Frame 2, Frame 1 and Frame 3 are used as bidirectional frames, and Frame 5 is used as the third reference frame. Both examples demonstrate that the MRFs technique in the HBP is not uniform in terms of the distance from the encoding frame and the reference frames. Thus, sometimes it is difficult to exploit the advantages of the MRF features from the close vicinity frames. Moreover, it could not ensure the implicit background/foreground referencing, and thus it has no physical meaning for referencing. To overcome the limitation of effectiveness using the MRFs technique in the dyadic HBP structure (Figure 3.10a) and to exploit the implicit background/foreground referencing, Paul et al. [22] proposed a new HBP scheme using the McFIS as a third reference frame. By this they assumed that motion areas of the current frame would be referenced from the two B-frames and static/uncovered background areas would be referenced from the McFIS. Experimental results reveal that the technique improves the rate-distortion performance significantly.

FIGURE 3.10
(See color insert.) (a) Dyadic hierarchical B-picture prediction structure using two frames and three frames including the third frame (dotted arrows) and (b) proposed triple frame referencing with the McFIS as the third frame.

3.4.2.4 Computational Time Reduction in Video Coding Using McFIS

To see the amount of computational reduction of McFIS-based approach, we have used the H.264 with fixed GOP and five reference frames. In video coding, major portion of entire computational time is used for ME and MC. Although the proposed scheme needs some extra computational time to generate McFIS and encode it as I-frame, this extra time is not significant in comparison with the ME reduction. To compare the experimental results we have implemented the proposed and the H.264 schemes adapted from JM 10.1 H.264/AVC reference software on a PC with Intel(R) Core^(TM)2 CPU [email protected] GHz, 2.39 GHz, and 3.50 GB of RAM. Figure 3.11 shows experimental results of computational reduction of the McFIS-based scheme (where McFIS is used as an I-frame and second reference frame) against the H.264 with five reference frames, using a number of video sequences (Mixed video, Silent, Paris, Bridge close, Hall Monitor, Salesman, News, Susie, and Popple) over different QPs, that is, 40, 36, 32, 28, 24, and 20. The computational complexity is calculated based on the overall encoding time including processing operations and accessing to the data. This figure confirms that the McFIS-based approach reduces 61% of the computation on average.

FIGURE 3.11
Average computational time reduction by the proposed scheme against the H.264 with fixed GOP and five reference frames using different standard video sequences (Silent, Bridge close, Paris, Hall Monitor, News, Salesman, News, Susie, and Popple). (From Paul, M. et al., IEEE Trans. Circ. Syst. Video Technol., 21(9), 1242, 2011.)

The McFIS-based technique requires extra computational time for the generation and encoding of the McFIS. This extra time is not more than 3% of the overall encoding time of a scene if we assume that a scene length is 100 frames, ME search length is 15, and single reference frame is used. The experimental results suggest that the proposed scheme saves -43%, 17%, and 58% computational time, reduces 22%, 20%, and 19% of bit rates, and improves 1.53, 1.47, and 1.45 dB image quality against the H.264 with one, two, and five reference frames, respectively, for News video sequence on average. As can be seen, the proposed method is more efficient even in comparison with the H.264 using two reference frames. The McFIS generation and encoding time is fixed and does not depend on the number of reference frames. Thus, when any fast motion estimation (such as Unsymmetrical–cross Multi-Hexagon-grid Search (UMHexagonS) [34,35]) is used, the percentage of time saving is lower compared to that when we use exhaustive search. For example, when we turn on the UMHexgonS for both the proposed scheme and the H.264 with five reference frames, the computational time saving is around 50%, which is significant as well. When we turn on the fast skip mode [3,29] for both the proposed scheme and the H.264 with five reference frames, the computational complexity is even better for the proposed scheme, as the proposed scheme produces more skip modes using the McFIS.

3.4.2.5 Consistency of Image Quality and Bits per Frame Using McFIS

Encoding the first frame as an I-frame and referencing in the conventional way, errors (due to the quantization) are propagated and accumulated toward the end of the GOP. Figure 3.12 shows bits per frame and PSNRs at frame level by the H.264 and the McFIS-based scheme using the first 256 frames of News video sequence. The figure demonstrates that the proposed scheme provides not only better PSNR, that is, 39.83 dB using 200 kbps bit rate but also consistent PSNR and bits per frame over the scene compared to the H.264 (i.e., 39.15 dB using 214 kbps) and Ding’s algorithm (39.41 dB using 200 kbps). Note that McFIS bits are considered at bit rate calculations. Figure 3.12b shows that only one McFIS is required for News video sequence as there is no scene change or there is no significant drop of referencing using the McFIS within 256 frames. Thus, for this sequence the fluctuation of bits is less compared to that of other methods. The standard deviations of the PSNR using the McFIS-based algorithm, Ding’s algorithm, and the H.264 are 0.1122, 0.255, and 0.2343, respectively. The PSNR fluctuations using the McFIS-based approach, Ding’s algorithm, and the H.264 are 0.8, 2.0, and 1.5 dB, respectively.

FIGURE 3.12
Fluctuations of PSNR (a) and bits per frame (b) by the H.264, Ding’s, and the McFIS-based schemes using first 256 frames of News sequence.

The McFIS-based scheme can provide more consistent image quality and bits per frame because a common frame McFIS is used as a reference frame directly (thus, no error propagation toward the end of the scene) for all inter-frames in a scene. As McFIS is encoded with finer quantization, there is less error due to the quantization. Encoding an I-frame in the conventional coding scheme with that level of fine quantization requires enormous number of bits (due to the regular insertion of I-frame at the beginning of a GOP for fixed size of GOP where the GOP size has to be small in order to cater for possible scene changes). Naturally, the generated McFIS enables lower PSNR fluctuation because it represents the most common and stable features in the video segment (on the contrary, the first frame only represents itself). A new McFIS needs to be generated and encoded if there is a significant drop of the PSNR of an image or the percentage of referencing drops significantly compared to the other frames of a scene. Two thresholds as 2.0 dB and 3%, respectively, are used in Ref. [16].

Figure 3.13 shows reference mapping using Silent and Paris video sequences by the McFIS-based scheme and Ding’s algorithm [4]. A scattered referencing takes place using Ding’s algorithm for the immediate previous and second previous frames. For the proposed method, moving object areas (black regions in Figure 3.13e and f) are referenced using the immediate previous frame, whereas the background regions are referenced using McFIS (normal area in Figure 3.13e and f). A large number of areas (normal regions in Figure 3.13e and f) are referenced by the McFIS, and this indicates the effectiveness of the McFIS for improving coding performance.

3.4.2.6 R-D Performance Improvement

Figure 3.14 shows the overall rate-distortion curves using the McFIs-based technique, Ding’s, Matsuoka’s, and the H.264 (with fixed GOP and five reference frames) algorithms for a number of video sequences. The experimental results confirm that the McFIS-based scheme outperforms the H.264 as well as other two existing algorithms in the most cases even with the video sequences (e.g., Tennis, Trevor, Bus, etc.) with camera motions. McFIS-based scheme as well as the other two techniques (Ding and Matsuoka) with two reference frames could not outperform the H.264 with five reference frames for the video sequences (e. g., Tempete, Mobile, and Foreman) with camera motions. It is due to the fact that the proposed, Ding’s, and Matsuoka’s techniques are not explicitly designed for camera motions. The performance improvement by the McFIS-based scheme is relatively high for News, Salesman, Silent, and Hall Monitor video sequences compared to the other sequences. This is due to the relatively larger background areas in these cases, and hence a larger number of references are selected from the McFIS.

FIGURE 3.13
Frame-level reference maps by the proposed and Ding’s methods for Silent and Paris video sequences: (a) and (b) are the decoded 31st frames of Silent and Paris videos; (c) and (d) are the reference maps by Ding’s algorithm; and (e) and (f) are the reference maps by the proposed algorithm where black regions are referenced from the immediate previous frame while other regions are referenced from the McFIS (for the proposed) or the second previous frame (for Ding’s)

3.5 Conclusions

In this chapter, the issue of effective, dynamic I-frame insertion, and reference frame (termed as the most common frame in scene, or McFIS for short) generation in video coding has been tackled simultaneously with a Gaussian mixture based model for dynamic background. To be more specific, the McFIS-based method used the generated McFIS’ inherent capability of scene change detection and adaptive GOP determination for integrated decision for efficient video coding. The McFIS is generated using real-time Gaussian mixture model. McFIS can be used as the second reference frame for efficient encoding of background. In essence, the scheme allows moving object areas being referenced with the immediate previous frame while background regions are being referenced with McFIS.

FIGURE 3.14
Rate-distortion performance comparison using different techniques such as H.264 with five reference frames, McFIS-based technique, Ding’s technique, and Matsuoka’s technique for different video sequences such as Salesman, Paris, Silent, Popple, Container, and Mobile.

A dynamic background modeling is also discussed using decoded or distorted frames instead of original frames. This allows wider scope of use with dynamic background modeling because raw video feeds (without any lossy compression) are usually not available and noise/error is inevitable especially in the case of wireless transmission. By foreground and background referencing, McFIS can improve rate-distortion performance in the uncovered background region, which is almost impossible by the traditional multiple reference schemes. The McFIS-based scheme effectively reduces computational complexity by limiting the reference frames into only two without sacrificing rate-distortion performance (actually it improves, compared to the relevant existing algorithms). By introducing McFIS as a reference frame, one can avoid the complication of selecting long-term reference frame.

Two advanced video coding techniques, namely, PVC and OCP, are also discussed in the chapter. McFIS is used in PVC to determine the moving region which is used in pattern-matching criteria. McFIS is also used to be referenced for newly generated static and smooth areas by the OCP technique Improved rate-distortion performance can be achieved using McFIS in the PVC and OCP video coding techniques.

The video coding technique using McFIS outperforms the existing relevant schemes, in terms of rate-distortion and computational requirement. The experimental results show that the technique detects scene changes more effectively compared to the two state-of-the-art algorithms, and outperforms them by 0.5–2.0 dB PSNR for coding quality. The technique outperforms the H.264 with fixed GOP and five reference frames by 0.8–2.0 dB in PSNR and by around 60% of reduced computational time.

References

1. ITU-T Recommendation H.264: Advanced video coding for generic audiovisual services, 03/2009.

2. Wiegand, T., G. J. Sullivan, G. Bjøntegaard, and A. Luthra, 2003. Overview of the H.264/AVC video coding standard. IEEE Transactions on Circuits and Systems for Video Technology, 13(7), 560–576.

3. Paul, M. and M. Murshed, 2010. video coding focusing on block partitioning and occlusions. IEEE Transactions on Image Processing, 19(3), 691–701.

4. Ding J.-R. and J.-F. Yang, 2008. Adaptive group-of-pictures and scene change detection methods based on existing H.264 advanced video coding information. IET Image Processing, 2(2), 85–94.

5. Huang, Y.-W., B.-Y. Hsieh, S.-Y. Chien, S.-Y. Ma, and L.-G. Chen, 2006. Analysis and complexity reduction of multiple reference frames motion estimation in H.264/AVC. IEEE Transactions on Circuits and Systems for Video Technology, 16(4), 507–522.

6. Shen, L., Z. Liu, Z. Zhang, and G. Wang, 2007. An adaptive and fast multi frame selection algorithm for H.264 Video coding. IEEE Signal Processing Letters, 14(11), 836–839.

7. Kuo, T.-Y. and H.-J. Lu, 2008. Efficient reference frame selector for H.264. IEEE Transactions on Circuits and Systems for Video Technology, 18(3), 400–405.

8. Hachicha, K., D. Faura, O. Romain, and P. Garda, 2009. Accelerating the multiple reference frames compensation in the H.264 video coder. Journal of Real-time Image Processing, 4(1), 55–65.

9. Hepper, D., 1990. Efficiency analysis and application of uncovered background prediction in a low bit rate image coder. IEEE Transactions on Communication, 38, 1578–1584.

10. Chien, S.-Y., S.-Y. Ma, and L.-G. Chen, 2002. Efficient moving object segmentation algorithm using background registration technique. IEEE Transactions on Circuits and Systems for Video Technology, 12(7), 577–586.

11. Totozafiny, T., O. Patrouix, F. Luthon, and J.-M. Coutellier, 2006. Dynamic background segmentation for remote reference image updating within motion detection JPEG 2000. IEEE International Symposium on Industrial Electronics, Montreal, Quebec, Canada, pp. 505–510.

12. Ding, R., Q. Dai, W. Xu, D. Zhu, and H. Yin, 2004. Background-frame based motion compensation for video compression. IEEE International Conference on Multimedia and Expo (ICME), Taipei, Vol. 2, pp. 1487–1490.

13. Stauffer C. and W. E. L. Grimson, 1999. Adaptive background mixture models for real-time tracking. IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 246–252.

14. Lee, D.-S., 2005. Effective Gaussian mixture learning for video background subtraction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(5), 827–832.

15. Haque, M., M. Murshed, and M. Paul, 2008. Improved Gaussian mixtures for robust object detection by adaptive multi-background generation. IEEE International Conference on Pattern Recognition, Tampa, FL, pp. 1–4.

16. Paul, M., W. Lin, C. T. Lau, and B.-S. Lee, 2010. Video coding using the most common frame in scene. IEEE International Conference on Acoustics, Speech, and Signal Processing (IEEE ICASSP-10), Dallas, TX, pp. 734–737.

17. Paul, M., W. Lin, C. T. Lau, and B.-S. Lee, 2011. Explore and model better I-frame for video coding. IEEE Transaction on Circuits and Systems for Video Technology, 21(9), 1242–1254.

18. Dimou, A., O. Nemethova, and M. Rupp, 2005. Scene change detection for H. 264 using dynamic threshold techniques. EURASIP Conference on Speech and Image Processing, Multimedia Communications and Service, Smolenice, Slovak Republic, pp. 222–227.

19. Alfonso, D., B. Biffi, and L. Pezzoni, 2006. Adaptive gop size control in H.264/AVC encoding based on scene change detection. Signal Processing Symposium, Reykjavik, Iceland, pp. 86–89.

20. Matsuoka, S., Y. Morigami, T. Song, and T. Shimamoto, 2008. Coding efficiency improvement with adaptive gop size selection for H.264/SVC. International Conference on Innovative Computing Information and Control, (ICICIC), Dalian, Liaoning, pp. 356–359.

21. Song, T., S. Matsuoka, Y. Morigami, and T. Shimamoto, 2009. Coding efficiency improvement with adaptive gop selection for H.264/SVC. International Journal of Innovative Computing, Information and Control, 5(11), 4155–4165.

22. Paul, M., W. Lin, C.T. Lau, and B.-S. Lee, 2011, McFIS in hierarchical bipredictive picture-based video coding for referencing the stable area in a scene. IEEE International Conference on Image Processing (IEEE ICIP-11), Brussels, Belgium, pp. 3521–3524.

23. Paul, M., M. Frater, and J. Arnold, 2009. An efficient mode selection prior to the actual encoding for H.264/AVC encoder. IEEE Transactions on Multimedia, 11(4), 581–588.

24. Paul, M., W. Lin, C.T. Lau, and B.-S. Lee, 2011. Direct intermode selection for H.264 video coding using phase correlation. IEEE Transaction on Image Processing, 20(2), 461–473.

25. Escoda, O. D., P. Yin, C. Dai, and X. Li, 2007. Geometry-adaptive block partitioning for video coding. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-07), Honolulu, Hawaii, pp. I-657–660.

26. Kim, J. H., A. Ortega, P. Yin, P. Pandit, and C. Gomila, 2008. Motion compensation based on implicit block segmentation, IEEE International Conference on Image Processing (ICIP-08), San Diego, CA, pp. 2452–2455.

27. Chen, S., Q. Sun, X. Wu, and L. Yu, 2008. L-shaped segmentations in motion-compensated prediction of H.264. IEEE International Conference on Circuits and Systems, Seattle, WA, pp. 1620–1623.

28. Paul, M., M. Murshed, and L. Dooley, 2005. A real-time pattern selection algorithm for very low bit-rate video coding using relevance and similarity metrics. IEEE Transactions on Circuits and Systems for Video Technology, 15(6), 753–761.

29. Wong, K.-W., K.-M. Lam, and W.-C. Siu, 2001. An efficient low bit-rate video-coding algorithm focusing on moving regions. IEEE Transactions on Circuits and System for Video Technology, 11(10), 1128–1134.

30. Liu, Y., F. Wu, and K. N. Ngan, 2007. 3-D object-based scalable wavelet video coding with boundary effect suppression. IEEE Transactions on Circuits and System for Video Technology, 17(5), 639–644.

31. Seran, V. and L. P. Kondi, 2006. New scaling coefficients for bi-orthogonal filter to control distortion variation in 3D wavelet based video coding. In Proceedings of the International Conference on Image Processing, Atlanta, GA, pp. 1873–1876.

32. Liu, A., W. Lin, M. Paul, and F. Zhang, 2011. Optimal compression plane determination for video coding. IEEE Transaction on Image Processing, 20(10), 2788–2799.

33. Paul, M. and W. Lin, 2011. Efficient video coding considering a video as 3D data cube. IEEE International Conference on Digital Image Computing: Techniques and Applications (IEEE DICTA-11), Noosa, Queensland, Australia, pp. 170–174

34. Chen, Z., P. Zhou, Y. He, and J. Zheng, 2006. Fast integer-PEL and fractional-PEL motion estimation for H.264/AVC. Journal of Visual Communication and Image Representation, 17(2), 264–290.

35. Rahman, C. A. and W. Badawy, 2005. Umhexagons algorithm based motion estimation architecture for H.264/AVC. Fifth International Workshop on System-on-Chip for Real-time Applications (IWSOC’05), Banff, Alberta, Canada, pp. 207–210.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 3. Computer Vision–Aided Video Coding

Create new playlist

Sign In

Sign Up

Table of Contents for
3. Computer Vision–Aided Video Coding