4.8 Modelling from a Bit-stream

As described in Section 4.5, DCT can be used in building computational models of visual attention. This section will introduce a computational model of visual attention in the compressed domain [10]. Most existing saliency detection models are built in the image (uncompressed) domain. However, images in storage and over the internet are typically in the compressed domain such as JPEGs. A novel saliency detection model in the compressed domain is proposed in [10]. The intensity, colour and texture features of the image are extracted from the DCT coefficients from a JPEG bit-stream. The saliency value of each DCT block is obtained based on the Hausdorff distance calculation and feature map fusion. As DCT is used in JPEG compression at 8 × 8-px block level, the DCT coefficients are used to extract intensity, colour and texture features for each 8 × 8-px block for saliency detection. Although the minimum coded unit (MCU) can be as large as 16 × 16-px (for 4: 2: 0 subsampling format), the saliency detection in this model is performed at the 8 × 8 block level for each DCT block. The saliency map for an image is calculated based on weighted feature differences between DCT blocks.

4.8.1 Feature Extraction from a JPEG Bit-stream

The Baseline method of JPEG, which is implemented based on DCT, is the most widely used image compression method [70]. Entropy decoding is used to decode the JPEG bit-stream to obtain the quantized DCT coefficients. As Huffman coding (one kind of entropy decoding algorithm) is utilized to encode the quantized DCT coefficients in the Baseline method of JPEG [70, 71], the JPEG bit-stream can be decoded into quantized DCT coefficients according to the two sets of Huffman tables (an AC table and a DC table per set). Then the dequantization operation is applied to these quantized DCT coefficients to obtain the DCT coefficients.

The syntax for DCT-based modes of operation in the JPEG standard is shown in Figure 4.19. In the JPEG standard, markers are used to identify various structural parts of compressed data. The SOI marker, which indicates the start of the image, can be obtained in the JPEG bit-stream to identify the start of a compressed image, while EOI represents the end of image. The frame header presented at the start of a frame (JPEG image) specifies the source image characteristics, the components in the frame and the sampling factors for each component, and also specifies the destinations from which the quantized tables to be used with each component may be retrieved. The parameter Tq included in the frame header specifies the quantization table destination (index) from which the specified quantization table is used for dequantization of the DCT coefficients.

Figure 4.19 Syntax for DCT-based modes of operation in JPEG standard [71]. © 2012 IEEE. Reprinted, with permission, from Y. Fang, Z. Chen, W. Lin, C. Lin, ‘Saliency detection in the compressed domain for adaptive image retargeting’, IEEE Transactions on Image Processing, Sept. 2012

img

Following the frame header, the scan header specifies which components and which DCT quantized coefficients are contained in the scan. The parameters Tdj and Taj in the scan header specify the DC and AC entropy coding table destinations respectively. The data following the scan header includes the ECS (entropy-coded segment) and RST (restart marker) data. Each ECS comprises a sequence of entropy-coded MCUs. The RST is a conditional marker placed between two ECSs only if restart is enabled. Detailed information of the JPEG bit-stream can be found in [71].

Based on the above description, the JPEG bit-stream can be decoded into quantized DCT coefficients based on the DC and AC entropy coding tables (Tdj and Taj) from the scan header. According to the quantization table from Tq, the quantized DCT coefficients are further decoded through dequantization operation to get the DCT coefficients.

Three features including the intensity, colour and texture are extracted based on the DCT coefficients to build the saliency detection model. The DCT coefficients in one 8 × 8-px block are shown as Figure 4.20. The DCT coefficients in one block are composed of the DC coefficient and AC coefficients. In each block, the DC coefficient is a measure of the average energy over all the 8 × 8 pixels, while the remaining 63 AC coefficients represent the detailed frequency properties of this block. The JPEG compression standard takes advantage of the fact that most of the energy is included in the first several low-frequency coefficients, which are in the left-upper corner of the block in Figure 4.20. The high-frequency coefficients from the right-bottom of the block are close to zero and thus are neglected during the quantization of the DCT coefficients.

Figure 4.20 DCT coefficients and the zig-zag scanning in one 8 × 8-px block

img

The AC coefficients are ordered by zig-zag scanning from low-frequency to high-frequency, as shown in Figure 4.20. The YCrCb colour space is used to encode colour images in the JPEG standard. The Y channel represents the luminance information while the Cr and Cb channels include the chrominance information for JPEG images. As discussed above, the DC coefficients represent the average energy of each 8 × 8 px block, and are firstly transferred from YCrCb colour space to the RGB colour space to extract the intensity and colour features. The colour and intensity features are computed by the following steps: let r, g and b denote the red, green and blue colour components from DC coefficients, and four broadly tuned colour channels are generated as: R = r − (g + b) for new red component, G = g − (r + b)/2 for new green component, B = b − (r + g)/2 for new blue component and Y = (r + g)/2 - |rb|/2 − b for new yellow component. The intensity feature can be calculated as: I = (r + g + b)/3. Each colour channel is then decomposed into red/green and blue/yellow double opponency according to the related properties of the human primary visual cortex [66]: Crg = RG and Cby = BY.

I, Crg and Cby are the three extracted intensity and colour features for an 8 × 8 block in the JPEG image. It is noted that a 16 × 16 MCU consists of four 8 × 8 luminance blocks and two 8 × 8 chrominance blocks (one for Cb and the other for Cr). Thus, four luminance blocks share the same chrominance blocks in a typical 4: 2: 0 component subsampling JPEG encoding system.

The AC coefficients include the detailed frequency information for each image block, and previous studies have shown that the AC coefficients can be used to represent the texture information for image blocks. In this model, the AC coefficients in YCrCb colour space is used to extract the texture feature for each 8 × 8 block. In YCrCb colour space, Cr and Cb components represent the colour information and their AC coefficients provide little information for texture. In addition, a 16 × 16 MCU consists of more luminance blocks than chrominance ones in a typical 4 : 2 : 0 scheme. Thus, the model uses the AC coefficients from the Y component only to extract the texture feature T. Following the studies in [72, 73], AC coefficients are classified into three parts: low-frequency (LF), medium-frequency (MF) and high-frequency (HF) parts, as shown in Figure 4.21. The coefficients in each part are summed as one value to obtain three corresponding elements (img, img and img) to represent the texture feature for each DCT block. Therefore, the texture feature T for each DCT block can be expressed as:

Figure 4.21 Different types of DCT coefficients in one 8 × 8-px block

img

(4.80) equation

where img, img and img are the sums of all the coefficients in the LF, MF and HF parts respectively in Figure 4.21.

4.8.2 Saliency Detection in the Compressed Domain

In this model, the four extracted features – including one intensity, two colour features and one texture feature – are used to calculate four feature maps respectively. The coherent normalization based fusion method is adopted to combine these four feature maps to get the final saliency map for JPEG images.

1. Feature differences between DCT blocks: as to the intensity and colour features (I, Crg and Cby), the feature differences between blocks i and j can be computed as

(4.81) equation

where k = {1, 2, 3} represents the intensity and colour features respectively (one intensity feature and two colour features).
The vector T from Equation 4.80, including three elements, is used to represent the texture feature for each DCT block in the JPEG image. The Hausdorff distance [74] is used to calculate the difference between two vectors of texture from two different blocks. The Hausdorff distance is widely used to calculate the dissimilarity between two point sets through examining the fraction of points in one set that lies near the other set (and vice versa). The texture difference img between two blocks i and j can be computed as follows:

(4.82) equation

where superscript 4 means that the texture feature is the fourth feature (the first three features include one intensity and two colour features, as described above); Ti and Tj represent the vectors of texture feature for blocks i and j, respectively, and img is calculated as

(4.83) equation

where img·img is the L2 norm.
2. Feature maps in the compressed domain: in this model, the saliency value for each DCT block in each feature map is determined by two factors: one is the block differences between this DCT block and all other DCT blocks of the input image; the other is the weighting for these block differences. If these differences between this DCT block and all other DCT blocks are larger, the saliency value for this DCT block is larger. In addition, a Gaussian model of the Euclidean distances between DCT blocks is used to determine the weighting for these DCT block differences for its generality. Here, img is used to represent the saliency value calculated from the kth feature for the DCT block i that can be obtained as follows:

(4.84) equation

where σ is a parameter for the Gaussian model; img is the Euclidean distance between DCT blocks i and j; img is calculated as Equations 4.81 and 4.82. We can set img.
From Equation 4.84, the saliency value of DCT block i considers all the block differences between this DCT block and the other DCT blocks in the image. The saliency value of block i is larger with greater block differences from all other blocks in the image. The model uses a Gaussian model of the Euclidean distances between DCT blocks to weight the block differences. From Equation 4.84, the weights of the block differences from nearer neighbour blocks are larger compared with these from farther neighbour blocks. Therefore, the contributions of the block differences to the saliency value of the DCT block i will decrease with larger-distance DCT blocks from DCT block i. On the contrary, the contributions of the block differences to the saliency value of the DCT block i will increase with smaller-distance DCT blocks from DCT block i. According to Equation 4.84, four feature maps (one intensity feature map, two colour feature maps and one texture feature map) can be calculated based on the intensity, colour and texture features.
3. Final saliency map in the compressed domain: while collecting the salient values of all DCT blocks together, the four feature maps img(img) are created. The saliency map for the JPEG image can be obtained by integrating these four feature maps. The coherent normalization based fusion method is used to combine these four feature maps into the saliency map SM:

(4.85) equation

where N′ (.)is the normalization operation; img; img and img are parameters determining the weights for each components in Equation 4.85. These two parameters are set as img. The second term in Equation 4.85 represents those regions which all the four feature maps detect as salient regions.
In sum, to obtain the saliency map for JPEG images in the compressed domain, the model in [10] extracts the intensity, colour and texture features from the DCT coefficients in the JPEG bit-stream to calculate the DCT block differences. Combining the Gaussian model for the Euclidean distances between the DCT blocks, the DCT block differences are utilized to obtain the saliency map for JPEG images. The experimental results in [10] show that the computational model of visual attention in the compressed domain outperforms other existing ones in salient object detection based on a large public database.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset