13. 2D-to-3D Video Conversion: Techniques and Applications in 3D Video Communications

Three-dimensional (3D) video can be regarded as the next revolution for many image and video applications such as television, movies, video games, video conferences, and remote video classrooms. It provides a dynamic, realistic, and immersive feeling for the viewers, which enhances the sense of presence. However, 3D has not been widely adopted yet. This is because the success of 3D depends not only on technological advances in 3D displays and capturing devices but also on the active deployment of 3D video communication systems and the wide availability of 3D video content. With only a limited number of programs available in 3D format, there is little reason to buy a more expensive 3D device. However, the tremendous production cost and the complicated process for 3D videos result in the lack of 3D content, which hinders the development of 3D technology. To resolve this problem, one effective alternative is to develop new techniques to convert 2D videos into 3D. An efficient 2D-to-3D conversion system can generate more 3D content from existing 2D videos. Most important of all, it can push the popularization of 3D into the general public and consequently stimulate the development of 3D technology. Generally, 2D-to-3D conversion techniques comprise of automatic and human interventional methods (semiautomatic). For the semiautomatic methods, certain key frames of video sequences are assigned depth information by human participation [3], and other frames are converted automatically. Since the semiautomatic methods involve human interaction, its results are better. However, human participation is impractical in many scenarios because it is slow, not real time, and costly. Hence, this chapter will focus on the automatic 2D-to-3D conversion technology mainly.

Except for 3D content creation, 2D-to-3D conversion can also help to reduce the bit rate and enhance the error resilience and error concealment of the 3D video communication system when 2D-plus-depth (2D+Z) or multiview video plus depth (MVD) format is employed. This is very important for the 3D contents transmission in the wireless network because the error-prone feature of wireless network and the packet loss will degrade the visual experience greatly. This chapter will propose such a communication system in the context of 2D-to-3D application.

The objective of this chapter is to provide a comprehensive overview of 2D-to-3D conversion technology and its applications in 3D video communication. Firstly, the principle of 3D display technology is presented in Section 13.2. Then, in Section 13.3, the depth cues that can be used to extract depth are introduced, including the advantage and disadvantage of each cue. Section 13.4 presents the state-of-the-art schemes for the 2D-to-3D conversion. An application of 2D-to-3D conversion in 3D video communication is proposed in Section 13.5. Finally, conclusions and future trends of 2D-to-3D conversion are given in Section 13.6.

13.2 Principle of 3D Display Technology

Because our eyes are spaced apart, the left and right retinas receive slightly different images. This difference in the left and right images is called binocular disparity [26]. The brain integrates these two images into a single 3D image, allowing us to perceive depth information. In fact, this is the basic idea for 3DTV, where a 3D effect is achieved by generating two slightly different images for each eye.

FIGURE 13.1
The relationship between depth map and disparity value.

As Figure 13.1 shows, two viewpoints (cameras/eyes) are located at position c_l and c_r, where f is the focal length of the cameras and t_x is the baseline distance between the two cameras. We assume that the two cameras are on the same horizontal line, just like our two eyes. With this assumption, the left and right cameras generate the images of object p (with distance Z) at the image plane. From the geometry shown in the figure, we have

$\begin{array}{l} x_{l} = - \frac{t_{x}}{2} \frac{f}{Z} \\ x_{r} = \frac{t_{x}}{2} \frac{f}{Z} \end{array}$ $\begin{array}{l} x_{l} = - \frac{t_{x}}{2} \frac{f}{Z} \\ x_{r} = \frac{t_{x}}{2} \frac{f}{Z} \end{array}$

(13.1)

The disparity between the two images is $d = x_{l} - x_{r}$ $d = x_{l} - x_{r}$ . After simple calculations, (13.1) can be rewritten as

$\frac{d}{t_{x}} = \frac{f}{Z}$ $\frac{d}{t_{x}} = \frac{f}{Z}$

(13.2)

Formula (13.2) shows that depth is inversely proportional to disparity. Based on this relation, current 3D devices can support the left and right view format or the 2D+Z format. Figure 13.2 shows a general 2D-to-3D conversion system, where depth image–based rendering is the process of synthesizing virtual views of a scene from texture images and its associated per-pixel depth information [7]. The depth recovering part is the most important part for 2D-to-3D conversion and it will be the focus of this chapter.

FIGURE 13.2
A general 2D-to-3D conversion system.

13.3 Depth Cues

To generate the 3D version from a 2D image, the basic idea is to discover its corresponding depth value for each pixel. As we all know, we can still get a sense of depth feeling even when we watch 2D images/videos. For example, the farther an object is, the smaller it will appear in the picture. This type of experience is caused by the depth cues in the images/videos. To get the depth information, discovering and understanding the depth cues is a very important step. Hence, the depth cues that can be employed for 2D-to-3D conversion are detailed in this section.

13.3.1 Motion Cues

Imagine taking the train and looking outside, near buildings move faster across the retina than far objects do. The relative motion between the viewing camera and the observed scene provides an important cue to depth perception. Most important of all, the relative motion may be seen as a form of “disparity over time.” Generally, the methods using motion cues try to get the motion information from the scene and they can be classified into two groups, which are motion estimation method and structure from motion (SFM).

Motion estimation is extensively used in video coding to get the motion vectors between different macroblocks. To a certain extent, the motion vector can be directly interpreted as disparity. Hence, the simplest way is extracting the motion vector from the video codec (MPEG or H.264/AVC). In Ref. [16], the depth map is obtained from an H.264-encoded sequence, in which different kinds of modes (inter, intra, and skip) are processed differently. Its advantage is the low complexity and its compatibility with broadcasting networks. However, the motion vector from video codec is based on blockwise motion estimation, hence the block artifacts cannot be avoided in the generated depth map. Even though some deblocking filters can be used to reduce block artifacts, the artifacts along object edges still exist, which degrades the 3D experience seriously. Optical flow is another way to estimate the motion vector, which is good on the objects’ boundary. However, optical flow is complex and it cannot handle the case when object moves too fast or has large displacement. Energy minimization is employed frequently in computer vision. It could also be used to estimate the motion information by modeling the displacement field as a Markov random field (MRF) and formulating an energy function for the motion information. In the energy minimization function, two constraints are used generally. The first one is the color similarity between two corresponding pixels in two different images. The second is the smoothness for the depth values in a neighborhood. Typically, the energy functions have many local minima and they take enormous computational costs. However, a lot of efficient algorithms, such as that based on max-flow/min-cut, graph cut, and belief propagation [19], can be used to estimate the disparity. For example, with the graph cut, a-expansion moves produce a solution within a known factor of the global minimum of energy [2]. However, energy minimization can only deal with the scenes with sufficient camera moving. Moreover, if there are complex scene changes or multiple independently moving objects, it is difficult to get a good disparity estimation. In addition, the occlusion and textureless region in the scene are difficult to get the correct depth value.

SFM [8] is often employed to recover the camera parameters and 3D scene structure by identifying points in two or more images that are the projections of the same point in space. Using SFM, it is impossible to compare every pixel of one image with every pixel of the other image(s) because of combinatorial complexity and ambiguous matching. In any case, not all points are equally well suited for matching, especially over a long sequence of images. Hence, firstly, SFM extracts feature points, such as Harris corner points that are located at the maxima of the local image’s autocorrelation function. Secondly, camera motion and scene structure are estimated based on the detected feature correspondences. Since feature correspondences are used, only a sparse depth map can be obtained from the aforementioned steps. For a dense depth map, Delaunay triangulation or other methods are required [13]. The performance of SFM depends on the feature detection and matching process. In this context, some efficient feature extraction and matching algorithms such as scale-invariant feature transform could be used to improve the performance of SFM. The depth map from SFM is quite good because of the pinhole camera model and epipolar geometry employed. However, SFM assumes that objects do not deform and their movements are linear. It cannot handle scenarios containing degenerated motion (e.g., rotation-only camera) or degenerated structure (e.g., coplanar scene) [13].

In conclusion, the advantage of using motion cues is that it considers more than one frame, hence the depth maps in different frames have a certain consistency, for example, less flickering between consecutive depth maps. However, the motion cues are not effective for static scene or more complex movement. In addition, motion cues require more than one frame to get the depth information, which causes extra delay compared with other depth cues.

13.3.2 Focus/Defocus Cues

Depth-from-defocus (DFD) methods generate a depth map based on the amount of blurring present in the images. In a thin lens system, objects in focus are clearly pictured while objects at other distances are defocused, that is, blurred, as shown in Figure 13.3. The farther the object is, the more blurry it appears. There are two kinds of schemes for the DFD technology. The first one requires a pair of images of the same scene with different focus settings [9]. Providing the camera settings, it recovers the depth by estimating the degree of defocus. If the camera setting is unknown, more images are required to eliminate the ambiguity in blur radius estimation. This approach is reliable and provides a good depth estimation. However, only a small number of 2D video materials will satisfy the aforementioned conditions, that is, images taken from a fixed camera position and object position but using different focal settings. The second scheme extracts the blur information from a single image by measuring the amount of blur. In Ref. [28], a re-blurred scheme is proposed to recover the relative depth from a single defocused image captured by an uncalibrated camera. Firstly, it models the blurred edge in the image as 2D Gaussian blur. Then, the input image is re-blurred using a known Gaussian blur kernel and the gradient ratio between input and the re-blurred image is calculated. Finally, from this ratio the amount of blur at edge locations can be derived and the blur propagation problem is solved as an optimization problem. DFD avoids matching (correspondence) ambiguity problems compared with SFM. Furthermore, its complexity is less than that of SFM. However, it cannot deal with the blur existing originally in the image as well as textureless regions. Another disadvantage of DFD is that it is only feasible for small depth-of-field images and it generates low precision depth maps.

FIGURE 13.3
(See color insert.) Focus and defocus for the thin lens.

Depth from focus, as an alternative technique, requires a series of images of the scene with different focus levels by varying and registering the distance between the camera and the scene [23]. This requirement is not easy to meet in practice, and it is not suitable for monocular images or videos.

13.3.3 Geometric Cues

Geometric cues include the familiar size/height of objects in the image, vanishing objects, vanishing lines, and vanishing points. Particularly, vanishing lines and vanishing points are often employed to estimate the depth. Vanishing lines refer to the fact that parallel lines, such as railroad tracks, appear to converge with the distance, eventually reaching a vanishing point at the horizon. The more the lines converge, the farther away they appear to be. Based on vanishing lines slope and origin of the vanishing lines onto the image plane, a set of gradient planes can be assigned, each corresponding to a single depth level. The pixels closer to the vanishing point are assigned a larger depth value and the density of the gradient planes is also higher. Figure 13.4 shows an example of vanishing lines in reality [1].

There are two steps to extract the vanishing point. Firstly, edge detection is employed to locate the predominant lines in the image. Then, the intersection points of these lines are determined by calculating the intersection point of each pair of vanishing lines. If all the intersection points are not located at the same point, a vanishing region whose range covers the most of intersection points is defined. Figure 13.5 gives an example for the detected vanishing points. The complexity of this kind of schemes mainly depends on the complexity of detecting vanishing lines, which can be performed in real time. The problem of this method is that it can only be applied to a particular class of images containing geometric objects. In addition, the depth generated by vanishing lines is a qualitative one. It does not take into consideration the prior knowledge of different areas. Accordingly, the points at the same distance away from the vanishing point are assigned the same depth value, regardless of their different semantic areas. In Figure 13.4, for example, the detected vanishing point will be the intersection of the highway. Using the highway as the vanishing line, it will assign the same depth level to the far mountain and sky (with the same gradient level to the vanishing point) without differentiating their semantic knowledge.

FIGURE 13.4
An example of vanishing line and vanishing point.

FIGURE 13.5
An example of detected vanishing points. (From Battiato, S. et al., Depth map generation by image classification, in Proceedings of SPIE 5302, pp. 95–104, 2004.)

13.3.4 Atmospheric Scattering Cues

The propagation of light through the atmosphere is affected in the sense that its direction and power are altered through a diffusion of radiation by small particles in the atmosphere. This leads to the phenomenon called atmospheric scattering, also known as haze. Distant objects appear less distinct and more bluish than objects nearby. In Ref. [6], a physics model is presented to describe the relationship between the radiance of an image and the distance between the object and the viewer. The model used is

$I = J e^{- β d} + A (1 - e^{- β d})$ $I = J e^{- β d} + A (1 - e^{- β d})$

(13.3)

Here, I is the observed pixel intensity, J is the scene radiance, A is the global atmospheric light, β is the scattering coefficient, and d denotes the depth. In most cases, J is unknown, β has some experimental value, and A can be measured from the sky region in the image. In Ref. [10], a dark prior is discovered to provide a good estimation of J. It is shown in

$J^{d a r k} (p) = \min_{c \in {r, g, b}} (\min_{q \in Ω (p)} (J^{c} (q)))$ $J^{d a r k} (p) = \min_{c \in {r, g, b}} (\min_{q \in Ω (p)} (J^{c} (q)))$

(13.4)

where

J(p) is the pixel intensity

Ω(p) represents a block region surrounding pixel p

c denotes a particular (RGB) color channel

The dark prior is that J^dark(p) has a very low intensity value for an outdoor haze-free image [10]. Using this prior and an input image with haze, J^dark(p) can be assumed to be zero and the formula about the depth is [10]

$e^{- β d} = 1 - \min_{c \in {r, g, b}} (\min_{q \in Ω (p)} (\frac{I^{c} (q)}{A^{c}}))$ $e^{- β d} = 1 - \min_{c \in {r, g, b}} (\min_{q \in Ω (p)} (\frac{I^{c} (q)}{A^{c}}))$

(13.5)

The atmospheric light A can be estimated from the most haze-opaque pixel [6], for example, the pixel with the highest intensity.

Using atmospheric cue can provide a good estimation of the depth map for an outdoor image, and the complexity of this algorithm depends on the estimation of J and A, which can meet the real-time requirement. However, the estimation of A is not easy for an indoor image, which limits the application of atmospheric cues. In addition, generally, there is no big difference between the observed scene I and the scene radiance J, unless haze, fog, and smoke are strong in the capturing environment.

13.3.5 Other Cues

In the previous sections, we commented on the most important cues, but there are a lot of other cues that can be exploited to estimate depth information. These cues will be briefly summarized in this section. Gradual variation of surface shading in an image provides information about the shape of the objects in the image. These cues are referred to as shading cues. Shape-from-shading (SFS) reconstructs 3D shapes from intensity images using the relationship between surface geometry and image brightness [27]. However, SFS is a well-known ill-posed problem just like SFM, in the sense that the solution may not exist or the solution is not unique and requires additional constraints.

Shape-from-texture tries to recover the 3D surface in a scene by analyzing the distortion of its texture projected in an image [4]. The texture cue offers a good 3D impression because of two key ingredients: the distortion of individual texels and the change rate of texel distortion across the texture region [23]. The latter is also known as the texture gradient. To recover the 3D surface from the texture variations in the image, the texture must be assumed to have some form of spatial homogeneity on the surface. Then, the texture variations are only produced by the projective geometry. In general, the output of shape-from-texture algorithms is a dense map of normal vectors. This is feasible for recovering the 3D shape under the assumption of a smooth textured surface. As a rule, the shape of a surface at any point is completely specified by the surface’s orientation and curvatures. However, it is an under-constrained problem. Most algorithms are designed to tackle specific group of textures, which limits their application.

Occlusion cues imply a depth relation between objects. An object which overlaps the view of another object is considered to be closer. In Ref. [15], specific points, such as T-junctions, are detected to infer the depth relationships between objects in a scene. The results in Ref. [15] indicate that occlusion cues are a good feature for relative depth ordering. However, in some cases, it is insufficient to classify depth planes and the lack of sufficient T-junctions may result in incorrect depth order.

13.4 State-of-the-Art Scheme for 2D-to-3D Conversion

From the last section, we know that all previously discussed cues are only suitable for a certain group of images. Generally, one particular cue alone does not produce a good depth map for all kinds of images or video sequences. Normally, effective algorithms will fuse different cues together to recover the depth map. In this section, some classical 2D-to-3D conversion schemes are presented.

13.4.1 Image Classification Techniques

In Ref. [1], images are first under-segmented into color regions using the mean-shift technique [5]. Then, the semantic regions are detected by comparing a group of columns of the regions with a set of typical sequences of a landscape. The typical semantic regions are characterized as sky, farther mountain, far mountain, near mountain, land, and others. A qualitative depth map can be generated by assigning different depths to these semantic regions. After that, the vanishing lines and vanishing points are detected by further classifying the semantic regions into landscape, outdoor with geometric elements, and indoor. For landscape without geometric elements, the lowest point is located at the boundary between the land and the other regions. Using such a boundary point, the vanishing point is obtained. For the indoor and outdoor with geometric elements, the edge or line detecting scheme is employed to get the vanishing lines firstly. Then, the intersections of the vanishing lines are located as vanishing points. After vanishing points detection, the gradient plane is generated and the depth is assigned based on the gradient. Finally, the qualitative and geometric depth maps are combined to provide a more accurate depth map. The whole scheme can be regarded as an approach using the semantic information and geometric cues together. With limited computation, it can generate a relative depth map from a single image. However, the accuracy of this approach depends on the detection of the vanishing lines and points as well as the identification of semantic regions. When the scene is more complex, the detection and identification will be difficult, which affects the performance of depth extraction.

13.4.2 Bundle Optimization Techniques

In Refs. [24,25], a bundle optimization model exploiting color constancy and geometric coherence constraints is proposed for estimating the depth map. It solves problems such as image noise, textureless pixels, and occlusions in an implicit way through an energy minimization model.

The method is composed of four steps. In step 1, the camera parameters for each frame are obtained by epipolar geometry with SFM. In step 2, the disparity is initialized for each frame independently by minimizing the initial energy function that considers the color similarity for two corresponding pixels and spatial smoothness constraint in the flat regions. Taking the possible occlusions into account, the temporal selection method is used to only select the frames in which the pixels are visible for matching. In order to deal with textureless regions, mean-shift color segmentation [5] is used to generate segments for each frame and this segmentation information is used during initialization. After that, each disparity segment is modeled with a 3D plane. In step 3, it refines depth information by minimizing the energy function (13.6) frame by frame.

$E (\hat{D}; \hat{I}) = \sum_{t = 1}^{n} (E_{d} (D_{t}; \hat{I}, \hat{D} D_{t}) + E_{s} (D_{t}))$ $E (\hat{D}; \hat{I}) = \sum_{t = 1}^{n} (E_{d} (D_{t}; \hat{I}, \hat{D} D_{t}) + E_{s} (D_{t}))$

(13.6)

where D_t is the disparity map for time t. The variables with hat are vectors that contain the corresponding variable at a different time. For example, $\hat{D} = {D_{t} | t = 1, \dots, n}$ $\hat{D} = {D_{t} | t = 1, \dots, n}$ and $\hat{I} = {I_{t} | t = 1, \dots, n}$ $\hat{I} = {I_{t} | t = 1, \dots, n}$ . The item E_d represents the color constraint and geometric constraint, and E_s is used for spatial smoothness. This process is called bundle optimization. It is a typical energy minimization function that considers both the geometric constraints and spatial smoothness. Finally, in step 4, the space–time fusion is employed to further reduce the noise in the depth. The fusion considers the spatial continuity, temporal coherence, and sparse feature correspondences (computed from SFM) together to provide a more accurate and stable depth map.

The good performance of this scheme is explained by considering the different constraints together. Since it takes the geometric coherence constraint associated with multiple frames and refines the results iteratively, the obtained depth maps in the sequence are stable. However, it also generates large delay because more frames are considered together. Therefore, its complexity is not suitable for real-time application unless some simplifications are adopted. In addition, if there is not sufficient camera movement, the obtained depth is probably less accurate. Even when the segmentation information is employed, the depth value for textureless region could still be incorrect due to the ambiguity for the depth inference.

13.4.3 Machine Learning Techniques

In Refs. [17,18], MRF is used to model monocular cues and the relations between various parts of the image. For each small homogeneous patch in an image, MRF is used to infer a set of plane parameters capturing both the 3D location and 3D orientation of the patch. The MRF, trained via supervised learning, models depth cues and the relationships between different parts of the image. Since it does not make any explicit assumptions about the image structure, it can capture more details. Firstly, the input image is oversegmented into different superpixels (patches). Then, the image features are calculated for each superpixel. The features include the texture-based summary and statistic features as well as the patch shape and location-based features. These schemes use 17 filters at 3 different spatial scales to get the features. The local feature alone is not enough. Hence, it also attempts to capture global information by including features from neighboring patches as well as from neighbors at different scales. The model considers properties such as connected structure, coplanar structure, and colinearity. All these features are used to derive the plane parameters for each patch. The fractional depth errors between the estimated depth and the ground-truth depth are used to train the model. Since the semantic context such as sky or grass field is recovered by the machine learning to predict the depth map accurately, it does put an enormous burden on the learning process. The parameter learning process is very complex; however, it can be performed offline. Apart from this step, obtaining the features with 524 dimensions is also very computationally intensive. This scheme creates 3D models which are both quantitatively and visually pleasing. With 588 images downloaded from the Internet, it provides 64.9% qualitatively correct 3D structure. Since this scheme is designed just for a single image, no motion information in the sequence is considered. If some scene structure is missed during the training phase, the results will be unreliable. In addition, if this scheme is applied to all the frames in a sequence, there could be some flickering in the depth map even though the depth of each frame may be good enough. A solution is using temporal filters to reduce the flickering.

13.4.4 Schemes Using Surrogate Depth Maps

Except for schemes using the aforementioned cues, other simple algorithms can also be used to get the disparity/depth map for application. In Refs. [21,22], one color component is adopted as the surrogate depth map, including some adjusting and scaling of the pixel value for the specific color component. This surrogate depth map provides the depth information needed for 2D-to-3D conversion. Specifically, the Cr chroma component from the YCbCr is used as the surrogate in Ref. [22] because this component has good features to be the depth map. For example, the skin generally consists of Cr component and will be put in the foreground. In the meantime, blue sky and green grass are placed in the background which applies to most of the scene. This kind of technique can be implemented in real time and does not have a large dependence on the existence of the cues. However, its performance is affected by the color information in the image. The bright red regions and spots will be indifferently rendered in the front of the scene and are very visible, which provides the wrong depth information and annoying visual experience.

13.5 Application of 2D-to-3D Conversion in 3D Video Communications

The application of 2D-to-3D in 3DTV is one of the most important application cases. Since there is not enough 3D content yet, 2D-to-3D conversion is especially important. In fact, almost all the 3DTV companies make 2D-to-3D conversion function available in their products, such as Philips, Sony, Samsung, and LG. Besides 3DTV, 2D-to-3D conversion is also employed in 3D smartphone and 3D video games. 2D-to-3D conversion will play its important role to provide enough 3D content before the 3D era really comes. As a lot of existing 2D content cannot be recaptured into 3D version, this technology will work even though there is enough 3D content.

As more and more 3D content will become available, efficient 3D compression and transmission is required. In this context, an error-resilient 3D wireless communication system assisted with 2D-to-3D is proposed here. For the stereoscopic 3D video, the left view and right view should be provided and encoded. Even though the interview prediction can be used, the bit rate is still quite high compared with 2D video coding. As an alternative, the 2D+Z format stores the texture frame and its corresponding depth map, which requires an additional bandwidth <10% compared to 2D video format. In addition, this format decouples the content creation and visualization process. Most important of all, it offers flexibility and compatibility with existing production equipment and compression tools. With so many advantages, this format has been standardized in MPEG as an extension for 3D filed under ISO/IEC 23002-3:2007 [11]. Due to the bandwidth of wireless network, the 2D+Z format is the better candidate compared with the stereo view format and we will focus on this format in this chapter.

With the 2D-to-3D conversion technology, the depth map coding efficiency and error concealment ability can be improved, which is very important for the transmission of 3D content in wireless network. In the 2D+Z format, the general input is a texture sequence and a depth sequence. This pair of sequences could be encoded separately with some existing video codecs, such as MPEG2 or H.264/AVC. Since our scheme will only affect the performance of depth information, the texture sequence will not be discussed here. For the depth map encoding, there will be motion prediction between the current depth map and its previous depth maps to exploit the temporal information. The coded stream will be transmitted through the network. If some packets containing one depth map are lost, the error will happen and it will propagate to other following depth maps due to the motion estimation. The previous study [14] has shown that coding artifacts on depth data can dramatically influence the quality of the synthesized view. Hence, the error robustness of the whole system is necessary. Generally, some error-resilient tools such as flexible macroblock ordering and data partitioning can be used to mitigate this problem [12]. However, these tools will definitely degrade the compression performance. For the proposed communication system, the compression efficiency can be improved and the error robustness is also enhanced.

In Figure 13.6, the new 3D video communication system with the 2D-to-3D conversion technology is shown. In the encoding end, firstly, the texture video frames are encoded and the reconstructed texture frames are used to extract the estimated depth map. Then, the estimated depth map is combined with the original depth map to provide more efficient depth map coding. With the estimated depth map, the current depth map (X_n) can select the previous depth map $({\hat{X}}_{n - 1})$ $({\hat{X}}_{n - 1})$ or its estimated depth map $({\tilde{X}}_{n})$ $({\tilde{X}}_{n})$ as its predicted version according to rate distortion function. From the information theory, it is reasonable to believe that more efficient compression can be achieved, that is, $H (\hat{X} | \hat{X}) \leq H (\hat{X})$ $H (\hat{X} | \hat{X}) \leq H (\hat{X})$ . Here, X represents the original depth map, $\hat{X}$ $\hat{X}$ denotes the reconstructed or decoded depth map and $\tilde{X}$ $\tilde{X}$ is the estimated depth map from texture images. Here, we use the reconstructed texture to estimate the depth because it is available at both the encoder and decoder end.

FIGURE 13.6
3D video communication system with 2D-to-3D conversion.

Most important of all, the estimated depth map could contribute to the error resilience and error concealment for the whole system. Generally, the error robustness is in conflict with compression efficiency. However, with the help of estimated depth map, both the error robustness and compression efficiency can be achieved for the depth map coding. In fact, the estimated depth map is available at both the encoder and decoder side, as shown in Figure 13.6. Hence, when one packet containing depth map ${\hat{X}}_{n - 1}$ ${\hat{X}}_{n - 1}$ is lost, the depth map ${\hat{X}}_{n}$ ${\hat{X}}_{n}$ that uses ${\hat{X}}_{n - 1}$ ${\hat{X}}_{n - 1}$ as prediction will be affected. However, if some macroblocks in depth map ${\hat{X}}_{n}$ ${\hat{X}}_{n}$ use the estimated depth map ${\tilde{X}}_{n}$ ${\tilde{X}}_{n}$ as prediction, then these macroblocks will not be affected because their predicted version, the estimated depth map, can still be acquired at the decoder end. In addition, the lost content in depth map ${\hat{X}}_{n - 1}$ ${\hat{X}}_{n - 1}$ can be error concealed by ${\tilde{X}}_{n - 1}$ ${\tilde{X}}_{n - 1}$ , which helps to mitigate the error propagation to a certain extent. The only assumption here is that the encoded texture sequence can be obtained at the decoder. However, if the encoded texture sequence is lost, the depth map is useless because it will not be displayed. Hence, the dependence on the encoded texture sequence is reasonable.

Besides the 2D+Z format, 2D-to-3D conversion can also be applied to the MVD format to provide a more robust and efficient coding scheme. Moreover, there is some research on novel predictions in depth map coding, such as edge-aware prediction [20]. Since depth maps typically consist of smooth regions separated by edges, the schemes that can describe the edge information efficiently will reduce the bit rate and artifact caused by quantization. With the estimated depth map, some coarse edge information can be obtained. Then together with the original depth map, the coarse information could be used to reduce the bit rate and represent the edge information efficiently.

It should be noted that the aforementioned applications rely on the correctness of the estimated depth map. The more accurate the estimated depth map is, the more compression and robustness it could provide for the aforementioned systems. In order to obtain more accurate estimated depth maps, some information of the original depth map could be used, such as the maximum and minimum depth value. This information could be transmitted as parameters if they cannot be available at the decoder end.

13.6 Conclusion and Future Trends

This chapter provides an overview of the main techniques for 2D-to-3D conversion and its application in 3D video communications. Currently, most of the 2D-to-3D conversion techniques focus on some particular cues in the images/videos, which are not easy to be generalized. Even though some algorithms use combined cues, they are often computation intensive and still limited to some kinds of images/videos. Hence, 2D-to-3D conversion is still a difficult task. To make it more effective, other information should be considered in the future research. The available information includes the camera parameters, the capturing environment, the known context of the scene, the video subtitle, and sounds information corresponding to the images/videos. In fact, the current 2D-to-3D conversion does not rely on this information. However, it could help to improve the performance and reduce the computation a lot. For example, if the context of the scene is known as landscape, there is no need to classify the scene as outdoor or indoor. Especially for the latest generated 2D material, most of the aforementioned information could be obtained. Hence, the future trend should focus on how to employ this information.

When there is less 3D content, 2D-to-3D conversion is very important for 3D content creation. With the development of 3D technology, there will be more and more 3D content. With such a large amount of data, efficient 3D video coding and transmission are required. In this case, 2D-to-3D conversion can be applied to reduce the bit rate and enhance the error resilience and error concealment of the whole 3D video communications system when 2D+Z or MVD format is employed. Such a system is proposed in this chapter. How to extract the depth map with the help of original depth map and how to exploit the estimated depth map in 3D video communications will be another hot topic in the future.

References

1. Battiato S., S. Curti, M. La Cascia, M. Tortora, and E. Scordato. Depth map generation by image classification. In Proceedings of SPIE Electronic Imaging-Three-Dimensional Image Capture and Applications VI, San Jose, CA, pp.95–104, 2004.

2. Boykov Y., O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11):1222–1239, November 2001.

3. Cao X., Z. Li, and Q. Dai. Semi-automatic 2D-to-3D conversion using disparity propagation. IEEE Transactions on Broadcasting, 57(2):491–499, June 2011.

4. Clerc M. and S. Mallat. The texture gradient equation for recovering shape from texture. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4):536–549, April 2002.

5. Comaniciu D. and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5): 603–619, May 2002.

6. Cozman F. and E. Krotkov. Depth from scattering. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, San Juan, Puerto Rico, pp.801–806, June 1997.

7. Fehn C. Depth-image-based rendering (DIBR), compression, and transmission for a new approach on 3d-tv. In Proceedings of SPIE Stereoscopic Displays and Virtual Reality Systems XI, San Jose, CA, pp.93–104, January 2004.

8. Forsyth D.A. and J. Ponce. Computer Vision: A Modern Approach. Prentice Hall, Upper Saddle River, NJ, 2002.

9. Hasinoff S. W. and K.N. Kutulakos. Confocal stereo. International Journal of Computer Vision, 81(1):82–104, January 2009.

10. He K., J. Sun, and X. Tang. Single image haze removal using dark channel prior. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(12):2341–2353, December 2011.

11. ISO/IEC JTC1/SC29/WG11. Text of ISO/IEC FDIS 23002-3 Representation of Auxiliary Video and Supplemental Information. Doc. N8768, Marrakech, Morocco, January 2007.

12. Kumar S., L. Xu, M. K. Mandal, and S. Panchanathan. Error resiliency schemes in h.264/avc standard. Journal of Visual Communication and Image Representation, 17(2):425–450, 2006.

13. Li P., D. Farin, R.K. Gunnewiek, and P.H.N. de With. On creating depth maps from monoscopic video using structure from motion. In Proceedings of IEEE Workshop on Content Generation and Coding for 3D-television, Eindhoven, Netherlands, pp. 85–92, 2006.

14. Merkle P., A. Smolic, K. Muller, and T. Wiegand. Multi-view video plus depth representation and coding. In IEEE International Conference on Image Processing (ICIP), San Antonio, TX, Vol.1, pp. I201–I204, October 2007.

15. Palou G. and P. Salembier. Occlusion-based depth ordering on monocular images with binary partition tree. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague Congress Centre, Prague, Czech Republic, pp.1093–1096, May 2011.

16. Pourazad M.T., P. Nasiopoulos, and R. K. Ward. Generating the depth map from the motion information of H.264-encoded 2D video sequence. Journal of Image and Video Processing, 2010:4:1–4:13, 2010.

17. Saxena A., M. Sun, and A.Y. Ng. Make3D: Learning 3D scene structure from a single still image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5):824–840, May 2009.

18. Saxena A., S.H. Chung, and A. Y. Ng. 3-D depth reconstruction from a single still image. International Journal of Computer Vision, 76(1):53–69, 2008.

19. Scharstein D. and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47: 7–42, April 2002.

20. Shen G., W.-S. Kim, A. Ortega, J. Lee, and H. Wey. Edge-aware intra prediction for depth-map coding. In 17th IEEE International Conference on Image Processing (ICIP), The Hong Kong Convention and Exhibition Center, Hong Kong, pp.3393–3396, September 2010.

21. Tam W.J. and C. Vazquez. Generation of a depth map from a monoscopic color image for rendering stereoscopic still and video images, United States Patent Application 12/060,978 Filed April 2, 2008.

22. Tam W. J. and C. Vazquez. CRC-CSDM: 2D to 3D conversion using colour-based surrogate depth maps. In International Conference on 3D Systems and Applications (3DSA 2010), Tokyo, Japan, pp.1194–1205, 2010.

23. Wei Q. Converting 2D to 3D: A survey. TU Delft, the Netherlands, 2005.

24. Zhang G., J. Jia, T.-T. Wong, and H. Bao. Recovering consistent video depth maps via bundle optimization. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Anchorage, AK, pp.1–8, June 2008.

25. Zhang G., J. Jia, T.-T. Wong, and H. Bao. Consistent depth maps recovery from a video sequence. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31:974–988, June 2009.

26. Zhang L., C. Vazquez, and S. Knorr. 3D-TV content creation: Automatic 2D-to-3D video conversion. IEEE Transactions on Broadcasting, 57(2):372–383, June 2011.

27. Zhang R., P.-S. Tsai, J.E. Cryer, and M. Shah. Shape-from-shading: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(8):690–706, August 1999.

28. Zhuo S. and T. Sim. On the recovery of depth from a single defocused image. In Proceedings of the 13th International Conference on Computer Analysis of Images and Patterns, CAIP’09, Münster, Germany, pp.889–897, 2009.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 13. 2D-to-3D Video Conversion: Techniques and Applications in 3D Video Communications

Create new playlist

Sign In

Sign Up

Table of Contents for
13. 2D-to-3D Video Conversion: Techniques and Applications in 3D Video Communications