Chapter 2

Deep Learning for Multimodal Data Fusion

Asako KanezakiRyohei KugaYusuke SuganoYasuyuki Matsushita    National Institute of Advanced Industrial Science and Technology, Tokyo, Japan
Graduate School of Information Science and Technology, Osaka University, Osaka, Japan

Abstract

Recent advance in deep learning has enabled realistic image-to-image translation of multimodal data. Along with the development, auto-encoders and generative adversarial networks (GAN) have been extended to deal with multimodal input and output. At the same time, multitask learning has been shown to efficiently and effectively address multiple mutually related recognition tasks. Various scene understanding tasks, such as semantic segmentation and depth prediction, can be viewed as cross-modal encoding / decoding, and hence most of the prior work used multimodal (various types of input) datasets for multitask (various types of output) learning. The inter-modal commonalities, such as one across RGB image, depth, and semantic labels, are being exploited while the study is still at an early stage. In this chapter, we introduce several state-of-the-art encoder–decoder methods on multimodal learning as well as a new approach to cross-modal networks. In particular, we detail a multimodal encoder–decoder networks that harnesses the multimodal nature of multitask scene recognition. In addition to the shared latent representation among encoder–decoder pairs, the model also has shared skip connections from different encoders. By combining these two representation sharing mechanisms, it is shown to efficiently learn a shared feature representation among all modalities in the training data.

Keywords

Encoder–decoder networks; Semi-supervised learning; Semantic segmentation; Depth estimation

2.1 Introduction

Scene understanding is one of the most important tasks for various applications including robotics and autonomous driving and has been an active research area in computer vision for a long time. The goal of scene understanding can be divided into several different tasks, such as depth reconstruction and semantic segmentation. Traditionally, these different tasks have been studied independently, resulting in their own tailored methods. Recently, there has been a growing demand for a single unified framework to achieve multiple tasks at a time unlike previous approaches. By sharing a part of the learned estimator, such a multitask learning framework is expected to achieve better performance with a compact representation.

In most of the prior work, multitask learning is formulated with a motivation to train a shared feature representation among different tasks for efficient feature encoding [13]. Accordingly, in recent convolutional neural network (CNN)-based methods, multitask learning often employs an encoder–decoder network architecture [1,2,4]. If, for example, the target tasks are semantic segmentation and depth estimation from RGB images, multitask networks encode the input image to a shared low-dimensional feature representation and then estimate depth and semantic labels with two distinct decoder networks.

While such a shared encoder architecture can constrain the network to extract a common feature for different tasks, one limitation is that it cannot fully exploit the multimodal nature of the training dataset. The representation capability of the shared representation in the above example is not limited to image-to-label and image-to-depth conversion tasks, but it can also represent the common feature for all of the cross-modal conversion tasks such as depth-to-label as well as within-modal dimensionality reduction tasks such as image-to-image. By incorporating these additional conversion tasks during the training phase, the multitask network is expected to learn more efficient shared feature representation for the diverse target tasks.

In this chapter, we introduce a recent method named the multimodal encoder–decoder networks method [5] for multitask scene recognition. The model consists of encoders and decoders for each modality, and the whole network is trained in an end-to-end manner taking into account all conversion paths—both cross-modal encoder–decoder pairs and within-modal self-encoders. As illustrated in Fig. 2.1, all encoder–decoder pairs are connected via a single shared latent representation in the method. In addition, inspired by the U-net architecture [6,7], the decoders for pixel-wise image conversion tasks such as semantic segmentation also take a shared skipped representation from all encoders. Since the whole network is jointly trained using multitask losses, these two shared representations are trained to extract the common feature representation among all modalities. Unlike multimodal auto-encoders [1], this method can further utilize auxiliary unpaired data to train self-encoding paths and consequently improve the cross-modal conversion performance. In the experiments using two public datasets, we show that the multimodal encoder–decoder networks perform significantly better on cross-modal conversion tasks.

Image
Figure 2.1 Overview of the multimodal encoder–decoder networks. The model takes data in multiple modalities, such as RGB images, depth, and semantic labels, as input, and generates multimodal outputs in a multitask learning framework.

The remainder of this chapter is organized as follows. In Sect. 2.2, we summarize in an overview various methods on multimodal data fusion. Next, we describe the basics of multimodal deep learning techniques in Sect. 2.3 and the latest work based on those techniques in Sect. 2.4. We then introduce the details of multimodal encoder–decoder networks in Sect. 2.5. In Sect. 2.6, we show experimental results and discuss the performance of multimodal encoder–decoder networks on several benchmark datasets. Finally, we conclude this chapter in Sect. 2.7.

2.2 Related Work

Multitask learning is motivated by the finding that the feature representation for one particular task could be useful for the other tasks [8]. In prior work, multiple tasks, such as scene classification, semantic segmentation [9], character recognition [10] and depth estimation [11,12], have been addressed with a single input of an RGB image, which is referred to as single-modal multitask learning. Hand et al. [13] demonstrated that multitask learning of gender and facial parts from one facial image leads to better accuracy than individual learning of each task. Hoffman et al. [14] proposed a modality hallucination architecture based on CNNs, which boosts the performance of RGB object detection using depth information only in the training phase. Teichmann et al. [15] presented neural networks for scene classification, object detection, segmentation of a street view image. Uhrig et al. [16] proposed an instance-level segmentation method via simultaneous estimation of semantic labels, depth, and instance center direction. Li et al. [17] proposed fully convolutional neural networks for segmentation and saliency tasks. In these previous approaches, the feature representation of a single input modality is shared in an intermediate layer for solving multiple tasks. In contrast, the multimodal encoder–decoder networks [5] described in Sect. 2.5 fully utilize the multimodal training data by learning cross-modal shared representations through joint multitask training.

There have been several prior attempts for utilizing multimodal inputs for deep neural networks. They proposed the use of multimodal input data, such as RGB and depth images [18], visual and textual features [19], audio and video [2], and multiple sensor data [20], for single-task neural networks. In contrast to such multimodal single-task learning methods, relatively few studies have been made on multimodal & multitask learning. Ehrlich et al. [21] presented a method to identify a person's gender and smiling based on two feature modalities extracted from face images. Cadena et al. [1] proposed neural networks based on auto-encoder for multitask estimation of semantic labels and depth.

Both of these single-task and multitask learning methods with multimodal data focused on obtaining better shared representation from multimodal data. Since straightforward concatenation of extracted features from different modalities often results in lower estimation accuracy, some prior methods tried to improve the shared representation by singular value decomposition [22], encoder–decoder [23], auto-encoder [2,1,24], and supervised mapping [25]. While the multimodal encoder–decoder networks are also based on the encoder–decoder approach, one employs the U-net architecture for further improving the learned shared representation, particularly in high-resolution convolutional layers.

Most of the prior works also assume that all modalities are available for the single-task or multitask in both training and test phases. One approach for dealing with the missing modal data is to perform zero-filling, which fills the missing elements in the input vector by zeros [2,1]. Although these approaches allow the multimodal networks to handle missing modalities and cross-modal conversion tasks, it has not been fully discussed whether such a zero-filling approach can be also applied to recent CNN-based architectures. Sohn et al. [19] explicitly estimated missing modal data from available modal data by deep neural networks. In a difficult task, such as a semantic segmentation with many classes, the missing modal data is estimated inaccurately, which has a negative influence on performance of the whole network. Using the multimodal encoder–decoder networks, at the test phase encoder–decoder paths work individually even for missing modal data. Furthermore, it can perform conversions between all modalities in the training set, and it can utilize single-modal data to improve within-modal self-encoding paths during the training.

Recently, many image-to-image translation methods based on deep neural networks have been developed [7,2632]. In contrast to that they address image-to-image translation on two different modalities, StarGAN [33] was recently proposed to efficiently learn the translation on more than two domains. The multimodal encoder–decoder networks is also applicable to the translation on more than two modalities. We describe the details of this work in Sect. 2.4 and the basic methods behind this work in Sect. 2.3.

2.3 Basics of Multimodal Deep Learning: VAEs and GANs

This section introduces the basics of multimodal deep learning for multimodal image translation. We first mention auto-encoder, which is the most basic neural network consisting of an encoder and a decoder. Then we introduce an important extension of auto-encoder named variational auto-encoder (VAE) [34,35]. VAEs consider a standard normal distribution for latent variables and thus they are useful for generative modeling. Next, we describe generative adversarial network (GAN) [36], which is the most well-known way of learning deep neural networks for multimodal data generation. Concepts of VAEs and GANs are combined in various ways to improve the distribution of latent space for image generation with, e.g., VAE-GAN [37], adversarial auto-encoder (AAE) [38], and adversarial variational Bayes (AVB) [39], which are described later in this section. We also introduce the adversarially learned inference (ALI) [40] and the bidirectional GAN (BiGAN) [41], which combine the GAN framework and the inference of latent representations.

2.3.1 Auto-Encoder

An auto-encoder is a neural network that consists of an encoder network and a decoder network (Fig. 2.2). Letting xRdImage be an input variable, which is a concatenation of pixel values when managing images, an encoder maps it to a latent variable zRrImage, where r is usually much smaller than d. A decoder maps z to the output xRdImage, which is the reconstruction of the input x. The encoder and decoder are trained so as to minimize the reconstruction errors such as the following squared errors:

LAE=Ex[xx2].

Image(2.1)

The purpose of an auto-encoder is typically dimensionality reduction, or in other words, unsupervised feature / representation learning. Recently, as well as the encoding process, more attention has been given to the decoding process which has the ability of data generation from latent variables.

Image
Figure 2.2 Architecture of Auto-encoder.

2.3.2 Variational Auto-Encoder (VAE)

The variational auto-encoder (VAE) [34,35] regards an auto-encoder as a generative model where the data is generated from some conditional distribution p(x|z)Image. Letting ϕ and θ denote the parameters of the encoder and the decoder, respectively, the encoder is described as a recognition model qϕ(z|x)Image, whereas the decoder is described as an approximation to the true posterior pθ(x|z)Image. The marginal likelihood of an individual data point xiImage is written as follows:

logpθ(xi)=DKL(qϕ(z|xi)||pθ(z|xi))+L(θ,ϕ;xi).

Image(2.2)

Here, DKLImage stands for the Kullback–Leibler divergence. The second term in this equation is called the (variational) lower bound on the marginal likelihood of data point i, which can be written as

L(θ,ϕ;xi)=DKL(qϕ(z|xi)||pθ(z))+Ezqϕ(z|xi)[logpθ(xi|z)].

Image(2.3)

In the training process, the parameters ϕ and θ are optimized so as to minimize the total loss iL(θ,ϕ;xi)Image, which can be written as

LVAE=DKL(qϕ(z|x)||pθ(z))Ezqϕ(z|x)[logpθ(x|z)].

Image(2.4)

Unlike the original auto-encoder which does not consider the distribution of latent variables, VAE assumes the prior over the latent variables such as the centered isotropic multivariate Gaussian pθ(z)=N(z;0,I)Image. In this case, we can let the variational approximate posterior be a multivariate Gaussian with a diagonal covariance structure:

logqϕ(z|xi)=logN(z;μi,σ2iI),

Image(2.5)

where the mean μiImage and the standard deviation σiImage of the approximate posterior are the outputs of the encoder (Fig. 2.3). In practice, the latent variable ziImage for data point i is calculated as follows:

zi=μi+σiϵ,whereϵN(0,I).

Image(2.6)

Image
Figure 2.3 Architecture of VAE [34,35].

2.3.3 Generative Adversarial Network (GAN)

The generative adversarial network (GAN) [36] is one of the most successful framework for data generation. It consists of two networks: a generator G and a discriminator D (shown in Fig. 2.4), which are jointly optimized in a competitive manner. Intuitively, the aim of the generator is to generate a sample xImage from a random noise vector z that can fool the discriminator, i.e., make it believe xImage is real. The aim of the discriminator is to distinguish a fake sample xImage from a real sample x. They are simultaneously optimized via the following two-player minimax game:

minGmaxDLGAN,whereLGAN=Expdata[logD(x)]+Ezp(z)[log(1D(G(z)))].

Image(2.7)

From the perspective of a discriminator D, the objective function is a simple cross-entropy loss function for the binary categorization problem. A generator G is trained so as to minimize log(1D(G(z)))Image, where the gradients of the parameters in G can be back-propagated through the outputs of (fixed) D. In spite of its simplicity, GAN is able to train a reasonable generator that can output realistic data samples.

Image
Figure 2.4 Architecture of GAN [36].

Deep convolutional GAN (DCGAN) [42] was proposed to utilize GAN for generating natural images. The DCGAN generator consists of a series of four fractionally-strided convolutions that convert a 100-dimensional random vector (in uniform distribution) into a 64×64Image pixel image. The main characteristics of the proposed CNN architecture are threefold. First, they used the all convolutional net [43] which replaces deterministic spatial pooling functions (such as maxpooling) with strided convolutions. Second, fully connected layers on top of convolutional features were eliminated. Finally, batch normalization [44], which normalize the input to each unit to have zero mean and unit variance, was used to stabilize learning.

DCGAN still has the limitation in resolution of generated images. StackGAN [45] was proposed to generate high-resolution (e.g., 256×256Image) images by using two-stage GANs. Stage-I GAN generator in StackGAN generates low-resolution (64×64Image) images, which are input into Stage-II GAN generator that outputs high-resolution (256×256Image) images. Discriminators in Stage-I and Stage-II distinguish output images and real images of corresponding resolutions, respectively. More stable StackGAN++ [46] was later proposed, which consists of multiple generators and multiple discriminators arranged in a tree-like structure.

2.3.4 VAE-GAN

VAE-GAN [37] is a combination of VAE [34,35] (Sect. 2.3.2) and GAN [36] (Sect. 2.3.3) as shown in Fig. 2.5. The VAE-GAN model is trained with the following criterion:

LVAEGAN=LVAE+LGAN,

Image(2.8)

LVAE=DKL(qϕ(z|x)||pθ(z))Ezqϕ(z|x)[logpθ(D(x)|z)],

Image(2.9)

where LGANImage is the same as that in Eq. (2.7). Note that VAE-GAN replaces the element-wise error measures in VAE (i.e., the second term in Eq. (2.4)) by the similarity measures trained through a GAN discriminator (i.e., the second term in Eq. (2.9)). Here, a Gaussian observation model for D(x)Image with mean D(x)Image and identity covariance is introduced:

pθ(D(x)|z)=N(D(x)|D(x),I).

Image(2.10)

In this way, VAE-GAN improves the sharpness of output images, in comparison to VAE.

Image
Figure 2.5 Architecture of VAE-GAN [37].

It has been said that GANs could suffer from mode collapse; the generator in GAN learns to produce samples with extremely low variety. VAE-GAN mitigates this problem by introducing the prior distribution for z. The learned latent space is therefore continuous and is able to produce meaningful latent vectors, e.g., visual attribute vectors corresponding to eyeglasses and smiles in face images.

2.3.5 Adversarial Auto-Encoder (AAE)

The adversarial auto-encoder (AAE) [38] was proposed to introduce the GAN framework to perform variational inference using auto-encoders. Similarly to VAE [34,35] (Sect. 2.3.2), AAE is a probabilistic autoencoder that assumes the prior over the latent variables. Let p(z)Image be the prior distribution we want to impose, q(z|x)Image be an encoding distribution, p(x|z)Image be the decoding distribution, and pdata(x)Image be the data distribution. An aggregated posterior distribution of q(z)Image is defined as follows:

q(z)=xq(z|x)pdata(x)dx.

Image(2.11)

The regularization of AAE is to match the aggregated posterior q(z)Image to an arbitrary prior p(z)Image. While the auto-encoder attempts to minimize the reconstruction error, an adversarial network guides q(z)Image to match p(z)Image (Fig. 2.6). The cost function of AAE can be written as

LAAE=LAE+Eqϕ(z|x)[logD(z)]+Ezp(z)[log(1D(z))],

Image(2.12)

where LAEImage represents the reconstruction error defined in Eq. (2.1).

Image
Figure 2.6 Architecture of AAE [38].

In contrast that VAE uses a KL divergence penalty to impose a prior distribution on latent variables, AAE uses an adversarial training procedure to encourage q(z)Image to match to p(z)Image. An important difference between VAEs and AAEs lies in the way of calculating gradients. VAE approximates the gradients of the variational lower bound through the KL divergence by Monte Carlo sampling, which needs the access to the exact functional form of the prior distribution. On the other hand, AAEs only need to be able to sample from the prior distribution for inducing q(z)Image to match p(z)Image. AAEs therefore can impose arbitrarily complicated distributions (e.g., swiss roll distribution) as well as black-box distributions.

2.3.6 Adversarial Variational Bayes (AVB)

Adversarial variational Bayes (AVB) [39] also uses adversarial training for VAE, which enables the usage of arbitrarily complex inference models. As shown in Fig. 2.7, the inference model (i.e. encoder) in AVB takes the noise ϵ as additional input. For its derivation, the optimization problem of VAE in Eq. (2.4) is rewritten as follows:

maxθmaxϕEpdata(x)[DKL(qϕ(z|x)||p(z))+Eqϕ(z|x)logpθ(x|z)]=maxθmaxϕEpdata(x)Eqϕ(z|x)(logp(z)logqϕ(z|x)+logpθ(x|z)).

Image(2.13)

This can be optimized by stochastic gradient descent (SGD) using the reparameterization trick [34,35], when considering an explicit qϕ(z|x)Image such as the centered isotropic multivariate Gaussian. For a black-box representation of qϕ(z|x)Image, the following objective function using the discriminator D(x,z)Image is considered:

maxDEpdata(x)Eqϕ(z|x)logσ(D(x,z))+Epdata(x)Ep(z)log(1σ(D(x,z))),

Image(2.14)

where σ(t)(1+et)1Image denotes the sigmoid function. For pθ(x|z)Image and qϕ(z|x)Image fixed, the optimal discriminator D(x,z)Image according to Eq. (2.14) is given by

D(x,z)=logqϕ(z|x)logp(z).

Image(2.15)

The optimization objective in Eq. (2.13) is rewritten as

maxθmaxϕEpdata(x)Eqϕ(z|x)(D(x,z)+logpθ(x|z)).

Image(2.16)

Using the reparameterization trick [34,35], this can be rewritten with a suitable function zϕ(x,ϵ)Image in the form

maxθmaxϕEpdata(x)Eϵ(D(x,zϕ(x,ϵ))+logpθ(x|zϕ(x,ϵ))).

Image(2.17)

The gradients of Eq. (2.17) w.r.t. ϕ and θ as well as the gradient of Eq. (2.14) w.r.t. the parameters for D are computed to apply SGD updates. Note that AAE [38] (Sect. 2.3.5) can be regarded as an approximation to AVB, where D(x,z)Image is restricted to the class of functions that do not depend on x (i.e., D(x,z)D(z)Image).

Image
Figure 2.7 Architecture of AVB [39].

2.3.7 ALI and BiGAN

The adversarially learned inference (ALI) [40] and the Bidirectional GAN (BiGAN) [41], which were proposed at almost the same time, have the same model structure shown in Fig. 2.8. The model consists of two generators Gz(x)Image and Gx(z)Image, which correspond to an encoder q(z)Image and a decoder p(x)Image, respectively, and a discriminator D(x,z)Image. In contrast that the original GAN [36] lacks the ability to infer z from x, ALI and BiGAN can induce latent variable mapping, similarly to auto-encoder. Instead of the explicit reconstruction loop of auto-encoder, the GAN-like adversarial training framework is used for training the networks. Owing to this, trivial factors of variation in the input are omitted and the learned features become insensitive to these trivial factors of variation. The objective of ALI/BiGAN is to match the two joint distributions q(x,z)=q(x)q(z|x)Image and p(x,z)=p(z)p(x|z)Image by optimizing the following function:

minGx,GzmaxDLBiGAN,whereLBiGAN=Eq(x)[logD(x,Gz(x))]+Ep(z)[log(1D(Gx(z),z))].

Image(2.18)

Here, a discriminator network learns to discriminate between a sample pair (x,z)Image drawn from q(x,z)Image and one from p(x,z)Image. The concept of using bidirectional mapping is also important for image-to-image translation methods, which are described in the next section.

Image
Figure 2.8 Architecture of ALI [40] and BiGAN [41].

2.4 Multimodal Image-to-Image Translation Networks

Recent advances of deep neural networks enabled realistic translation of multimodal images. Basic ideas of such techniques are derived from VAEs and GANs. In this section, we introduce the latest works on this topic: image-to-image translation in two different domains/modals using deep neural networks. There are a few latest methods such as StarGAN [33] that is able to manage image-to-image translation among more than two domains, however, we do not address this topic in this chapter.

2.4.1 Pix2pix and Pix2pixHD

Pix2pix [7] is one of the earliest works on image-to-image translation using a conditional GAN (cGAN) framework. The purpose of this work is to learn a generator that takes an image x (e.g., an edge map) as input and outputs an image y in a different modality (e.g., a color photo). Whereas the generator in a GAN is only dependent on a random vector z, the one in a cGAN also depends on another observed variable, which is the input image x in this case. In the proposed setting in [7], the discriminator also depends on x. The architecture of pix2pix is shown in Fig. 2.9. The objective of the cGAN is written as follows:

LcGAN(G,D)=Ex,y[logD(x,y)]+Ex,z[log(1D(x,G(x,z)))].

Image(2.19)

The above objective is then mixed with 1Image distance between the ground truth and the generated image:

L1(G)=Ex,y,z[yG(x,z)1].

Image(2.20)

Note that images generated using 1Image distance tend to be less blurring than 2Image distance. The final objective of pix2pix is

minGmaxDLpix2pix(G,D),whereLpix2pix(G,D)=LcGAN(G,D)+λL1(G),

Image(2.21)

where λ controls the importance of the two terms.

Image
Figure 2.9 Architecture of pix2pix [7].

The generator of pix2pix takes the U-Net architecture [6], which is an encoder–decoder network with skip connections between intermediate layers (see Fig. 2.10). Because the U-Net transfers low-level information to high-level layers, it improves the quality (e.g., the edge sharpness) of output images.

Image
Figure 2.10 Architecture of U-Net [6].

More recent work named pix2pixHD [32] generates 2048×1024Image high-resolution images from semantic label maps using the pix2pix framework. They improved the pix2pix framework by using a coarse-to-fine generator, a multiscale discriminator architecture, and a robust adversarial learning objective function. The generator G consists of two sub-networks {G1,G2}Image, where G1Image generates 1024×512Image images and G2Image generates 2048×1024Image images using the outputs of G1Image. Multiscale discriminators {D1,D2,D3}Image, who have an identical network structure but operate at different image scales, are used in a multitask learning setting. The objective of pix2pixHD is

minG((maxD1,D2,D3k=1,2,3LcGAN(G,Dk))+λk=1,2,3LFM(G,Dk)),

Image(2.22)

where LFM(G,Dk)Image represents the newly proposed feature matching loss, which is the summation of 1Image distance between the discriminator's ith-layer outputs of the real and the synthesized images.

2.4.2 CycleGAN, DiscoGAN, and DualGAN

CycleGAN [26] is a newly designed image-to-image translation framework. Suppose that x and y denote samples in a source domain X and a target domain Y, respectively. Previous approaches such as pix2pix [7] (described in Sect. 2.4.1) only lean a mapping GY:XYImage, whereas CycleGAN also learns a mapping GX:YXImage. Here, two generators {Gy,Gx}Image and two discriminators {Dy,Dx}Image are jointly optimized (Fig. 2.11A). GyImage translates x to yImage, which is then fed into GxImage to generate xImage (Fig. 2.11B). In the same manner, GxImage translates y to xImage, which is then fed into GyImage to generate yImage (Fig. 2.11C). The 1Image distances between the input and the output are summed up to calculate the following cycle-consistency loss:

Lcyc(Gx,Gy)=Expdata(x)[xGx(Gy(x))1]+Eypdata(y)[yGy(Gx(y))1].

Image(2.23)

The most important factor of the cycle-consistency loss is that paired training data is unnecessary. In contrast that Eq. (2.20) in pix2pix loss requires x and y corresponding to the same sample, Eq. (2.23) only needs to compare the input and the output in the respective domains. The full objective of CycleGAN is

LcycleGAN(Gx,Gy,Dx,Dy)=Eypdata(y)[logDy(y)]+Expdata(x)[log(1Dy(Gy(x)))]+Expdata(x)[logDx(x)]+Eypdata(y)[log(1Dx(Gx(y)))]+λLcyc(Gx,Gy).

Image(2.24)

Image
Figure 2.11 Architecture of CycleGAN [26], DiscoGAN [27], and DualGAN [28]. (A) two generators {Gy,Gx} and two discriminators {Dy,Dx} are jointly optimized. (B) forward cycle-consistency and (C) backward cycle-consistency.

DiscoGAN [27] and DualGAN [28] were proposed in the same period that CycleGAN was proposed. The architectures of those methods are identical to that of CycleGAN (Fig. 2.11). DiscoGAN uses 2Image loss instead of 1Image loss for the cycle-consistency loss. DualGAN replaced the sigmoid cross-entropy loss used in the original GAN by Wasserstein GAN (WGAN) [47].

2.4.3 CoGAN

A coupled generative adversarial network (CoGAN) [29] was proposed to learn a joint distribution of multidomain images. CoGAN consists of a pair of GANs, each of which synthesizes an image in one domain. These two GANs share a subset of parameters as shown in Fig. 2.12. The proposed model is based on the idea that a pair of corresponding images in two domains share the same high-level concepts. The objective of CoGAN is a simple combination of the two GANs:

LCoGAN(G1,G2,D1,D2)=Ex2pdata(x2)[logD2(x2)]+Ezp(z)[log(1D2(G2(z)))]+Ex1pdata(x1)[logD1(x1)]+Ezp(z)[log(1D1(G1(z)))].

Image(2.25)

Similarly to the GAN frameworks introduced in Sect. 2.4.2, CoGAN does not require pairs of corresponding images as supervision.

Image
Figure 2.12 Architecture of CoGAN [29].

2.4.4 UNIT

Unsupervised Image-to-Image Translation (UNIT) Networks [30] combined the VAE-GAN [37] model (Sect. 2.3.4) and the CoGAN [29] model (Sect. 2.4.3) as shown in Fig. 2.13. UNIT also does not require pairs of corresponding images as supervision. The sets of networks {E1,G1,D1}Image and {E2,G2,D2}Image correspond to VAE-GANs, whereas the set of networks {G1,G2,D1,D2}Image correspond to CoGAN. Because the two VAEs share the weights of the last few layers of encoders as well as the first few layers of decoders (i.e., generators), an image of one domain x1Image is able to be translated to an image in the other domain x2Image through {E1,G2}Image, and vice versa. Note that the weight-sharing constraint alone does not guarantee that corresponding images in two domains will have the same latent code z. It was, however, shown that, by E1Image and E2Image, a pair of corresponding images in the two domains can be mapped to a common latent code, which is then mapped to a pair of corresponding images in the two domains by G1Image and G2Image. This benefits from the cycle-consistency constraint described below.

Image
Figure 2.13 Architecture of UNIT [30].

The learning problems of the VAEs and GANs for the image reconstruction streams, the image translation streams, and the cycle-reconstruction streams are jointly solved:

minE1,E2,G1,G2maxD1,D2LUNIT,

Image(2.26)

where

LUNIT=λ1LVAE(E1,G1)+λ2LGAN(E2,G1,D1)+λ3LCC(E1,G1,E2,G2)+λ1LVAE(E2,G2)+λ2LGAN(E1,G2,D2)+λ3LCC(E2,G2,E1,G1).

Image(2.27)

Here, LVAE()Image and LGAN()Image are the same as those in Eq. (2.4) and Eq. (2.7), respectively. LCC()Image represents the cycle-consistency constraint given by the following VAE-like objective function:

LCC(Ea,Ga,Eb,Gb)=DKL(qa(za|xa)||pθ(z))+DKL(qb(zb|xaba)||pθ(z))Ezbqb(zb|xaba)[logpGa(xa|zb)].

Image(2.28)

The third term in the above function ensures a twice translated resembles the input one, whereas the KL terms penalize the latent variable z deviating from the prior distribution pθ(z)N(z|0,I)Image in the cycle-reconstruction stream.

2.4.5 Triangle GAN

A triangle GAN (Δ-GAN) [31] is developed for semi-supervised cross-domain distribution matching. This framework requires only a few paired samples in two different domains as supervision. Δ-GAN consists of two generators {Gx,Gy}Image and two discriminators {D1,D2}Image, as shown in Fig. 2.14. Suppose that xImage and yImage represent the translated image from y and x using GxImage and GyImage, respectively. The fake data pair (x,y)Image is sampled from the joint distribution px(x,y)=px(x|y)p(y)Image, and vice versa. The objective of Δ-GAN is to match the three joint distribution p(x,y)Image, px(x,y)Image, and py(x,y)Image. If this is achieved, the learned bidirectional mapping px(x|y)Image and py(y|x)Image are guaranteed to generate fake data pairs (x,y)Image and (x,y)Image that are indistinguishable from the true data pair (x,y)Image.

Image
Figure 2.14 Architecture of Δ-GAN [31].

The objective function of Δ-GAN is given by

minGx,GymaxD1,D2E(x,y)p(x,y)[logD1(x,y)]+Eyp(y),xpx(x|y)[log((1D1(x,y))D2(x,y))]+Exp(x),ypy(y|x)[log((1D1(x,y))(1D2(x,y)))].

Image(2.29)

The discriminator D1Image distinguishes whether a sample pair is from the true data distribution p(x,y)Image or not. If this is not from p(x,y)Image, D2Image is used to distinguish whether a sample pair is from px(x,y)Image or py(x,y)Image.

Δ-GAN can be regarded as a combination of cGAN and BiGAN [41]/ALI [40] (Sect. 2.3.7). Eq. (2.29) can be rewritten as

minGx,GymaxD1,D2LΔ-GAN,

Image(2.30)

where

LΔ-GAN=LcGAN+LBiGAN,LcGAN=Ep(x,y)[logD1(x,y)]+Epx(x,y)[log(1D1(x,y))]+Epy(x,y)[log(1D1(x,y))],LBiGAN=Epx(x,y)[logD2(x,y)]+Epy(x,y)[log(1D2(x,y))].

Image

2.5 Multimodal Encoder–Decoder Networks

The architecture of the multimodal encoder–decoder networks [5] is illustrated in Fig. 2.15. To exploit the commonality among different tasks, all encoder / decoder pairs are connected with each other via the shared latent representation. In addition, if the decoding task is expected to benefit from high-resolution representations, the decoder is further connected with all encoders via shared skip connections in a similar manner to the U-net architecture [6]. Given one input modality, the encoder generates a single representation, which is then decoded through different decoders into all available modalities. The whole network is trained by taking into account all combinations of the conversion tasks among different modalities.

Image
Figure 2.15 Model architecture of the multimodal encoder–decoder networks. This model consists of encoder–decoder networks with the shared latent representation. Depending on the task, a decoder also employs the U-net architecture and is connected with all encoders via shared skip connections. The network consists of Conv+Norm+ReLU modules except for the final layer, which is equivalent to Conv. For experiments described in Sect. 2.6, we use kernel size 3 × 3 with stride 1 and padding size 1 for all convolution layers, and kernel size 2 × 2 and stride 2 for max-pooling.

In what follows, we discuss the details of the architecture of the multimodal encoder–decoder networks for the task of depth and semantic label estimation from RGB images assuming a training dataset consisting of three modalities: RGB images, depth maps, and semantic labels. In this example, semantic segmentation is the task where the advantage of the skip connections has been already shown [7], while such high-resolution representations are not always beneficial to depth and RGB image decoding tasks. It is also worth noting that the task and the number of modalities are not limited to this particular example. More encoders and decoders can be added to the model, and the decoder can be trained with different tasks and loss functions within this framework.

2.5.1 Model Architecture

As illustrated in Fig. 2.15, each convolution layer (Conv) in the encoder is followed by a batch-normalization layer (Norm) and activation function (ReLU). Two max-pooling operations are placed in the middle of seven Conv+Norm+ReLU components, which makes the dimension of the latent representation be 1/16 of the input. Similarly, the decoder network consists of seven Conv+Norm+ReLU components except for the final layer, while max-pooling operations are replaced by un-pooling operations for expanding a feature map. The max-pooling operation pools a feature map by taking its maximum values, and the un-pooling operation restores the pooled feature into an un-pooled feature map using switches where the locations of the maxima are recorded. The final output of the decoder is then rescaled to the original input size. In case for a classification task, the rescaled output may be further fed into a softmax layer to yield a class probability distribution.

As discussed earlier, all encoder / decoder pairs are connected with the shared latent representation. Let xrImage, xsImage, and xdImage be the input modalities for each encoder; RGB image, semantic label, and depth map, and ErImage, EsImage, and EdImage be the corresponding encoder functions, then the outputs from each encoder r, which means latent representation, is defined by r{Er(xr),Es(xs),Ed(xd)}Image. Here, the dimension of r is C×H×WImage, where C, H, and W are the number of channels, height, and width, respectively. Because the intermediate output r from all the encoders form the same shape C×H×WImage by the convolution and pooling operations at all the encoder functions E{Er,Es,Es}Image, we can obtain the output y{Dr(r),Ds(r),Dd(r)}Image from any r, where DrImage, DsImage, and DdImage are decoder functions. The latent representations between encoders and decoders are not distinguished among different modalities, i.e., the latent representation encoded by an encoder is fed into all decoders, and at the same time, each decoder has to be able to decode latent representation from any of the encoders. In other words, latent representation is shared for by all the combinations of encoder / decoder pairs.

For semantic segmentation, the U-net architecture with skip paths are also employed to propagate intermediate low-level features from encoders to decoders. Low-level feature maps in the encoder are concatenated with feature maps generated from latent representations and then convolved in order to mix the features. Since we use 3×3Image convolution kernels with 2×2Image max-pooling operators for the encoder and 3×3Image convolution kernels with 2×2Image un-pooling operators for the decoder, the encoder and decoder networks are symmetric (U-shape). The introduced model has skip paths among all combinations of the encoders and decoders, and also shares the low-level features in a similar manner to the latent representation.

2.5.2 Multitask Training

In the training phase, a batch of training data is passed through all forwarding paths for calculating losses. For example, given a batch of paired RGB image, depth, and semantic label images, three decoding losses from the RGB image / depth / label decoders are first computed for the RGB image encoder. The same procedure is then repeated for depth and label encoders, and the global loss is defined as the sum of all decoding losses from nine encoder–decoder pairs, i.e., three input modalities for three tasks. The gradients for the whole network are computed based on the global loss by back-propagation. If the training batch contains unpaired data, only within-modal self-encoding losses are computed. In the following, we describe details of the cost functions defined for decoders of RGB images, semantic labels, and depth maps.

RGB images. For RGB image decoding, we define the loss LrImage as the 1Image distance of RGB values as

LI=1NxPr(x)fr(x)1,

Image(2.31)

where r(x)R3+Image and fr(x)R3Image are the ground truth and predicted RGB values, respectively. If the goal of the network is realistic RGB image generation from depth and label images, the RGB image decoding loss may be further extended to DCGAN-based architectures [42].

Semantic labels. In this chapter, we define a label image as a map in which each pixel has a one-hot-vector that represents the class that the pixel belongs to. The number of the input and output channels is thus equivalent to the number of classes. We define the loss function of semantic label decoding by the pixel-level cross-entropy. Let K be the number of classes, the softmax function is then written as

p(k)(x)=exp(f(k)s(x))Ki=1exp(f(i)s(x)),

Image(2.32)

where f(k)s(x)RImage indicates the value at the location x in the kth channel (k{1,K}Image) of the tensor given by the final layer output. Letting P be the whole set of pixels in the output and N be the number of the pixels, the loss function LsImage is defined as

Ls=1NxPKk=1tk(x)logp(k)(x),

Image(2.33)

where tk(x){0,1}Image is the kth channel ground-truth label, which is one if the pixel belongs to the kth class, and zero, otherwise.

Depth maps. For the depth decoder, we use the distance between the ground truth and predicted depth maps. The loss function LdImage is defined as

Ld=1NxP|d(x)fd(x)|,

Image(2.34)

where d(x)R+Image and fd(x)RImage are the ground truth and predicted depth values, respectively. We normalize the depth values so that they range in [0,255]Image by linear interpolation. In the evaluation step, we revert the normalized depth map into the original scale.

2.5.3 Implementation Details

In the multimodal encoder–decoder networks, learnable parameters are initialized by a normal distribution. We set the learning rate to 0.001 and the momentum to 0.9 for all layers with weight decay 0.0005. The input image size is fixed to 96×96Image. We use both paired and unpaired data as training data, which are randomly mixed in every epoch. When a triplet of the RGB image, semantic label, and depth map of a scene is available, training begins with the RGB image as input and the corresponding semantic label, depth map, and the RGB image itself as output. In the next step, the semantic label is used as input, and the others as well as the semantic label are used as output for computing individual losses. These steps are repeated on all combinations of the modalities and input / output. A loss is calculated on each combination and used for updating parameters. In case the triplet of the training data is unavailable (for example, when the dataset contains extra unlabeled RGB images), the network is updated only with the reconstruction loss in a similar manner to auto-encoders. We train the network by the mini-batch training method with a batch including at least one labeled RGB image paired with either semantic labels or depth maps if available, because it is empirically found that such paired data contribute to convergence of training.

2.6 Experiments

In this section, we evaluate the multimodal encoder–decoder networks [5] for semantic segmentation and depth estimation using two public datasets: NYUDv2 [48] and Cityscape [49]. The baseline model is the single-task encoder–decoder networks (enc-dec) and single-modal (RGB image) multitask encoder–decoder networks (enc-decs) that have the same architecture as the multimodal encoder–decoder networks. We also compare the multimodal encoder–decoder networks to multimodal auto-encoders (MAEs) [1], which concatenates latent representations of auto-encoders for different modalities. Since the shared representation in MAE is the concatenation of latent representations in all the modalities, it is required to explicitly input zero-filled pseudo signals to estimate the missing modalities. Also, MAE uses fully connected layers instead of convolutional layers, so that input images are flattened when fed into the first layer.

For semantic segmentation, we use the mean intersection over union (MIOU) scores for the evaluation. IOU is defined as

IOU=TPTP+FP+FN,

Image(2.35)

where TP, FP, and FN are the numbers of true positive, false positive, and false negative pixels, respectively, determined over the whole test set. MIOU is the mean of the IOU on all classes.

For depth estimation, we use several evaluation measures commonly used in prior work [50,11,12]:

  • •  Root mean squared error: 1NxP(d(x)ˆd(x))2Image
  • •  Average relative error: 1NxP|d(x)ˆd(x)|d(x)Image
  • •  Average log10Image error: 1NxP|log10d(x)ˆd(x)|Image
  • •  Accuracy with threshold: Percentage of xPImage s.t. max(d(x)ˆd(x),ˆd(x)d(x))<δImage,

where d(x)Image and ˆd(x)Image are the ground-truth depth and predicted depth at the pixel x, P is the whole set of pixels in an image, and N is the number of the pixels in P.

2.6.1 Results on NYUDv2 Dataset

NYUDv2 dataset has 1448 images annotated with semantic labels and measured depth values. This dataset contains 868 images for training and a set of 580 images for testing. We divided test data into 290 and 290 for validation and testing, respectively, and used the validation data for early stopping. The dataset also has extra 407,024 unpaired RGB images, and we randomly selected 10,459 images as unpaired training data, while other class was not considered in both of training and evaluation. For semantic segmentation, following prior work [51,52,12], we evaluate the performance on estimating 12 classes out of all available classes. We trained and evaluated the multimodal encoder–decoder networks on two different image resolutions 96×96Image and 320×240Image pixels.

Table 2.1 shows depth estimation and semantic segmentation results of all methods. The first six rows show results on 96×96Image input images, and each row corresponds to MAE [1], single-task encoder–decoder with and without skip connections, single-modal multitask encoder–decoder, the multimodal encoder–decoder networks without and with extra RGB training data. The first six columns show performance metrics for depth estimation, and the last column shows semantic segmentation performances. The performance of the multimodal encoder–decoder networks was better than the single-task network (enc-dec) and single-modal multitask encoder–decoder network (enc-decs) on all metrics even without the extra data, showing the effectiveness of the multimodal architecture. The performance was further improved with the extra data, and achieved the best performance in all evaluation metrics. It shows the benefit of using unpaired training data and multiple modalities to learn more effective representations.

Table 2.1

Performance comparison on the NYUDv2 dataset. The first six rows show results on 96 × 96 input images, and each row corresponds to MAE [1], single-task encoder–decoder with and without U-net architecture, single-modal multitask encoder–decoder, the multimodal encoder–decoder networks with and without extra RGB training data. The next seven rows show results on 320 × 240 input images in comparison with baseline depth estimation methods [5355,11,50].

blank cellblank cellDepth Estimationblank cell
ErrorAccuracySemantic Segmentation
Rellog10RMSEδ < 1.25δ < 1.252δ < 1.253MIOU
96 × 96MAE [1]1.1470.2902.3110.0980.2930.4910.018
enc-dec (U)0.357
enc-dec0.3400.1491.2160.3960.6990.732
enc-decs0.3210.1501.2010.3980.6870.7180.352
multimodal enc-decs0.2960.1201.0460.4500.7750.8100.411
multimodal enc-decs (+extra)0.2830.1191.0420.4610.7780.8100.420
320 × 240Make3d [53]0.3491.2140.4470.7450.897
DepthTransfer [54]0.3500.1311.200
Discrete-continuous CRF [55]0.3350.1271.060
Eigen et al. [11]0.2150.9070.6010.8870.971
Liu et al. [50]0.2300.0950.8240.6140.8830.971
multimodal enc-decs0.2280.0880.8230.5760.8490.8670.543
multimodal enc-decs (+extra)0.2210.0870.8190.5790.8530.8720.548

Image

In addition, next seven rows show results on 320×240Image input images in comparison with baseline depth estimation methods including Make3d [53], DepthTransfer [54], Discrete-continuous CRF [55], Eigen et al. [11], and Liu et al. [50]. The performance was improved when trained on higher resolution images, in terms of both depth estimation and semantic segmentation. The multimodal encoder–decoder networks achieved a better performance than baseline methods, and comparable to methods requiring CRF-based optimization [55] and a large amount of labeled training data [11] even without the extra data.

More detailed results on semantic segmentation on 96×96Image images are shown in Table 2.2. Each column shows class-specific IOU scores for all models. The multimodal encoder–decoder networks with extra training data outperforms the baseline models with 10 out of the 12 classes and achieved 0.063 points improvement in MIOU.

Table 2.2

Detailed IOU on the NYUDv2 dataset. Each column shows class-specific IOU scores for all models.

MAEenc-dec (U)enc-decsmultimodal enc-decsmultimodal enc-decs (+extra)
book0.0020.0550.0710.0960.072
cabinet0.0330.3710.3820.4800.507
ceiling0.0000.4720.4140.5290.534
floor0.1010.6480.6590.7040.736
table0.0200.1970.2220.2370.299
wall0.0230.7110.7060.7450.749
window0.0220.3340.3630.3210.320
picture0.0050.3610.3360.4140.422
blinds0.0010.2740.2340.3030.304
sofa0.0040.3020.3000.3650.375
bed0.0060.3700.3200.4550.413
tv0.0000.1920.2200.2850.307
mean0.0180.3570.3520.4110.420

Image

2.6.2 Results on Cityscape Dataset

The Cityscape dataset consists of 2975 images for training and 500 images for validation, which are provided together with semantic labels and disparity. We divide the validation data into 250 and 250 for validation and testing, respectively, and used the validation data for early stopping. This dataset has 19,998 additional RGB images without annotations, and we also used them as extra training data. There are semantic labels of 19 class objects and a single background (unannotated) class. We used the 19 classes (excluding the background class) for evaluation. For depth estimation, we used the disparity maps provided together with the dataset as the ground truth. Since there were missing disparity values in the raw data unlike NYUDv2, we adopted the image inpainting method [57] to interpolate disparity maps for both training and testing. We used image resolutions 96×96Image and 512×256Image, while the multimodal encoder–decoder networks was trained on half-split 256×256Image images in the 512×256Image case.

The results are shown in Table 2.3, and the detailed comparison on semantic segmentation using 96×96Image images are summarized in Table 2.4. The first six rows in Table 2.3 show a comparison between different architectures using 96×96Image images. The multimodal encoder–decoder networks achieved improvement over both of the MAE [1] and the baseline networks in most of the target classes. While the multimodal encoder–decoder networks without extra data did not improve MIOU, it resulted in 0.043 points improvement with extra data. The multimodal encoder–decoder networks also achieved the best performance on the depth estimation task, and the performance gain from extra data illustrates the generalization capability of the described training strategy. The next three rows show results using 512×256Image, and the multimodal encoder–decoder networks achieved a better performance than the baseline method [56] on semantic segmentation.

Table 2.3

Performance comparison on the Cityscape dataset.

blank cellblank cellDepth Estimationblank cell
ErrorAccuracySemantic Segmentation
Rellog10RMSEδ < 1.25δ < 1.252δ < 1.253MIOU
96 × 96MAE [1]3.6750.44134.5830.2130.3950.4710.099
enc-dec (U)0.346
enc-dec0.3800.1258.9830.6020.7800.870
enc-decs0.3650.1178.8630.6250.7980.8800.356
multimodal enc-decs0.3870.1158.2670.6310.8030.8870.346
multimodal enc-decs (+extra)0.2900.1007.7590.6670.8370.9080.389
512 × 256Segnet [56]0.561
multimodal enc-decs0.2010.0765.5280.7590.9080.9490.575
multimodal enc-decs (+extra)0.2170.0805.4750.7650.9080.9490.604

Image

Table 2.4

Detailed IOU on the Cityscape dataset.

MAEenc-decenc-decsmultimodal enc-decsmultimodal enc-decs (+extra)
road0.6880.9310.9360.9250.950
side walk0.1590.5560.5510.5290.640
build ing0.3720.7570.7690.7700.793
wall0.0220.1250.1280.0530.172
fence0.0000.0540.0510.0360.062
pole0.0000.2300.2200.2250.280
traffic light0.0000.1000.0740.0490.109
traffic sigh0.0000.1640.2030.1890.231
vegetation0.2000.8020.8050.8050.826
terrain0.0000.4300.4460.4450.498
sky0.2950.8690.8870.8670.890
person0.0000.3090.3180.3250.365
rider0.0000.0400.0580.0070.036
car0.1370.7240.7430.7200.788
truck0.0000.0620.0510.0750.035
bus0.0000.0960.1520.1530.251
train0.0000.0060.0770.1330.032
motor cycle0.0000.0480.0560.0430.108
bicycle0.0000.2700.2410.2180.329
mean0.0990.3460.3560.3460.389

Image

2.6.3 Auxiliary Tasks

Although the main goal of the described approach is semantic segmentation and depth estimation from RGB images, in Fig. 2.16 we show other cross-modal conversion pairs, i.e., semantic segmentation from depth images and depth estimation from semantic labels on cityscape dataset. From left to right, each column corresponds to ground truth (A) RGB image, (B) upper for semantic label image lower for depth map and (C) upper for image-to-label lower for image-to-depth, (D) upper for depth-to-label lower for label-to-depth. The ground-truth depth maps are ones after inpainting. As can be seen, the multimodal encoder–decoder networks could also reasonably perform these auxiliary tasks.

Image
Figure 2.16 Example output images from the multimodal encoder–decoder networks on the Cityscape dataset. From left to right, each column corresponds to (A) input RGB image, (B) the ground-truth semantic label image (top) and depth map (bottom), (C) estimated label and depth from RGB image, (D) estimated label from depth image (top) and estimated depth from label image (bottom).

More detailed examples and evaluations on NYUDv2 dataset is shown in Fig. 2.17 and Table 2.5. The left side of Fig. 2.17 shows examples of output images corresponding to all of the above-mentioned tasks. From top to bottom on the left side, each row corresponds to the ground truth, (A) RGB image, (B) semantic label image, estimated semantic labels from (C) the baseline enc-dec model, (D) image-to-label, (E) depth-to-label conversion paths of the multimodal encoder–decoder networks, (F) depth map (normalized to [0,255]Image for visualization) and estimated depth maps from (G) enc-dec, (H) image-to-depth, (I) label-to-depth. Interestingly, these auxiliary tasks achieved better performances than the RGB input cases. A clearer object boundary in the label and depth images is one of the potential reasons of the performance improvement. In addition, the right side of Fig. 2.17 shows image decoding tasks and each block corresponds to (A) the ground-truth RGB image, (B) semantic label, (C) depth map, (D) label-to-image, and (E) depth-to-image. Although the multimodal encoder–decoder networks could not correctly reconstruct the input color, object shapes can be seen even with the simple image reconstruction loss.

Image
Figure 2.17 Example outputs from the multimodal encoder–decoder networks on the NYUDv2 dataset. From top to bottom on the left side, each row corresponds to (A) the input RGB image, (B) the ground-truth semantic label image, (C) estimation by enc-dec, (D) image-to-label estimation by the multimodal encoder–decoder networks, (E) depth-to-label estimation by the multimodal encoder–decoder networks, (F) estimated depth map by the multimodal encoder–decoder networks (normalized to [0,255] for visualization), (G) estimation by enc-dec, (H) image-to-depth estimation by the multimodal encoder–decoder networks, and (I) label-to-depth estimation by the multimodal encoder–decoder networks. In addition, the right side shows image decoding tasks, where each block corresponds to (A) the ground-truth RGB image, (B) label-to-image estimate, and (C) depth-to-image estimate.

Table 2.5

Comparison of auxiliary task performances on the NYUDv2.

blank cellDepth Estimationblank cell
ErrorAccuracySemantic Segmentation
Rellog10RMSEδ < 1.25δ < 1.252δ < 1.253MIOU
image-to-depth0.2830.1191.0420.4610.7780.810
label-to-depth0.2580.1281.1140.4520.7410.779
image-to-label0.420
depth-to-label0.476

Image

2.7 Conclusion

In this chapter, we introduced several state-of-the-art approaches on deep learning for multimodal data fusion as well as basic techniques behind those works. In particular, we described a new approach named multimodal encoder–decoder networks for efficient multitask learning with a shared feature representation. In the multimodal encoder–decoder networks, encoders and decoders are connected via the shared latent representation and shared skipped representations. Experiments showed the potential of shared representations from different modalities to improve the multitask performance.

One of the most important issues in future work is to investigate the effectiveness of the multimodal encoder–decoder networks on different tasks such as image captioning and DCGAN-based image translation. More detailed investigation on learned shared representations during multitask training is another important future direction to understand why and how the multimodal encoder–decoder architecture addresses the multimodal conversion tasks.

References

[1] C. Cadena, A. Dick, I.D. Reid, Multi-modal auto-encoders as joint estimators for robotics scene understanding, Proceedings of Robotics: Science and Systems. 2016.

[2] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, A.Y. Ng, Multimodal deep learning, Proceedings of International Conference on Machine Learning. ICML. 2011.

[3] N. Srivastava, R.R. Salakhutdinov, Multimodal learning with deep Boltzmann machines, Proceedings of Advances in Neural Information Processing Systems. NIPS. 2012.

[4] A. Kendall, Y. Gal, R. Cipolla, Multi-task learning using uncertainty to weigh losses for scene geometry and semantics, arXiv preprint arXiv:1705.07115.

[5] R. Kuga, A. Kanezaki, M. Samejima, Y. Sugano, Y. Matsushita, Multi-task learning using multi-modal encoderdecoder networks with shared skip connections, Proceedings of International Conference on Computer Vision Workshops. ICCV. 2017.

[6] O. Ronneberger, P. Fischer, T. Brox, U-net: convolutional networks for biomedical image segmentation, Proceedings of International Conference on Medical Image Computing and Computer-Assisted Intervention. 2015.

[7] P. Isola, J.-Y. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. CVPR. 2017:5967–5976.

[8] R. Caruana, Multitask learning, Machine Learning 1997;28(1):41–75.

[9] Y. Liao, S. Kodagoda, Y. Wang, L. Shi, Y. Liu, Understand scene categories by objects: a semantic regularized scene classifier using convolutional neural networks, Proceedings of IEEE International Conference on Robotics and Automation. ICRA. 2016.

[10] Y. Yang, T. Hospedales, Deep multi-task representation learning: a tensor factorisation approach, Proceedings of International Conference on Learning Representations. ICLR. 2017.

[11] D. Eigen, C. Puhrsch, R. Fergus, Depth map prediction from a single image using a multi-scale deep network, Proceedings of Advances in Neural Information Processing Systems. NIPS. 2014.

[12] D. Eigen, R. Fergus, Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture, Proceedings of International Conference on Computer Vision. ICCV. 2015.

[13] E.M. Hand, R. Chellappa, Attributes for improved attributes: a multi-task network for attribute classification, Proceedings of AAAI Conference on Artificial Intelligence. 2017.

[14] J. Hoffman, S. Gupta, T. Darrell, Learning with side information through modality hallucination, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. CVPR. 2016.

[15] M. Teichmann, M. Weber, M. Zoellner, R. Cipolla, R. Urtasun, Multinet: real-time joint semantic reasoning for autonomous driving, Proceedings of 2018 IEEE Intelligent Vehicles Symposium. IV. 2018.

[16] J. Uhrig, M. Cordts, U. Franke, T. Brox, Pixel-level encoding and depth layering for instance-level semantic labeling, Proceedings of German Conference on Pattern Recognition. 2016.

[17] X. Li, L. Zhao, L. Wei, M.H. Yang, F. Wu, Y. Zhuang, H. Ling, J. Wang, DeepSaliency: multi-task deep neural network model for salient object detection, IEEE Transactions on Image Processing 2016;25(8):3919–3930.

[18] A. Eitel, J.T. Springenberg, L. Spinello, M. Riedmiller, W. Burgard, Multimodal deep learning for robust RGB-D object recognition, Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems. IROS. 2015.

[19] K. Sohn, W. Shang, H. Lee, Improved multimodal deep learning with variation of information, Proceedings of Advances in Neural Information Processing Systems. NIPS. 2014:2141–2149.

[20] V. Radu, N.D. Lane, S. Bhattacharya, C. Mascolo, M.K. Marina, F. Kawsar, Towards multimodal deep learning for activity recognition on mobile devices, Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct. 2016:185–188.

[21] M. Ehrlich, T.J. Shields, T. Almaev, M.R. Amer, Facial attributes classification using multi-task representation learning, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2016.

[22] E. Bruni, N.K. Tran, M. Baroni, Multimodal distributional semantics, Journal of Artificial Intelligence Research 2014;49(1):1–47.

[23] I.V. Serban, A.G. Ororbia II, J. Pineau, A.C. Courville, Multi-modal variational encoder–decoders, arXiv preprint arXiv:1612.00377.

[24] C. Silberer, V. Ferrari, M. Lapata, Visually grounded meaning representations, IEEE Transactions on Pattern Analysis and Machine Intelligence 2017;39(11):2284–2297.

[25] G. Collell, T. Zhang, M. Moens, Imagined visual representations as multimodal embeddings, Proceedings of AAAI Conference on Artificial Intelligence. 2017.

[26] J.-Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, Proceedings of International Conference on Computer Vision. ICCV. 2017.

[27] T. Kim, M. Cha, H. Kim, J.K. Lee, J. Kim, Learning to discover cross-domain relations with generative adversarial networks, Proceedings of International Conference on Machine Learning. ICML. 2017.

[28] Z. Yi, H. Zhang, P. Tan, M. Gong, Dualgan: unsupervised dual learning for image-to-image translation, Proceedings of International Conference on Computer Vision. ICCV. 2017.

[29] M.-Y. Liu, O. Tuzel, Coupled generative adversarial networks, Proceedings of Advances in Neural Information Processing Systems. NIPS. 2016.

[30] M.-Y. Liu, T. Breuel, J. Kautz, Unsupervised image-to-image translation networks, Proceedings of Advances in Neural Information Processing Systems. NIPS. 2017.

[31] Z. Gan, L. Chen, W. Wang, Y. Pu, Y. Zhang, H. Liu, C. Li, L. Carin, Triangle generative adversarial networks, Proceedings of Advances in Neural Information Processing Systems. NIPS. 2017.

[32] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, B. Catanzaro, High-resolution image synthesis and semantic manipulation with conditional gans, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. CVPR. 2018.

[33] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, J. Choo, Stargan: unified generative adversarial networks for multi-domain image-to-image translation, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. CVPR. 2018.

[34] D.P. Kingma, M. Welling, Auto-encoding variational Bayes, Proceedings of International Conference on Learning Representations. ICLR. 2014.

[35] D.J. Rezende, S. Mohamed, D. Wierstra, Stochastic backpropagation and approximate inference in deep generative models, Proceedings of International Conference on Machine Learning. ICML. 2014.

[36] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, Proceedings of Advances in Neural Information Processing Systems. NIPS. 2014.

[37] A.B.L. Larsen, S.K. Sønderby, H. Larochelle, O. Winther, Autoencoding beyond pixels using a learned similarity metric, Proceedings of International Conference on Machine Learning. ICML. 2016.

[38] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, Adversarial autoencoders, Proceedings of International Conference on Learning Representations. ICLR. 2016.

[39] L. Mescheder, S. Nowozin, A. Geiger, Adversarial variational Bayes: unifying variational autoencoders and generative adversarial networks, Proceedings of International Conference on Machine Learning. ICML. 2017.

[40] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, A. Courville, Adversarially learned inference, Proceedings of International Conference on Learning Representations. ICLR. 2017.

[41] J. Donahue, P. Krähenbühl, T. Darrell, Adversarial feature learning, Proceedings of International Conference on Learning Representations. ICLR. 2017.

[42] A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, Proceedings of International Conference on Learning Representations. ICLR. 2016.

[43] J. Springenberg, A. Dosovitskiy, T. Brox, M. Riedmiller, Striving for simplicity: the all convolutional net, Proceedings of International Conference on Learning Representations (Workshop Track). ICLR. 2015.

[44] S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift, Proceedings of International Conference on Machine Learning. ICML. 2015.

[45] H. Zhang, T. Xu, H. Li Stackgan, Text to photo-realistic image synthesis with stacked generative adversarial networks, Proceedings of International Conference on Computer Vision. ICCV. 2017.

[46] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D. Metaxas, Stackgan++: realistic image synthesis with stacked generative adversarial networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 2019 10.1109/TPAMI.2018.2856256.

[47] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein generative adversarial networks, Proceedings of International Conference on Machine Learning. ICML. 2017.

[48] N. Silberman, D. Hoiem, P. Kohli, R. Fergus, Indoor segmentation and support inference from rgbd images, Proceedings of European Conference on Computer Vision. ECCV. 2012:746–760.

[49] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, B. Schiele, The cityscapes dataset for semantic urban scene understanding, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. CVPR. 2016.

[50] F. Liu, C. Shen, G. Lin, Deep convolutional neural fields for depth estimation from a single image, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. CVPR. 2015.

[51] A. Hermans, G. Floros, B. Leibe, Dense 3D semantic mapping of indoor scenes from RGB-D images, Proceedings of IEEE International Conference on Robotics and Automation. ICRA. 2014.

[52] A. Wang, J. Lu, G. Wang, J. Cai, T.-J. Cham, Multi-modal unsupervised feature learning for RGB-D scene labeling, Proceedings of European Conference on Computer Vision. ECCV. 2014.

[53] A. Saxena, M. Sun, A.Y. Ng, Make3d: learning 3d scene structure from a single still image, IEEE Transactions on Pattern Analysis and Machine Intelligence 2009;31(5):824–840.

[54] K. Karsch, C. Liu, S.B. Kang, Depth transfer: depth extraction from video using non-parametric sampling, IEEE Transactions on Pattern Analysis and Machine Intelligence 2014;36(11):2144–2158.

[55] M. Liu, M. Salzmann, X. He, Discrete-continuous depth estimation from a single image, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. CVPR. 2014:716–723.

[56] V. Badrinarayanan, A. Kendall, R. Cipolla, Segnet: a deep convolutional encoder–decoder architecture for image segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 2017;39(12):2481–2495.

[57] A. Telea, An image inpainting technique based on the fast marching method, Journal of Graphics Tools 2004;9(1):23–34.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset