Asako Kanezaki⁎; Ryohei Kuga†; Yusuke Sugano†; Yasuyuki Matsushita† ⁎National Institute of Advanced Industrial Science and Technology, Tokyo, Japan
†Graduate School of Information Science and Technology, Osaka University, Osaka, Japan
Recent advance in deep learning has enabled realistic image-to-image translation of multimodal data. Along with the development, auto-encoders and generative adversarial networks (GAN) have been extended to deal with multimodal input and output. At the same time, multitask learning has been shown to efficiently and effectively address multiple mutually related recognition tasks. Various scene understanding tasks, such as semantic segmentation and depth prediction, can be viewed as cross-modal encoding / decoding, and hence most of the prior work used multimodal (various types of input) datasets for multitask (various types of output) learning. The inter-modal commonalities, such as one across RGB image, depth, and semantic labels, are being exploited while the study is still at an early stage. In this chapter, we introduce several state-of-the-art encoder–decoder methods on multimodal learning as well as a new approach to cross-modal networks. In particular, we detail a multimodal encoder–decoder networks that harnesses the multimodal nature of multitask scene recognition. In addition to the shared latent representation among encoder–decoder pairs, the model also has shared skip connections from different encoders. By combining these two representation sharing mechanisms, it is shown to efficiently learn a shared feature representation among all modalities in the training data.
Encoder–decoder networks; Semi-supervised learning; Semantic segmentation; Depth estimation
Scene understanding is one of the most important tasks for various applications including robotics and autonomous driving and has been an active research area in computer vision for a long time. The goal of scene understanding can be divided into several different tasks, such as depth reconstruction and semantic segmentation. Traditionally, these different tasks have been studied independently, resulting in their own tailored methods. Recently, there has been a growing demand for a single unified framework to achieve multiple tasks at a time unlike previous approaches. By sharing a part of the learned estimator, such a multitask learning framework is expected to achieve better performance with a compact representation.
In most of the prior work, multitask learning is formulated with a motivation to train a shared feature representation among different tasks for efficient feature encoding [1–3]. Accordingly, in recent convolutional neural network (CNN)-based methods, multitask learning often employs an encoder–decoder network architecture [1,2,4]. If, for example, the target tasks are semantic segmentation and depth estimation from RGB images, multitask networks encode the input image to a shared low-dimensional feature representation and then estimate depth and semantic labels with two distinct decoder networks.
While such a shared encoder architecture can constrain the network to extract a common feature for different tasks, one limitation is that it cannot fully exploit the multimodal nature of the training dataset. The representation capability of the shared representation in the above example is not limited to image-to-label and image-to-depth conversion tasks, but it can also represent the common feature for all of the cross-modal conversion tasks such as depth-to-label as well as within-modal dimensionality reduction tasks such as image-to-image. By incorporating these additional conversion tasks during the training phase, the multitask network is expected to learn more efficient shared feature representation for the diverse target tasks.
In this chapter, we introduce a recent method named the multimodal encoder–decoder networks method [5] for multitask scene recognition. The model consists of encoders and decoders for each modality, and the whole network is trained in an end-to-end manner taking into account all conversion paths—both cross-modal encoder–decoder pairs and within-modal self-encoders. As illustrated in Fig. 2.1, all encoder–decoder pairs are connected via a single shared latent representation in the method. In addition, inspired by the U-net architecture [6,7], the decoders for pixel-wise image conversion tasks such as semantic segmentation also take a shared skipped representation from all encoders. Since the whole network is jointly trained using multitask losses, these two shared representations are trained to extract the common feature representation among all modalities. Unlike multimodal auto-encoders [1], this method can further utilize auxiliary unpaired data to train self-encoding paths and consequently improve the cross-modal conversion performance. In the experiments using two public datasets, we show that the multimodal encoder–decoder networks perform significantly better on cross-modal conversion tasks.
The remainder of this chapter is organized as follows. In Sect. 2.2, we summarize in an overview various methods on multimodal data fusion. Next, we describe the basics of multimodal deep learning techniques in Sect. 2.3 and the latest work based on those techniques in Sect. 2.4. We then introduce the details of multimodal encoder–decoder networks in Sect. 2.5. In Sect. 2.6, we show experimental results and discuss the performance of multimodal encoder–decoder networks on several benchmark datasets. Finally, we conclude this chapter in Sect. 2.7.
Multitask learning is motivated by the finding that the feature representation for one particular task could be useful for the other tasks [8]. In prior work, multiple tasks, such as scene classification, semantic segmentation [9], character recognition [10] and depth estimation [11,12], have been addressed with a single input of an RGB image, which is referred to as single-modal multitask learning. Hand et al. [13] demonstrated that multitask learning of gender and facial parts from one facial image leads to better accuracy than individual learning of each task. Hoffman et al. [14] proposed a modality hallucination architecture based on CNNs, which boosts the performance of RGB object detection using depth information only in the training phase. Teichmann et al. [15] presented neural networks for scene classification, object detection, segmentation of a street view image. Uhrig et al. [16] proposed an instance-level segmentation method via simultaneous estimation of semantic labels, depth, and instance center direction. Li et al. [17] proposed fully convolutional neural networks for segmentation and saliency tasks. In these previous approaches, the feature representation of a single input modality is shared in an intermediate layer for solving multiple tasks. In contrast, the multimodal encoder–decoder networks [5] described in Sect. 2.5 fully utilize the multimodal training data by learning cross-modal shared representations through joint multitask training.
There have been several prior attempts for utilizing multimodal inputs for deep neural networks. They proposed the use of multimodal input data, such as RGB and depth images [18], visual and textual features [19], audio and video [2], and multiple sensor data [20], for single-task neural networks. In contrast to such multimodal single-task learning methods, relatively few studies have been made on multimodal & multitask learning. Ehrlich et al. [21] presented a method to identify a person's gender and smiling based on two feature modalities extracted from face images. Cadena et al. [1] proposed neural networks based on auto-encoder for multitask estimation of semantic labels and depth.
Both of these single-task and multitask learning methods with multimodal data focused on obtaining better shared representation from multimodal data. Since straightforward concatenation of extracted features from different modalities often results in lower estimation accuracy, some prior methods tried to improve the shared representation by singular value decomposition [22], encoder–decoder [23], auto-encoder [2,1,24], and supervised mapping [25]. While the multimodal encoder–decoder networks are also based on the encoder–decoder approach, one employs the U-net architecture for further improving the learned shared representation, particularly in high-resolution convolutional layers.
Most of the prior works also assume that all modalities are available for the single-task or multitask in both training and test phases. One approach for dealing with the missing modal data is to perform zero-filling, which fills the missing elements in the input vector by zeros [2,1]. Although these approaches allow the multimodal networks to handle missing modalities and cross-modal conversion tasks, it has not been fully discussed whether such a zero-filling approach can be also applied to recent CNN-based architectures. Sohn et al. [19] explicitly estimated missing modal data from available modal data by deep neural networks. In a difficult task, such as a semantic segmentation with many classes, the missing modal data is estimated inaccurately, which has a negative influence on performance of the whole network. Using the multimodal encoder–decoder networks, at the test phase encoder–decoder paths work individually even for missing modal data. Furthermore, it can perform conversions between all modalities in the training set, and it can utilize single-modal data to improve within-modal self-encoding paths during the training.
Recently, many image-to-image translation methods based on deep neural networks have been developed [7,26–32]. In contrast to that they address image-to-image translation on two different modalities, StarGAN [33] was recently proposed to efficiently learn the translation on more than two domains. The multimodal encoder–decoder networks is also applicable to the translation on more than two modalities. We describe the details of this work in Sect. 2.4 and the basic methods behind this work in Sect. 2.3.
This section introduces the basics of multimodal deep learning for multimodal image translation. We first mention auto-encoder, which is the most basic neural network consisting of an encoder and a decoder. Then we introduce an important extension of auto-encoder named variational auto-encoder (VAE) [34,35]. VAEs consider a standard normal distribution for latent variables and thus they are useful for generative modeling. Next, we describe generative adversarial network (GAN) [36], which is the most well-known way of learning deep neural networks for multimodal data generation. Concepts of VAEs and GANs are combined in various ways to improve the distribution of latent space for image generation with, e.g., VAE-GAN [37], adversarial auto-encoder (AAE) [38], and adversarial variational Bayes (AVB) [39], which are described later in this section. We also introduce the adversarially learned inference (ALI) [40] and the bidirectional GAN (BiGAN) [41], which combine the GAN framework and the inference of latent representations.
An auto-encoder is a neural network that consists of an encoder network and a decoder network (Fig. 2.2). Letting x∈Rd be an input variable, which is a concatenation of pixel values when managing images, an encoder maps it to a latent variable z∈Rr, where r is usually much smaller than d. A decoder maps z to the output x′∈Rd, which is the reconstruction of the input x. The encoder and decoder are trained so as to minimize the reconstruction errors such as the following squared errors:
LAE=Ex[‖x−x′‖2].
The purpose of an auto-encoder is typically dimensionality reduction, or in other words, unsupervised feature / representation learning. Recently, as well as the encoding process, more attention has been given to the decoding process which has the ability of data generation from latent variables.
The variational auto-encoder (VAE) [34,35] regards an auto-encoder as a generative model where the data is generated from some conditional distribution p(x|z). Letting ϕ and θ denote the parameters of the encoder and the decoder, respectively, the encoder is described as a recognition model qϕ(z|x), whereas the decoder is described as an approximation to the true posterior pθ(x|z). The marginal likelihood of an individual data point xi is written as follows:
logpθ(xi)=DKL(qϕ(z|xi)||pθ(z|xi))+L(θ,ϕ;xi).
Here, DKL stands for the Kullback–Leibler divergence. The second term in this equation is called the (variational) lower bound on the marginal likelihood of data point i, which can be written as
L(θ,ϕ;xi)=−DKL(qϕ(z|xi)||pθ(z))+Ez∼qϕ(z|xi)[logpθ(xi|z)].
In the training process, the parameters ϕ and θ are optimized so as to minimize the total loss −∑iL(θ,ϕ;xi), which can be written as
LVAE=DKL(qϕ(z|x)||pθ(z))−Ez∼qϕ(z|x)[logpθ(x|z)].
Unlike the original auto-encoder which does not consider the distribution of latent variables, VAE assumes the prior over the latent variables such as the centered isotropic multivariate Gaussian pθ(z)=N(z;0,I). In this case, we can let the variational approximate posterior be a multivariate Gaussian with a diagonal covariance structure:
logqϕ(z|xi)=logN(z;μi,σ2iI),
where the mean μi and the standard deviation σi of the approximate posterior are the outputs of the encoder (Fig. 2.3). In practice, the latent variable zi for data point i is calculated as follows:
zi=μi+σi⋅ϵ,whereϵ∼N(0,I).
The generative adversarial network (GAN) [36] is one of the most successful framework for data generation. It consists of two networks: a generator G and a discriminator D (shown in Fig. 2.4), which are jointly optimized in a competitive manner. Intuitively, the aim of the generator is to generate a sample x′ from a random noise vector z that can fool the discriminator, i.e., make it believe x′ is real. The aim of the discriminator is to distinguish a fake sample x′ from a real sample x. They are simultaneously optimized via the following two-player minimax game:
minGmaxDLGAN,whereLGAN=Ex∼pdata[logD(x)]+Ez∼p(z)[log(1−D(G(z)))].
From the perspective of a discriminator D, the objective function is a simple cross-entropy loss function for the binary categorization problem. A generator G is trained so as to minimize log(1−D(G(z))), where the gradients of the parameters in G can be back-propagated through the outputs of (fixed) D. In spite of its simplicity, GAN is able to train a reasonable generator that can output realistic data samples.
Deep convolutional GAN (DCGAN) [42] was proposed to utilize GAN for generating natural images. The DCGAN generator consists of a series of four fractionally-strided convolutions that convert a 100-dimensional random vector (in uniform distribution) into a 64×64 pixel image. The main characteristics of the proposed CNN architecture are threefold. First, they used the all convolutional net [43] which replaces deterministic spatial pooling functions (such as maxpooling) with strided convolutions. Second, fully connected layers on top of convolutional features were eliminated. Finally, batch normalization [44], which normalize the input to each unit to have zero mean and unit variance, was used to stabilize learning.
DCGAN still has the limitation in resolution of generated images. StackGAN [45] was proposed to generate high-resolution (e.g., 256×256) images by using two-stage GANs. Stage-I GAN generator in StackGAN generates low-resolution (64×64) images, which are input into Stage-II GAN generator that outputs high-resolution (256×256) images. Discriminators in Stage-I and Stage-II distinguish output images and real images of corresponding resolutions, respectively. More stable StackGAN++ [46] was later proposed, which consists of multiple generators and multiple discriminators arranged in a tree-like structure.
VAE-GAN [37] is a combination of VAE [34,35] (Sect. 2.3.2) and GAN [36] (Sect. 2.3.3) as shown in Fig. 2.5. The VAE-GAN model is trained with the following criterion:
LVAE−GAN=LVAE⁎+LGAN,
LVAE⁎=DKL(qϕ(z|x)||pθ(z))−Ez∼qϕ(z|x)[logpθ(D(x)|z)],
where LGAN is the same as that in Eq. (2.7). Note that VAE-GAN replaces the element-wise error measures in VAE (i.e., the second term in Eq. (2.4)) by the similarity measures trained through a GAN discriminator (i.e., the second term in Eq. (2.9)). Here, a Gaussian observation model for D(x) with mean D(x′) and identity covariance is introduced:
pθ(D(x)|z)=N(D(x)|D(x′),I).
In this way, VAE-GAN improves the sharpness of output images, in comparison to VAE.
It has been said that GANs could suffer from mode collapse; the generator in GAN learns to produce samples with extremely low variety. VAE-GAN mitigates this problem by introducing the prior distribution for z. The learned latent space is therefore continuous and is able to produce meaningful latent vectors, e.g., visual attribute vectors corresponding to eyeglasses and smiles in face images.
The adversarial auto-encoder (AAE) [38] was proposed to introduce the GAN framework to perform variational inference using auto-encoders. Similarly to VAE [34,35] (Sect. 2.3.2), AAE is a probabilistic autoencoder that assumes the prior over the latent variables. Let p(z) be the prior distribution we want to impose, q(z|x) be an encoding distribution, p(x|z) be the decoding distribution, and pdata(x) be the data distribution. An aggregated posterior distribution of q(z) is defined as follows:
q(z)=∫xq(z|x)pdata(x)dx.
The regularization of AAE is to match the aggregated posterior q(z) to an arbitrary prior p(z). While the auto-encoder attempts to minimize the reconstruction error, an adversarial network guides q(z) to match p(z) (Fig. 2.6). The cost function of AAE can be written as
LAAE=LAE+Eqϕ(z|x)[logD(z)]+Ez∼p(z)[log(1−D(z))],
where LAE represents the reconstruction error defined in Eq. (2.1).
In contrast that VAE uses a KL divergence penalty to impose a prior distribution on latent variables, AAE uses an adversarial training procedure to encourage q(z) to match to p(z). An important difference between VAEs and AAEs lies in the way of calculating gradients. VAE approximates the gradients of the variational lower bound through the KL divergence by Monte Carlo sampling, which needs the access to the exact functional form of the prior distribution. On the other hand, AAEs only need to be able to sample from the prior distribution for inducing q(z) to match p(z). AAEs therefore can impose arbitrarily complicated distributions (e.g., swiss roll distribution) as well as black-box distributions.
Adversarial variational Bayes (AVB) [39] also uses adversarial training for VAE, which enables the usage of arbitrarily complex inference models. As shown in Fig. 2.7, the inference model (i.e. encoder) in AVB takes the noise ϵ as additional input. For its derivation, the optimization problem of VAE in Eq. (2.4) is rewritten as follows:
maxθmaxϕEpdata(x)[−DKL(qϕ(z|x)||p(z))+Eqϕ(z|x)logpθ(x|z)]=maxθmaxϕEpdata(x)Eqϕ(z|x)(logp(z)−logqϕ(z|x)+logpθ(x|z)).
This can be optimized by stochastic gradient descent (SGD) using the reparameterization trick [34,35], when considering an explicit qϕ(z|x) such as the centered isotropic multivariate Gaussian. For a black-box representation of qϕ(z|x), the following objective function using the discriminator D(x,z) is considered:
maxDEpdata(x)Eqϕ(z|x)logσ(D(x,z))+Epdata(x)Ep(z)log(1−σ(D(x,z))),
where σ(t)≡(1+e−t)−1 denotes the sigmoid function. For pθ(x|z) and qϕ(z|x) fixed, the optimal discriminator D⁎(x,z) according to Eq. (2.14) is given by
D⁎(x,z)=logqϕ(z|x)−logp(z).
The optimization objective in Eq. (2.13) is rewritten as
maxθmaxϕEpdata(x)Eqϕ(z|x)(−D⁎(x,z)+logpθ(x|z)).
Using the reparameterization trick [34,35], this can be rewritten with a suitable function zϕ(x,ϵ) in the form
maxθmaxϕEpdata(x)Eϵ(−D⁎(x,zϕ(x,ϵ))+logpθ(x|zϕ(x,ϵ))).
The gradients of Eq. (2.17) w.r.t. ϕ and θ as well as the gradient of Eq. (2.14) w.r.t. the parameters for D are computed to apply SGD updates. Note that AAE [38] (Sect. 2.3.5) can be regarded as an approximation to AVB, where D(x,z) is restricted to the class of functions that do not depend on x (i.e., D(x,z)≡D(z)).
The adversarially learned inference (ALI) [40] and the Bidirectional GAN (BiGAN) [41], which were proposed at almost the same time, have the same model structure shown in Fig. 2.8. The model consists of two generators Gz(x) and Gx(z), which correspond to an encoder q(z) and a decoder p(x), respectively, and a discriminator D(x,z). In contrast that the original GAN [36] lacks the ability to infer z from x, ALI and BiGAN can induce latent variable mapping, similarly to auto-encoder. Instead of the explicit reconstruction loop of auto-encoder, the GAN-like adversarial training framework is used for training the networks. Owing to this, trivial factors of variation in the input are omitted and the learned features become insensitive to these trivial factors of variation. The objective of ALI/BiGAN is to match the two joint distributions q(x,z)=q(x)q(z|x) and p(x,z)=p(z)p(x|z) by optimizing the following function:
minGx,GzmaxDLBiGAN,whereLBiGAN=Eq(x)[logD(x,Gz(x))]+Ep(z)[log(1−D(Gx(z),z))].
Here, a discriminator network learns to discriminate between a sample pair (x,z) drawn from q(x,z) and one from p(x,z). The concept of using bidirectional mapping is also important for image-to-image translation methods, which are described in the next section.
Recent advances of deep neural networks enabled realistic translation of multimodal images. Basic ideas of such techniques are derived from VAEs and GANs. In this section, we introduce the latest works on this topic: image-to-image translation in two different domains/modals using deep neural networks. There are a few latest methods such as StarGAN [33] that is able to manage image-to-image translation among more than two domains, however, we do not address this topic in this chapter.
Pix2pix [7] is one of the earliest works on image-to-image translation using a conditional GAN (cGAN) framework. The purpose of this work is to learn a generator that takes an image x (e.g., an edge map) as input and outputs an image y in a different modality (e.g., a color photo). Whereas the generator in a GAN is only dependent on a random vector z, the one in a cGAN also depends on another observed variable, which is the input image x in this case. In the proposed setting in [7], the discriminator also depends on x. The architecture of pix2pix is shown in Fig. 2.9. The objective of the cGAN is written as follows:
LcGAN(G,D)=Ex,y[logD(x,y)]+Ex,z[log(1−D(x,G(x,z)))].
The above objective is then mixed with ℓ1 distance between the ground truth and the generated image:
Lℓ1(G)=Ex,y,z[‖y−G(x,z)‖1].
Note that images generated using ℓ1 distance tend to be less blurring than ℓ2 distance. The final objective of pix2pix is
minGmaxDLpix2pix(G,D),whereLpix2pix(G,D)=LcGAN(G,D)+λLℓ1(G),
where λ controls the importance of the two terms.
The generator of pix2pix takes the U-Net architecture [6], which is an encoder–decoder network with skip connections between intermediate layers (see Fig. 2.10). Because the U-Net transfers low-level information to high-level layers, it improves the quality (e.g., the edge sharpness) of output images.
More recent work named pix2pixHD [32] generates 2048×1024 high-resolution images from semantic label maps using the pix2pix framework. They improved the pix2pix framework by using a coarse-to-fine generator, a multiscale discriminator architecture, and a robust adversarial learning objective function. The generator G consists of two sub-networks {G1,G2}, where G1 generates 1024×512 images and G2 generates 2048×1024 images using the outputs of G1. Multiscale discriminators {D1,D2,D3}, who have an identical network structure but operate at different image scales, are used in a multitask learning setting. The objective of pix2pixHD is
minG((maxD1,D2,D3∑k=1,2,3LcGAN(G,Dk))+λ∑k=1,2,3LFM(G,Dk)),
where LFM(G,Dk) represents the newly proposed feature matching loss, which is the summation of ℓ1 distance between the discriminator's ith-layer outputs of the real and the synthesized images.
CycleGAN [26] is a newly designed image-to-image translation framework. Suppose that x and y denote samples in a source domain X and a target domain Y, respectively. Previous approaches such as pix2pix [7] (described in Sect. 2.4.1) only lean a mapping GY:X→Y, whereas CycleGAN also learns a mapping GX:Y→X. Here, two generators {Gy,Gx} and two discriminators {Dy,Dx} are jointly optimized (Fig. 2.11A). Gy translates x to y′, which is then fed into Gx to generate x′ (Fig. 2.11B). In the same manner, Gx translates y to x′, which is then fed into Gy to generate y′ (Fig. 2.11C). The ℓ1 distances between the input and the output are summed up to calculate the following cycle-consistency loss:
Lcyc(Gx,Gy)=Ex∼pdata(x)[‖x−Gx(Gy(x))‖1]+Ey∼pdata(y)[‖y−Gy(Gx(y))‖1].
The most important factor of the cycle-consistency loss is that paired training data is unnecessary. In contrast that Eq. (2.20) in pix2pix loss requires x and y corresponding to the same sample, Eq. (2.23) only needs to compare the input and the output in the respective domains. The full objective of CycleGAN is
LcycleGAN(Gx,Gy,Dx,Dy)=Ey∼pdata(y)[logDy(y)]+Ex∼pdata(x)[log(1−Dy(Gy(x)))]+Ex∼pdata(x)[logDx(x)]+Ey∼pdata(y)[log(1−Dx(Gx(y)))]+λLcyc(Gx,Gy).
DiscoGAN [27] and DualGAN [28] were proposed in the same period that CycleGAN was proposed. The architectures of those methods are identical to that of CycleGAN (Fig. 2.11). DiscoGAN uses ℓ2 loss instead of ℓ1 loss for the cycle-consistency loss. DualGAN replaced the sigmoid cross-entropy loss used in the original GAN by Wasserstein GAN (WGAN) [47].
A coupled generative adversarial network (CoGAN) [29] was proposed to learn a joint distribution of multidomain images. CoGAN consists of a pair of GANs, each of which synthesizes an image in one domain. These two GANs share a subset of parameters as shown in Fig. 2.12. The proposed model is based on the idea that a pair of corresponding images in two domains share the same high-level concepts. The objective of CoGAN is a simple combination of the two GANs:
LCoGAN(G1,G2,D1,D2)=Ex2∼pdata(x2)[logD2(x2)]+Ez∼p(z)[log(1−D2(G2(z)))]+Ex1∼pdata(x1)[logD1(x1)]+Ez∼p(z)[log(1−D1(G1(z)))].
Similarly to the GAN frameworks introduced in Sect. 2.4.2, CoGAN does not require pairs of corresponding images as supervision.
Unsupervised Image-to-Image Translation (UNIT) Networks [30] combined the VAE-GAN [37] model (Sect. 2.3.4) and the CoGAN [29] model (Sect. 2.4.3) as shown in Fig. 2.13. UNIT also does not require pairs of corresponding images as supervision. The sets of networks {E1,G1,D1} and {E2,G2,D2} correspond to VAE-GANs, whereas the set of networks {G1,G2,D1,D2} correspond to CoGAN. Because the two VAEs share the weights of the last few layers of encoders as well as the first few layers of decoders (i.e., generators), an image of one domain x1 is able to be translated to an image in the other domain x2 through {E1,G2}, and vice versa. Note that the weight-sharing constraint alone does not guarantee that corresponding images in two domains will have the same latent code z. It was, however, shown that, by E1 and E2, a pair of corresponding images in the two domains can be mapped to a common latent code, which is then mapped to a pair of corresponding images in the two domains by G1 and G2. This benefits from the cycle-consistency constraint described below.
The learning problems of the VAEs and GANs for the image reconstruction streams, the image translation streams, and the cycle-reconstruction streams are jointly solved:
minE1,E2,G1,G2maxD1,D2LUNIT,
where
LUNIT=λ1LVAE(E1,G1)+λ2LGAN(E2,G1,D1)+λ3LCC(E1,G1,E2,G2)+λ1LVAE(E2,G2)+λ2LGAN(E1,G2,D2)+λ3LCC(E2,G2,E1,G1).
Here, LVAE(⋅) and LGAN(⋅) are the same as those in Eq. (2.4) and Eq. (2.7), respectively. LCC(⋅) represents the cycle-consistency constraint given by the following VAE-like objective function:
LCC(Ea,Ga,Eb,Gb)=DKL(qa(za|xa)||pθ(z))+DKL(qb(zb|xa→ba)||pθ(z))−Ezb∼qb(zb|xa→ba)[logpGa(xa|zb)].
The third term in the above function ensures a twice translated resembles the input one, whereas the KL terms penalize the latent variable z deviating from the prior distribution pθ(z)≡N(z|0,I) in the cycle-reconstruction stream.
A triangle GAN (Δ-GAN) [31] is developed for semi-supervised cross-domain distribution matching. This framework requires only a few paired samples in two different domains as supervision. Δ-GAN consists of two generators {Gx,Gy} and two discriminators {D1,D2}, as shown in Fig. 2.14. Suppose that x′ and y′ represent the translated image from y and x using Gx and Gy, respectively. The fake data pair (x′,y) is sampled from the joint distribution px(x,y)=px(x|y)p(y), and vice versa. The objective of Δ-GAN is to match the three joint distribution p(x,y), px(x,y), and py(x,y). If this is achieved, the learned bidirectional mapping px(x|y) and py(y|x) are guaranteed to generate fake data pairs (x′,y) and (x,y′) that are indistinguishable from the true data pair (x,y).
The objective function of Δ-GAN is given by
minGx,GymaxD1,D2E(x,y)∼p(x,y)[logD1(x,y)]+Ey∼p(y),x′∼px(x|y)[log((1−D1(x′,y))⋅D2(x′,y))]+Ex∼p(x),y′∼py(y|x)[log((1−D1(x,y′))⋅(1−D2(x,y′)))].
The discriminator D1 distinguishes whether a sample pair is from the true data distribution p(x,y) or not. If this is not from p(x,y), D2 is used to distinguish whether a sample pair is from px(x,y) or py(x,y).
Δ-GAN can be regarded as a combination of cGAN and BiGAN [41]/ALI [40] (Sect. 2.3.7). Eq. (2.29) can be rewritten as
minGx,GymaxD1,D2LΔ-GAN,
where
LΔ-GAN=LcGAN+LBiGAN,LcGAN=Ep(x,y)[logD1(x,y)]+Epx(x′,y)[log(1−D1(x′,y))]+Epy(x,y′)[log(1−D1(x,y′))],LBiGAN=Epx(x′,y)[logD2(x′,y)]+Epy(x,y′)[log(1−D2(x,y′))].
The architecture of the multimodal encoder–decoder networks [5] is illustrated in Fig. 2.15. To exploit the commonality among different tasks, all encoder / decoder pairs are connected with each other via the shared latent representation. In addition, if the decoding task is expected to benefit from high-resolution representations, the decoder is further connected with all encoders via shared skip connections in a similar manner to the U-net architecture [6]. Given one input modality, the encoder generates a single representation, which is then decoded through different decoders into all available modalities. The whole network is trained by taking into account all combinations of the conversion tasks among different modalities.
In what follows, we discuss the details of the architecture of the multimodal encoder–decoder networks for the task of depth and semantic label estimation from RGB images assuming a training dataset consisting of three modalities: RGB images, depth maps, and semantic labels. In this example, semantic segmentation is the task where the advantage of the skip connections has been already shown [7], while such high-resolution representations are not always beneficial to depth and RGB image decoding tasks. It is also worth noting that the task and the number of modalities are not limited to this particular example. More encoders and decoders can be added to the model, and the decoder can be trained with different tasks and loss functions within this framework.
As illustrated in Fig. 2.15, each convolution layer (Conv) in the encoder is followed by a batch-normalization layer (Norm) and activation function (ReLU). Two max-pooling operations are placed in the middle of seven Conv+Norm+ReLU components, which makes the dimension of the latent representation be 1/16 of the input. Similarly, the decoder network consists of seven Conv+Norm+ReLU components except for the final layer, while max-pooling operations are replaced by un-pooling operations for expanding a feature map. The max-pooling operation pools a feature map by taking its maximum values, and the un-pooling operation restores the pooled feature into an un-pooled feature map using switches where the locations of the maxima are recorded. The final output of the decoder is then rescaled to the original input size. In case for a classification task, the rescaled output may be further fed into a softmax layer to yield a class probability distribution.
As discussed earlier, all encoder / decoder pairs are connected with the shared latent representation. Let xr, xs, and xd be the input modalities for each encoder; RGB image, semantic label, and depth map, and Er, Es, and Ed be the corresponding encoder functions, then the outputs from each encoder r, which means latent representation, is defined by r∈{Er(xr),Es(xs),Ed(xd)}. Here, the dimension of r is C×H×W, where C, H, and W are the number of channels, height, and width, respectively. Because the intermediate output r from all the encoders form the same shape C×H×W by the convolution and pooling operations at all the encoder functions E∈{Er,Es,Es}, we can obtain the output y∈{Dr(r),Ds(r),Dd(r)} from any r, where Dr, Ds, and Dd are decoder functions. The latent representations between encoders and decoders are not distinguished among different modalities, i.e., the latent representation encoded by an encoder is fed into all decoders, and at the same time, each decoder has to be able to decode latent representation from any of the encoders. In other words, latent representation is shared for by all the combinations of encoder / decoder pairs.
For semantic segmentation, the U-net architecture with skip paths are also employed to propagate intermediate low-level features from encoders to decoders. Low-level feature maps in the encoder are concatenated with feature maps generated from latent representations and then convolved in order to mix the features. Since we use 3×3 convolution kernels with 2×2 max-pooling operators for the encoder and 3×3 convolution kernels with 2×2 un-pooling operators for the decoder, the encoder and decoder networks are symmetric (U-shape). The introduced model has skip paths among all combinations of the encoders and decoders, and also shares the low-level features in a similar manner to the latent representation.
In the training phase, a batch of training data is passed through all forwarding paths for calculating losses. For example, given a batch of paired RGB image, depth, and semantic label images, three decoding losses from the RGB image / depth / label decoders are first computed for the RGB image encoder. The same procedure is then repeated for depth and label encoders, and the global loss is defined as the sum of all decoding losses from nine encoder–decoder pairs, i.e., three input modalities for three tasks. The gradients for the whole network are computed based on the global loss by back-propagation. If the training batch contains unpaired data, only within-modal self-encoding losses are computed. In the following, we describe details of the cost functions defined for decoders of RGB images, semantic labels, and depth maps.
RGB images. For RGB image decoding, we define the loss Lr as the ℓ1 distance of RGB values as
LI=1N∑x∈P‖r(x)−fr(x)‖1,
where r(x)∈R3+ and fr(x)∈R3 are the ground truth and predicted RGB values, respectively. If the goal of the network is realistic RGB image generation from depth and label images, the RGB image decoding loss may be further extended to DCGAN-based architectures [42].
Semantic labels. In this chapter, we define a label image as a map in which each pixel has a one-hot-vector that represents the class that the pixel belongs to. The number of the input and output channels is thus equivalent to the number of classes. We define the loss function of semantic label decoding by the pixel-level cross-entropy. Let K be the number of classes, the softmax function is then written as
p(k)(x)=exp(f(k)s(x))∑Ki=1exp(f(i)s(x)),
where f(k)s(x)∈R indicates the value at the location x in the kth channel (k∈{1…,K}) of the tensor given by the final layer output. Letting P be the whole set of pixels in the output and N be the number of the pixels, the loss function Ls is defined as
Ls=1N∑x∈PK∑k=1tk(x)logp(k)(x),
where tk(x)∈{0,1} is the kth channel ground-truth label, which is one if the pixel belongs to the kth class, and zero, otherwise.
Depth maps. For the depth decoder, we use the distance between the ground truth and predicted depth maps. The loss function Ld is defined as
Ld=1N∑x∈P|d(x)−fd(x)|,
where d(x)∈R+ and fd(x)∈R are the ground truth and predicted depth values, respectively. We normalize the depth values so that they range in [0,255] by linear interpolation. In the evaluation step, we revert the normalized depth map into the original scale.
In the multimodal encoder–decoder networks, learnable parameters are initialized by a normal distribution. We set the learning rate to 0.001 and the momentum to 0.9 for all layers with weight decay 0.0005. The input image size is fixed to 96×96. We use both paired and unpaired data as training data, which are randomly mixed in every epoch. When a triplet of the RGB image, semantic label, and depth map of a scene is available, training begins with the RGB image as input and the corresponding semantic label, depth map, and the RGB image itself as output. In the next step, the semantic label is used as input, and the others as well as the semantic label are used as output for computing individual losses. These steps are repeated on all combinations of the modalities and input / output. A loss is calculated on each combination and used for updating parameters. In case the triplet of the training data is unavailable (for example, when the dataset contains extra unlabeled RGB images), the network is updated only with the reconstruction loss in a similar manner to auto-encoders. We train the network by the mini-batch training method with a batch including at least one labeled RGB image paired with either semantic labels or depth maps if available, because it is empirically found that such paired data contribute to convergence of training.
In this section, we evaluate the multimodal encoder–decoder networks [5] for semantic segmentation and depth estimation using two public datasets: NYUDv2 [48] and Cityscape [49]. The baseline model is the single-task encoder–decoder networks (enc-dec) and single-modal (RGB image) multitask encoder–decoder networks (enc-decs) that have the same architecture as the multimodal encoder–decoder networks. We also compare the multimodal encoder–decoder networks to multimodal auto-encoders (MAEs) [1], which concatenates latent representations of auto-encoders for different modalities. Since the shared representation in MAE is the concatenation of latent representations in all the modalities, it is required to explicitly input zero-filled pseudo signals to estimate the missing modalities. Also, MAE uses fully connected layers instead of convolutional layers, so that input images are flattened when fed into the first layer.
For semantic segmentation, we use the mean intersection over union (MIOU) scores for the evaluation. IOU is defined as
IOU=TPTP+FP+FN,
where TP, FP, and FN are the numbers of true positive, false positive, and false negative pixels, respectively, determined over the whole test set. MIOU is the mean of the IOU on all classes.
For depth estimation, we use several evaluation measures commonly used in prior work [50,11,12]:
where d(x) and ˆd(x) are the ground-truth depth and predicted depth at the pixel x, P is the whole set of pixels in an image, and N is the number of the pixels in P.
NYUDv2 dataset has 1448 images annotated with semantic labels and measured depth values. This dataset contains 868 images for training and a set of 580 images for testing. We divided test data into 290 and 290 for validation and testing, respectively, and used the validation data for early stopping. The dataset also has extra 407,024 unpaired RGB images, and we randomly selected 10,459 images as unpaired training data, while other class was not considered in both of training and evaluation. For semantic segmentation, following prior work [51,52,12], we evaluate the performance on estimating 12 classes out of all available classes. We trained and evaluated the multimodal encoder–decoder networks on two different image resolutions 96×96 and 320×240 pixels.
Table 2.1 shows depth estimation and semantic segmentation results of all methods. The first six rows show results on 96×96 input images, and each row corresponds to MAE [1], single-task encoder–decoder with and without skip connections, single-modal multitask encoder–decoder, the multimodal encoder–decoder networks without and with extra RGB training data. The first six columns show performance metrics for depth estimation, and the last column shows semantic segmentation performances. The performance of the multimodal encoder–decoder networks was better than the single-task network (enc-dec) and single-modal multitask encoder–decoder network (enc-decs) on all metrics even without the extra data, showing the effectiveness of the multimodal architecture. The performance was further improved with the extra data, and achieved the best performance in all evaluation metrics. It shows the benefit of using unpaired training data and multiple modalities to learn more effective representations.
Table 2.1
Performance comparison on the NYUDv2 dataset. The first six rows show results on 96 × 96 input images, and each row corresponds to MAE [1], single-task encoder–decoder with and without U-net architecture, single-modal multitask encoder–decoder, the multimodal encoder–decoder networks with and without extra RGB training data. The next seven rows show results on 320 × 240 input images in comparison with baseline depth estimation methods [53–55,11,50].
blank cell | blank cell | Depth Estimation | blank cell | |||||
---|---|---|---|---|---|---|---|---|
Error | Accuracy | Semantic Segmentation | ||||||
Rel | log10 | RMSE | δ < 1.25 | δ < 1.252 | δ < 1.253 | MIOU | ||
96 × 96 | MAE [1] | 1.147 | 0.290 | 2.311 | 0.098 | 0.293 | 0.491 | 0.018 |
enc-dec (U) | – | – | – | – | – | – | 0.357 | |
enc-dec | 0.340 | 0.149 | 1.216 | 0.396 | 0.699 | 0.732 | – | |
enc-decs | 0.321 | 0.150 | 1.201 | 0.398 | 0.687 | 0.718 | 0.352 | |
multimodal enc-decs | 0.296 | 0.120 | 1.046 | 0.450 | 0.775 | 0.810 | 0.411 | |
multimodal enc-decs (+extra) | 0.283 | 0.119 | 1.042 | 0.461 | 0.778 | 0.810 | 0.420 | |
320 × 240 | Make3d [53] | 0.349 | – | 1.214 | 0.447 | 0.745 | 0.897 | – |
DepthTransfer [54] | 0.350 | 0.131 | 1.200 | – | – | – | – | |
Discrete-continuous CRF [55] | 0.335 | 0.127 | 1.060 | – | – | – | – | |
Eigen et al. [11] | 0.215 | – | 0.907 | 0.601 | 0.887 | 0.971 | – | |
Liu et al. [50] | 0.230 | 0.095 | 0.824 | 0.614 | 0.883 | 0.971 | – | |
multimodal enc-decs | 0.228 | 0.088 | 0.823 | 0.576 | 0.849 | 0.867 | 0.543 | |
multimodal enc-decs (+extra) | 0.221 | 0.087 | 0.819 | 0.579 | 0.853 | 0.872 | 0.548 |
In addition, next seven rows show results on 320×240 input images in comparison with baseline depth estimation methods including Make3d [53], DepthTransfer [54], Discrete-continuous CRF [55], Eigen et al. [11], and Liu et al. [50]. The performance was improved when trained on higher resolution images, in terms of both depth estimation and semantic segmentation. The multimodal encoder–decoder networks achieved a better performance than baseline methods, and comparable to methods requiring CRF-based optimization [55] and a large amount of labeled training data [11] even without the extra data.
More detailed results on semantic segmentation on 96×96 images are shown in Table 2.2. Each column shows class-specific IOU scores for all models. The multimodal encoder–decoder networks with extra training data outperforms the baseline models with 10 out of the 12 classes and achieved 0.063 points improvement in MIOU.
Table 2.2
Detailed IOU on the NYUDv2 dataset. Each column shows class-specific IOU scores for all models.
MAE | enc-dec (U) | enc-decs | multimodal enc-decs | multimodal enc-decs (+extra) | |
---|---|---|---|---|---|
book | 0.002 | 0.055 | 0.071 | 0.096 | 0.072 |
cabinet | 0.033 | 0.371 | 0.382 | 0.480 | 0.507 |
ceiling | 0.000 | 0.472 | 0.414 | 0.529 | 0.534 |
floor | 0.101 | 0.648 | 0.659 | 0.704 | 0.736 |
table | 0.020 | 0.197 | 0.222 | 0.237 | 0.299 |
wall | 0.023 | 0.711 | 0.706 | 0.745 | 0.749 |
window | 0.022 | 0.334 | 0.363 | 0.321 | 0.320 |
picture | 0.005 | 0.361 | 0.336 | 0.414 | 0.422 |
blinds | 0.001 | 0.274 | 0.234 | 0.303 | 0.304 |
sofa | 0.004 | 0.302 | 0.300 | 0.365 | 0.375 |
bed | 0.006 | 0.370 | 0.320 | 0.455 | 0.413 |
tv | 0.000 | 0.192 | 0.220 | 0.285 | 0.307 |
mean | 0.018 | 0.357 | 0.352 | 0.411 | 0.420 |
The Cityscape dataset consists of 2975 images for training and 500 images for validation, which are provided together with semantic labels and disparity. We divide the validation data into 250 and 250 for validation and testing, respectively, and used the validation data for early stopping. This dataset has 19,998 additional RGB images without annotations, and we also used them as extra training data. There are semantic labels of 19 class objects and a single background (unannotated) class. We used the 19 classes (excluding the background class) for evaluation. For depth estimation, we used the disparity maps provided together with the dataset as the ground truth. Since there were missing disparity values in the raw data unlike NYUDv2, we adopted the image inpainting method [57] to interpolate disparity maps for both training and testing. We used image resolutions 96×96 and 512×256, while the multimodal encoder–decoder networks was trained on half-split 256×256 images in the 512×256 case.
The results are shown in Table 2.3, and the detailed comparison on semantic segmentation using 96×96 images are summarized in Table 2.4. The first six rows in Table 2.3 show a comparison between different architectures using 96×96 images. The multimodal encoder–decoder networks achieved improvement over both of the MAE [1] and the baseline networks in most of the target classes. While the multimodal encoder–decoder networks without extra data did not improve MIOU, it resulted in 0.043 points improvement with extra data. The multimodal encoder–decoder networks also achieved the best performance on the depth estimation task, and the performance gain from extra data illustrates the generalization capability of the described training strategy. The next three rows show results using 512×256, and the multimodal encoder–decoder networks achieved a better performance than the baseline method [56] on semantic segmentation.
Table 2.3
Performance comparison on the Cityscape dataset.
blank cell | blank cell | Depth Estimation | blank cell | |||||
---|---|---|---|---|---|---|---|---|
Error | Accuracy | Semantic Segmentation | ||||||
Rel | log10 | RMSE | δ < 1.25 | δ < 1.252 | δ < 1.253 | MIOU | ||
96 × 96 | MAE [1] | 3.675 | 0.441 | 34.583 | 0.213 | 0.395 | 0.471 | 0.099 |
enc-dec (U) | – | – | – | – | – | – | 0.346 | |
enc-dec | 0.380 | 0.125 | 8.983 | 0.602 | 0.780 | 0.870 | – | |
enc-decs | 0.365 | 0.117 | 8.863 | 0.625 | 0.798 | 0.880 | 0.356 | |
multimodal enc-decs | 0.387 | 0.115 | 8.267 | 0.631 | 0.803 | 0.887 | 0.346 | |
multimodal enc-decs (+extra) | 0.290 | 0.100 | 7.759 | 0.667 | 0.837 | 0.908 | 0.389 | |
512 × 256 | Segnet [56] | – | – | – | – | – | – | 0.561 |
multimodal enc-decs | 0.201 | 0.076 | 5.528 | 0.759 | 0.908 | 0.949 | 0.575 | |
multimodal enc-decs (+extra) | 0.217 | 0.080 | 5.475 | 0.765 | 0.908 | 0.949 | 0.604 |
Table 2.4
Detailed IOU on the Cityscape dataset.
MAE | enc-dec | enc-decs | multimodal enc-decs | multimodal enc-decs (+extra) | |
---|---|---|---|---|---|
road | 0.688 | 0.931 | 0.936 | 0.925 | 0.950 |
side walk | 0.159 | 0.556 | 0.551 | 0.529 | 0.640 |
build ing | 0.372 | 0.757 | 0.769 | 0.770 | 0.793 |
wall | 0.022 | 0.125 | 0.128 | 0.053 | 0.172 |
fence | 0.000 | 0.054 | 0.051 | 0.036 | 0.062 |
pole | 0.000 | 0.230 | 0.220 | 0.225 | 0.280 |
traffic light | 0.000 | 0.100 | 0.074 | 0.049 | 0.109 |
traffic sigh | 0.000 | 0.164 | 0.203 | 0.189 | 0.231 |
vegetation | 0.200 | 0.802 | 0.805 | 0.805 | 0.826 |
terrain | 0.000 | 0.430 | 0.446 | 0.445 | 0.498 |
sky | 0.295 | 0.869 | 0.887 | 0.867 | 0.890 |
person | 0.000 | 0.309 | 0.318 | 0.325 | 0.365 |
rider | 0.000 | 0.040 | 0.058 | 0.007 | 0.036 |
car | 0.137 | 0.724 | 0.743 | 0.720 | 0.788 |
truck | 0.000 | 0.062 | 0.051 | 0.075 | 0.035 |
bus | 0.000 | 0.096 | 0.152 | 0.153 | 0.251 |
train | 0.000 | 0.006 | 0.077 | 0.133 | 0.032 |
motor cycle | 0.000 | 0.048 | 0.056 | 0.043 | 0.108 |
bicycle | 0.000 | 0.270 | 0.241 | 0.218 | 0.329 |
mean | 0.099 | 0.346 | 0.356 | 0.346 | 0.389 |
Although the main goal of the described approach is semantic segmentation and depth estimation from RGB images, in Fig. 2.16 we show other cross-modal conversion pairs, i.e., semantic segmentation from depth images and depth estimation from semantic labels on cityscape dataset. From left to right, each column corresponds to ground truth (A) RGB image, (B) upper for semantic label image lower for depth map and (C) upper for image-to-label lower for image-to-depth, (D) upper for depth-to-label lower for label-to-depth. The ground-truth depth maps are ones after inpainting. As can be seen, the multimodal encoder–decoder networks could also reasonably perform these auxiliary tasks.
More detailed examples and evaluations on NYUDv2 dataset is shown in Fig. 2.17 and Table 2.5. The left side of Fig. 2.17 shows examples of output images corresponding to all of the above-mentioned tasks. From top to bottom on the left side, each row corresponds to the ground truth, (A) RGB image, (B) semantic label image, estimated semantic labels from (C) the baseline enc-dec model, (D) image-to-label, (E) depth-to-label conversion paths of the multimodal encoder–decoder networks, (F) depth map (normalized to [0,255] for visualization) and estimated depth maps from (G) enc-dec, (H) image-to-depth, (I) label-to-depth. Interestingly, these auxiliary tasks achieved better performances than the RGB input cases. A clearer object boundary in the label and depth images is one of the potential reasons of the performance improvement. In addition, the right side of Fig. 2.17 shows image decoding tasks and each block corresponds to (A) the ground-truth RGB image, (B) semantic label, (C) depth map, (D) label-to-image, and (E) depth-to-image. Although the multimodal encoder–decoder networks could not correctly reconstruct the input color, object shapes can be seen even with the simple image reconstruction loss.
Table 2.5
Comparison of auxiliary task performances on the NYUDv2.
blank cell | Depth Estimation | blank cell | |||||
---|---|---|---|---|---|---|---|
Error | Accuracy | Semantic Segmentation | |||||
Rel | log10 | RMSE | δ < 1.25 | δ < 1.252 | δ < 1.253 | MIOU | |
image-to-depth | 0.283 | 0.119 | 1.042 | 0.461 | 0.778 | 0.810 | – |
label-to-depth | 0.258 | 0.128 | 1.114 | 0.452 | 0.741 | 0.779 | – |
image-to-label | – | – | – | – | – | – | 0.420 |
depth-to-label | – | – | – | – | – | – | 0.476 |
In this chapter, we introduced several state-of-the-art approaches on deep learning for multimodal data fusion as well as basic techniques behind those works. In particular, we described a new approach named multimodal encoder–decoder networks for efficient multitask learning with a shared feature representation. In the multimodal encoder–decoder networks, encoders and decoders are connected via the shared latent representation and shared skipped representations. Experiments showed the potential of shared representations from different modalities to improve the multitask performance.
One of the most important issues in future work is to investigate the effectiveness of the multimodal encoder–decoder networks on different tasks such as image captioning and DCGAN-based image translation. More detailed investigation on learned shared representations during multitask training is another important future direction to understand why and how the multimodal encoder–decoder architecture addresses the multimodal conversion tasks.