Chapter 12

Cross-modal Learning by Hallucinating Missing Modalities in RGB-D Vision

Nuno C. Garcia,Pietro MorerioVittorio Murino,    Pattern Analysis & Computer Vision (PAVIS), Istituto Italiano di Tecnologia (IIT), Genova, Italy
Universita' degli Studi di Genova, Genova, Italy
Universita' degli Studi di Verona, Verona, Italy

Abstract

Diverse input data modalities can provide complementary cues for several tasks, usually leading to more robust algorithms and better performance. However, while a (training) dataset could be accurately designed to include a variety of sensory inputs, it is often the case that not all modalities are available in real life (testing) scenarios, when the model is to be deployed. This raises the challenge of how to learn robust representations leveraging multimodal data in the training stage, while considering limitations at test time, such as noisy or missing modalities. This chapter presents a new approach for multimodal video action recognition, developed within the unified frameworks of distillation and privileged information, named generalized distillation. We consider the particular case of learning representations from depth and RGB videos, while relying on RGB data only at test time. Our approach consists in training a hallucination network that learns to distill depth features through multiplicative connections of spatiotemporal representations, leveraging soft labels and hard labels, and the euclidean distance between feature maps. We report state-of-the-art or comparable results on video action recognition on the largest multimodal dataset available for this task, the NTU RGB+D, as well as on the UWA3DII and Northwestern-UCLA. Our code is available at https://github.com/ncgarcia/modality-distillation.

Keywords

Action recognition; Deep multimodal learning; Distillation; Privileged information

12.1 Introduction

Depth perception refers to the interesting ability to reason about the 3D world from visual information captured by the retinal surface. It is vital for the survival of many different animal species which use it for hunting or escaping, and an important skill for humans to understand and interact with the surrounding environment. Humans start to develop depth perception very early, when babies start to crawl [1]. There have been identified several mechanisms that jointly contribute in different ways to the sense of relative and absolute position of objects, usually called depth cues. These can be divided in ocular motor cues (e.g. the movement of eyes converging or diverging) and, more interestingly in this scope, visual cues, which can be binocular or monocular. Binocular cues are related to stereovision and how the brain calculates depth based on the disparity of the left and right eyes' images. On the other hand, monocular cues refers to a priori visual assumptions derived from 2D single images, often related with physical factors such as shadows, perspective, motion parallax, texture gradient, occlusion, and others. For example, the assumption that an object looks blurrier the further it is, or that an object must be closer if it occludes another one, are signals that we can acquire with one eye only, and that our brain uses to reason about relative depth [2]. Although using only monocular vision affects object distance estimation [3], we are still able to perform most of our vision-related tasks efficiently with one eye only, and most importantly, to extract at least some depth information from 2D images without stereo mechanisms, using monocular cues, and make use it to navigate the 3D world.

Similarly, depth information is often of paramount importance for many computer vision tasks related to robotics, autonomous driving, scene understanding, to name a few. The emergence of cheap depth sensors and the need for big data to train bigger models led to big multimodal datasets containing RGB, depth, infrared, and joints sequences [4], which stimulated multimodal deep learning approaches. Traditional computer vision tasks like action recognition, object detection, or instance segmentation have been shown to benefit performance gains if the model considers other modalities, namely depth, instead of RGB-only [58].

Even though depth information brings improvements over RGB-only approaches, it is unrealistic to expect total availability of such data modality when the model is deployed in the real world. RGB cameras are still much more ubiquitous than depth sensors which may be difficult to install everywhere, and, moreover, good quality depth data might be difficult to acquire due to far-distance or reflectance issues, not to mention sensor or communications failure, or other unpredictable events.

Considering this limitation, we would like to answer the following question: what is the best way of using all data available at training time, in order to learn robust representations, knowing that there are missing (or noisy) modalities at test time? In other words, is there any added value in training a model by exploiting multimodal data, even if only one modality is available at test time? Unsurprisingly, the simplest and most commonly adopted solution consists in training the model using only the modality in which it will be tested. However, a more interesting alternative is trying to exploit the potential of the available data and train the model using all modalities, being, however, aware of the fact that not all of them will be accessible at test time (Fig. 12.1). This learning paradigm, i.e., when the model is trained using extra information, is generally known as learning with privileged information [9] or learning with side information [10].

Image
Figure 12.1 What is the best way of using all data available at training time, considering a missing (or noisy) modality at test time?

In this chapter, we present a multimodal stream framework that learns from different data modalities and can be deployed and tested on a subset of these [11]. We design a model able to learn from RGB and depth video sequences, but due to its general structure, it can also be used to manage whatever combination of other modalities as well. To show its potential, we evaluate the performance on the task of video action recognition. The goal of the new learning paradigm, depicted in Fig. 12.2, is to distill the information conveyed by depth into a hallucination network, which is meant to “mimic” the missing stream at test time. Distillation [12] [13] refers to any training procedure where knowledge is transferred from a previously trained complex model to a simpler one. Our learning procedure introduces a new loss function, which is inspired by the generalized distillation framework [14], that formally unifies distillation and privileged information learning theories.

Image
Figure 12.2 Training procedure described in Sect. 12.3.3 (see also text therein). The first step refers to the separate (pre-)training of depth and RGB streams with standard cross-entropy classification loss, with both streams initialized with ImageNet weights. The second step, depicted in the scheme, represents the learning of the teacher network; both streams are initialized with the respective weights from step 1, and trained jointly with a cross-entropy loss as a traditional two-stream model, using RGB and depth data. The third step represents the learning of the student network: both streams are initialized with the depth stream weights from the previous step, but the actual depth stream is frozen; importantly, the input for the hallucination stream is RGB data; the model is trained using the loss proposed in Eq. (12.5). The fourth and last step refers to a fine-tuning step and also the test setup of our model, represented in the scheme; the hallucination stream is initialized from the respective weights from previous step, and the RGB stream with the respective weights from the second step; this model is fine-tuned using a cross-entropy loss, and importantly, using only RGB data as input for both streams.

Our model is inspired by the two-stream network introduced by Simonyan and Zisserman [15], which has been notably successful in the traditional setting for video action recognition task [16] [17]. Differently from these works, we are interested in a multimodal setting, where we train one stream for each modality (RGB and depth in our case) and use these in the framework of privileged information. The most related work to ours is the inspiring method of Hoffman et al. [10], which proposes a hallucination network to learn with side information. We build on this idea, extending it by devising a new mechanism to learn and use such hallucination stream through a more general loss function and inter-stream connections.

This book chapter is strongly based on our recent conference paper [11], which main contributions and ideas are the following: (1) a multimodal stream network architecture able to exploit multiple data modalities at training while using only one at test time; (2) a new training procedure to learn a hallucination network within a novel two-stream model; (3) a more general loss function, based on the generalized distillation framework; (4) we report results comparable to/or achieving state of art – in the privileged information scenario – on the largest multimodal dataset for video action recognition, the NTU RGB+D [18], and on two other smaller ones, the UWA3DII [19] and the Northwestern-UCLA [20].

The rest of the chapter is organized as follows. Sect. 12.2 reviews similar approaches and discusses how they relate to the present work. Sect. 12.3 details the proposed architecture and the novel learning paradigm. Sect. 12.4 reports the results obtained on the various datasets, including a detailed ablation study performed on the NTU RGB+D dataset and a comparative performance with respect to the state of the art. Finally, we draw conclusions and future research directions in Sect. 12.5.

12.2 Related Work

Our work is at the intersection of three topics: privileged information [9], network distillation [12] [13], and multimodal video action recognition. However, Lopez et al. [14] noted that privileged information and network distillation are instances of the same more inclusive theory, called generalized distillation.

12.2.1 Generalized Distillation

Within the generalized distillation framework, our model is both related to the privileged information theory [9], considering that the extra modality (depth, in this case) is only used at training time, and, mostly, to the distillation framework. Indeed the core mechanism that our model uses to learn the hallucination network is derived from a distillation loss. More specifically, the supervision information provided by the teacher network (in this case, the network processing the depth data stream) is distilled into the hallucination network leveraging teacher's soft predictions and hard ground-truth labels in the loss function.

In this context, the closest works to our proposal are [21] and [10]. Luo et al. [21] addressed a similar problem to ours, where the model is first trained on several modalities (RGB, depth, joints and infrared), but tested only in one. The authors propose a graph-based distillation method that is able to distill information from all modalities at training time, while also passing through a validation phase on a subset of modalities. This showed to reach state-of-the-art results in action recognition and action detection tasks. Our work substantially differs from [21] since we benefit from a hallucination mechanism, consisting in an auxiliary network trained using the guidance distilled by the teacher network (that processes the depth data stream in our case). This mechanism allows the model to learn to emulate the presence of the missing modality at test time.

The work of Hoffman et al. [10] introduced a model to hallucinate depth features from RGB input for object detection task. While the idea of using a hallucination stream is similar to the one thereby presented, the mechanism used to learn it is different. In [10], the authors use a Euclidean loss between depth and hallucinated feature maps that is part of the total loss along with more than ten classification and localization losses, which makes its effectiveness very dependent on hyperparameter tuning to balance the different values, as the model is trained jointly in one step by optimizing the aforementioned composite loss. Differently, we propose a loss inspired to the distillation framework that not only uses the Euclidean distance between feature maps, and the one-hot labels, but also leverages soft predictions from the depth network. Moreover, we encourage the hallucination learning by design, by using cross-stream connections (see Sect. 12.3). This showed to largely improve the performance of our model with respect to the one-step learning process proposed in [10].

12.2.2 Multimodal Video Action Recognition

Video action recognition has a long and rich field of literature, spanning from classification methods using handcrafted features [22] [23] [24] [25] to modern deep learning approaches [26] [27] [28] [16], using either RGB-only or various multimodal data. Here, we focus on some of the more relevant works in multimodal video action recognition, including state-of-the-art methods considering the NTU RGB+D dataset, as well as architectures related to our proposed model.

The two-stream model introduced by Simonyan and Zisserman [15] is a landmark on video analysis, and since then has inspired a series of variants that achieved state-of-the-art performance on diverse datasets. This architecture is composed of an RGB and an optical flow stream, which are trained separately, and then fused at the prediction layer. The current state of the art in video action recognition [16] is inspired by such model, featuring 3D convolutions to deal with the temporal dimension, instead of the original 2D ones. In [17], a further variation of the two-stream approach is proposed, which models spatiotemporal features by injecting the motion stream's signal into the residual unit of the appearance stream. The idea of combining the two streams have also been explored previously by the same authors in [29].

Instead, in [5], the authors explore the complementary properties of RGB and depth data, taking the NTU RGB+D dataset as testbed. This work designed a deep autoencoder architecture and a structured sparsity learning machine, and showed to achieve state-of-the-art results for action recognition. Liu et al. [6] also use RGB and depth complementary information to devise a method for viewpoint invariant action recognition. Here, dense trajectories from RGB data are first extracted, which are then encoded in viewpoint invariant deep features. The RGB and depth features are then used as a dictionary to predict the test label.

All these previous methods exploited the rich information conveyed by the multimodal data to improve recognition. Our work, instead, proposes a fully convolutional model that exploits RGB and depth data at training time only, and uses exclusively RGB data as input at test time, reaching performance comparable to those utilizing the complete set of modalities in both stages.

12.3 Generalized Distillation with Multiple Stream Networks

This section describes our approach in terms of its architecture, the losses used to learn the different networks, and the training procedure.

12.3.1 Cross-stream Multiplier Networks

Typically in two-stream architectures, the two streams are trained separately and the predictions are fused with a late fusion mechanism [15] [17]. Such models use as input appearance (RGB) and motion (optical flow) data, which are fed separately into each stream, both in training and testing. Instead, in this work we use RGB and depth frames as inputs for training, but only RGB at test time, as already discussed (Fig. 12.2).

We use the ResNet-50-based [30] [31] model proposed in [17] as baseline architecture for each stream block of our model. In that paper, Feichtenhofer et al. proposed to connect the appearance and motion streams with multiplicative connections at several layers, as opposed to previous models which would only interact at the prediction layer. Such connections are depicted in Fig. 12.2 with the blue arrows. Fig. 12.3 illustrates this mechanism at a given layer of the multiple stream architecture, while it is actually implemented at the four convolutional layers of the Resnet-50 model. The underlying intuition is that these connections enable the model to learn better spatiotemporal representations, and help to distinguish between identical actions that require the combination of appearance and motion features. Originally, the cross-stream connections consisted of the injection of the motion stream signal into the other stream's residual unit, without affecting the skip path. ResNet's residual units are formally expressed as:

xl+1=f(h(xl)+F(xl,Wl)),

Image

where xlImage and xl+1Image are lth layer's input and output, respectively, F represents the residual convolutional layers defined by weights WlImage, h(xl)Image is an identity mapping and f is a ReLU non-linearity. The cross-streams connections are then defined as

xl+1a=f(xla)+F(xlaf(xlm),Wl),

Image

where xaImage and xmImage are the appearance and motion streams, respectively, and ⊙ is the element-wise multiplication operation. Such mechanism implies a spatial alignment between both feature maps, and therefore between both modalities. This alignment comes for free when using RGB and optical flow, since the latter is computed from the former in a way that spatial arrangement is preserved. However, this is an assumption we cannot generally make. For instance, depth and RGB are often captured from different sensors, likely resulting in spatially misaligned frames. The alignment procedure, described in Sect. 12.4.2, is part of the pre-processing phase and it refers uniquely to the NTU RGB+D dataset.

Image
Figure 12.3 Detail of the ResNet residual unit, showing the multiplicative connections and temporal convolutions [17]. In our architecture, the signal injection occurs before the second residual unit of each of the four ResNet blocks.

Temporal convolutions. In order to augment the model temporal support, we implement 1D temporal convolutions in the second residual unit of each ResNet layer (as in [17]), as illustrated in Fig. 12.3. The weights WlR1×1×3×Cl×ClImage are convolutional filters initialized as identity mappings at feature level, and centered in time, and ClImage is the number of channels in layer l.

Very recently in [32], the authors explored various network configurations using temporal convolutions, comparing several different combinations for the task of video classification. This work suggests that decoupling 3D convolutions into 2D (spatial) and 1D (temporal) filters is the best setup in action recognition tasks, producing best accuracies. The intuition for the latter setup is that factorizing spatial and temporal convolutions in two consecutive convolutional layers eases training of the spatial and temporal tasks (also in line with [33]).

12.3.2 Hallucination Stream

We also introduce and learn a hallucination network [10], using a new learning paradigm, loss function and interaction mechanism. The hallucination stream network has the same architecture as the appearance and depth stream models.

This network receives RGB as input, and is trained to “imitate” the depth stream at different levels, i.e. at feature and prediction layers. In this work, we explore several ways to implement such learning paradigm, including both the training procedure and the loss, and how they affect the overall performance of the model.

In [10] it is proposed a regression loss between the hallucination and depth feature maps, defined as:

Lhall(l)=λlσ(Ald)σ(Alh)22,

Image(12.1)

where σ is the sigmoid function, and AldImage and AlhImage are the l-th layer activations of depth and hallucination network. This Euclidean loss forces both activation maps to be similar. In [10], this loss is weighted along with another ten classification and localization loss terms, making it hard to balance the total loss. One of the main motivations behind the proposed new staged learning paradigm, described in Sect. 12.3.3, is to avoid the inefficient, heuristic-based tweaking of so many loss weights, a.k.a. hyperparameter tuning.

Instead, we adopt an approach inspired by the generalized distillation framework [14], in which a student model fsFsImage distills the representation ftFtImage learned by the teacher model. This is formalized as

fs=arg minfFs1ni=1nLGD(i),n=1,...,N

Image(12.2)

where N is the number of examples in the dataset. The generalized distillation loss is so defined as:

LGD(i)=(1λ)(yi,ς(f(xi)))+λ(si,ς(f(xi))),λ[0,1]fsFs,

Image(12.3)

ς is the softmax operator and siImage is the soft prediction from the teacher network:

si=ς(ft(xi)/T),T>0.

Image(12.4)

The parameter λ in Eq. (12.3) allows one to tune the loss by giving more importance either to imitating ground-truth hard or soft teacher targets, yiImage and siImage, respectively. This mechanism indeed allows the transfer of information from the depth (teacher) to the hallucination (student) network. The temperature parameter T in Eq. (12.4) allows one to smooth the probability vector predicted by the teacher network. The intuition is that such smoothing may expose relations between classes that would not easily be revealed in raw predictions, further facilitating the distillation by the student network fsImage.

We suggest that both Euclidean and generalized distillation losses are indeed useful in the learning process. In fact, by encouraging the network to decrease the distance between hallucinated and true depth feature maps, it can help to distill depth information encoded in the generalized distillation loss. Thus, we formalize our final loss function as follows:

L=(1α)LGD+αLhall,α[0,1],

Image(12.5)

where α is a parameter balancing the contributions of the two loss terms during training. The parameters λ, α and T are estimated by utilizing a validation set and discussed in Sect. 12.4.2.

In summary, the generalized distillation framework proposes to use the student-teacher framework introduced in the distillation theory to extract knowledge from the privileged information source. We explore this idea by proposing a new learning paradigm to train a hallucination network using privileged information, which we will describe in the next section. In addition to the loss functions introduced above, we also allow the teacher network to share information with the student network by design, through the cross-stream multiplicative connections. We test how all these possibilities affect the model's performance in the experimental section through an extensive ablation study.

12.3.3 Training Paradigm

In general, the proposed training paradigm, illustrated in Fig. 12.2, is divided in two core parts: the first part (steps 1 and 2 in the figure) focuses on learning the teacher network ftImage, leveraging RGB and depth data (the privileged information in this case); the second part (steps 3 and 4 in the figure) focuses on learning the hallucination network, referred to as student network fsImage in the distillation framework, using the general hallucination loss defined in Eq. (12.5).

The first training step consists in training both streams separately, which is a common practice in two-stream architectures. Both depth and appearance streams are trained minimizing cross-entropy, after being initialized with a pre-trained ImageNet model for all experiments. Temporal kernels are initialized as [0,1,0]Image, i.e. only information on the central frame is used at the beginning—this eventually changes as the training continues. As in [34], depth frames are encoded into color images using a jet colormap.

The second training step is still focused on further training the teacher model. Since the model trained in this step has the architecture and capacity of the final one, and has access to both modalities, its performance represents an upper bound for the task we are addressing. This is one of the major differences between our approach and the one used in [10]: by decoupling the teacher learning phase with the hallucination learning, we are able to both learn a better teacher and a better student, as we will show in the experimental section.

In the third training step, we focus on learning the hallucination network from the teacher model, i.e., the depth stream network just trained. Here, the weights of the depth network are frozen, while receiving in input depth data. Instead, the hallucination network, receiving in input RGB data, is trained with the loss defined in Eq. (12.5), while also receiving feedback from the cross-stream connections from the depth network. We found that this helps the learning process.

In the fourth and last step, we carry out fine tuning of the whole model, composed of the RGB and the hallucination streams. This step uses RGB-only as input, and it also precisely resembles the setup used at test time. The cross-stream connections inject the hallucinated signal into the appearance RGB stream network, resulting in the multiplication of the hallucinated feature maps and the RGB feature maps. The intuition is that the hallucination network has learned to inform the RGB model where the action is taking place, similarly to what the depth model would do with real depth data.

12.4 Experiments

12.4.1 Datasets

We evaluate our method on three datasets, while the ablation study is performed only on the NTU RGB+D dataset. Our model is initialized with ImageNet pre-trained weights and trained and evaluated on the NTU RGB+D dataset. We later fine-tune this model on each of the two smaller datasets for the corresponding evaluation experiments.

NTU RGB+D [18]. This is the largest public dataset for multimodal video action recognition. It is composed of 56,880 videos, available in four modalities: RGB videos, depth sequences, infrared frames, and 3D skeleton data of 25 joints. It was acquired with a Kinect v2 sensor in 80 different viewpoints, and includes 40 subjects performing 60 distinct actions. We follow the two evaluation protocols originally proposed in [18], which are cross-subject and cross-view. As in the original paper, we use about 5% of the training data as validation set for both protocols, in order to select the parameters λ, α and T. In this work, we use only RGB and depth data. The masked depth maps are converted to a three channel map via a jet mapping, as in [34].

UWA3DII [19]. This dataset consists on 1075 samples of RGB, depth and skeleton sequences. It features 10 subjects performing 30 actions captured in 5 different views.

Northwestern-UCLA [20]. Similarly to the other datasets, it provides RGB, depth and skeleton sequences for 1475 samples. It features 10 subjects performing 10 actions captured in 3 different views.

12.4.2 Pre-processing and Alignment of RGB and Depth Frames

The multiplicative cross-stream connections require both RGB and depth frames to be spatially aligned, since they are element-wise operations over the feature maps. The NTU RGB+D dataset is acquired using a Kinect sensor, which result in different dimensions and aspect ratios for RGB and depth frames. Fortunately, this dataset provides the joints' spatial coordinates in every RGB and depth frames, rgbx,yImage and depthx,yImage, respectively, which we use to align both modalities. Because the other two smaller datasets do not provide such information, this alignment procedure is only applicable to the NTU RGB+D dataset. Consequently, experiments using the other datasets refer to the model with no cross-stream connections.

For every frame of a given video, we first compute the ratio ratioxA,B=(rgbxArgbxB)/(depthxAdepthxB)Image A,BSImage, using all depth and RGB x coordinates from the frame's well-tracked joints set S, and similarly for the y dimension. The video aspect ratio is then calculated as the mean between the median aspect ratio for x and the median aspect ratio for y dimensions. The RGB frames of a given video are scaled according to this ratio. Finally, both RGB and depth frames are overlaid by aligning both skeletons, and the intersection is cropped on both modalities. The cropped sections are then re-scaled according to the network's input dimension, in this case 224×224Image.

Similarly to what was done in [17], we sample 5 frames evenly spaced in time for each video, both for training and testing. Sampling random frames for training results in similar accuracy values. For training, we also flip horizontally the video frames with probability P=0.5Image.

12.4.3 Hyperparameters and Validation Set

After validation, we have selected the following set of hyperparameters: α=0.5Image, λ=0.5Image, T=10Image; slightly different values do not show a significant change in performance. The networks are optimized using ADAM [35] with default learning rate, except for the fine-tuning steps where the learning rate was decreased by a factor of 10.

Regarding the NTU RGB+D dataset, the validation set is not defined in the original paper where the dataset is presented [18]. For the sake of experiments reproducibility, we explain here how we defined the validation set. For the cross-subject protocol, we choose the subject #1 (from the training set), which corresponds to around 5% of the training set. For the cross-view protocol, we do the following: 1) create a dictionary of sorted videos for each key=action (from the training set); 2) set numpy random seed equal to 0; 3) sample 31 videos using numpy.random.choice for each action, which in the end will correspond to around 5% of the training set.

12.4.4 Ablation Study

In this section we discuss the results of the experiments carried out to understand the contribution of each part of the model and of the training procedure. Table 12.1 reports performances at the several training steps, different losses and model configurations.

Table 12.1

Ablation study. A full set of experiments is provided for the NTU cross-subject evaluation protocol. For the cross-view protocol, only the most important results are reported.

#MethodTest ModalityLossCross-SubjectCross-View
1Ours – step 1, depth streamDepthx-entr70.44%75.16%
2Ours – step 1, RGB streamRGBx-entr66.52%71.39%
3Hoffman [10] w/o conn.RGBEq. (12.1)64.64%
4Hoffman [10] w/o conn.RGBEq. (12.3)68.60%
5Hoffman [10] w/o conn.RGBEq. (12.5)70.70%
6Ours – step 2, depth streamDepthx-entr71.09%77.30%
7Ours – step 2, RGB streamRGBx-entr66.68%56.26%
8Ours – step 2RGB & Depthx-entr79.73%81.43%
9Ours – step 2 w/o conn.RGB & Depthx-entr78.27%82.11%
10Ours – step 3 w/o conn.RGB (hall)Eq. (12.1)69.93%70.64%
11Ours – step 3 w/ conn.RGB (hall)Eq. (12.1)70.47%
12Ours – step 3 w/ conn.RGB (hall)Eq. (12.3)71.52%
13Ours – step 3 w/ conn.RGB (hall)Eq. (12.5)71.93%74.10%
14Ours – step 3 w/o conn.RGB (hall)Eq. (12.5)71.10%
15Ours – step 4RGBx-entr73.42%77.21%

Image

Rows #1 and #2 refer to the first training step, where depth and RGB streams are trained separately. We note that the depth stream network provides better performance with respect to the RGB one, as expected.

The second part of the table (Rows #3–5) shows the results using Hoffman et al.'s method [10]i.e., adopting a model initialized with the pre-trained networks from the first training step, and the hallucination network initialized using the depth network. Row #3 refers to the original paper [10] (i.e., using the loss LhallImage, Eq. (12.1)), and rows #4 and #5 refer to the training using the proposed losses LGDImage and L, in Eqs. (12.3) and (12.5), respectively. It can be noticed that the accuracies achieved using the proposed loss functions overcome that obtained in [10] by a significant margin (about 6% in the case of the total loss L).

The third part of the table (rows #6–9) reports performances after the training step 2. Rows #6 and #7 refer to the accuracy provided by depth and RGB stream networks belonging to the model of row #8, taken individually. The final model constitutes the upper bound for our hallucination model, since it uses RGB and depth for training and testing. Performances obtained by the model in row #8 and #9, with and without cross-stream connections, respectively, are the highest in absolute since using both modalities (around 78–79% for the cross-subject and 81–82% for the cross-view protocols, respectively), largely outperforming the accuracies obtained using only one modality (in rows #6 and #7).

The fourth part of the table (rows #10–14) shows results for our hallucination network after the several variations of learning processes, different losses and with and without cross-stream connections.

Finally, the last row, #15, reports results after the last fine-tuning step which further narrows the gap with the upper bound.

12.4.4.1 Contribution of the cross-stream connections

We claim that the signal injection provided by the cross-stream connections helps the learning of a better hallucination network. Row #13 and #14 show the performances for the hallucination network learning process, starting from the same point and using the same loss. The hallucination network that is learned using multiplicative connections performs better than its counterpart, where depth and RGB frames are properly aligned. It is important to note though that this is not observed in the other two smaller datasets, due to the spatial misalignment of modalities, and consequently between feature maps.

12.4.4.2 Contributions of the proposed distillation loss (Eq. (12.5))

The distillation and Euclidean losses have complementary contributions to the learning of the hallucination network. This is observed by looking at the performances reported in rows #3, #4 and #5, and also #11, #12 and #13. In both the training procedure proposed by Hoffman et al. [10] and our staged training process, the distillation loss improves over the Euclidean loss, and the combination of both improves over the rest. This suggests that both Euclidean and distillation losses have its own share and act differently to align the hallucination (student) and depth (teacher) feature maps and outputs' distributions.

12.4.4.3 Contributions of the proposed training procedure

The intuition behind the staged training procedure proposed in this work can be ascribed to the divide et impera (divide-and-conquer) strategy. In our case, it means breaking the problem in two parts: learning the actual task we aim to solve and learning the student network to face test-time limitations. Row #5 reports accuracy for the architecture proposed by Hoffman et al., and rows #15 report the performance for our model with connections. Both use the same loss to learn the hallucination network, and both start from the same initialization. We observe that our method outperform the one in row #5, which justifies the proposed staged training procedure.

12.4.5 Inference with Noisy Depth

Suppose that in a real test scenario we can only access unreliable sensors which produce noisy depth data. The question we now address is: to which extent can we trust such noisy data? In other words, at which level of noise does it become favorable to hallucinate the depth modality with respect to using the full teacher model (step 2) with noisy depth data?

The depth sensor used in the NTU dataset (Kinect), is an IR emitter coupled with an IR camera, and has very complex noise characterization comprising at least six different sources [36]. It is beyond the scope of this work to investigate noise models affecting the depth channel, so, for our analysis we choose the most common one, i.e., the multiplicative speckle noise. Hence, we inject Gaussian noise in the depth images I in order to simulate speckle noise: I=InImage, nN(1,σ)Image. Table 12.2 shows how performances of the network degrade when depth is corrupted with such Gaussian noise with increasing variance (NTU cross-view protocol only). Results show that accuracy significantly decreases w.r.t. to the one guaranteed by our hallucination model (77.21% – row #15 in Table 12.1), even with low noise variance. This means, in conclusion, that training a hallucination network is an effective way not only to obviate to the problem of a missing modality, but also to deal with noise affecting the input data channel.

Table 12.2

Accuracy of the model tested with clean RGB and noisy depth data. Accuracy of the proposed hallucination model, i.e. with no depth at test time, is 77.21%.

ImageImageImageImageImageImageImage
σ2no noise10−310−210−1100101void
Accuracy81.43%81.34%81.12%76.85%62.47%51.43%14.24%

Image

12.4.6 Comparison with Other Methods

Table 12.3 compares performances of different methods on the various datasets. The standard performance measure used for this task and datasets is classification accuracy, estimated according to the protocols (training and testing splits) reported in the respective works we are comparing with.

Table 12.3

Classification accuracies and comparisons with the state of the art. Performances referred to the several steps of our approach (ours) are highlighted in bold. × refers to comparisons with unsupervised learning methods. △ refers to supervised methods: here train and test modalities coincide. □ refers to privileged information methods: here training exploits RGB+D data, while test relies on RGB data only. The second column refers to the modalities used at test time: R-RGB, D-Depth, and J-Joints. The third column refers to cross-subject and the fourth to the cross-view evaluation protocols on the NTU dataset. The results reported on the other two datasets are for the cross-view protocol.

MethodTest Mods.NTU (p1)NTU (p2)UWA3DIINW-UCLA
Luo [37]D66.2%×
Luo [37]R56.0%
Rahmani [38]R67.4%78.1%
HOG-2 [39]D32.4%22.3%
Action Tube [40]R37.0%61.5%
Ours – depthD70.44%75.16%75.28%72.38%
step 1
Ours – RGBR66.52%71.39%63.67%85.22%
step 1
Deep RNN [18]J56.3%64.1%
Deep LSTM [18]J60.7%67.3%
Sharoudy [18]J62.93%70.27%
Kim [41]J74.3%83.1%
Sharoudy [5]R+D74.86%
Liu [6]R+D77.5%84.5%
Rahmani [42]D+J75.283.184.2%
Ours – step 2R+D79.73%81.43%79.66%88.87%
Hoffman et al. [10]R64.64%66.67%83.30%
ADMD [43]R73.11%81.50%91.64%
Ours – step 3R71.93%74.10%71.54%76.30%
Ours – step 4R73.42%77.21%73.23%86.72%

Image

The first part of the table (indicated by × symbol) refers to unsupervised methods, which achieve surprisingly high results even without relying on labels in learning representations.

The second part refers to supervised methods (indicated by △), divided according to the modalities used for training and testing. Here, we list the performance of the separate RGB and depth streams trained in step 1, as a reference. We expect our final model to perform better than the one trained on RGB-only, whose accuracy constitutes a lower bound for our student network. The values reported for our step 1 models for UWA3DII and NW-UCLA datasets refer to the fine-tuning of our NTU model. We have experimented training using pre-trained ImageNet weights, which led from 20% to 30% less accuracy. We also propose our baseline, consisting in the teacher model trained in step 2. Its accuracy represents an upper bound for the final model, which will not rely on depth data at test time.

The last part of the table (indicated by □) reports our model's performances at two different stages together with the other privileged information methods [10] [11]. For all datasets and protocols, we can see that our privileged information approach outperforms [10], which is the only fair direct comparison we can make (same training & test data). Besides, as expected, our final model performs better than “Ours – RGB model, step 1” since it exploits more data at training time, and worse than “ Ours – step 2”, since it exploits less data at test time. Other RGB+D methods perform better (which is comprehensible since they rely on RGB+D in both training and test) but not by a large margin. The method by Garcia et al. [11] is similar to these two in the sense that is also uses a hallucination network to cope with the missing depth modality. However, it takes a different look to it, by learning the hallucination stream through an adversarial strategy.

12.4.7 Inverting Modalities – RGB Distillation

The results presented in Table 12.4 address the opposite case of what is studied in the rest of the chapter, i.e., the case when RGB data is missing. In this case, the hallucination stream distills knowledge from the RGB stream in step 3 (Fig. 12.2).

Table 12.4

RGB distillation (NTU RGB-D, cross-view protocol).

#MethodTest ModalityLossCross-View
13aOurs – step 3Depth (hall)Eq. (12.5)76.12%
15aOurs – step 4Depthx-entr76.41%

Image

We observe that the performance of the final model degrades by almost 1%, 76.41% vs. 77.21% (cf. line 15 of Table 12.2). A more consistent setting would be to modify the model, inverting the cross-stream connections in steps 3 and 4, thus having information flowing again from depth to RGB.

12.5 Conclusions and Future Work

This chapter addresses the task of video action recognition in the context of privileged information. We describe a new learning paradigm to teach a hallucination network to mimic the depth stream. Our model outperforms many of the supervised methods recently evaluated on the NTU RGB+D dataset, as well as the original hallucination model proposed in [10]. We conducted an extensive ablation study to verify how the several parts composing our learning paradigm contribute to the model performance. As a future work, we would like to extend this approach to dealing with additional modalities that may be available at training time, such as skeleton joints data or infrared sequences. Finally, the current model cannot be applied to still images due to the presence of temporal convolutions. In principle, we could remove them and apply our method to still images and other tasks such as object detection.

References

[1] E.J. Gibson, R.D. Walk, The “visual cliff”, Scientific American 1960;202(4):64–71.

[2] M.R. Watson, J.T. Enns, Encyclopedia of Human Behavior. Elsevier; 2012:690–696 Ch. Depth Perception.

[3] P. Servos, Distance estimation in the visual and visuomotor systems, Experimental Brain Research 2000;130(1):35–47.

[4] M. Firman, Rgbd datasets: past, present and future, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2016:19–31.

[5] A. Shahroudy, T.-T. Ng, Y. Gong, G. Wang, Deep multimodal feature analysis for action recognition in rgb+d videos, IEEE Transactions on Pattern Analysis and Machine Intelligence 2018;40(5):1045–1058.

[6] J. Liu, N. Akhtar, A. Mian, Viewpoint invariant action recognition using rgb-d videos, IEEE Access 2017 10.1109/DICTA.2017.8227505.

[7] S. Gupta, R. Girshick, P. Arbeláez, J. Malik, Learning rich features from rgb-d images for object detection and segmentation, European Conference on Computer Vision. 2014:345–360.

[8] C. Hazirbas, L. Ma, C. Domokos, D. Cremers, Fusenet: incorporating depth into semantic segmentation via fusion-based cnn architecture, Asian Conference on Computer Vision. Springer; 2016:213–228.

[9] V. Vapnik, A. Vashist, A new learning paradigm: learning using privileged information, Neural Networks 2009;22(5):544–557.

[10] J. Hoffman, S. Gupta, T. Darrell, Learning with side information through modality hallucination, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016:826–834.

[11] N.C. Garcia, P. Morerio, V. Murino, Modality distillation with multiple stream networks for action recognition, The European Conference on Computer Vision. ECCV. 2018.

[12] G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, NIPS Deep Learning and Representation Learning Workshop. 2014.

[13] J. Ba, R. Caruana, Do deep nets really need to be deep? Advances in Neural Information Processing Systems. 2014:2654–2662.

[14] D. Lopez-Paz, B. Schölkopf, L. Bottou, V. Vapnik, Unifying distillation and privileged information, International Conference on Learning Representations. ICLR. 2016.

[15] K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, Advances in Neural Information Processing Systems. 2014:568–576.

[16] J. Carreira, A. Zisserman, Quo vadis, action recognition? A new model and the kinetics dataset, Computer Vision and Pattern Recognition, 2017 IEEE Conference on. CVPR. IEEE; 2017:4724–4733.

[17] C. Feichtenhofer, A. Pinz, R.P. Wildes, Spatiotemporal multiplier networks for video action recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017:4768–4777.

[18] A. Shahroudy, J. Liu, T.-T. Ng, G. Wang, Ntu rgb+d: a large scale dataset for 3d human activity analysis, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016:1010–1019.

[19] H. Rahmani, A. Mahmood, D. Huynh, A. Mian, Histogram of oriented principal components for cross-view action recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 2016;38(12):2430–2443.

[20] J. Wang, X. Nie, Y. Xia, Y. Wu, S.-C. Zhu, Cross-view action modeling, learning and recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014:2649–2656.

[21] Z. Luo, J.-T. Hsieh, L. Jiang, J. Carlos Niebles, L. Fei-Fei, Graph distillation for action detection with privileged modalities, The European Conference on Computer Vision. ECCV. 2018.

[22] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, Computer Vision and Pattern Recognition, 2005, IEEE Computer Society Conference on, vol. 1. CVPR 2005. IEEE; 2005:886–893.

[23] H. Wang, A. Kläser, C. Schmid, C.-L. Liu, Action recognition by dense trajectories, Computer Vision and Pattern Recognition, 2011 IEEE Conference on. CVPR. IEEE; 2011:3169–3176.

[24] H. Wang, C. Schmid, Action recognition with improved trajectories, Proceedings of the IEEE International Conference on Computer Vision. 2013:3551–3558.

[25] I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, Learning realistic human actions from movies, Computer Vision and Pattern Recognition, 2008, IEEE Conference on. CVPR 2008. IEEE; 2008:1–8.

[26] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei, Large-scale video classification with convolutional neural networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014:1725–1732.

[27] D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3d convolutional networks, Proceedings of the IEEE International Conference on Computer Vision. 2015:4489–4497.

[28] X. Wang, R. Girshick, A. Gupta, K. He, Non-local neural networks, The IEEE Conference on Computer Vision and Pattern Recognition, vol. 1. CVPR. 2018:4.

[29] C. Feichtenhofer, A. Pinz, A. Zisserman, Convolutional two-stream network fusion for video action recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016:1933–1941.

[30] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016:770–778.

[31] K. He, X. Zhang, S. Ren, J. Sun, Identity mappings in deep residual networks, European Conference on Computer Vision. Springer; 2016:630–645.

[32] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, M. Paluri, A closer look at spatiotemporal convolutions for action recognition, Proceedings of the IEEE International Conference on Computer Vision. 2018.

[33] L. Sun, K. Jia, D.-Y. Yeung, B.E. Shi, Human action recognition using factorized spatio-temporal convolutional networks, Proceedings of the IEEE International Conference on Computer Vision. 2015:4597–4605.

[34] A. Eitel, J.T. Springenberg, L. Spinello, M. Riedmiller, W. Burgard, Multimodal deep learning for robust rgb-d object recognition, Intelligent Robots and Systems, 2015 IEEE/RSJ International Conference on. IROS. IEEE; 2015:681–687.

[35] D.P. Kingma, J. Ba, Adam: a method for stochastic optimization, International Conference on Learning Representations. ICLR. 2015.

[36] T. Mallick, P.P. Das, A.K. Majumdar, Characterizations of noise in Kinect depth images: a review, IEEE Sensors Journal 2014;14(6):1731–1740.

[37] Z. Luo, B. Peng, D.-A. Huang, A. Alahi, L. Fei-Fei, Unsupervised learning of long-term motion dynamics for videos, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017:2203–2212.

[38] H. Rahmani, A. Mian, M. Shah, Learning a deep model for human action recognition from novel viewpoints, IEEE Transactions on Pattern Analysis and Machine Intelligence 2018;40(3):667–681.

[39] E. Ohn-Bar, M.M. Trivedi, Joint angles similarities and hog2 for action recognition, Computer Vision and Pattern Recognition Workshops, 2013 IEEE Conference on. CVPRW. IEEE; 2013:465–470.

[40] G. Gkioxari, J. Malik, Finding action tubes, Computer Vision and Pattern Recognition, 2015 IEEE Conference on. CVPR. IEEE; 2015:759–768.

[41] T. Soo Kim, A. Reiter, Interpretable 3d human action analysis with temporal convolutional networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2017:20–28.

[42] H. Rahmani, M. Bennamoun, Learning action recognition model from depth and skeleton videos, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017:5832–5841.

[43] N.C. Garcia, P. Morerio, V. Murino, Learning with privileged information via adversarial discriminative modality distillation, arXiv preprint arXiv:1810.08437.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset