Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 12

Cross-modal Learning by Hallucinating Missing Modalities in RGB-D Vision

Nuno C. Garcia^⁎^,^†; Pietro Morerio^⁎; Vittorio Murino^⁎^,^‡ ^⁎Pattern Analysis & Computer Vision (PAVIS), Istituto Italiano di Tecnologia (IIT), Genova, Italy
^†Universita' degli Studi di Genova, Genova, Italy
^‡Universita' degli Studi di Verona, Verona, Italy

Abstract

Diverse input data modalities can provide complementary cues for several tasks, usually leading to more robust algorithms and better performance. However, while a (training) dataset could be accurately designed to include a variety of sensory inputs, it is often the case that not all modalities are available in real life (testing) scenarios, when the model is to be deployed. This raises the challenge of how to learn robust representations leveraging multimodal data in the training stage, while considering limitations at test time, such as noisy or missing modalities. This chapter presents a new approach for multimodal video action recognition, developed within the unified frameworks of distillation and privileged information, named generalized distillation. We consider the particular case of learning representations from depth and RGB videos, while relying on RGB data only at test time. Our approach consists in training a hallucination network that learns to distill depth features through multiplicative connections of spatiotemporal representations, leveraging soft labels and hard labels, and the euclidean distance between feature maps. We report state-of-the-art or comparable results on video action recognition on the largest multimodal dataset available for this task, the NTU RGB+D, as well as on the UWA3DII and Northwestern-UCLA. Our code is available at https://github.com/ncgarcia/modality-distillation.

Keywords

Action recognition; Deep multimodal learning; Distillation; Privileged information

12.1 Introduction

Depth perception refers to the interesting ability to reason about the 3D world from visual information captured by the retinal surface. It is vital for the survival of many different animal species which use it for hunting or escaping, and an important skill for humans to understand and interact with the surrounding environment. Humans start to develop depth perception very early, when babies start to crawl [1]. There have been identified several mechanisms that jointly contribute in different ways to the sense of relative and absolute position of objects, usually called depth cues. These can be divided in ocular motor cues (e.g. the movement of eyes converging or diverging) and, more interestingly in this scope, visual cues, which can be binocular or monocular. Binocular cues are related to stereovision and how the brain calculates depth based on the disparity of the left and right eyes' images. On the other hand, monocular cues refers to a priori visual assumptions derived from 2D single images, often related with physical factors such as shadows, perspective, motion parallax, texture gradient, occlusion, and others. For example, the assumption that an object looks blurrier the further it is, or that an object must be closer if it occludes another one, are signals that we can acquire with one eye only, and that our brain uses to reason about relative depth [2]. Although using only monocular vision affects object distance estimation [3], we are still able to perform most of our vision-related tasks efficiently with one eye only, and most importantly, to extract at least some depth information from 2D images without stereo mechanisms, using monocular cues, and make use it to navigate the 3D world.

Similarly, depth information is often of paramount importance for many computer vision tasks related to robotics, autonomous driving, scene understanding, to name a few. The emergence of cheap depth sensors and the need for big data to train bigger models led to big multimodal datasets containing RGB, depth, infrared, and joints sequences [4], which stimulated multimodal deep learning approaches. Traditional computer vision tasks like action recognition, object detection, or instance segmentation have been shown to benefit performance gains if the model considers other modalities, namely depth, instead of RGB-only [5–8].

Even though depth information brings improvements over RGB-only approaches, it is unrealistic to expect total availability of such data modality when the model is deployed in the real world. RGB cameras are still much more ubiquitous than depth sensors which may be difficult to install everywhere, and, moreover, good quality depth data might be difficult to acquire due to far-distance or reflectance issues, not to mention sensor or communications failure, or other unpredictable events.

Considering this limitation, we would like to answer the following question: what is the best way of using all data available at training time, in order to learn robust representations, knowing that there are missing (or noisy) modalities at test time? In other words, is there any added value in training a model by exploiting multimodal data, even if only one modality is available at test time? Unsurprisingly, the simplest and most commonly adopted solution consists in training the model using only the modality in which it will be tested. However, a more interesting alternative is trying to exploit the potential of the available data and train the model using all modalities, being, however, aware of the fact that not all of them will be accessible at test time (Fig. 12.1). This learning paradigm, i.e., when the model is trained using extra information, is generally known as learning with privileged information [9] or learning with side information [10].

Figure 12.1 What is the best way of using all data available at training time, considering a missing (or noisy) modality at test time?

In this chapter, we present a multimodal stream framework that learns from different data modalities and can be deployed and tested on a subset of these [11]. We design a model able to learn from RGB and depth video sequences, but due to its general structure, it can also be used to manage whatever combination of other modalities as well. To show its potential, we evaluate the performance on the task of video action recognition. The goal of the new learning paradigm, depicted in Fig. 12.2, is to distill the information conveyed by depth into a hallucination network, which is meant to “mimic” the missing stream at test time. Distillation [12] [13] refers to any training procedure where knowledge is transferred from a previously trained complex model to a simpler one. Our learning procedure introduces a new loss function, which is inspired by the generalized distillation framework [14], that formally unifies distillation and privileged information learning theories.

Figure 12.2 Training procedure described in Sect. 12.3.3 (see also text therein). The first step refers to the separate (pre-)training of depth and RGB streams with standard cross-entropy classification loss, with both streams initialized with ImageNet weights. The second step, depicted in the scheme, represents the learning of the teacher network; both streams are initialized with the respective weights from step 1, and trained jointly with a cross-entropy loss as a traditional two-stream model, using RGB and depth data. The third step represents the learning of the student network: both streams are initialized with the depth stream weights from the previous step, but the actual depth stream is frozen; importantly, the input for the hallucination stream is RGB data; the model is trained using the loss proposed in Eq. (12.5). The fourth and last step refers to a fine-tuning step and also the test setup of our model, represented in the scheme; the hallucination stream is initialized from the respective weights from previous step, and the RGB stream with the respective weights from the second step; this model is fine-tuned using a cross-entropy loss, and importantly, using only RGB data as input for both streams.

Our model is inspired by the two-stream network introduced by Simonyan and Zisserman [15], which has been notably successful in the traditional setting for video action recognition task [16] [17]. Differently from these works, we are interested in a multimodal setting, where we train one stream for each modality (RGB and depth in our case) and use these in the framework of privileged information. The most related work to ours is the inspiring method of Hoffman et al. [10], which proposes a hallucination network to learn with side information. We build on this idea, extending it by devising a new mechanism to learn and use such hallucination stream through a more general loss function and inter-stream connections.

This book chapter is strongly based on our recent conference paper [11], which main contributions and ideas are the following: (1) a multimodal stream network architecture able to exploit multiple data modalities at training while using only one at test time; (2) a new training procedure to learn a hallucination network within a novel two-stream model; (3) a more general loss function, based on the generalized distillation framework; (4) we report results comparable to/or achieving state of art – in the privileged information scenario – on the largest multimodal dataset for video action recognition, the NTU RGB+D [18], and on two other smaller ones, the UWA3DII [19] and the Northwestern-UCLA [20].

The rest of the chapter is organized as follows. Sect. 12.2 reviews similar approaches and discusses how they relate to the present work. Sect. 12.3 details the proposed architecture and the novel learning paradigm. Sect. 12.4 reports the results obtained on the various datasets, including a detailed ablation study performed on the NTU RGB+D dataset and a comparative performance with respect to the state of the art. Finally, we draw conclusions and future research directions in Sect. 12.5.

12.2 Related Work

Our work is at the intersection of three topics: privileged information [9], network distillation [12] [13], and multimodal video action recognition. However, Lopez et al. [14] noted that privileged information and network distillation are instances of the same more inclusive theory, called generalized distillation.

12.2.1 Generalized Distillation

Within the generalized distillation framework, our model is both related to the privileged information theory [9], considering that the extra modality (depth, in this case) is only used at training time, and, mostly, to the distillation framework. Indeed the core mechanism that our model uses to learn the hallucination network is derived from a distillation loss. More specifically, the supervision information provided by the teacher network (in this case, the network processing the depth data stream) is distilled into the hallucination network leveraging teacher's soft predictions and hard ground-truth labels in the loss function.

In this context, the closest works to our proposal are [21] and [10]. Luo et al. [21] addressed a similar problem to ours, where the model is first trained on several modalities (RGB, depth, joints and infrared), but tested only in one. The authors propose a graph-based distillation method that is able to distill information from all modalities at training time, while also passing through a validation phase on a subset of modalities. This showed to reach state-of-the-art results in action recognition and action detection tasks. Our work substantially differs from [21] since we benefit from a hallucination mechanism, consisting in an auxiliary network trained using the guidance distilled by the teacher network (that processes the depth data stream in our case). This mechanism allows the model to learn to emulate the presence of the missing modality at test time.

The work of Hoffman et al. [10] introduced a model to hallucinate depth features from RGB input for object detection task. While the idea of using a hallucination stream is similar to the one thereby presented, the mechanism used to learn it is different. In [10], the authors use a Euclidean loss between depth and hallucinated feature maps that is part of the total loss along with more than ten classification and localization losses, which makes its effectiveness very dependent on hyperparameter tuning to balance the different values, as the model is trained jointly in one step by optimizing the aforementioned composite loss. Differently, we propose a loss inspired to the distillation framework that not only uses the Euclidean distance between feature maps, and the one-hot labels, but also leverages soft predictions from the depth network. Moreover, we encourage the hallucination learning by design, by using cross-stream connections (see Sect. 12.3). This showed to largely improve the performance of our model with respect to the one-step learning process proposed in [10].

12.2.2 Multimodal Video Action Recognition

Video action recognition has a long and rich field of literature, spanning from classification methods using handcrafted features [22] [23] [24] [25] to modern deep learning approaches [26] [27] [28] [16], using either RGB-only or various multimodal data. Here, we focus on some of the more relevant works in multimodal video action recognition, including state-of-the-art methods considering the NTU RGB+D dataset, as well as architectures related to our proposed model.

The two-stream model introduced by Simonyan and Zisserman [15] is a landmark on video analysis, and since then has inspired a series of variants that achieved state-of-the-art performance on diverse datasets. This architecture is composed of an RGB and an optical flow stream, which are trained separately, and then fused at the prediction layer. The current state of the art in video action recognition [16] is inspired by such model, featuring 3D convolutions to deal with the temporal dimension, instead of the original 2D ones. In [17], a further variation of the two-stream approach is proposed, which models spatiotemporal features by injecting the motion stream's signal into the residual unit of the appearance stream. The idea of combining the two streams have also been explored previously by the same authors in [29].

Instead, in [5], the authors explore the complementary properties of RGB and depth data, taking the NTU RGB+D dataset as testbed. This work designed a deep autoencoder architecture and a structured sparsity learning machine, and showed to achieve state-of-the-art results for action recognition. Liu et al. [6] also use RGB and depth complementary information to devise a method for viewpoint invariant action recognition. Here, dense trajectories from RGB data are first extracted, which are then encoded in viewpoint invariant deep features. The RGB and depth features are then used as a dictionary to predict the test label.

All these previous methods exploited the rich information conveyed by the multimodal data to improve recognition. Our work, instead, proposes a fully convolutional model that exploits RGB and depth data at training time only, and uses exclusively RGB data as input at test time, reaching performance comparable to those utilizing the complete set of modalities in both stages.

12.3 Generalized Distillation with Multiple Stream Networks

This section describes our approach in terms of its architecture, the losses used to learn the different networks, and the training procedure.

12.3.1 Cross-stream Multiplier Networks

Typically in two-stream architectures, the two streams are trained separately and the predictions are fused with a late fusion mechanism [15] [17]. Such models use as input appearance (RGB) and motion (optical flow) data, which are fed separately into each stream, both in training and testing. Instead, in this work we use RGB and depth frames as inputs for training, but only RGB at test time, as already discussed (Fig. 12.2).

We use the ResNet-50-based [30] [31] model proposed in [17] as baseline architecture for each stream block of our model. In that paper, Feichtenhofer et al. proposed to connect the appearance and motion streams with multiplicative connections at several layers, as opposed to previous models which would only interact at the prediction layer. Such connections are depicted in Fig. 12.2 with the blue arrows. Fig. 12.3 illustrates this mechanism at a given layer of the multiple stream architecture, while it is actually implemented at the four convolutional layers of the Resnet-50 model. The underlying intuition is that these connections enable the model to learn better spatiotemporal representations, and help to distinguish between identical actions that require the combination of appearance and motion features. Originally, the cross-stream connections consisted of the injection of the motion stream signal into the other stream's residual unit, without affecting the skip path. ResNet's residual units are formally expressed as:

$x_{l + 1} = f (h (x_{l}) + F (x_{l}, W_{l})),$ $x_{l + 1} = f (h (x_{l}) + F (x_{l}, W_{l})),$

where $x_{l}$ $x_{l}$ and $x_{l + 1}$ $x_{l + 1}$ are lth layer's input and output, respectively, F represents the residual convolutional layers defined by weights $W_{l}$ $W_{l}$ , $h (x_{l})$ $h (x_{l})$ is an identity mapping and f is a ReLU non-linearity. The cross-streams connections are then defined as

$x_{l + 1}^{a} = f (x_{l}^{a}) + F (x_{l}^{a} ⊙ f (x_{l}^{m}), W_{l}),$ $x_{l + 1}^{a} = f (x_{l}^{a}) + F (x_{l}^{a} ⊙ f (x_{l}^{m}), W_{l}),$

where $x^{a}$ $x^{a}$ and $x^{m}$ $x^{m}$ are the appearance and motion streams, respectively, and ⊙ is the element-wise multiplication operation. Such mechanism implies a spatial alignment between both feature maps, and therefore between both modalities. This alignment comes for free when using RGB and optical flow, since the latter is computed from the former in a way that spatial arrangement is preserved. However, this is an assumption we cannot generally make. For instance, depth and RGB are often captured from different sensors, likely resulting in spatially misaligned frames. The alignment procedure, described in Sect. 12.4.2, is part of the pre-processing phase and it refers uniquely to the NTU RGB+D dataset.

Figure 12.3 Detail of the ResNet residual unit, showing the multiplicative connections and temporal convolutions [17]. In our architecture, the signal injection occurs before the second residual unit of each of the four ResNet blocks.

Temporal convolutions. In order to augment the model temporal support, we implement 1D temporal convolutions in the second residual unit of each ResNet layer (as in [17]), as illustrated in Fig. 12.3. The weights $W_{l} \in R^{1 \times 1 \times 3 \times C_{l} \times C_{l}}$ $W_{l} \in R^{1 \times 1 \times 3 \times C_{l} \times C_{l}}$ are convolutional filters initialized as identity mappings at feature level, and centered in time, and $C_{l}$ $C_{l}$ is the number of channels in layer l.

Very recently in [32], the authors explored various network configurations using temporal convolutions, comparing several different combinations for the task of video classification. This work suggests that decoupling 3D convolutions into 2D (spatial) and 1D (temporal) filters is the best setup in action recognition tasks, producing best accuracies. The intuition for the latter setup is that factorizing spatial and temporal convolutions in two consecutive convolutional layers eases training of the spatial and temporal tasks (also in line with [33]).

12.3.2 Hallucination Stream

We also introduce and learn a hallucination network [10], using a new learning paradigm, loss function and interaction mechanism. The hallucination stream network has the same architecture as the appearance and depth stream models.

This network receives RGB as input, and is trained to “imitate” the depth stream at different levels, i.e. at feature and prediction layers. In this work, we explore several ways to implement such learning paradigm, including both the training procedure and the loss, and how they affect the overall performance of the model.

In [10] it is proposed a regression loss between the hallucination and depth feature maps, defined as:

$L_{hall} (l) = λ_{l} {‖ σ (A_{l}^{d}) - σ (A_{l}^{h}) ‖}_{2}^{2},$ $L_{hall} (l) = λ_{l} {‖ σ (A_{l}^{d}) - σ (A_{l}^{h}) ‖}_{2}^{2},$

(12.1)

where σ is the sigmoid function, and $A_{l}^{d}$ $A_{l}^{d}$ and $A_{l}^{h}$ $A_{l}^{h}$ are the l-th layer activations of depth and hallucination network. This Euclidean loss forces both activation maps to be similar. In [10], this loss is weighted along with another ten classification and localization loss terms, making it hard to balance the total loss. One of the main motivations behind the proposed new staged learning paradigm, described in Sect. 12.3.3, is to avoid the inefficient, heuristic-based tweaking of so many loss weights, a.k.a. hyperparameter tuning.

Instead, we adopt an approach inspired by the generalized distillation framework [14], in which a student model $f_{s} \in F_{s}$ $f_{s} \in F_{s}$ distills the representation $f_{t} \in F_{t}$ $f_{t} \in F_{t}$ learned by the teacher model. This is formalized as

$f_{s} = \underset{f \in F_{s}}{arg min} \frac{1}{n} \sum_{i = 1}^{n} L_{G D} (i), n = 1, . . ., N$ $f_{s} = \underset{f \in F_{s}}{arg min} \frac{1}{n} \sum_{i = 1}^{n} L_{G D} (i), n = 1, . . ., N$

(12.2)

where N is the number of examples in the dataset. The generalized distillation loss is so defined as:

$L_{G D} (i) = (1 - λ) ℓ (y_{i}, ς (f (x_{i}))) + λ ℓ (s_{i}, ς (f (x_{i}))), λ \in [0, 1] f_{s} \in F_{s},$ $L_{G D} (i) = (1 - λ) ℓ (y_{i}, ς (f (x_{i}))) + λ ℓ (s_{i}, ς (f (x_{i}))), λ \in [0, 1] f_{s} \in F_{s},$

(12.3)

ς is the softmax operator and $s_{i}$ $s_{i}$ is the soft prediction from the teacher network:

$s_{i} = ς (f_{t} (x_{i}) / T), T > 0 .$ $s_{i} = ς (f_{t} (x_{i}) / T), T > 0 .$

(12.4)

The parameter λ in Eq. (12.3) allows one to tune the loss by giving more importance either to imitating ground-truth hard or soft teacher targets, $y_{i}$ $y_{i}$ and $s_{i}$ $s_{i}$ , respectively. This mechanism indeed allows the transfer of information from the depth (teacher) to the hallucination (student) network. The temperature parameter T in Eq. (12.4) allows one to smooth the probability vector predicted by the teacher network. The intuition is that such smoothing may expose relations between classes that would not easily be revealed in raw predictions, further facilitating the distillation by the student network $f_{s}$ $f_{s}$ .

We suggest that both Euclidean and generalized distillation losses are indeed useful in the learning process. In fact, by encouraging the network to decrease the distance between hallucinated and true depth feature maps, it can help to distill depth information encoded in the generalized distillation loss. Thus, we formalize our final loss function as follows:

$L = (1 - α) L_{G D} + α L_{hall}, α \in [0, 1],$ $L = (1 - α) L_{G D} + α L_{hall}, α \in [0, 1],$

(12.5)

where α is a parameter balancing the contributions of the two loss terms during training. The parameters λ, α and T are estimated by utilizing a validation set and discussed in Sect. 12.4.2.

In summary, the generalized distillation framework proposes to use the student-teacher framework introduced in the distillation theory to extract knowledge from the privileged information source. We explore this idea by proposing a new learning paradigm to train a hallucination network using privileged information, which we will describe in the next section. In addition to the loss functions introduced above, we also allow the teacher network to share information with the student network by design, through the cross-stream multiplicative connections. We test how all these possibilities affect the model's performance in the experimental section through an extensive ablation study.

12.3.3 Training Paradigm

In general, the proposed training paradigm, illustrated in Fig. 12.2, is divided in two core parts: the first part (steps 1 and 2 in the figure) focuses on learning the teacher network $f_{t}$ $f_{t}$ , leveraging RGB and depth data (the privileged information in this case); the second part (steps 3 and 4 in the figure) focuses on learning the hallucination network, referred to as student network $f_{s}$ $f_{s}$ in the distillation framework, using the general hallucination loss defined in Eq. (12.5).

The first training step consists in training both streams separately, which is a common practice in two-stream architectures. Both depth and appearance streams are trained minimizing cross-entropy, after being initialized with a pre-trained ImageNet model for all experiments. Temporal kernels are initialized as $[0, 1, 0]$ $[0, 1, 0]$ , i.e. only information on the central frame is used at the beginning—this eventually changes as the training continues. As in [34], depth frames are encoded into color images using a jet colormap.

The second training step is still focused on further training the teacher model. Since the model trained in this step has the architecture and capacity of the final one, and has access to both modalities, its performance represents an upper bound for the task we are addressing. This is one of the major differences between our approach and the one used in [10]: by decoupling the teacher learning phase with the hallucination learning, we are able to both learn a better teacher and a better student, as we will show in the experimental section.

In the third training step, we focus on learning the hallucination network from the teacher model, i.e., the depth stream network just trained. Here, the weights of the depth network are frozen, while receiving in input depth data. Instead, the hallucination network, receiving in input RGB data, is trained with the loss defined in Eq. (12.5), while also receiving feedback from the cross-stream connections from the depth network. We found that this helps the learning process.

In the fourth and last step, we carry out fine tuning of the whole model, composed of the RGB and the hallucination streams. This step uses RGB-only as input, and it also precisely resembles the setup used at test time. The cross-stream connections inject the hallucinated signal into the appearance RGB stream network, resulting in the multiplication of the hallucinated feature maps and the RGB feature maps. The intuition is that the hallucination network has learned to inform the RGB model where the action is taking place, similarly to what the depth model would do with real depth data.

12.4 Experiments

12.4.1 Datasets

We evaluate our method on three datasets, while the ablation study is performed only on the NTU RGB+D dataset. Our model is initialized with ImageNet pre-trained weights and trained and evaluated on the NTU RGB+D dataset. We later fine-tune this model on each of the two smaller datasets for the corresponding evaluation experiments.

NTU RGB+D [18]. This is the largest public dataset for multimodal video action recognition. It is composed of 56,880 videos, available in four modalities: RGB videos, depth sequences, infrared frames, and 3D skeleton data of 25 joints. It was acquired with a Kinect v2 sensor in 80 different viewpoints, and includes 40 subjects performing 60 distinct actions. We follow the two evaluation protocols originally proposed in [18], which are cross-subject and cross-view. As in the original paper, we use about 5% of the training data as validation set for both protocols, in order to select the parameters λ, α and T. In this work, we use only RGB and depth data. The masked depth maps are converted to a three channel map via a jet mapping, as in [34].

UWA3DII [19]. This dataset consists on 1075 samples of RGB, depth and skeleton sequences. It features 10 subjects performing 30 actions captured in 5 different views.

Northwestern-UCLA [20]. Similarly to the other datasets, it provides RGB, depth and skeleton sequences for 1475 samples. It features 10 subjects performing 10 actions captured in 3 different views.

12.4.2 Pre-processing and Alignment of RGB and Depth Frames

The multiplicative cross-stream connections require both RGB and depth frames to be spatially aligned, since they are element-wise operations over the feature maps. The NTU RGB+D dataset is acquired using a Kinect sensor, which result in different dimensions and aspect ratios for RGB and depth frames. Fortunately, this dataset provides the joints' spatial coordinates in every RGB and depth frames, ${rgb}_{x, y}$ ${rgb}_{x, y}$ and ${depth}_{x, y}$ ${depth}_{x, y}$ , respectively, which we use to align both modalities. Because the other two smaller datasets do not provide such information, this alignment procedure is only applicable to the NTU RGB+D dataset. Consequently, experiments using the other datasets refer to the model with no cross-stream connections.

For every frame of a given video, we first compute the ratio ${ratio}_{x}^{A, B} = ({rgb}_{x}^{A} - {rgb}_{x}^{B}) / ({depth}_{x}^{A} - {depth}_{x}^{B})$ ${ratio}_{x}^{A, B} = ({rgb}_{x}^{A} - {rgb}_{x}^{B}) / ({depth}_{x}^{A} - {depth}_{x}^{B})$ $\forall A, B \in S$ $\forall A, B \in S$ , using all depth and RGB x coordinates from the frame's well-tracked joints set S, and similarly for the y dimension. The video aspect ratio is then calculated as the mean between the median aspect ratio for x and the median aspect ratio for y dimensions. The RGB frames of a given video are scaled according to this ratio. Finally, both RGB and depth frames are overlaid by aligning both skeletons, and the intersection is cropped on both modalities. The cropped sections are then re-scaled according to the network's input dimension, in this case $224 \times 224$ $224 \times 224$ .

Similarly to what was done in [17], we sample 5 frames evenly spaced in time for each video, both for training and testing. Sampling random frames for training results in similar accuracy values. For training, we also flip horizontally the video frames with probability $P = 0.5$ $P = 0.5$ .

12.4.3 Hyperparameters and Validation Set

After validation, we have selected the following set of hyperparameters: $α = 0.5$ $α = 0.5$ , $λ = 0.5$ $λ = 0.5$ , $T = 10$ $T = 10$ ; slightly different values do not show a significant change in performance. The networks are optimized using ADAM [35] with default learning rate, except for the fine-tuning steps where the learning rate was decreased by a factor of 10.

Regarding the NTU RGB+D dataset, the validation set is not defined in the original paper where the dataset is presented [18]. For the sake of experiments reproducibility, we explain here how we defined the validation set. For the cross-subject protocol, we choose the subject #1 (from the training set), which corresponds to around 5% of the training set. For the cross-view protocol, we do the following: 1) create a dictionary of sorted videos for each key=action (from the training set); 2) set numpy random seed equal to 0; 3) sample 31 videos using numpy.random.choice for each action, which in the end will correspond to around 5% of the training set.

12.4.4 Ablation Study

In this section we discuss the results of the experiments carried out to understand the contribution of each part of the model and of the training procedure. Table 12.1 reports performances at the several training steps, different losses and model configurations.

Table 12.1

Ablation study. A full set of experiments is provided for the NTU cross-subject evaluation protocol. For the cross-view protocol, only the most important results are reported.

#	Method	Test Modality	Loss	Cross-Subject	Cross-View
1	Ours – step 1, depth stream	Depth	x-entr	70.44%	75.16%
2	Ours – step 1, RGB stream	RGB	x-entr	66.52%	71.39%
3	Hoffman [10] w/o conn.	RGB	Eq. (12.1)	64.64%	–
4	Hoffman [10] w/o conn.	RGB	Eq. (12.3)	68.60%	–
5	Hoffman [10] w/o conn.	RGB	Eq. (12.5)	70.70%	–
6	Ours – step 2, depth stream	Depth	x-entr	71.09%	77.30%
7	Ours – step 2, RGB stream	RGB	x-entr	66.68%	56.26%
8	Ours – step 2	RGB & Depth	x-entr	79.73%	81.43%
9	Ours – step 2 w/o conn.	RGB & Depth	x-entr	78.27%	82.11%
10	Ours – step 3 w/o conn.	RGB (hall)	Eq. (12.1)	69.93%	70.64%
11	Ours – step 3 w/ conn.	RGB (hall)	Eq. (12.1)	70.47%	–
12	Ours – step 3 w/ conn.	RGB (hall)	Eq. (12.3)	71.52%	–
13	Ours – step 3 w/ conn.	RGB (hall)	Eq. (12.5)	71.93%	74.10%
14	Ours – step 3 w/o conn.	RGB (hall)	Eq. (12.5)	71.10%	–
15	Ours – step 4	RGB	x-entr	73.42%	77.21%

Rows #1 and #2 refer to the first training step, where depth and RGB streams are trained separately. We note that the depth stream network provides better performance with respect to the RGB one, as expected.

The second part of the table (Rows #3–5) shows the results using Hoffman et al.'s method [10] – i.e., adopting a model initialized with the pre-trained networks from the first training step, and the hallucination network initialized using the depth network. Row #3 refers to the original paper [10] (i.e., using the loss $L_{hall}$ $L_{hall}$ , Eq. (12.1)), and rows #4 and #5 refer to the training using the proposed losses $L_{G D}$ $L_{G D}$ and L, in Eqs. (12.3) and (12.5), respectively. It can be noticed that the accuracies achieved using the proposed loss functions overcome that obtained in [10] by a significant margin (about 6% in the case of the total loss L).

The third part of the table (rows #6–9) reports performances after the training step 2. Rows #6 and #7 refer to the accuracy provided by depth and RGB stream networks belonging to the model of row #8, taken individually. The final model constitutes the upper bound for our hallucination model, since it uses RGB and depth for training and testing. Performances obtained by the model in row #8 and #9, with and without cross-stream connections, respectively, are the highest in absolute since using both modalities (around 78–79% for the cross-subject and 81–82% for the cross-view protocols, respectively), largely outperforming the accuracies obtained using only one modality (in rows #6 and #7).

The fourth part of the table (rows #10–14) shows results for our hallucination network after the several variations of learning processes, different losses and with and without cross-stream connections.

Finally, the last row, #15, reports results after the last fine-tuning step which further narrows the gap with the upper bound.

12.4.4.1 Contribution of the cross-stream connections

We claim that the signal injection provided by the cross-stream connections helps the learning of a better hallucination network. Row #13 and #14 show the performances for the hallucination network learning process, starting from the same point and using the same loss. The hallucination network that is learned using multiplicative connections performs better than its counterpart, where depth and RGB frames are properly aligned. It is important to note though that this is not observed in the other two smaller datasets, due to the spatial misalignment of modalities, and consequently between feature maps.

12.4.4.2 Contributions of the proposed distillation loss (Eq. (12.5))

The distillation and Euclidean losses have complementary contributions to the learning of the hallucination network. This is observed by looking at the performances reported in rows #3, #4 and #5, and also #11, #12 and #13. In both the training procedure proposed by Hoffman et al. [10] and our staged training process, the distillation loss improves over the Euclidean loss, and the combination of both improves over the rest. This suggests that both Euclidean and distillation losses have its own share and act differently to align the hallucination (student) and depth (teacher) feature maps and outputs' distributions.

12.4.4.3 Contributions of the proposed training procedure

The intuition behind the staged training procedure proposed in this work can be ascribed to the divide et impera (divide-and-conquer) strategy. In our case, it means breaking the problem in two parts: learning the actual task we aim to solve and learning the student network to face test-time limitations. Row #5 reports accuracy for the architecture proposed by Hoffman et al., and rows #15 report the performance for our model with connections. Both use the same loss to learn the hallucination network, and both start from the same initialization. We observe that our method outperform the one in row #5, which justifies the proposed staged training procedure.

12.4.5 Inference with Noisy Depth

Suppose that in a real test scenario we can only access unreliable sensors which produce noisy depth data. The question we now address is: to which extent can we trust such noisy data? In other words, at which level of noise does it become favorable to hallucinate the depth modality with respect to using the full teacher model (step 2) with noisy depth data?

The depth sensor used in the NTU dataset (Kinect), is an IR emitter coupled with an IR camera, and has very complex noise characterization comprising at least six different sources [36]. It is beyond the scope of this work to investigate noise models affecting the depth channel, so, for our analysis we choose the most common one, i.e., the multiplicative speckle noise. Hence, we inject Gaussian noise in the depth images I in order to simulate speckle noise: $I = I ⁎ n$ $I = I ⁎ n$ , $n \sim N (1, σ)$ $n \sim N (1, σ)$ . Table 12.2 shows how performances of the network degrade when depth is corrupted with such Gaussian noise with increasing variance (NTU cross-view protocol only). Results show that accuracy significantly decreases w.r.t. to the one guaranteed by our hallucination model (77.21% – row #15 in Table 12.1), even with low noise variance. This means, in conclusion, that training a hallucination network is an effective way not only to obviate to the problem of a missing modality, but also to deal with noise affecting the input data channel.

Table 12.2

Accuracy of the model tested with clean RGB and noisy depth data. Accuracy of the proposed hallucination model, i.e. with no depth at test time, is 77.21%.


σ²	no noise	10⁻³	10⁻²	10⁻¹	10⁰	10¹	void
Accuracy	81.43%	81.34%	81.12%	76.85%	62.47%	51.43%	14.24%

12.4.6 Comparison with Other Methods

Table 12.3 compares performances of different methods on the various datasets. The standard performance measure used for this task and datasets is classification accuracy, estimated according to the protocols (training and testing splits) reported in the respective works we are comparing with.

Table 12.3

Classification accuracies and comparisons with the state of the art. Performances referred to the several steps of our approach (ours) are highlighted in bold. × refers to comparisons with unsupervised learning methods. △ refers to supervised methods: here train and test modalities coincide. □ refers to privileged information methods: here training exploits RGB+D data, while test relies on RGB data only. The second column refers to the modalities used at test time: R-RGB, D-Depth, and J-Joints. The third column refers to cross-subject and the fourth to the cross-view evaluation protocols on the NTU dataset. The results reported on the other two datasets are for the cross-view protocol.

Method	Test Mods.	NTU (p1)	NTU (p2)	UWA3DII	NW-UCLA
Luo [37]	D	66.2%	–	–	–	×
Luo [37]	R	56.0%	–	–	–
Rahmani [38]	R	–	–	67.4%	78.1%
HOG-2 [39]	D	32.4%	22.3%	–	–	△
Action Tube [40]	R	–	–	37.0%	61.5%
Ours – depth	D	70.44%	75.16%	75.28%	72.38%
step 1	D	70.44%	75.16%	75.28%	72.38%
Ours – RGB	R	66.52%	71.39%	63.67%	85.22%
step 1	R	66.52%	71.39%	63.67%	85.22%
Deep RNN [18]	J	56.3%	64.1%	–	–
Deep LSTM [18]	J	60.7%	67.3%	–	–
Sharoudy [18]	J	62.93%	70.27%	–	–
Kim [41]	J	74.3%	83.1%	–	–
Sharoudy [5]	R+D	74.86%	–	–	–
Liu [6]	R+D	77.5%	84.5%	–	–
Rahmani [42]	D+J	75.2	83.1	84.2%	–
Ours – step 2	R+D	79.73%	81.43%	79.66%	88.87%
Hoffman et al. [10]	R	64.64%	–	66.67%	83.30%	□
ADMD [43]	R	73.11%	81.50%	–	91.64%
Ours – step 3	R	71.93%	74.10%	71.54%	76.30%
Ours – step 4	R	73.42%	77.21%	73.23%	86.72%

The first part of the table (indicated by × symbol) refers to unsupervised methods, which achieve surprisingly high results even without relying on labels in learning representations.

The second part refers to supervised methods (indicated by △), divided according to the modalities used for training and testing. Here, we list the performance of the separate RGB and depth streams trained in step 1, as a reference. We expect our final model to perform better than the one trained on RGB-only, whose accuracy constitutes a lower bound for our student network. The values reported for our step 1 models for UWA3DII and NW-UCLA datasets refer to the fine-tuning of our NTU model. We have experimented training using pre-trained ImageNet weights, which led from 20% to 30% less accuracy. We also propose our baseline, consisting in the teacher model trained in step 2. Its accuracy represents an upper bound for the final model, which will not rely on depth data at test time.

The last part of the table (indicated by □) reports our model's performances at two different stages together with the other privileged information methods [10] [11]. For all datasets and protocols, we can see that our privileged information approach outperforms [10], which is the only fair direct comparison we can make (same training & test data). Besides, as expected, our final model performs better than “Ours – RGB model, step 1” since it exploits more data at training time, and worse than “ Ours – step 2”, since it exploits less data at test time. Other RGB+D methods perform better (which is comprehensible since they rely on RGB+D in both training and test) but not by a large margin. The method by Garcia et al. [11] is similar to these two in the sense that is also uses a hallucination network to cope with the missing depth modality. However, it takes a different look to it, by learning the hallucination stream through an adversarial strategy.

12.4.7 Inverting Modalities – RGB Distillation

The results presented in Table 12.4 address the opposite case of what is studied in the rest of the chapter, i.e., the case when RGB data is missing. In this case, the hallucination stream distills knowledge from the RGB stream in step 3 (Fig. 12.2).

Table 12.4

RGB distillation (NTU RGB-D, cross-view protocol).

#	Method	Test Modality	Loss	Cross-View
13a	Ours – step 3	Depth (hall)	Eq. (12.5)	76.12%
15a	Ours – step 4	Depth	x-entr	76.41%

We observe that the performance of the final model degrades by almost 1%, 76.41% vs. 77.21% (cf. line 15 of Table 12.2). A more consistent setting would be to modify the model, inverting the cross-stream connections in steps 3 and 4, thus having information flowing again from depth to RGB.

12.5 Conclusions and Future Work

This chapter addresses the task of video action recognition in the context of privileged information. We describe a new learning paradigm to teach a hallucination network to mimic the depth stream. Our model outperforms many of the supervised methods recently evaluated on the NTU RGB+D dataset, as well as the original hallucination model proposed in [10]. We conducted an extensive ablation study to verify how the several parts composing our learning paradigm contribute to the model performance. As a future work, we would like to extend this approach to dealing with additional modalities that may be available at training time, such as skeleton joints data or infrared sequences. Finally, the current model cannot be applied to still images due to the presence of temporal convolutions. In principle, we could remove them and apply our method to still images and other tasks such as object detection.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 12: Cross-modal Learning by Hallucinating Missing Modalities in RGB-D Vision

Create new playlist

Sign In

Sign Up