90 4. MULTIMODAL COOPERATIVE LEARNING
“Concert Hall.” Although the textual modality contains useless information, our model can uti-
lize the information from the visual and acoustic modalities to guide the estimation of the venue
category. Similarly, from the bottom micro-video in Figure 4.12, it can be seen that many peo-
ple kayaking on a lake surrounding by a lot of trees. erefore, it was captured at “Garden.” Our
proposed model successfully predicts the venues of these micro-videos, which demonstrates that
our model can learn discriminative representations to distinguish well micro-videos of different
venue categories.
4.6 MULTIMODAL COOPERATIVE LEARNING
Technically speaking, venue category estimation of micro-videos is usually treated as a multi-
modal fusion problem, and solved by integrating the geographic cues from visual, acoustic, and
textual modalities of micro-videos. Several pioneering efforts have been dedicated to the task,
such as [102, 192], and [122]. e current methods, however, are restricted to only fusing the
common (a.k.a., consistent) cues over multiple modalities or complementary cues. Moving one
step forward, in this work, we shed light on the cooperative relations, comprising the consistent
and complementary components. We refer to the consistent component as the same informa-
tion appearing in more than one modality in different forms. As shown in Figure 4.13, a red
candy displaying in the visual modality and the text of lollipop” describe the consistency. By
contrast, the complementary component represents the exclusive information appearing only in
one modality. For instance, it is hard to find the equivalent in other modalities in Figure 4.13
of the textual concept of “girl” or the visual concept of “grass.” To supercharge a multimodal
prediction scheme with such cooperative relations, the multimodal cooperation shall be able to:
(1) enhance the confidence of the same evidence from various views via consistent regularization;
A girl feeds the cute dog with a giant lollipop on the campus.
Figure 4.13: Exemplar demonstration of the correlation between the visual modality and textural
modality. e blue and brown boxes show the consistent information and the red dashed boxes
show the complementary ones, respectively.
4.6. MULTIMODAL COOPERATIVE LEARNING 91
and (2) provide a comprehensive representation from the exclusive perspective of complementary
component. Nevertheless, characterizing and modeling multimodal cooperation are non-trivial
due to the following challenges: (1) consistent and complementary information is often mixed.
How to separate it from different modalities is largely untapped; and (2) after separation, it is
difficult to associate them with each other, since they are orthogonal.
To address the problems analyzed above, we present an end-to-end deep multimodal co-
operative learning approach to estimating the venue categories of micro-videos. Notably, this
approach is applicable to other multimodal cooperative scenarios. As illustrated in Figure 4.14,
the features are firstly extracted from each modality and fed into three peer cooperative nets. In
Attention
Net
Attention
Net
Attention
Net
SoftMax
SoftMax
SoftMax
Late Fusion
A
A
Park
Concert
School
Ċ
Store
ĸ
×
×
×
Visual
Modality
Acoustic
Modality
Texture
Modality
Host
Guest
Cooperative Net
Host
Guest
Cooperative Net
Host
Guest
Cooperative Net
Figure 4.14: An illustration of our framework. It separates the consistent features from the com-
plementary ones and enhances the expressiveness of each modality via the proposed cooperative
net. en, it selects the features to generate a discriminative representation in the attention net-
work toward venue category estimation.
each cooperative net, we respectively treat one modality as the host and the rest as the guests.
en we obtain the augmented feature vectors as the output of the cooperative nets. Follow-
ing that, each vector is fed into an attention net followed by a late fusion over the prediction
results from different softmax functions. Stepping into the cooperative net as demonstrated in
Figure 4.15, the structure is symmetric. In particular, on the left hand, we first concatenate the
guest modalities and estimate the relevance between each dimension of the combining vector
and the host vector. As to the combined vector, a gate with the learned threshold is used to sep-
arate its consistent part and complementary part. An analogous process is applied to the right
hand side. ereafter, two consistent parts are fused with a deep neural network model and the
fusion result is ultimately concatenated with the two complementary parts.
We first formally define the problem. Assume that we are given a set of N micro-videos
X D fx
i
g
N
iD1
. For each micro-video x 2 X , we segment it into three modalities fx
v
; x
a
; x
t
g,
where v, a, and t denote the visual, acoustic, and textual modality indices, respectively. Let
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset