4.6. MULTIMODAL COOPERATIVE LEARNING 91
and (2) provide a comprehensive representation from the exclusive perspective of complementary
component. Nevertheless, characterizing and modeling multimodal cooperation are non-trivial
due to the following challenges: (1) consistent and complementary information is often mixed.
How to separate it from different modalities is largely untapped; and (2) after separation, it is
difficult to associate them with each other, since they are orthogonal.
To address the problems analyzed above, we present an end-to-end deep multimodal co-
operative learning approach to estimating the venue categories of micro-videos. Notably, this
approach is applicable to other multimodal cooperative scenarios. As illustrated in Figure 4.14,
the features are firstly extracted from each modality and fed into three peer cooperative nets. In
Attention
Net
Attention
Net
Attention
Net
SoftMax
SoftMax
SoftMax
Late Fusion
Park
Concert
School
Ċ
Store
ĸ
×
×
×
Visual
Modality
Acoustic
Modality
Texture
Modality
Host
Guest
Cooperative Net
Host
Guest
Cooperative Net
Host
Guest
Cooperative Net
•
•
•
•
•
Figure 4.14: An illustration of our framework. It separates the consistent features from the com-
plementary ones and enhances the expressiveness of each modality via the proposed cooperative
net. en, it selects the features to generate a discriminative representation in the attention net-
work toward venue category estimation.
each cooperative net, we respectively treat one modality as the host and the rest as the guests.
en we obtain the augmented feature vectors as the output of the cooperative nets. Follow-
ing that, each vector is fed into an attention net followed by a late fusion over the prediction
results from different softmax functions. Stepping into the cooperative net as demonstrated in
Figure 4.15, the structure is symmetric. In particular, on the left hand, we first concatenate the
guest modalities and estimate the relevance between each dimension of the combining vector
and the host vector. As to the combined vector, a gate with the learned threshold is used to sep-
arate its consistent part and complementary part. An analogous process is applied to the right
hand side. ereafter, two consistent parts are fused with a deep neural network model and the
fusion result is ultimately concatenated with the two complementary parts.
We first formally define the problem. Assume that we are given a set of N micro-videos
X D fx
i
g
N
iD1
. For each micro-video x 2 X , we segment it into three modalities fx
v
; x
a
; x
t
g,
where v, a, and t denote the visual, acoustic, and textual modality indices, respectively. Let