4.6. MULTIMODAL COOPERATIVE LEARNING 93
the overall representation as the input. We term the generic solution as multimodal early fusion.
Formally, for each micro-video x, we concatenate x
v
, x
a
, and x
t
into one vector as
x D
x
v
; x
a
; x
t
; (4.39)
where x is the multimodal representation by fusing the features from the visual, acoustic, and
textual modalities.
In fact, the early fusion implicitly assumes that the modalities are linearly independent,
overlooking the correlations among the modalities. Hence, it fails to explore the modal correla-
tions to strengthen the expressiveness of each modality and further improve the capacity of the
fusion method. In this work, we argue that the information across modalities can be categorized
into two parts: consistent and complementary components. For example, let certain features of
x
v
indicate the visual concepts of “sunshine” and “crowed;” and some of x
a
describe the acoustic
concepts of “wind” and “crowed cheering.” From the angle of consistency, the visual concept
of “crowed” is consistent with the acoustic concept of “crowed cheering.” For complementarity,
the visual concept of “sunshine” provides the exclusive signals, as compared to the acoustic one
of “wind.”
Uncovering the underlying modality relations in micro-videos is already challenging, not
to mention different types of relations to the final prediction. To the best of our knowledge, most
existing efforts only implicitly model the modality relations during the learning process, leaving
the explicit exhibition of relations untouched. Specifically, the deep learning based methods,
which feed multimodal features together into a black box multi-layer neural network and output
a joint representation, are widely used to characterize the multimodal data. With the deep neural
network, the correlations between different features are involved in the new representations.
However, the corresponding features cannot be captured and filtered from the vectors. Toward
this end, we aim to propose a novel cooperative learning mechanism to leverage the uncovered
relations and boost the prediction performance.
4.6.2 COOPERATIVE NETWORKS
Our preliminary consideration is to explicitly model the relations comprised of the consistent
and complementary parts. A viable solution [1, 133] is to project the representations of different
modalities into a common latent space. In this solution, the consistent cues should be close to
each other since they show the same evidences, whereas the complementary cues in the com-
mon space should be distant due to the fact they have no overlapping information. To map the
heterogeneous information extracted from a micro-video to the same coordinate, some infor-
mation, especially the modality specific information, probably lose during the projection. We
term it as the common-specific method. Hence, such direct mapping will lead to suboptimal
expressiveness of the method. Although through careful parameter tuning, we can control the
loss to a certain extent, it requires extensive experiments which are not easily adapted to other
applications.