4.6. MULTIMODAL COOPERATIVE LEARNING 93
the overall representation as the input. We term the generic solution as multimodal early fusion.
Formally, for each micro-video x, we concatenate x
v
, x
a
, and x
t
into one vector as
x D
x
v
; x
a
; x
t
; (4.39)
where x is the multimodal representation by fusing the features from the visual, acoustic, and
textual modalities.
In fact, the early fusion implicitly assumes that the modalities are linearly independent,
overlooking the correlations among the modalities. Hence, it fails to explore the modal correla-
tions to strengthen the expressiveness of each modality and further improve the capacity of the
fusion method. In this work, we argue that the information across modalities can be categorized
into two parts: consistent and complementary components. For example, let certain features of
x
v
indicate the visual concepts of “sunshine” and “crowed;” and some of x
a
describe the acoustic
concepts of “wind and “crowed cheering.” From the angle of consistency, the visual concept
of “crowed is consistent with the acoustic concept of “crowed cheering.” For complementarity,
the visual concept of sunshine” provides the exclusive signals, as compared to the acoustic one
of wind.”
Uncovering the underlying modality relations in micro-videos is already challenging, not
to mention different types of relations to the final prediction. To the best of our knowledge, most
existing efforts only implicitly model the modality relations during the learning process, leaving
the explicit exhibition of relations untouched. Specifically, the deep learning based methods,
which feed multimodal features together into a black box multi-layer neural network and output
a joint representation, are widely used to characterize the multimodal data. With the deep neural
network, the correlations between different features are involved in the new representations.
However, the corresponding features cannot be captured and filtered from the vectors. Toward
this end, we aim to propose a novel cooperative learning mechanism to leverage the uncovered
relations and boost the prediction performance.
4.6.2 COOPERATIVE NETWORKS
Our preliminary consideration is to explicitly model the relations comprised of the consistent
and complementary parts. A viable solution [1, 133] is to project the representations of different
modalities into a common latent space. In this solution, the consistent cues should be close to
each other since they show the same evidences, whereas the complementary cues in the com-
mon space should be distant due to the fact they have no overlapping information. To map the
heterogeneous information extracted from a micro-video to the same coordinate, some infor-
mation, especially the modality specific information, probably lose during the projection. We
term it as the common-specific method. Hence, such direct mapping will lead to suboptimal
expressiveness of the method. Although through careful parameter tuning, we can control the
loss to a certain extent, it requires extensive experiments which are not easily adapted to other
applications.
94 4. MULTIMODAL COOPERATIVE LEARNING
To avoid such information loss, we devise a novel solution named cooperative network,
in which each modality information is overall retained and augmented by the other modalities.
Specifically, this network assigns each dimension of features with a relation score and conse-
quently divides the features into consistent and complementary parts. Here the relation score
for each feature reflects how consistent the information is derived from the other modalities.
e use of relation score endows our model with strong expressiveness and benefits the further
cooperative learning. In what follows, we elaborate the key ingredients of cooperative network.
Relation Score
e goal of the relation score is to select features from each modality, where the underlying
information is consistent among modalities. As shown in Figure 4.15, we treat one specific
modality m as the host represented as h
m
; the other modalities as the guests denoted as g
m
1
and g
m
2
, respectively. Intuitively, we can explicitly capture the varying consistency of the host
and guest features by assigning an attentive weight for each feature dimension. e weights
are considered as the relation scores. erefore, given the representations of the host and guest
modalities, we present a novel relation-aware attention mechanism to score each feature.
Considering that the consistency should be the correlation between the host and the whole
guest modality, we concatenate all the guest vectors together as follows:
g
m
D
g
m
1
; g
m
2
; (4.40)
where the g
m
encapsulates all the features from the guest modalities. Subsequently, we feed the
guest vector g
m
and the host vector h
m
into the attention scoring function, which is a neural
network composed of a single hidden layer and a softmax layer. And the output of this function
is a host score vector, where the value of each dimension reflects the degree of a host feature
derived from the whole guest features. e degree reaches the highest at 1 and the lowest at 0.
It is formally defined as
s
m
h
D softmax
W
m
h
Œh
m
; g
m
; (4.41)
where W
m
h
2 R
D
h
D
and s
m
h
2 R
D
h
denote the weight matrix and relation score vector corre-
sponding to each dimension of the host vector, respectively; the D
h
denotes the dimension of
the host vector, and D is the dimension of the overall vector. For simplicity, we omit the bias
terms.
For the guest modality, we analogously score the feature dimensions to measure the degree
of a guest feature derived from the host features, defined as follows:
s
m
g
D softmax
W
m
g
Œ
g
m
; h
m
; (4.42)
where W
m
g
2 R
D
g
D
, the D
g
and s
m
g
2 R
R
g
denote the weight matrix, the dimension of guest
vector and the relation score vector corresponding to each dimension of the guest vector, respec-
tively.
4.6. MULTIMODAL COOPERATIVE LEARNING 95
Consistency and Complementary Components
Having established the attentive relation scores, we can easily locate the consistent and com-
plementary features from each modality. Toward this end, we set a trainable threshold denoted
as
m
o
, in which we use o 2 O D fh; gg as the host and guest indicator. is threshold divides
the relation score vector into two parts: consistent vector and complementary vector, namely
m
o
and ı
m
o
. e element in the complementary vectors is defined as follows:
ı
m
o
Œi D
8
<
:
1 s
m
o
Œi; if s
m
o
Œi <
m
0
;
0; otherwise,
(4.43)
where ı
m
o
Œi is the value of the i-th dimension in the complementary weight vector ı
m
o
, reflecting
the degree of the complementary relation. For the consistent weight vector
m
o
, we formulate
its element as
m
o
Œi D
8
<
:
s
m
o
Œi; if s
m
o
Œi
m
0
;
0; otherwise,
(4.44)
where
m
o
Œi is the value of i -th the dimension in the consistent weight vector
m
o
, indicating
the degree of the consistency.
Particular, since the original functions are not continuous, we introduced a sigmoid func-
tion to make them differentiable, as follow:
8
<
:
m
o
Œi D
s
m
o
Œi
1 C e
w.s
m
o
Œi
m
o
/
;
ı
m
o
Œi D 1
m
o
Œi;
(4.45)
where w denotes a scalar weighting the difference between s
m
o
Œi and
m
o
to make the output
m
o
Œi as close as possible to 0 or s
m
o
Œi. rough experiments, the best results are obtained with
a weight of 50.
After that, we gain four correlation weight vectors from each host-guest pair, namely ı
m
h
,
ı
m
g
,
m
h
, and
m
g
. Based on these weight vectors, we separated the consistent features and the
complementary features from the mixed information, which are the element-wise products of
the original feature vector and each weight vector as,
8
ˆ
ˆ
ˆ
ˆ
ˆ
<
ˆ
ˆ
ˆ
ˆ
ˆ
:
˛
m
h
D h
m
˝ ı
m
h
;
˛
m
g
D g
m
˝ ı
m
g
;
ˇ
m
h
D h
m
˝
m
h
;
ˇ
m
g
D g
m
˝
m
g
;
(4.46)
where two complementary vectors and two consistent vectors of the host modality and guest
modality are denoted as ˛
m
h
, ˛
m
g
, ˇ
m
h
, and ˇ
m
g
, respectively.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset