Attention Networks

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

4.6. MULTIMODAL COOPERATIVE LEARNING 93

the overall representation as the input. We term the generic solution as multimodal early fusion.

Formally, for each micro-video x, we concatenate x

, x

, and x

into one vector as

x D



; x



; (4.39)

where x is the multimodal representation by fusing the features from the visual, acoustic, and

textual modalities.

In fact, the early fusion implicitly assumes that the modalities are linearly independent,

overlooking the correlations among the modalities. Hence, it fails to explore the modal correla-

tions to strengthen the expressiveness of each modality and further improve the capacity of the

fusion method. In this work, we argue that the information across modalities can be categorized

into two parts: consistent and complementary components. For example, let certain features of

indicate the visual concepts of “sunshine” and “crowed;” and some of x

describe the acoustic

concepts of “wind” and “crowed cheering.” From the angle of consistency, the visual concept

of “crowed” is consistent with the acoustic concept of “crowed cheering.” For complementarity,

the visual concept of “sunshine” provides the exclusive signals, as compared to the acoustic one

of “wind.”

Uncovering the underlying modality relations in micro-videos is already challenging, not

to mention diﬀerent types of relations to the ﬁnal prediction. To the best of our knowledge, most

existing eﬀorts only implicitly model the modality relations during the learning process, leaving

the explicit exhibition of relations untouched. Speciﬁcally, the deep learning based methods,

which feed multimodal features together into a black box multi-layer neural network and output

a joint representation, are widely used to characterize the multimodal data. With the deep neural

network, the correlations between diﬀerent features are involved in the new representations.

However, the corresponding features cannot be captured and ﬁltered from the vectors. Toward

this end, we aim to propose a novel cooperative learning mechanism to leverage the uncovered

relations and boost the prediction performance.

4.6.2 COOPERATIVE NETWORKS

Our preliminary consideration is to explicitly model the relations comprised of the consistent

and complementary parts. A viable solution [1, 133] is to project the representations of diﬀerent

modalities into a common latent space. In this solution, the consistent cues should be close to

each other since they show the same evidences, whereas the complementary cues in the com-

mon space should be distant due to the fact they have no overlapping information. To map the

heterogeneous information extracted from a micro-video to the same coordinate, some infor-

mation, especially the modality speciﬁc information, probably lose during the projection. We

term it as the common-speciﬁc method. Hence, such direct mapping will lead to suboptimal

expressiveness of the method. Although through careful parameter tuning, we can control the

loss to a certain extent, it requires extensive experiments which are not easily adapted to other

applications.

94 4. MULTIMODAL COOPERATIVE LEARNING

To avoid such information loss, we devise a novel solution named cooperative network,

in which each modality information is overall retained and augmented by the other modalities.

Speciﬁcally, this network assigns each dimension of features with a relation score and conse-

quently divides the features into consistent and complementary parts. Here the relation score

for each feature reﬂects how consistent the information is derived from the other modalities.

e use of relation score endows our model with strong expressiveness and beneﬁts the further

cooperative learning. In what follows, we elaborate the key ingredients of cooperative network.

Relation Score

e goal of the relation score is to select features from each modality, where the underlying

information is consistent among modalities. As shown in Figure 4.15, we treat one speciﬁc

modality m as the host represented as h

; the other modalities as the guests denoted as g

and g

, respectively. Intuitively, we can explicitly capture the varying consistency of the host

and guest features by assigning an attentive weight for each feature dimension. e weights

are considered as the relation scores. erefore, given the representations of the host and guest

modalities, we present a novel relation-aware attention mechanism to score each feature.

Considering that the consistency should be the correlation between the host and the whole

guest modality, we concatenate all the guest vectors together as follows:



; g



; (4.40)

where the g

encapsulates all the features from the guest modalities. Subsequently, we feed the

guest vector g

and the host vector h

into the attention scoring function, which is a neural

network composed of a single hidden layer and a softmax layer. And the output of this function

is a host score vector, where the value of each dimension reﬂects the degree of a host feature

derived from the whole guest features. e degree reaches the highest at 1 and the lowest at 0.

It is formally deﬁned as

D softmax



 Œh

; g





; (4.41)

where W

2 R

D

and s

2 R

denote the weight matrix and relation score vector corre-

sponding to each dimension of the host vector, respectively; the D

denotes the dimension of

the host vector, and D is the dimension of the overall vector. For simplicity, we omit the bias

terms.

For the guest modality, we analogously score the feature dimensions to measure the degree

of a guest feature derived from the host features, deﬁned as follows:

D softmax





; h





; (4.42)

where W

2 R

D

, the D

and s

2 R

denote the weight matrix, the dimension of guest

vector and the relation score vector corresponding to each dimension of the guest vector, respec-

tively.

4.6. MULTIMODAL COOPERATIVE LEARNING 95

Consistency and Complementary Components

Having established the attentive relation scores, we can easily locate the consistent and com-

plementary features from each modality. Toward this end, we set a trainable threshold denoted

as 

, in which we use o 2 O D fh; gg as the host and guest indicator. is threshold divides

the relation score vector into two parts: consistent vector and complementary vector, namely 

and ı

. e element in the complementary vectors is deﬁned as follows:

Œi D

1  s

Œi; if s

Œi < 

;

0; otherwise,

(4.43)

where ı

Œi is the value of the i-th dimension in the complementary weight vector ı

, reﬂecting

the degree of the complementary relation. For the consistent weight vector 

, we formulate

its element as



Œi D

Œi; if s

Œi  

;

0; otherwise,

(4.44)

where 

Œi is the value of i -th the dimension in the consistent weight vector 

, indicating

the degree of the consistency.

Particular, since the original functions are not continuous, we introduced a sigmoid func-

tion to make them diﬀerentiable, as follow:



Œi D

Œi

1 C e

w.s

Œi

;

Œi D 1  

Œi;

(4.45)

where w denotes a scalar weighting the diﬀerence between s

Œi and 

to make the output



Œi as close as possible to 0 or s

Œi. rough experiments, the best results are obtained with

a weight of 50.

After that, we gain four correlation weight vectors from each host-guest pair, namely ı

, 

, and 

. Based on these weight vectors, we separated the consistent features and the

complementary features from the mixed information, which are the element-wise products of

the original feature vector and each weight vector as,

D h

˝ ı

;

D g

˝ ı

;

D h

˝ 

;

D g

˝ 

;

(4.46)

where two complementary vectors and two consistent vectors of the host modality and guest

modality are denoted as ˛

, ˛

, ˇ

, and ˇ

, respectively.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Attention Networks

Create new playlist

Sign In

Sign Up

Table of Contents for
Attention Networks