Experiments (1/2)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

96 4. MULTIMODAL COOPERATIVE LEARNING

With the separated consistent and complementary components, we can reconstruct the

representations with better expressiveness. We employ diﬀerent strategies on distinct compo-

nents. To adequately exploit the correlations between the consistent component pairs, we con-

catenate these vectors and feed them into a neural network to learn an enhanced consistent

vector,

D '







; ˇ





; (4.47)

where W

, './, and

denote the weight matrix, activation function, and the enhanced con-

sistent vector in the modality m, respectively.

To supplement the exclusive information from other modalities, we integrate the en-

hanced consistent components and the complementary components to generate a feature vector

with powerful expressiveness as



;

; ˛



: (4.48)

Meanwhile, to guarantee the consistency, the diversity of the consistent component pairs

should be minimized. However, the dimension of each vector is diﬀerent, and the number of

the consistent features is dynamic. We hence fail to capture the diversity of these features di-

rectly. Toward this end, we propose to compute the probability distributions of venue categories

represented by consistent vectors, and further leverage the Kullback–Leibler divergence (KL

divergence) to encourage them to be close.

Particularly, the probability distribution over categories is deﬁned as follows:

D softmax



 ˇ



; (4.49)

where U

2 R

KD

and p

2 R

, respectively, denote the weight matrix and the probability

distribution of the venue categories represented by the consistent vector ˇ

Following that, we compute the KL divergence between the two probability distributions

and p

, formally as,

x2X

log p

 p

log p

/; (4.50)

where p

and p

both denote the probability distribution of the venue categories. Based upon

this, we calculate the sum of the KL divergences from all modalities as

m2M

: (4.51)

4.6.3 ATTENTION NETWORKS

Given the augmented representations above, a straightforward way to estimate the venue cate-

gory is to adopt a classiﬁer. However, we argue that the rich information within the augmented

representations is redundant for the prediction task, and hence the simple classiﬁer can hardly

4.6. MULTIMODAL COOPERATIVE LEARNING 97

select the discriminative features. Several eﬀorts have been paid to achieve the a discriminative

representation from massive features, like PCA [159] and sparse representation [207]. ese

approaches, however, have many hyper-parameters to tune. More principle components, for

instance, can lead to the suboptimal performance.

With the advance of the attention mechanism, we employ an attention network to evaluate

the attention scores for each feature toward diﬀerent venue categories. ese scores can measure

the relevance and signiﬁcance of the features for the venue category. In addition, the continuous

attention scores make the feature selection ﬂexible. ereafter, we obtain the scored features

and leverage them to learn a discriminative representation to estimate the venue category in

each modality.

Attention Score

Given a feature vector, we assign an attention score to each feature according to the venue cat-

egory and obtain the scored feature to learn a discriminative representation.

Instead of computing the importance of each feature to categories, we constructed a train-

able memory matrix to store the attention score of them. e matrix is denoted as 

2 R

K

in the modality m and the entry in row i and column j represents the importance of i-th fea-

ture toward j -th venue category . For each category, the scored feature vector is obtained by

calculating the element-wise product of the feature vector and the corresponding row vector in

matrix



. It is formulated as

D !

˝ Ox

; (4.52)

where Ox

2 R

is the augmented vector in the modality m; !

2 R

denotes the feature at-

tention scores of the venue category j and

2 R

denotes the scored feature vector toward

venue category j .

To yield the discriminative representation, we feed the scored feature vector into a fully

connected layer as follows:



D .W



/; (4.53)

where W

, ./, and 

denote the weight matrix, the activation function and the discriminative

representation of j -th venue category in the modality m, respectively.

Multimodal Estimation

After obtaining the discriminative representations, we pass them into a fully connected softmax

layer. It computes the probability distributions over the venue category labels in each modality,

mathematically stated as



j�



exp







exp







; (4.54)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Experiments (1/2)

Create new playlist

Sign In

Sign Up

Table of Contents for
Experiments (1/2)