96 4. MULTIMODAL COOPERATIVE LEARNING
With the separated consistent and complementary components, we can reconstruct the
representations with better expressiveness. We employ different strategies on distinct compo-
nents. To adequately exploit the correlations between the consistent component pairs, we con-
catenate these vectors and feed them into a neural network to learn an enhanced consistent
vector,
Q
ˇ
m
D '
W
m
ˇ
ˇ
m
h
; ˇ
m
g
; (4.47)
where W
m
ˇ
, './, and
Q
ˇ
m
denote the weight matrix, activation function, and the enhanced con-
sistent vector in the modality m, respectively.
To supplement the exclusive information from other modalities, we integrate the en-
hanced consistent components and the complementary components to generate a feature vector
with powerful expressiveness as
O
x
m
D
˛
m
h
;
Q
ˇ
m
; ˛
m
g
: (4.48)
Meanwhile, to guarantee the consistency, the diversity of the consistent component pairs
should be minimized. However, the dimension of each vector is different, and the number of
the consistent features is dynamic. We hence fail to capture the diversity of these features di-
rectly. Toward this end, we propose to compute the probability distributions of venue categories
represented by consistent vectors, and further leverage the Kullback–Leibler divergence (KL
divergence) to encourage them to be close.
Particularly, the probability distribution over categories is defined as follows:
p
m
o
D softmax
U
m
o
ˇ
m
o
; (4.49)
where U
m
o
2 R
KD
o
and p
m
o
2 R
K
, respectively, denote the weight matrix and the probability
distribution of the venue categories represented by the consistent vector ˇ
m
o
.
Following that, we compute the KL divergence between the two probability distributions
p
m
h
and p
m
g
, formally as,
L
m
1
D
X
x2X
.p
m
g
log p
m
h
p
m
h
log p
m
g
/; (4.50)
where p
m
h
and p
m
g
both denote the probability distribution of the venue categories. Based upon
this, we calculate the sum of the KL divergences from all modalities as
L
1
D
X
m2M
L
m
1
: (4.51)
4.6.3 ATTENTION NETWORKS
Given the augmented representations above, a straightforward way to estimate the venue cate-
gory is to adopt a classifier. However, we argue that the rich information within the augmented
representations is redundant for the prediction task, and hence the simple classifier can hardly
4.6. MULTIMODAL COOPERATIVE LEARNING 97
select the discriminative features. Several efforts have been paid to achieve the a discriminative
representation from massive features, like PCA [159] and sparse representation [207]. ese
approaches, however, have many hyper-parameters to tune. More principle components, for
instance, can lead to the suboptimal performance.
With the advance of the attention mechanism, we employ an attention network to evaluate
the attention scores for each feature toward different venue categories. ese scores can measure
the relevance and significance of the features for the venue category. In addition, the continuous
attention scores make the feature selection flexible. ereafter, we obtain the scored features
and leverage them to learn a discriminative representation to estimate the venue category in
each modality.
Attention Score
Given a feature vector, we assign an attention score to each feature according to the venue cat-
egory and obtain the scored feature to learn a discriminative representation.
Instead of computing the importance of each feature to categories, we constructed a train-
able memory matrix to store the attention score of them. e matrix is denoted as
m
2 R
D
m
K
in the modality m and the entry in row i and column j represents the importance of i-th fea-
ture toward j -th venue category . For each category, the scored feature vector is obtained by
calculating the element-wise product of the feature vector and the corresponding row vector in
matrix
m
. It is formulated as
m
j
D !
m
j
˝ Ox
m
; (4.52)
where Ox
m
2 R
D
is the augmented vector in the modality m; !
m
j
2 R
D
denotes the feature at-
tention scores of the venue category j and
m
j
2 R
D
denotes the scored feature vector toward
venue category j .
To yield the discriminative representation, we feed the scored feature vector into a fully
connected layer as follows:
m
j
D .W
m
m
j
/; (4.53)
where W
m
, ./, and
m
j
denote the weight matrix, the activation function and the discriminative
representation of j -th venue category in the modality m, respectively.
Multimodal Estimation
After obtaining the discriminative representations, we pass them into a fully connected softmax
layer. It computes the probability distributions over the venue category labels in each modality,
mathematically stated as
p
Oy
m
k
j
m
k
D
exp
z
T
k
m
k
P
K
k
0
D1
exp
z
T
k
0
m
k
0
; (4.54)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset