96 4. MULTIMODAL COOPERATIVE LEARNING
With the separated consistent and complementary components, we can reconstruct the
representations with better expressiveness. We employ different strategies on distinct compo-
nents. To adequately exploit the correlations between the consistent component pairs, we con-
catenate these vectors and feed them into a neural network to learn an enhanced consistent
vector,
Q
ˇ
m
D '
W
m
ˇ
ˇ
m
h
; ˇ
m
g
; (4.47)
where W
m
ˇ
, './, and
Q
ˇ
m
denote the weight matrix, activation function, and the enhanced con-
sistent vector in the modality m, respectively.
To supplement the exclusive information from other modalities, we integrate the en-
hanced consistent components and the complementary components to generate a feature vector
with powerful expressiveness as
O
x
m
D
˛
m
h
;
Q
ˇ
m
; ˛
m
g
: (4.48)
Meanwhile, to guarantee the consistency, the diversity of the consistent component pairs
should be minimized. However, the dimension of each vector is different, and the number of
the consistent features is dynamic. We hence fail to capture the diversity of these features di-
rectly. Toward this end, we propose to compute the probability distributions of venue categories
represented by consistent vectors, and further leverage the Kullback–Leibler divergence (KL
divergence) to encourage them to be close.
Particularly, the probability distribution over categories is defined as follows:
p
m
o
D softmax
U
m
o
ˇ
m
o
; (4.49)
where U
m
o
2 R
KD
o
and p
m
o
2 R
K
, respectively, denote the weight matrix and the probability
distribution of the venue categories represented by the consistent vector ˇ
m
o
.
Following that, we compute the KL divergence between the two probability distributions
p
m
h
and p
m
g
, formally as,
L
m
1
D
X
x2X
.p
m
g
log p
m
h
p
m
h
log p
m
g
/; (4.50)
where p
m
h
and p
m
g
both denote the probability distribution of the venue categories. Based upon
this, we calculate the sum of the KL divergences from all modalities as
L
1
D
X
m2M
L
m
1
: (4.51)
4.6.3 ATTENTION NETWORKS
Given the augmented representations above, a straightforward way to estimate the venue cate-
gory is to adopt a classifier. However, we argue that the rich information within the augmented
representations is redundant for the prediction task, and hence the simple classifier can hardly