114 5. MULTIMODAL TRANSFER LEARNING
for each modality that is able to project the low-level features to concept-level representations.
Analogous to the dictionaries in dictionary learning paradigms, the mapping functions are the
concept-feature distributions.
Let
e
X
a
D fex
a
i
g
N
0
iD1
be the dataset of external sounds. ese sounds share the same low-
level feature space with the acoustic modality in micro-videos (i.e., ex
a
2 R
D
a
). For each sound
clip ex
a
, we denote its corresponding concept-wise representation as ea
a
2 R
K
0
over K
0
acoustic
concepts, whereby K
0
equals to the number of acoustic concepts in this work, i.e., 313. It is
worth noting thatea
a
is observable, since we know the associated tags (acoustic concepts of each
collected sound clip). During learning, we aim to use the concept space of the external real-
life sounds to represent the acoustic modality in the given micro-video. is is accomplished
by ensuring that x
a
and ex
a
share the same mapping function. Based upon this, our objective
function J
1
of sound knowledge transfer can be stated as
J
1
D
1
N
X
x2X
X
m2M
k
D
m
x
m
a
m
k
2
C
1
N
0
X
ex2eX
k
D
a
ex
a
ea
a
k
2
; (5.1)
where D
a
2 R
D
a
K
0
is the shared mapping function, bridging the gap between the external
sounds and the internal acoustic modality, whereinto its i-th column d
a
i
represents the low-
level feature for the i-th concept, such as footsteps or clearing throat; and a
a
2 R
K
0
is the desired
concept-level representation of x over the K
0
acoustic concepts; and D
v
and a
v
(D
t
and a
t
) are
analogous to D
a
and a
a
. Noticeably, D
v
is an identity matrix, slightly different from other two
mapping functions, since the visual features are sufficiently abstractive extracted by AlexNet.
5.5.2 MULTI-MODAL FUSION
As aforementioned, multi-modalities provide complementary cues. We thus argue that multi-
modal fusion can provide comprehensive and informative description for micro-videos. In our
case, we adopt early fusion strategy for simplicity. Formally, for each micro-video x, we con-
catenate a
v
, a
a
, and a
t
into one vector as
a D
a
v
; a
a
; a
t
;
(5.2)
where a 2 R
D
v
CK
0
CD
t
is the desired multi-modal representation for x, whereinto a
v
, a
a
, and
a
t
, respectively, denote the concept-level representation over the visual, acoustic, and textual
modalities.
To alleviate the problem of unbalanced training samples, we further regularize a
i
for each
micro-video x
i
by similarity preservation. In particular, if two micro-videos are in the same
venue category, they should have similar representations in the latent space. Otherwise, they
have dissimilar ones. is suits well the paradigm of graph embedding [182], which injects the
label information into the embeddings.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset