5.5. DEEP MULTI-MODAL TRANSFER LEARNING 115
Formally, we denote .x
i
; x
j
/ as the pair of the i-th and j -th samples, and define a pairwise
class indicator as
ij
D
8
<
:
C1; if x
i
and x
j
are with the same label;
1; otherwise.
(5.3)
To encode the similarity preservation, we minimize the cross entropy loss of classifying all the
pairs into a label ,
N
X
i;j D1
I
ij
D 1
log
a
>
i
a
j
I
ij
D 1
log
a
>
i
a
j
; (5.4)
where I./ is a binary indicator function that outputs 1 when the argument is true, otherwise 0;
and ./ is the sigmoid function. We can equivalently rewrite the above equation as
J
2
D
N
X
iD1
N
X
j D1
log
ij
a
>
i
a
j
: (5.5)
It is very time-consuming to directly optimize Eq. (5.5) due to the huge amount of the instance
pairs, i.e., O(N
2
) w.r.t N samples.
To reduce the computing load, we turn to negative sampling [115]. In particular, for a
given micro-video sample x, we respectively sampled S positive and S negative micro-videos
from xs own category and its non-categories following a distribution .x
i
; x
j
;
ij
/. Formally, we
uniformly sample the first instance x
i
. We then sample the next instance x
j
with a probability
s
ij
that represents the geometric closeness between x
i
and x
j
. We calculate s
ij
with the radial
basis function kernel as
s
ij
D
1
jMj
X
m2M
exp
0
B
@
x
m
i
x
m
j
2
ı
2
m
1
C
A
; (5.6)
where ı
2
m
is a radius parameter that is set as the median of the Euclidean distances of all samples
on the modality m.
5.5.3 DEEP NETWORK FOR VENUE ESTIMATION
After obtaining the multi-modal representations, we add a stack of fully connected layers, which
enables us to capture the nonlinear and complex interactions between the visual, acoustic, and
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset