5.5. DEEP MULTI-MODAL TRANSFER LEARNING 115
Formally, we denote .x
i
; x
j
/ as the pair of the i-th and j -th samples, and define a pairwise
class indicator as
ij
D
8
<
:
C1; if x
i
and x
j
are with the same label;
1; otherwise.
(5.3)
To encode the similarity preservation, we minimize the cross entropy loss of classifying all the
pairs into a label ,
N
X
i;j D1
I
ij
D 1
log
a
>
i
a
j
I
ij
D 1
log
a
>
i
a
j
; (5.4)
where I./ is a binary indicator function that outputs 1 when the argument is true, otherwise 0;
and ./ is the sigmoid function. We can equivalently rewrite the above equation as
J
2
D
N
X
iD1
N
X
j D1
log
ij
a
>
i
a
j
: (5.5)
It is very time-consuming to directly optimize Eq. (5.5) due to the huge amount of the instance
pairs, i.e., O(N
2
) w.r.t N samples.
To reduce the computing load, we turn to negative sampling [115]. In particular, for a
given micro-video sample x, we respectively sampled S positive and S negative micro-videos
from x’s own category and its non-categories following a distribution .x
i
; x
j
;
ij
/. Formally, we
uniformly sample the first instance x
i
. We then sample the next instance x
j
with a probability
s
ij
that represents the geometric closeness between x
i
and x
j
. We calculate s
ij
with the radial
basis function kernel as
s
ij
D
1
jMj
X
m2M
exp
0
B
@
x
m
i
x
m
j
2
ı
2
m
1
C
A
; (5.6)
where ı
2
m
is a radius parameter that is set as the median of the Euclidean distances of all samples
on the modality m.
5.5.3 DEEP NETWORK FOR VENUE ESTIMATION
After obtaining the multi-modal representations, we add a stack of fully connected layers, which
enables us to capture the nonlinear and complex interactions between the visual, acoustic, and