3.6. MULTIMODAL TRANSDUCTIVE LEARNING 29
modality can be represented as X
k
2 R
.N CM /Z
k
. e popularity of all the videos are denoted
by y D fy
1
; y
2
; : : : ; y
N
g
T
2 R
N
. Let f D ff
1
; f
2
; : : : ; f
N
; f
N C1
; f
N C2
; : : : ; f
N CM
g
T
2 R
N CM
stand for the predicted results regarding popularity for all samples, including the labeled and un-
labeled ones. We aim to jointly learn the common space X
0
2 R
.N CM /Z
0
shared by multiple
modalities and the popularity for the M unlabeled micro-videos.
We present a novel Transductive Multi-modAL Learning approach, TMALL for short,
to predicting the popularity of micro-videos. As illustrated in Figure 3.1, we first crawl a rep-
resentative micro-video dataset from Vine and develop a rich set of popularity-oriented fea-
tures from multi-modalities. We then perform multi-modal learning to predict the popularity
of micro-videos, which seamlessly takes the modality relatedness and modality limitation into
account by utilizing a common space shared by all modalities. We assume that there exists an
optimal common space, which maintains the original intrinsic characteristics of micro-videos in
the original spaces. In light of this, all modalities are forced to be correlated. Meanwhile, micro-
videos with different popularity can be better separated in such optimal common space, as com-
pared to that of each single modality. In a sense, we alleviate the modality limitation problem. It
is worth mentioning that, in this work, we aim to predict how popular a given micro-video will
be when the propagation is stable rather than when the given micro-video would be popular.
An Example of Micro-video
Ground Truth
Social Visual Acoustic Textual
Feature Extraction from Multi-modal MALL Model Predicted Popularity
Top
Optimal Subspace
Figure 3.1: Micro-video popularity prediction via our proposed TMALL model.
3.6.1 OBJECTIVE FORMULATION
It is apparent that different modalities may contribute distinctive and complementary informa-
tion about micro-videos. For example, textual modality gives us hints about the topics of the
given micro-video; acoustic and visual modalities may, respectively, convey location and situa-
tion of micro-videos, and user modality demonstrates the influence of the micro-video publisher.
ese clues jointly contribute to the popularity of a micro-video. Obviously, due to the noise and
information insufficiency of each modality, it may be suboptimal to conduct learning directly
30 3. MULTIMODAL TRANSDUCTIVE LEARNING
from each single modality separately. In contrast, we assume that there exists an optimal latent
space, in which micro-videos can be better described. Moreover, the optimal latent space should
maintain the original intrinsic characteristics conveyed by multi-modalities of the given micro-
videos. erefore, we penalize the disagreement of the normalized Laplacian matrix between
the latent space and each modality. In particular, we formalize this assumption as follows: Let
S
k
2 R
.N CM /.N CM/
be the similarity matrix,
7
which is computed by the Gaussian similarity
function as follows:
S
k
.i; j / D
8
ˆ
ˆ
ˆ
<
ˆ
ˆ
ˆ
:
exp
0
B
@
x
i
k
x
j
k
2
2
2
k
1
C
A
; if i ¤ j ,
0 ; if i D j ,
(3.5)
where x
i
k
and x
j
k
are the micro-video pairs in the k-th modality space. ereinto, the radius
parameter
k
is simply set as the median of the Euclidean distances over all video pairs in the
k-th modality. We then derive the corresponding normalized Laplacian matrix as follows:
L.S
k
/ D I D
1
2
k
S
k
D
1
2
k
; (3.6)
where I is a .N C M / .N C M / identity matrix and D
k
2 R
.N CM /.N CM/
is the diagonal
degree matrix, whose .u; u/-th entry is the sum of the u-th row of S
k
. Since S
k
.i; j / > 0, we
can derive that tr.L.S
k
// > 0. We thus can formulate the disagreement penalty between the
latent space and the original modalities as
K
X
kD1
1
tr
.
L
.
S
0
//
L
.
S
0
/
1
tr
.
L
.
S
k
//
L
.
S
k
/
2
F
; (3.7)
where tr.A/ is the trace of matrix A and
k
k
F
denotes the Frobenius norm of matrix. In addition,
inspired by [164], considering that similar micro-videos attempt to have similar popularity in
the latent common space, we adopt the following regularizer:
1
2
N CM
X
mD1
N CM
X
nD1
f .x
m
0
/
p
D
0
.x
m
0
/
f .x
n
0
/
p
D
0
.x
n
0
/
!
2
S
0
.m; n/ D f
T
L.S
0
/f: (3.8)
Based upon these formulations, we can define the loss function that measures the em-
pirical error on the training samples. As reported in [123], the squared loss usually yields good
performance as other complex ones. We thus adopt the squared loss in our algorithm for sim-
plicity and efficiency. In particular, since we do not have the labels for testing samples, we only
7
To facilitate the illustration, k ranges from 0K.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset