3.6. MULTIMODAL TRANSDUCTIVE LEARNING 29
modality can be represented as X
k
2 R
.N CM /Z
k
. e popularity of all the videos are denoted
by y D fy
1
; y
2
; : : : ; y
N
g
T
2 R
N
. Let f D ff
1
; f
2
; : : : ; f
N
; f
N C1
; f
N C2
; : : : ; f
N CM
g
T
2 R
N CM
stand for the predicted results regarding popularity for all samples, including the labeled and un-
labeled ones. We aim to jointly learn the common space X
0
2 R
.N CM /Z
0
shared by multiple
modalities and the popularity for the M unlabeled micro-videos.
We present a novel Transductive Multi-modAL Learning approach, TMALL for short,
to predicting the popularity of micro-videos. As illustrated in Figure 3.1, we first crawl a rep-
resentative micro-video dataset from Vine and develop a rich set of popularity-oriented fea-
tures from multi-modalities. We then perform multi-modal learning to predict the popularity
of micro-videos, which seamlessly takes the modality relatedness and modality limitation into
account by utilizing a common space shared by all modalities. We assume that there exists an
optimal common space, which maintains the original intrinsic characteristics of micro-videos in
the original spaces. In light of this, all modalities are forced to be correlated. Meanwhile, micro-
videos with different popularity can be better separated in such optimal common space, as com-
pared to that of each single modality. In a sense, we alleviate the modality limitation problem. It
is worth mentioning that, in this work, we aim to predict how popular a given micro-video will
be when the propagation is stable rather than when the given micro-video would be popular.
An Example of Micro-video
Ground Truth
Social Visual Acoustic Textual
Feature Extraction from Multi-modal MALL Model Predicted Popularity
Top
Optimal Subspace
Figure 3.1: Micro-video popularity prediction via our proposed TMALL model.
3.6.1 OBJECTIVE FORMULATION
It is apparent that different modalities may contribute distinctive and complementary informa-
tion about micro-videos. For example, textual modality gives us hints about the topics of the
given micro-video; acoustic and visual modalities may, respectively, convey location and situa-
tion of micro-videos, and user modality demonstrates the influence of the micro-video publisher.
ese clues jointly contribute to the popularity of a micro-video. Obviously, due to the noise and
information insufficiency of each modality, it may be suboptimal to conduct learning directly