3.4. RELATED WORK 25
timents in video popularity propagation into account but also reveals more underlying factors
that determine the popularity of a video.
However, the aforementioned studies do not consider the combined impact of hetero-
geneous, interconnected, and noisy data. In contrast, our proposed scheme not only pursues a
solid fusion of heterogeneous multi-view features based on the complementary characteristics
but also concentrates on exploiting the advantages of the low-rank representation to learn robust
features within the incomplete and noisy data. As a complement, we aim to timely predict the
popularity of a given micro-video even before it get published by proposing a novel multi-modal
learning scheme.
3.4.2 MULTI-VIEW LEARNING
Technically speaking, traditional multimodal fusion approaches consist of early fusion and late
fusion. Early fusion approaches, such as [42, 146], typically concatenate the unimodal features
extracted from each individual modality into a single representation to adapt to the learning set-
ting. Following that, one can devise a classifier, such as a neural network, treating the overall rep-
resentation as the input. However, these approaches generally overlook the obvious fact that each
view has its own specific statistical property and ignore the structural relatedness among views.
Hence, it fails to explore the modal correlations to strengthen the expressiveness of each modal-
ity and further improve the capacity of the fusion method. Late fusion performs the learning
directly over unimodal features, and then the prediction scores are fused to predict the venue cat-
egory, such as averaging [144], voting [118] and weighting [134]. Although this fusion method
is flexible and easy to work, it overlooks the correlation in the mixed feature space.
In contrast to the early and late fusion, as a new paradigm, multi-view learning exploits the
correlations between the representations of the information from multiple modalities to improve
the learning performance. It can be classified into three categories: co-training, multiple kernel
learning, and subspace learning.
Co-training [31] is a semi-supervised learning technique which first learns a separate
classifier for each view using the labeled examples. It maximizes the mutual agreement on two
distinct views of the unlabeled data by alternative training. Many variants have since been de-
veloped. Instead of committing labels for the unlabeled examples, Nigam et al. [125] proposed a
co-EM approach to running EM in each view and assigned probabilistic labels to the unlabeled
examples. To resolve the regression problems, Zhou and Li [208] employed two k-nearest neigh-
bor regressors to label the unknown instances during the learning process. More recently, Yu et
al. [186] proposed a Bayesian undirected graphical model for co-training through the Gaussian
process. e success of the co-training algorithms relies on three assumptions: (a) each view is
sufficient to estimate on its own; (b) it is probable that a function predicts the same labels for
each view feature; and (c) the views are conditionally independent of the given label. However,
these assumptions are too strong to satisfy in practice, especially for the micro-videos with dif-
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset