8 1. INTRODUCTION
to that of each single modality. In this transductive framework, we integrate the low-rank con-
straints to somehow alleviate the information sparseness and low-quality problems. Because the
formulated objective function is not smooth and hard to solve, we design an effective algorithm
based on the augmented Lagrange multiplier to optimize it and ensure a fast convergence.
As to the venue category estimation of micro-videos, we shed light on characterizing and
modeling the correlations between modalities, especially the consistent and complementary re-
lations. e consistent part is to strengthen the confidence and the complementary one is able
to supplement a lot of exclusive information. We argue that explicitly parsing these two kinds
of correlations and treating them separately within a unified model can boost the representation
discrimination for multimodal samples. Toward this goal, we devise a series of cooperative learn-
ing models, which split the consistent information from the complementary one, and leverage
them to strengthen the expressiveness of each modality. In addition, we regularize the hierar-
chical structure of micro-videos via the tree-guided group lasso, which characterizes the inter-
and intra-relatedness among venue categories. Meanwhile, we integrate the dictionary learn-
ing component to learn the concept-level sparse representation of micro-videos, whereby the
problem of information sparseness can be alleviated.
We also explore methods to enhance the low-quality problems for the venue category
estimation task, especially compensating the acoustic modality. Toward this goal, we develop
a deep multimodal transfer learning scheme, which transfers knowledge from external high-
quality sound clips to strengthen the description of the internal acoustic modality in micro-
videos. is is accomplished by enforcing the external sound clips and the internal acoustic
modality to share the same acoustic concept-level space. Notably, this scheme is applicable to
enhance other modality representation, like visual and textual ones.
In order to intelligently route micro-videos to the target users, we present a multimodal
sequential learning scheme. To capture the users’ dynamic and diverse interest, we encode their
historical complicated interaction sequences into a temporal graph and then design a novel tem-
poral graph-based long short-term memory (LSTM) network to model it. Afterwards, we esti-
mate the click probability via calculating the similarity between the users’ interest representation
and the embedding of the given micro-video. Considering that users’ interest is multi-level, we
introduce a user matrix to enhance the user interest modeling by incorporating their like” and
“follow” information. And at this step, we also get a click probability with respect to users’
more precise interest information. Analogously, since we know the sequence of users’ disliked
micro-videos, another temporal graph-based LSTM is built to characterize users’ uninterested
information, and the other click probability can be estimated based on true negative samples.
We can thus obtain a click probability regarding users’ uninterested information. Finally, the
weighted sum of the above three probability scores is set as our final prediction result.
1.5. BOOK STRUCTURE 9
1.5 BOOK STRUCTURE
In this book, we present an in-depth introduction to multimodal learning toward micro-video
understanding, and a comprehensive literature survey of all the important research topics and
latest state-of-the-art methods in the area. It is suitable for students, researchers, and practi-
tioners who are interested in multimodal learning. It is worth emphasizing that the multimodal
learning methods presented in this book are applicable to other fields owing multi-aspect data,
like web image analysis, visual question answering, and user profiling across multiple social net-
works.
e remainder of this book consists of six chapters. Chapter 2 introduces the three micro-
video benchmark datasets for three practical tasks. Chapter 3 describes a multimodal trans-
ductive learning framework to tackle the information sparsity and low-quality problem. We
theoretically derive its closed-form solution and practically apply it to the micro-video popular-
ity prediction. In Chapter 4, we present a series of multi-modal cooperative learning methods
toward characterizing the explicit correlations among different modalities, such as consistent
and complementary relationship. ese methods are verified over the task of venue category
estimation of micro-videos. In Chapter 5, we devise a deep transfer learning model by harness-
ing the external sound knowledge to compensate the acoustic modality in micro-videos. is
is a robust method to address the low-quality problem. Following that, in Chapter 6, we study
the multimodal sequential property of micro-videos and testify its effectiveness over the task of
micro-video routing. We finally conclude this book and figure out the future research directions
in Chapter 7.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset