Data Collection

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

8 1. INTRODUCTION

to that of each single modality. In this transductive framework, we integrate the low-rank con-

straints to somehow alleviate the information sparseness and low-quality problems. Because the

formulated objective function is not smooth and hard to solve, we design an eﬀective algorithm

based on the augmented Lagrange multiplier to optimize it and ensure a fast convergence.

As to the venue category estimation of micro-videos, we shed light on characterizing and

modeling the correlations between modalities, especially the consistent and complementary re-

lations. e consistent part is to strengthen the conﬁdence and the complementary one is able

to supplement a lot of exclusive information. We argue that explicitly parsing these two kinds

of correlations and treating them separately within a uniﬁed model can boost the representation

discrimination for multimodal samples. Toward this goal, we devise a series of cooperative learn-

ing models, which split the consistent information from the complementary one, and leverage

them to strengthen the expressiveness of each modality. In addition, we regularize the hierar-

chical structure of micro-videos via the tree-guided group lasso, which characterizes the inter-

and intra-relatedness among venue categories. Meanwhile, we integrate the dictionary learn-

ing component to learn the concept-level sparse representation of micro-videos, whereby the

problem of information sparseness can be alleviated.

We also explore methods to enhance the low-quality problems for the venue category

estimation task, especially compensating the acoustic modality. Toward this goal, we develop

a deep multimodal transfer learning scheme, which transfers knowledge from external high-

quality sound clips to strengthen the description of the internal acoustic modality in micro-

videos. is is accomplished by enforcing the external sound clips and the internal acoustic

modality to share the same acoustic concept-level space. Notably, this scheme is applicable to

enhance other modality representation, like visual and textual ones.

In order to intelligently route micro-videos to the target users, we present a multimodal

sequential learning scheme. To capture the users’ dynamic and diverse interest, we encode their

historical complicated interaction sequences into a temporal graph and then design a novel tem-

poral graph-based long short-term memory (LSTM) network to model it. Afterwards, we esti-

mate the click probability via calculating the similarity between the users’ interest representation

and the embedding of the given micro-video. Considering that users’ interest is multi-level, we

introduce a user matrix to enhance the user interest modeling by incorporating their “like” and

“follow” information. And at this step, we also get a click probability with respect to users’

more precise interest information. Analogously, since we know the sequence of users’ disliked

micro-videos, another temporal graph-based LSTM is built to characterize users’ uninterested

information, and the other click probability can be estimated based on true negative samples.

We can thus obtain a click probability regarding users’ uninterested information. Finally, the

weighted sum of the above three probability scores is set as our ﬁnal prediction result.

1.5. BOOK STRUCTURE 9

1.5 BOOK STRUCTURE

In this book, we present an in-depth introduction to multimodal learning toward micro-video

understanding, and a comprehensive literature survey of all the important research topics and

latest state-of-the-art methods in the area. It is suitable for students, researchers, and practi-

tioners who are interested in multimodal learning. It is worth emphasizing that the multimodal

learning methods presented in this book are applicable to other ﬁelds owing multi-aspect data,

like web image analysis, visual question answering, and user proﬁling across multiple social net-

works.

e remainder of this book consists of six chapters. Chapter 2 introduces the three micro-

video benchmark datasets for three practical tasks. Chapter 3 describes a multimodal trans-

ductive learning framework to tackle the information sparsity and low-quality problem. We

theoretically derive its closed-form solution and practically apply it to the micro-video popular-

ity prediction. In Chapter 4, we present a series of multi-modal cooperative learning methods

toward characterizing the explicit correlations among diﬀerent modalities, such as consistent

and complementary relationship. ese methods are veriﬁed over the task of venue category

estimation of micro-videos. In Chapter 5, we devise a deep transfer learning model by harness-

ing the external sound knowledge to compensate the acoustic modality in micro-videos. is

is a robust method to address the low-quality problem. Following that, in Chapter 6, we study

the multimodal sequential property of micro-videos and testify its eﬀectiveness over the task of

micro-video routing. We ﬁnally conclude this book and ﬁgure out the future research directions

in Chapter 7.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Data Collection

Create new playlist

Sign In

Sign Up

Table of Contents for
Data Collection