Acknowledgments

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

Preface

Next Chapter

Introduction

xiii

Preface

e unprecedented growth of portable devices contributes to the success of micro-video sharing

platforms such as Vine, Kuaishou, and TikTok. ese devices enable users to record and share

their daily life within a few seconds in the form of micro-videos at any time and any place. As

a new media type, micro-videos have garnered great enthusiasm due to brevity, authenticity,

communicability, and low-cost. e proliferation of micro-videos conﬁrms the old saying that

good things come in small packages.

Like traditional long videos, micro-videos are a combination of textual, acoustic, and visual

modalities. ese modalities are correlated rather than independent, and they essentially charac-

terize the same micro-videos from distinct angles. Eﬀectively fusing heterogeneous modalities

toward video understanding indeed has been well-studied in the past decade. Yet, micro-videos

have their unique characteristics and corresponding research challenges, including but not lim-

ited to the following.

(1) Information sparseness. Micro-videos are very short, lasting for 6–15 s, and they hence

usually convey only a few concepts. In light of this, we need to learn their sparse and conceptual

representations for better discrimination. (2) Hierarchical structure. Micro-videos are implicitly

organized into a four-layer hierarchical tree structure with respect to their recording venues.

We should leverage such a structure to guide the organization of micro-videos by categorizing

them into the leaf nodes of this tree. (3) Low-quality. Most portable devices have nothing to

oﬀer for video stabilization. Some recorded videos can thus be visually shaky or bumpy, which

greatly hinders the visual expression. Furthermore, the audio track that comes along with the

video can diﬀer in terms of distortion and noise, such as buzzing, hums, hisses, and whistling,

which is probably caused by the poor microphones or complex surrounding environments. We

thus have to harness the external visual or sound knowledge to compensate the shortest boards.

(4) Multimodal sequential data. Beyond textual, acoustic, and visual modalities, micro-videos

also have social modality. In such a context, a user is enabled to interact with micro-videos

and other users via social actions , such as click, like, and follow. As time goes on, multiple

sequential data in diﬀerent forms emerge and reﬂect users’ historical preferences. To strengthen

micro-video understanding, we have to characterize and model the sequential patterns. (5) e

last challenge we are facing is the lack of benchmark datasets to justify our ideas.

In this book, to tackle the aforementioned research challenges, we present some state-of-

the-art multimodal learning theories and verify them over three practical tasks of micro-video

understanding: popularity prediction, venue category estimation, and micro-video routing. In

particular, we ﬁrst construct three large-scale real-world micro-video datasets corresponding

to the three practical tasks. We then propose a multimodal transductive learning framework

xiv PREFACE

to learn the micro-video representations in an optimal latent space via unifying and preserv-

ing information from diﬀerent modalities. In this transductive framework, we integrate the

low-rank constraints to somehow alleviate the information sparseness and low-quality prob-

lems. is framework is veriﬁed on the popularity prediction task. We next present a series of

multimodal cooperative learning approaches, which explicitly model the consistent and com-

plementary modality correlations. In the multimodal cooperative learning approaches, we make

full use of the hierarchical structure by the tree-guided group lasso, and further solve the infor-

mation sparseness via dictionary learning. Following that, we work toward compensating the

low-quality acoustic modalities via harnessing the external sound knowledge. is is accom-

plished by a deep multimodal transfer learning scheme. e multimodal cooperative learning

approaches and the multimodal transfer learning scheme are both justiﬁed over the task of venue

category estimation. ereafter, we develop a multimodal sequential learning approach, relying

on temporal graph-based long short-term memory networks, to intelligently route micro-videos

to the target users in a personalized manner. We ultimately summarize the book and ﬁgure out

the future research directions in multimodal learning toward micro-video understanding.

is book represents a preliminary research on learning from multiple correlated modal-

ities of given micro-videos, and we anticipate that the lectures in this series will dramatically

inﬂuence future thought on these subjects. If in this book we have been able to dream further

than others have, it is because we are standing on the shoulders of giants.

Liqiang Nie, Meng Liu, and Xuemeng Song

July 2019

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Acknowledgments

Create new playlist

Sign In

Sign Up

Table of Contents for
Acknowledgments