xiii
Preface
e unprecedented growth of portable devices contributes to the success of micro-video sharing
platforms such as Vine, Kuaishou, and TikTok. ese devices enable users to record and share
their daily life within a few seconds in the form of micro-videos at any time and any place. As
a new media type, micro-videos have garnered great enthusiasm due to brevity, authenticity,
communicability, and low-cost. e proliferation of micro-videos confirms the old saying that
good things come in small packages.
Like traditional long videos, micro-videos are a combination of textual, acoustic, and visual
modalities. ese modalities are correlated rather than independent, and they essentially charac-
terize the same micro-videos from distinct angles. Effectively fusing heterogeneous modalities
toward video understanding indeed has been well-studied in the past decade. Yet, micro-videos
have their unique characteristics and corresponding research challenges, including but not lim-
ited to the following.
(1) Information sparseness. Micro-videos are very short, lasting for 6–15 s, and they hence
usually convey only a few concepts. In light of this, we need to learn their sparse and conceptual
representations for better discrimination. (2) Hierarchical structure. Micro-videos are implicitly
organized into a four-layer hierarchical tree structure with respect to their recording venues.
We should leverage such a structure to guide the organization of micro-videos by categorizing
them into the leaf nodes of this tree. (3) Low-quality. Most portable devices have nothing to
offer for video stabilization. Some recorded videos can thus be visually shaky or bumpy, which
greatly hinders the visual expression. Furthermore, the audio track that comes along with the
video can differ in terms of distortion and noise, such as buzzing, hums, hisses, and whistling,
which is probably caused by the poor microphones or complex surrounding environments. We
thus have to harness the external visual or sound knowledge to compensate the shortest boards.
(4) Multimodal sequential data. Beyond textual, acoustic, and visual modalities, micro-videos
also have social modality. In such a context, a user is enabled to interact with micro-videos
and other users via social actions , such as click, like, and follow. As time goes on, multiple
sequential data in different forms emerge and reflect users’ historical preferences. To strengthen
micro-video understanding, we have to characterize and model the sequential patterns. (5) e
last challenge we are facing is the lack of benchmark datasets to justify our ideas.
In this book, to tackle the aforementioned research challenges, we present some state-of-
the-art multimodal learning theories and verify them over three practical tasks of micro-video
understanding: popularity prediction, venue category estimation, and micro-video routing. In
particular, we first construct three large-scale real-world micro-video datasets corresponding
to the three practical tasks. We then propose a multimodal transductive learning framework
xiv PREFACE
to learn the micro-video representations in an optimal latent space via unifying and preserv-
ing information from different modalities. In this transductive framework, we integrate the
low-rank constraints to somehow alleviate the information sparseness and low-quality prob-
lems. is framework is verified on the popularity prediction task. We next present a series of
multimodal cooperative learning approaches, which explicitly model the consistent and com-
plementary modality correlations. In the multimodal cooperative learning approaches, we make
full use of the hierarchical structure by the tree-guided group lasso, and further solve the infor-
mation sparseness via dictionary learning. Following that, we work toward compensating the
low-quality acoustic modalities via harnessing the external sound knowledge. is is accom-
plished by a deep multimodal transfer learning scheme. e multimodal cooperative learning
approaches and the multimodal transfer learning scheme are both justified over the task of venue
category estimation. ereafter, we develop a multimodal sequential learning approach, relying
on temporal graph-based long short-term memory networks, to intelligently route micro-videos
to the target users in a personalized manner. We ultimately summarize the book and figure out
the future research directions in multimodal learning toward micro-video understanding.
is book represents a preliminary research on learning from multiple correlated modal-
ities of given micro-videos, and we anticipate that the lectures in this series will dramatically
influence future thought on these subjects. If in this book we have been able to dream further
than others have, it is because we are standing on the shoulders of giants.
Liqiang Nie, Meng Liu, and Xuemeng Song
July 2019
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset