xiii
Preface
e unprecedented growth of portable devices contributes to the success of micro-video sharing
platforms such as Vine, Kuaishou, and TikTok. ese devices enable users to record and share
their daily life within a few seconds in the form of micro-videos at any time and any place. As
a new media type, micro-videos have garnered great enthusiasm due to brevity, authenticity,
communicability, and low-cost. e proliferation of micro-videos confirms the old saying that
good things come in small packages.
Like traditional long videos, micro-videos are a combination of textual, acoustic, and visual
modalities. ese modalities are correlated rather than independent, and they essentially charac-
terize the same micro-videos from distinct angles. Effectively fusing heterogeneous modalities
toward video understanding indeed has been well-studied in the past decade. Yet, micro-videos
have their unique characteristics and corresponding research challenges, including but not lim-
ited to the following.
(1) Information sparseness. Micro-videos are very short, lasting for 6–15 s, and they hence
usually convey only a few concepts. In light of this, we need to learn their sparse and conceptual
representations for better discrimination. (2) Hierarchical structure. Micro-videos are implicitly
organized into a four-layer hierarchical tree structure with respect to their recording venues.
We should leverage such a structure to guide the organization of micro-videos by categorizing
them into the leaf nodes of this tree. (3) Low-quality. Most portable devices have nothing to
offer for video stabilization. Some recorded videos can thus be visually shaky or bumpy, which
greatly hinders the visual expression. Furthermore, the audio track that comes along with the
video can differ in terms of distortion and noise, such as buzzing, hums, hisses, and whistling,
which is probably caused by the poor microphones or complex surrounding environments. We
thus have to harness the external visual or sound knowledge to compensate the shortest boards.
(4) Multimodal sequential data. Beyond textual, acoustic, and visual modalities, micro-videos
also have social modality. In such a context, a user is enabled to interact with micro-videos
and other users via social actions , such as click, like, and follow. As time goes on, multiple
sequential data in different forms emerge and reflect users’ historical preferences. To strengthen
micro-video understanding, we have to characterize and model the sequential patterns. (5) e
last challenge we are facing is the lack of benchmark datasets to justify our ideas.
In this book, to tackle the aforementioned research challenges, we present some state-of-
the-art multimodal learning theories and verify them over three practical tasks of micro-video
understanding: popularity prediction, venue category estimation, and micro-video routing. In
particular, we first construct three large-scale real-world micro-video datasets corresponding
to the three practical tasks. We then propose a multimodal transductive learning framework