Multimodal Learning
toward Micro-Video
Understanding
Liqiang Nie
Meng Liu
Xuemeng Song
Series Editor: Alan C. Bovik, University of Texas, Austin
Multimodal Learning toward Micro-Video Understanding
Liqiang Nie, Shandong University, Jinan, China
Meng Liu, Shandong University, Jinan, China
Xuemeng Song, Shandong University, Jinan, China
Micro-videos, a new form of user-generated content, have been spreading widely across various social platforms, such
as Vine, Kuaishou, and TikTok. Dierent from traditional long videos, micro-videos are usually recorded by smart
mobile devices at any place within a few seconds. Due to their brevity and low bandwidth cost, micro-videos are gaining
increasing user enthusiasm. e blossoming of micro-videos opens the door to the possibility of many promising
applications, ranging from network content caching to online advertising. us, it is highly desirable to develop an
eective scheme for high-order micro-video understanding.
Micro-video understanding is, however, non-trivial due to the following challenges: (1) how to represent micro-
videos that only convey one or few high-level themes or concepts; (2) how to utilize the hierarchical structure of venue
categories to guide micro-video analysis; (3) how to alleviate the inuence of low quality caused by complex surrounding
environments and camera shake; (4) how to model multimodal sequential data, i.e. textual, acoustic, visual, and social
modalities to enhance micro-video understanding; and (5) how to construct large-scale benchmark datasets for analysis.
ese challenges have been largely unexplored to date.
In this book, we focus on addressing the challenges presented above by proposing some state-of-the-art multimodal
learning theories. To demonstrate the eectiveness of these models, we apply them to three practical tasks of micro-video
understanding: popularity prediction, venue category estimation, and micro-video routing. Particularly, we rst build
three large-scale real-world micro-video datasets for these practical tasks. We then present a multimodal transductive
learning framework for micro-video popularity prediction. Furthermore, we introduce several multimodal cooperative
learning approaches and a multimodal transfer learning scheme for micro-video venue category estimation. Meanwhile,
we develop a multimodal sequential learning approach for micro-video recommendation. Finally, we conclude the book
and gure out the future research directions in multimodal learning toward micro-video understanding.
store.morganclaypool.com
About SYNTHESIS
This volume is a printed version of a work that appears in the Synthesis
Digital Library of Engineering and Computer Science. Synthesis
books provide concise, original presentations of important research and
development topics, published quickly, in digital and print formats.
Synthesis Lectures on
Image, Video Multimedia Processing
Series ISSN: 1559-8136
Synthesis Lectures on
Image, Video Multimedia Processing
NIE • ET AL MULTIMODAL LEARNING TOWARD MICROVIDEO UNDERSTANDING MORGAN & CLAYPOOL