Multimodal Learning toward Micro-Video Understanding
Liqiang Nie, Shandong University, Jinan, China
Meng Liu, Shandong University, Jinan, China
Xuemeng Song, Shandong University, Jinan, China
Micro-videos, a new form of user-generated content, have been spreading widely across various social platforms, such
as Vine, Kuaishou, and TikTok. Dierent from traditional long videos, micro-videos are usually recorded by smart
mobile devices at any place within a few seconds. Due to their brevity and low bandwidth cost, micro-videos are gaining
increasing user enthusiasm. e blossoming of micro-videos opens the door to the possibility of many promising
applications, ranging from network content caching to online advertising. us, it is highly desirable to develop an
eective scheme for high-order micro-video understanding.
Micro-video understanding is, however, non-trivial due to the following challenges: (1) how to represent micro-
videos that only convey one or few high-level themes or concepts; (2) how to utilize the hierarchical structure of venue
categories to guide micro-video analysis; (3) how to alleviate the inuence of low quality caused by complex surrounding
environments and camera shake; (4) how to model multimodal sequential data, i.e. textual, acoustic, visual, and social
modalities to enhance micro-video understanding; and (5) how to construct large-scale benchmark datasets for analysis.
ese challenges have been largely unexplored to date.
In this book, we focus on addressing the challenges presented above by proposing some state-of-the-art multimodal
learning theories. To demonstrate the eectiveness of these models, we apply them to three practical tasks of micro-video
understanding: popularity prediction, venue category estimation, and micro-video routing. Particularly, we rst build
three large-scale real-world micro-video datasets for these practical tasks. We then present a multimodal transductive
learning framework for micro-video popularity prediction. Furthermore, we introduce several multimodal cooperative
learning approaches and a multimodal transfer learning scheme for micro-video venue category estimation. Meanwhile,
we develop a multimodal sequential learning approach for micro-video recommendation. Finally, we conclude the book
and gure out the future research directions in multimodal learning toward micro-video understanding.
