Multimodal Learning
toward Micro-Video
Understanding
Liqiang Nie
Meng Liu
Xuemeng Song
Series Editor: Alan C. Bovik, University of Texas, Austin
Multimodal Learning toward Micro-Video Understanding
Liqiang Nie, Shandong University, Jinan, China
Meng Liu, Shandong University, Jinan, China
Xuemeng Song, Shandong University, Jinan, China
Micro-videos, a new form of user-generated content, have been spreading widely across various social platforms, such
as Vine, Kuaishou, and TikTok. Dierent from traditional long videos, micro-videos are usually recorded by smart
mobile devices at any place within a few seconds. Due to their brevity and low bandwidth cost, micro-videos are gaining
increasing user enthusiasm. e blossoming of micro-videos opens the door to the possibility of many promising
applications, ranging from network content caching to online advertising. us, it is highly desirable to develop an
eective scheme for high-order micro-video understanding.
Micro-video understanding is, however, non-trivial due to the following challenges: (1) how to represent micro-
videos that only convey one or few high-level themes or concepts; (2) how to utilize the hierarchical structure of venue
categories to guide micro-video analysis; (3) how to alleviate the inuence of low quality caused by complex surrounding
environments and camera shake; (4) how to model multimodal sequential data, i.e. textual, acoustic, visual, and social
modalities to enhance micro-video understanding; and (5) how to construct large-scale benchmark datasets for analysis.
ese challenges have been largely unexplored to date.
In this book, we focus on addressing the challenges presented above by proposing some state-of-the-art multimodal
learning theories. To demonstrate the eectiveness of these models, we apply them to three practical tasks of micro-video
understanding: popularity prediction, venue category estimation, and micro-video routing. Particularly, we rst build
three large-scale real-world micro-video datasets for these practical tasks. We then present a multimodal transductive
learning framework for micro-video popularity prediction. Furthermore, we introduce several multimodal cooperative
learning approaches and a multimodal transfer learning scheme for micro-video venue category estimation. Meanwhile,
we develop a multimodal sequential learning approach for micro-video recommendation. Finally, we conclude the book
and gure out the future research directions in multimodal learning toward micro-video understanding.
store.morganclaypool.com
About SYNTHESIS
This volume is a printed version of a work that appears in the Synthesis
Digital Library of Engineering and Computer Science. Synthesis
books provide concise, original presentations of important research and
development topics, published quickly, in digital and print formats.
Synthesis Lectures on
Image, Video  Multimedia Processing
Series ISSN: 1559-8136
Synthesis Lectures on
Image, Video  Multimedia Processing
NIE • ET AL MULTIMODAL LEARNING TOWARD MICROVIDEO UNDERSTANDING MORGAN & CLAYPOOL
Multimodal Learning toward
Micro-Video Understanding
Synthesis Lectures on Image,
Video, and Multimedia
Processing
Editor
Alan C. Bovik, University of Texas, Austin
e Lectures on Image, Video and Multimedia Processing are intended to provide a unique and
groundbreaking forum for the worlds experts in the field to express their knowledge in unique and
effective ways. It is our intention that the Series will contain Lectures of basic, intermediate, and
advanced material depending on the topical matter and the authors’ level of discourse. It is also
intended that these Lectures depart from the usual dry textbook format and instead give the author
the opportunity to speak more directly to the reader, and to unfold the subject matter from a more
personal point of view. e success of this candid approach to technical writing will rest on our
selection of exceptionally distinguished authors, who have been chosen for their noteworthy
leadership in developing new ideas in image, video, and multimedia processing research,
development, and education.
In terms of the subject matter for the series, there are few limitations that we will impose other
than the Lectures be related to aspects of the imaging sciences that are relevant to furthering our
understanding of the processes by which images, videos, and multimedia signals are formed,
processed for various tasks, and perceived by human viewers. ese categories are naturally quite
broad, for two reasons: First, measuring, processing, and understanding perceptual signals involves
broad categories of scientific inquiry, including optics, surface physics, visual psychophysics and
neurophysiology, information theory, computer graphics, display and printing technology, artificial
intelligence, neural networks, harmonic analysis, and so on. Secondly, the domain of application of
these methods is limited only by the number of branches of science, engineering, and industry that
utilize audio, visual, and other perceptual signals to convey information. We anticipate that the
Lectures in this series will dramatically influence future thought on these subjects as the
Twenty-First Century unfolds.
Multimodal Learning toward Micro-Video Understanding
Liqiang Nie, Meng Liu, and Xuemeng Song
2019
Virtual Reality and Virtual Environments in 10 Lectures
Stanislav Stanković
2015
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset