141
C H A P T E R 7
Research Frontiers
In this book, we investigate some application-motivated problems, namely the research prob-
lems of micro-video understanding. To solve these problems, we design some general principles,
methodologies, and optimizations by jointly learning from multiple correlated modalities of the
given micro-videos, including the textual, visual, acoustic, and social ones. ey are empiri-
cally validated on multiple real-world datasets. In particular, we first introduce the proliferation
of micro-video services and identify three practical tasks of micro-video understanding: popu-
larity prediction, venue category estimation, and micro-video routing. Based upon these tasks,
we analyze the unique research challenges of micro-videos that are distinct from traditional
long videos, such as information sparseness, hierarchical structure, low-quality, multimodal se-
quential data, as well as lack of benchmark datasets. To address these problems, we present a
series of multimodal learning methods, consisting of multimodal transductive learning, multi-
modal cooperative learning, multimodal transductive learning and multimodal sequential learn-
ing. ese theoretical methods are verified over three datasets we constructed. To facilitate other
researchers, we have released the codes, parameter settings, as well as the three datasets. We have
to emphasize that learning from multiple modalities of the given micro-videos is still a young
and highly promising research field. ere are many unexplored but fruitful future directions
and challenging research issues. We illustrate a few of them here.
7.1 MICRO-VIDEO ANNOTATION
Facing the exponentially growing number of micro-videos, it is important to help users quickly
identify their desired ones. e hashtags associated with micro-videos are typically provided by
uploaders to summarize the post content of users and attract the attention of followers. Taking
the popular social platform Instagram as an example, as shown in Figure 7.1, the hashtags are
prefixed with the symbol “#” to mark keywords or key topics of a post. e hashtags have been
proved to be useful in many applications, including micro-blog retrieval, event analysis, and
sentiment analysis. Moreover, the tagging service can benefit the stakeholders of micro-video
ecosystems. For users, hashtags facilitate them to search and locate their desired micro-videos.
For post-sharers, concise and concrete hashtags can increase the probability of their micro-videos
to be discovered. For platforms, hashtags can make the management of micro-videos (e.g., cat-
egorization) more convenient. Despite their importance, numerous micro-videos are lack of
hashtags or the hashtags are inaccurate or incomplete. In light of this, micro-video annotation,
142 7. RESEARCH FRONTIERS
Figure 7.1: An example of micro-video with hashtags in Instagram.
which suggests a list of hashtags to a user when he or she wants to annotate a post, becomes a
crucial research problem.
Although several models have been adopted for hashtag recommendation and achieved
some progress, such as collaborative filtering, generative models, and DNNs, they mainly fo-
cus on hashtag recommendation for micro-blogs or social images. Limited research efforts have
been devoted to the micro-video annotation, due to the following reasons. (1) Long-tail distribu-
tion. e hashtag distribution is heavily skewed toward a few frequent hashtags with a long-tail
consisting of less frequent tags, as shown in Figure 7.2. Current studies note that many hash-
tags from the long-tail are misspelled or “meaningless” words, we believe that there are some
meaningful hashtags within the long-tail which have been overlooked. at is, how to create
correlations between the frequent hashtags and their “related long-tail hashtags to enhance the
representation of them is untapped. (2) Multimodal sequence modeling. Micro-videos consist
of visual, acoustic, and textual modalities, which are encoded together with sequential structure.
On one hand, different streams in one micro-video demonstrate different temporal dynamics
and thus should be modeled individually. For example, the objects in a micro-video could be the
same throughout the time span of the micro-video, while the motion and audio may change from
time to time. On the other hand, different modalities depict the intrinsic content of micro-videos
consistently and complementarily from different views. erefore, how to capture the sequen-
tial and multi-modality features is a considerable problem. (3) Diverse annotation. We find that
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset