Micro-Video Captioning

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

141

C H A P T E R 7

Research Frontiers

In this book, we investigate some application-motivated problems, namely the research prob-

lems of micro-video understanding. To solve these problems, we design some general principles,

methodologies, and optimizations by jointly learning from multiple correlated modalities of the

given micro-videos, including the textual, visual, acoustic, and social ones. ey are empiri-

cally validated on multiple real-world datasets. In particular, we ﬁrst introduce the proliferation

of micro-video services and identify three practical tasks of micro-video understanding: popu-

larity prediction, venue category estimation, and micro-video routing. Based upon these tasks,

we analyze the unique research challenges of micro-videos that are distinct from traditional

long videos, such as information sparseness, hierarchical structure, low-quality, multimodal se-

quential data, as well as lack of benchmark datasets. To address these problems, we present a

series of multimodal learning methods, consisting of multimodal transductive learning, multi-

modal cooperative learning, multimodal transductive learning and multimodal sequential learn-

ing. ese theoretical methods are veriﬁed over three datasets we constructed. To facilitate other

researchers, we have released the codes, parameter settings, as well as the three datasets. We have

to emphasize that learning from multiple modalities of the given micro-videos is still a young

and highly promising research ﬁeld. ere are many unexplored but fruitful future directions

and challenging research issues. We illustrate a few of them here.

7.1 MICRO-VIDEO ANNOTATION

Facing the exponentially growing number of micro-videos, it is important to help users quickly

identify their desired ones. e hashtags associated with micro-videos are typically provided by

uploaders to summarize the post content of users and attract the attention of followers. Taking

the popular social platform Instagram as an example, as shown in Figure 7.1, the hashtags are

preﬁxed with the symbol “#” to mark keywords or key topics of a post. e hashtags have been

proved to be useful in many applications, including micro-blog retrieval, event analysis, and

sentiment analysis. Moreover, the tagging service can beneﬁt the stakeholders of micro-video

ecosystems. For users, hashtags facilitate them to search and locate their desired micro-videos.

For post-sharers, concise and concrete hashtags can increase the probability of their micro-videos

to be discovered. For platforms, hashtags can make the management of micro-videos (e.g., cat-

egorization) more convenient. Despite their importance, numerous micro-videos are lack of

hashtags or the hashtags are inaccurate or incomplete. In light of this, micro-video annotation,

142 7. RESEARCH FRONTIERS

Figure 7.1: An example of micro-video with hashtags in Instagram.

which suggests a list of hashtags to a user when he or she wants to annotate a post, becomes a

crucial research problem.

Although several models have been adopted for hashtag recommendation and achieved

some progress, such as collaborative ﬁltering, generative models, and DNNs, they mainly fo-

cus on hashtag recommendation for micro-blogs or social images. Limited research eﬀorts have

been devoted to the micro-video annotation, due to the following reasons. (1) Long-tail distribu-

tion. e hashtag distribution is heavily skewed toward a few frequent hashtags with a long-tail

consisting of less frequent tags, as shown in Figure 7.2. Current studies note that many hash-

tags from the long-tail are “misspelled” or “meaningless” words, we believe that there are some

meaningful hashtags within the long-tail which have been overlooked. at is, how to create

correlations between the frequent hashtags and their “related” long-tail hashtags to enhance the

representation of them is untapped. (2) Multimodal sequence modeling. Micro-videos consist

of visual, acoustic, and textual modalities, which are encoded together with sequential structure.

On one hand, diﬀerent streams in one micro-video demonstrate diﬀerent temporal dynamics

and thus should be modeled individually. For example, the objects in a micro-video could be the

same throughout the time span of the micro-video, while the motion and audio may change from

time to time. On the other hand, diﬀerent modalities depict the intrinsic content of micro-videos

consistently and complementarily from diﬀerent views. erefore, how to capture the sequen-

tial and multi-modality features is a considerable problem. (3) Diverse annotation. We ﬁnd that

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Micro-Video Captioning

Create new playlist

Sign In

Sign Up

Table of Contents for
Micro-Video Captioning