1.4. OUR SOLUTIONS 7
categories in the tree taxonomy are not independent, but hierarchically correlated (as shown in
Figure 1.2). In particular, the closer two venue categories are located in the tree, the more similar
concepts the associated micro-videos should convey. In a sense, we have to consider the inherent
structure of micro-videos when learning their representations. Also, we should leverage such a
structure to guide the organization of micro-videos by categorizing them into the leaf nodes of
this tree.
e third research challenge we face is low-quality. Micro-videos are often captured by
users with hand-held mobile devices. Most portable devices have nothing to offer for video
stabilization, which easily results in poor video quality, such as low resolution, wobbly frames,
constrained lighting conditions, and background noise. is greatly hinders the visual expres-
sion. Furthermore, the audio track that comes along with the video can be in different types of
distortion and noise, such as buzzing, hums, hisses, and whistling, which are probably caused
by the poor microphones or complex surrounding environments. We thus have to harness the
external high-quality visual or sound knowledge to compensate the internal shortest boards. Al-
ternatively, we should build robust models to explore the intrinsic structure property embedded
in data by inferring meaningful features and alleviating the impact of noisy ones.
e fourth challenge is the multimodal sequential data. Beyond textual, acoustic, and
visual modalities, micro-videos have a new one; social modality. In such a context, a user is
enabled to interact with micro-videos and other users via social actions, such as click, like, and
follow. As time goes on, multiple sequential data in different forms emerge and they reflect users’
historical preferences. To strengthen micro-video understanding, we have to characterize and
model the sequential patterns.
Last, but not least, the final research problem we are facing is the lack of benchmark
datasets to support our research on micro-video understanding.
1.4 OUR SOLUTIONS
In this book, to tackle the aforementioned research challenges, we present some state-of-the-art
multimodal learning theories and verify them over three practical tasks of micro-video under-
standing: popularity prediction, venue category estimation, and micro-video routing.
We first construct two large-scale, real-world micro-video datasets, from Vine, which
correspond to the tasks of popularity prediction and venue category estimation. Considering
the large volume of micro-videos, to save the human labor, we establish their ground truth using
some manually defined rules. As to the task of micro-video routing, we carry on our experiments
on the public benchmark datatsets.
For the popularity prediction task, we propose a multimodal transductive learning frame-
work to learn the micro-video representations in an optimal latent space. We assume that this
optimal latent space maintains the original intrinsic characteristics of micro-videos in the orig-
inal spaces. In light of this, all modalities are forced to be correlated. Meanwhile, micro-videos
with different popularity can be better separated in such optimal common space, as compared
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset