60 4. MULTIMODAL COOPERATIVE LEARNING
merely able to convey only one or just a few high-level themes or concepts. Consequently, it is
necessary to learn the high-level and sparse representations of micro-videos.
(3) Low-quality. Most portable devices have nothing to offer for video stabilization. Some
videos can thus be shaky or bumpy, which greatly hinders the visual expression. Furthermore,
the audio track that comes along with the video, can be in different types of distortion and noise,
such as buzzing, hums, hisses, and whistling, which is probably caused by the poor microphones
or complex surrounding environments.
(4) Information loss. Apart from acoustic and visual modalities, micro-videos are, more
often than not, uploaded with textual descriptions, which express some useful cues that may
be not available in the other two modalities. However, the textual information may be not well
correlated with visual and acoustic cues. Moreover, according to our statistics upon 276,624
Vine videos, more than 11.4% of them do not have such text, probably the results of users’
casual habits. is serious information missing problem greatly reduces the usability of textual
modality.
(5) Hierarchical structure. e venues of micro-videos are organized into hundreds of cat-
egories, which are not independent but hierarchically correlated. Part of this structure is shown
in Figure 1.2. How to explore such structure to guide the venue category estimation is largely
untapped.
Moreover, when organizing the micro-videos, we have to consider one indispensable fac-
tor, i.e., online learning. On the one hand, micro-videos are often easily shot and instantly shared
at the mobile end, timeliness is therefore one of their highlights. In light of this, efficient online
operations deserve our attention. On the other hand, according to our statistics over 2 million
Vine videos, only 1:22% of them have been labeled with venue information and the tree struc-
ture of the venue categories holds 821 leaf nodes. ereby, it is hard to acquire sufficient training
samples to build a robust model for the micro-video categorization. Fortunately, micro-videos
are continuously uploaded and we expect to incrementally strengthen our model by leveraging
the knowledge of incoming samples.
To address the aforementioned challenges, in this chapter we develop three schemes from
different perspectives in order to organize micro-videos into a tree taxonomy.
4.3 RELATED WORK
Our work is related to a broad spectrum of multimedia location estimation, multi-modal multi-
task learning, and dictionary learning.
4.3.1 MULTIMEDIA VENUE ESTIMATION
Nowadays, it has become convenient to capture images and videos on the mobile end and asso-
ciate them with GPS tags. Such a hybrid data structure can benefit a wide variety of potential
multimedia applications, such as location recognition [58], landmark search [23], augmented
reality[15], and commercial recommendations [183]. It hence has attracted great attention from
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset