4.3. RELATED WORK 61
the multimedia community. Generally speaking, prior efforts can be divided into two categories:
mono-modal venue estimation [15, 23, 27] and multi-modal venue estimation [29, 50, 58]. Ap-
proaches in the former category extract a rich set of visual features from images and leverage the
visual features to train either shallow or deep models to estimate the venues of the given images.
As reported in [58], the landmark identification [27] and scene classification [15] of images are
the key factors to recognize the venues. e basic philosophy behind these approaches is that
certain visual features in images correlate strongly with certain geographies even if the relation-
ship is not strong enough to specifically pinpoint a location coordinate. Beyond the mono-modal
venue estimation which only takes the visual information into consideration, multi-modal venue
estimation works by inferring the geo-coordinates of the recording places of the given videos by
fusing the textual metadata and visual or acoustic cues [29, 49]. Friendland et al. [49] determined
the geo-coordinates of the Flickr videos based on both textual metadata and visual cues. Audio
tracks from the Placing Task 2011 dataset videos were also used to train a location estimation
models and it achieved reasonable performance [84]. e main idea is that the integration of
multiple modalities can lead to better results, and it is consistent to the old saying “two heads are
better than one.” However, multi-modal venue estimation is still at its infant stage, and more
efforts should be dedicated to improve this line of research.
Noticeably, the venue granularity of the targeted multimedia entities in the aforemen-
tioned literature varies significantly. Roughly, the spatial resolutions are in three levels: city-
level [50, 58], within-city-level [23, 89, 143], and close-to-exact GPS level [49]. City-level and
within-city-level location estimation can be applied to multimedia organization [33], location
visualization [23], and image classification [153]. However, their granularities are large, which
may be not suitable for some application scenarios, such as business venue discovery [21]. e
granularity of close-to-exact GPS level is finer; nevertheless, it is hard to estimate the precise
coordinates, especially for the indoor cases. For example, it is challenging to distinguish an office
on the third floor and a coffee shop on the second floor within the same building, since the GPS
is not available indoors.
Our work differs from the above methods in the following two aspects: (1) we focus on the
estimation of venue category which is neither city-level nor the precise location. is is because
venue category is more of an abstract concept than single venue name, which can help many
applications for personalized and location-based services/marketing [21]; and (2) micro-videos
are the medium between images and traditional long videos, which pose tough challenges.
4.3.2 MULTI-MODAL MULTI-TASK LEARNING
e literature on the multi-task problem with multi-modal data is relatively sparse. He et al. [59]
proposed a graph-based iterative framework for multi-view multi-task learning (i.e., IteM
2
) and
applied it to text classification. However, it can only deal with problems with non-negative
feature values. In addition, it is a transductive model. Hence, it is unable to generate predic-
tive models for independent and unseen testing samples. To address the intrinsic limitations
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset