Related Work

109

C H A P T E R 5

Multimodal Transfer Learning

in Micro-Video Analysis

5.1 BACKGROUND

In Chapter 4, we introduce a series of multimodal cooperative learning methods toward micro-

video understanding, especially on the task of estimating their venue categories. It utilizes the

consensus information on visual, acoustic, and textual modalities of micro-videos to recognize

the venue information. is multi-modal method, however, overlooks that the description of

each modality may diﬀer dramatically. According to the experiments [191], the acoustic modal-

ity has the weakest capability of indicating venue information. It will cause the substantial cask

eﬀect, which is crucial to venue estimation and further content understanding.

To gain deep insights, we performed a user study to verify the acoustic inﬂuence on esti-

mating venue category. Given 100 micro-videos randomly selected from Vine, 5 volunteers were

blinded to the visual and textual modalities, tried to infer the venue category of each micro-video

based on its acoustic modality, and subsequently rated the level of acoustic importance with one

score from 1–5. As shown in Figure 5.1a, we have several observations: (1) the acoustic modal-

ity in 59% of micro-videos can beneﬁt, more or less, the venue category estimation. It reveals

the potential impact of the acoustic concepts. For example, recognizing bird chirps or crowds

cheering from the acoustic modality is helpful for the estimation of a park or concert. However,

(2) the acoustic modality in 84% of micro-videos is insuﬃciently descriptive (below 4) to sup-

port the venue category estimation, pertaining to their noise and low-quality. is study shows

that detecting acoustic concepts is useful yet needs further enhancement.

5.2 RESEARCH PROBLEMS

Leveraging external rich sound knowledge to compensate the internal acoustic modality is an

intuitive thought. It is, however, non-trivial owing to the following challenges: (1) as micro-

videos often record events, it is desired to detect high-level acoustic concepts, which are more

discriminative for event description [141]. We thus have to learn the conceptual representation

of micro-videos; (2) micro-videos are about users’ daily activities, which restricts us to harness the

external real-life sound clips. However, to the best of our knowledge, there is no such sound data

collection available. Moreover, (3) the external sounds are mono-modal data, whereas the micro-

videos are unifying of textual, visual, and acoustic modalities. e micro-videos and external

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Related Work