109
C H A P T E R 5
Multimodal Transfer Learning
in Micro-Video Analysis
5.1 BACKGROUND
In Chapter 4, we introduce a series of multimodal cooperative learning methods toward micro-
video understanding, especially on the task of estimating their venue categories. It utilizes the
consensus information on visual, acoustic, and textual modalities of micro-videos to recognize
the venue information. is multi-modal method, however, overlooks that the description of
each modality may differ dramatically. According to the experiments [191], the acoustic modal-
ity has the weakest capability of indicating venue information. It will cause the substantial cask
effect, which is crucial to venue estimation and further content understanding.
To gain deep insights, we performed a user study to verify the acoustic influence on esti-
mating venue category. Given 100 micro-videos randomly selected from Vine, 5 volunteers were
blinded to the visual and textual modalities, tried to infer the venue category of each micro-video
based on its acoustic modality, and subsequently rated the level of acoustic importance with one
score from 15. As shown in Figure 5.1a, we have several observations: (1) the acoustic modal-
ity in 59% of micro-videos can benefit, more or less, the venue category estimation. It reveals
the potential impact of the acoustic concepts. For example, recognizing bird chirps or crowds
cheering from the acoustic modality is helpful for the estimation of a park or concert. However,
(2) the acoustic modality in 84% of micro-videos is insufficiently descriptive (below 4) to sup-
port the venue category estimation, pertaining to their noise and low-quality. is study shows
that detecting acoustic concepts is useful yet needs further enhancement.
5.2 RESEARCH PROBLEMS
Leveraging external rich sound knowledge to compensate the internal acoustic modality is an
intuitive thought. It is, however, non-trivial owing to the following challenges: (1) as micro-
videos often record events, it is desired to detect high-level acoustic concepts, which are more
discriminative for event description [141]. We thus have to learn the conceptual representation
of micro-videos; (2) micro-videos are about users’ daily activities, which restricts us to harness the
external real-life sound clips. However, to the best of our knowledge, there is no such sound data
collection available. Moreover, (3) the external sounds are mono-modal data, whereas the micro-
videos are unifying of textual, visual, and acoustic modalities. e micro-videos and external
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset