5.5. DEEP MULTI-MODAL TRANSFER LEARNING 113
learning and regularizes the similarity preservation to alleviate the problem of unbalanced train-
ing samples. e scheme of our proposed DARE approach is illustrated in Figure 5.3. To be
more specific, we first segment each micro-video into visual, textual, and acoustic modalities.
We then project the low-level representations of each modality to conceptual ones by distinct
mapping functions. To transfer the external sound knowledge to strengthen the acoustic modal-
ity, we force them to share the same low-level feature space and mapping function. Following
that, we propose a deep multi-modal fusion method, which utilizes complementary information
from each modality, encodes the category structure information by similarity preservation, and
uncover the nonlinear correlations between concepts. We ultimately feed the fused representa-
tions into a prediction function to estimate the venue categories.
Visual
Modality
Textual
Modality
Acoustic
Modality
Audio
Clips
Visual Mapping Matrix
Textual Mapping Matrix
External Acoustic Knowledge
Acoustic Concepts
Hidden Layers Venue Estimation
Applause
Cheering
Applause
Acoustic
Mapping Matrix
• School
• Park
• Concert
Figure 5.3: Schematic illustration of our proposed deep transfer model. It transfers knowledge
from external sound clips to strengthen the description of the internal acoustic modality in
micro-videos. Meanwhile, it conducts a deep multi-modal fusion toward venue category esti-
mation.
5.5.1 SOUND KNOWLEDGE TRANSFER
In order to leverage the external sound knowledge to enhance the acoustic modality in micro-
videos, we have two assumptions: (1) concept-level representations are more discriminative to
characterize each modality in micro-videos and the external sounds; and (2) the natural cor-
relation between the acoustic modality in micro-videos and the real-life sounds motivate us to
assume that they share the same acoustic concept space.
As to the concept-level representation, one intuitive thought is multi-modal dictionary
learning, whereby the atoms in the dictionaries are treated as concepts. We, however, argue
that the implicit assumption of multi-modal dictionary learning does not always hold in some
real-world scenarios: the dictionaries of distinct modalities share the same concept space. Con-
sidering the micro-video analysis as an example, the acoustic modality may contain the concept
of chirp of birds that is hardly expressed by the visual modality. In the textual one, it may sig-
nal some atoms related to sense of smell, which also impossibly appear in the visual modality.
erefore, it is not necessary to enforce the dictionaries of different modalities to contain the
same set of concepts. To avoid such a problem, we propose to learn a separate mapping function
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset