4.6. MULTIMODAL COOPERATIVE LEARNING 105
sounds, musical instruments and genres, and everyday common sounds from the environment,
like “Speech,” “Laughter,” and “Guitar.”
To estimate the concepts in the audio, we employed a VGG-like model [62] and trained
it over the AudioSet. According to the input format of the CNN model, we regenerated the
acoustic features of the micro-videos. e extracted audios are divided into non-overlapping
960 ms frames, and then the spectrogram transformed from the frames are integrated into 64
mel-spaced frequency bins. Finally, we took the mean pooling strategy over all the frames of the
micro-video to yield a new acoustic feature vector.
With the new acoustic conceptual features, we conducted experiments to shed some light
on the correlation between the acoustic modality and the other modalities. In addition, we visu-
alized the attention score matrix between the acoustic concepts and venue categories to validate
our proposed model intuitively.
• To visualize the consistent and the complementary components, we selected exemplary
demonstrations of two micro-videos categorized as “Park” and “Piazza place,” as shown
in Figure 4.18a and 4.18b. For these demonstrations, we treated the acoustic modality as
the host part, the visual and the textual modalities as the guests. And we showed a heat
map to illustrate the correlation between the host and guest feature pairs, where the darker
color indicates that the host feature is consistent with the guest modalities and vice versa.
From Figure 4.18a, we observe that several acoustic concepts are consistent with the visual
and textual modalities, such as “Music” and “Violin,” and some are exclusive ones hardly
revealed from the other modalities, such as “Applause,” “Noise,” and “Car alarm.” In con-
trast, given Figure 4.18b, we find that the correlation score distribution is totally different.
e concepts, such as “Applause,” “Crowed,” and “Noisy,” can be represented by the guest
Music
Classical music
Song
Applause
Crowd
Children playing
Animal
Dog
Car alarm
Violin, fiddle
Electric guitar
Harmonica
Accordian
Shofar
Guitar
Bagpipes
Noise
Echo
Outside, urban or manmade
Outside, rural or natural
Music
Classical music
Song
Applause
Crowd
Children playing
Animal
Dog
Car alarm
Violin, fiddle
Electric guitar
Harmonica
Accordian
Shofar
Guitar
Bagpipes
Noise
Echo
Outside, urban or manmade
Outside, rural or natural
Street music in Boston Football up high
(a) Park Example (b) Piazza Place Example
Park Piazza Place
Figure 4.18: Visualization of the correlation scores between the same acoustic concept-level
features and different visual and textual features.