4.6. MULTIMODAL COOPERATIVE LEARNING 103
5. Comparing each row in Table 4.9 to the first three rows in Table 4.8, the performance of
each modality, which is enhanced by the other two modalities with our model, is better
than that of the early fusion integrating the attention model. It indicates that our model
can capture the correlation between different modalities.
6. Jointly analyzing the curves in Figure 4.17, we found that utilizing our proposed coopera-
tive learning to seamlessly integrate multiple modalities can boost the performance effec-
tively. is demonstrates the rationality of our model. And the performance tends to be
stable at around 30 iterations. is signals the convergence property of our model and also
indicates its efficiency.
We also list several variants based on our proposed cooperative net. ese methods group
the features of each modality into consistent and complementary parts. Afterwards, we adopted
different fusing strategies to leverage the consistent and complementary features, including the
following.
Variant-I: In this model, Eq. (4.47) is removed. In other words, we integrated the guest
complementary information into the host modality without enhancing the consistent
parts, while the guest consistent part is retained to calculate the KL-diversity for keep-
ing the consistency.
Variant-II: is variant discards Eq. (4.48) and merely harnesses the consistent vector
pairs to learn an enhanced feature vector for each modality and categorize the venue with
these enhanced feature vectors.
Variant-III: After obtaining the consistent and complementary features from each host
and modality pair, Eq. (4.47) is replaced, and a new enhanced consistent vector is learned
by integrating all host consistent vectors. After that, the category is estimated by fusing
the predictions of the newly enhanced consistent vector and each complementary vector.
Variant-IV: In this variant model, we respectively concatenated all complementary parts
and all consistent parts, instead of Eqs. (4.47) and (4.48). Finally, we estimated the venue
category of the two concatenated parts and fused them to gain the result.
From Table 4.10, we have the following observations.
1. In terms of Macro-F1, Variant-I and Variant-IV outperform Variant-II and Variant-III,
respectively. is may because combining the complementary information can involve
more information, strengthening the expressiveness of the representations.
2. e accuracy of the first two variants is comparatively higher than the other variants. is
benefits from capturing the correlation between the host and the guest features which is
ignored by the Variant-III and the Variant-IV.
104 4. MULTIMODAL COOPERATIVE LEARNING
Table 4.10: Performance of variants
Micro-F1 Macro-F1
Variant-I 39.17 ± 0.27% 25.05 ± 0.28%
Variant-II 39.01 ± 0.37% 23.70 ± 0.19%
Variant-III 38.11 ± 0.40% 22.78 ± 0.10%
Variant-IV 38.48 ± 0.33% 24.49 ± 0.18%
NMCL 40.04 ± 0.37% 26.78 ± 0.42%
3. Our proposed method outperforms its all variants, justifying the rationality and effective-
ness of cooperative learning. Different from several variants, the original one considers
the consistency between each host and guest modality pairs and supplements the exclusive
signals from the guest modalities.
4. We observe that Variant-I, which discards one of the consistent parts, does not cause a
significant reduction in accuracy. It shows that the information contained in the two con-
sistent vectors is almost the same, and it also proves that our model can correctly distinguish
and capture the consistent features.
5. Comparing the proposed method with Variant-II, we observe that the improvement in
terms of Micro-F1 is not significant. For further analysis, we believe that the main rea-
son is that the concepts contained in micro-videos are sparse. Moreover, the information
contained in any single modality is almost covered by the other two modalities. In other
words, the complementary parts contain little external information. erefore, the removal
of the complementary parts barely affects the performance.
Visualization
Apart from achieving more accurate prediction, the key advantage of NMCL over other meth-
ods is that it exhibits the consistent and complementary features. Toward this end, we show
examples drawn from our model to visualize two representation components.
Since the acoustic modality is the hardest one to be visualized among the multiple modal-
ities, we utilized the concept-level features to present the acoustic one. To extract the concept
from fine-grained acoustic features, we leveraged an external dataset namely AudioSet, which
is a large-scale dataset released by Google.
3
e AudioSet consists of an expanding ontology of 632 audio event classes and a collection
of 2,084,320 human labeled 10-s sound clips drawn from YouTube videos. e ontology is
specified as a hierarchical graph of event categories, covering a wide range of human and animal
3
e external audio dataset was just used for the visualization. https://research.google.com/audioset/.
4.6. MULTIMODAL COOPERATIVE LEARNING 105
sounds, musical instruments and genres, and everyday common sounds from the environment,
like “Speech,” Laughter,” and “Guitar.”
To estimate the concepts in the audio, we employed a VGG-like model [62] and trained
it over the AudioSet. According to the input format of the CNN model, we regenerated the
acoustic features of the micro-videos. e extracted audios are divided into non-overlapping
960 ms frames, and then the spectrogram transformed from the frames are integrated into 64
mel-spaced frequency bins. Finally, we took the mean pooling strategy over all the frames of the
micro-video to yield a new acoustic feature vector.
With the new acoustic conceptual features, we conducted experiments to shed some light
on the correlation between the acoustic modality and the other modalities. In addition, we visu-
alized the attention score matrix between the acoustic concepts and venue categories to validate
our proposed model intuitively.
To visualize the consistent and the complementary components, we selected exemplary
demonstrations of two micro-videos categorized as Park” and “Piazza place,” as shown
in Figure 4.18a and 4.18b. For these demonstrations, we treated the acoustic modality as
the host part, the visual and the textual modalities as the guests. And we showed a heat
map to illustrate the correlation between the host and guest feature pairs, where the darker
color indicates that the host feature is consistent with the guest modalities and vice versa.
From Figure 4.18a, we observe that several acoustic concepts are consistent with the visual
and textual modalities, such as Music” and Violin,” and some are exclusive ones hardly
revealed from the other modalities, such as Applause,” Noise,” and “Car alarm. In con-
trast, given Figure 4.18b, we find that the correlation score distribution is totally different.
e concepts, such as Applause,” “Crowed,” and “Noisy,” can be represented by the guest
Music
Classical music
Song
Applause
Crowd
Children playing
Animal
Dog
Car alarm
Violin, fiddle
Electric guitar
Harmonica
Accordian
Shofar
Guitar
Bagpipes
Noise
Echo
Outside, urban or manmade
Outside, rural or natural
Music
Classical music
Song
Applause
Crowd
Children playing
Animal
Dog
Car alarm
Violin, fiddle
Electric guitar
Harmonica
Accordian
Shofar
Guitar
Bagpipes
Noise
Echo
Outside, urban or manmade
Outside, rural or natural
Street music in Boston Football up high
(a) Park Example (b) Piazza Place Example
Park Piazza Place
Figure 4.18: Visualization of the correlation scores between the same acoustic concept-level
features and different visual and textual features.
106 4. MULTIMODAL COOPERATIVE LEARNING
features, and the Music” and the “Violin are barely captured in the other modalities.
However, these lighter-colored features provide the exclusive and discriminative infor-
mation to predict the venue category. In our proposed model, we can explicitly capture the
exclusive information as a supplement, rather than omitting it during the learning produce.
ese observations verify the assumption that the information from different modalities
is complementary to each other and demonstrate that our proposed model can explicitly
separate the consistent information from the complementary one.
To save the space, we performed the part of the attention matrix via a heat map, where
the lighter color indicates weak attention and vice versa, as shown in Figure 4.19. We can
see that every selected venue category has various relations to each acoustic concept. For
instance, the micro-videos with the venue of Mall” have strong correlations with “Speech
and “Children shouting;” the correlation with “Babbling” is loose. In addition, for the
Animal Shelter
College Classroom
Pet Store
Doctor’s Office
Cocktail Bar
Diner
Dog Run
Lounge
Playground
Hookah Bar
Mall
Plaza
Art Museum
Cooking Space
College Bookstore
Cemetery
Outdoors and Recreation
Automotive Shop
Museum
Soccer Stadium
Speech
Male speech, man speaking
Female speech, woman speaking
Child speech, kid speaking
Conversation
Narration, monologue
Babbling
Speech synthesizer
Shout
Bellow
Whoop
Yell
Battle cry
Children shouting
Screaming
Whispering
Laughter
Baby laughter
Giggle
Snicker
Figure 4.19: Visualization of the attention scores of acoustic concept-level features and venue
category pairs.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset