Summary

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

4.6. MULTIMODAL COOPERATIVE LEARNING 103

5. Comparing each row in Table 4.9 to the ﬁrst three rows in Table 4.8, the performance of

each modality, which is enhanced by the other two modalities with our model, is better

than that of the early fusion integrating the attention model. It indicates that our model

can capture the correlation between diﬀerent modalities.

6. Jointly analyzing the curves in Figure 4.17, we found that utilizing our proposed coopera-

tive learning to seamlessly integrate multiple modalities can boost the performance eﬀec-

tively. is demonstrates the rationality of our model. And the performance tends to be

stable at around 30 iterations. is signals the convergence property of our model and also

indicates its eﬃciency.

We also list several variants based on our proposed cooperative net. ese methods group

the features of each modality into consistent and complementary parts. Afterwards, we adopted

diﬀerent fusing strategies to leverage the consistent and complementary features, including the

following.

• Variant-I: In this model, Eq. (4.47) is removed. In other words, we integrated the guest

complementary information into the host modality without enhancing the consistent

parts, while the guest consistent part is retained to calculate the KL-diversity for keep-

ing the consistency.

• Variant-II: is variant discards Eq. (4.48) and merely harnesses the consistent vector

pairs to learn an enhanced feature vector for each modality and categorize the venue with

these enhanced feature vectors.

• Variant-III: After obtaining the consistent and complementary features from each host

and modality pair, Eq. (4.47) is replaced, and a new enhanced consistent vector is learned

by integrating all host consistent vectors. After that, the category is estimated by fusing

the predictions of the newly enhanced consistent vector and each complementary vector.

• Variant-IV: In this variant model, we respectively concatenated all complementary parts

and all consistent parts, instead of Eqs. (4.47) and (4.48). Finally, we estimated the venue

category of the two concatenated parts and fused them to gain the result.

From Table 4.10, we have the following observations.

1. In terms of Macro-F1, Variant-I and Variant-IV outperform Variant-II and Variant-III,

respectively. is may because combining the complementary information can involve

more information, strengthening the expressiveness of the representations.

2. e accuracy of the ﬁrst two variants is comparatively higher than the other variants. is

beneﬁts from capturing the correlation between the host and the guest features which is

ignored by the Variant-III and the Variant-IV.

104 4. MULTIMODAL COOPERATIVE LEARNING

Table 4.10: Performance of variants

Micro-F1 Macro-F1

Variant-I 39.17 ± 0.27% 25.05 ± 0.28%

Variant-II 39.01 ± 0.37% 23.70 ± 0.19%

Variant-III 38.11 ± 0.40% 22.78 ± 0.10%

Variant-IV 38.48 ± 0.33% 24.49 ± 0.18%

NMCL 40.04 ± 0.37% 26.78 ± 0.42%

3. Our proposed method outperforms its all variants, justifying the rationality and eﬀective-

ness of cooperative learning. Diﬀerent from several variants, the original one considers

the consistency between each host and guest modality pairs and supplements the exclusive

signals from the guest modalities.

4. We observe that Variant-I, which discards one of the consistent parts, does not cause a

signiﬁcant reduction in accuracy. It shows that the information contained in the two con-

sistent vectors is almost the same, and it also proves that our model can correctly distinguish

and capture the consistent features.

5. Comparing the proposed method with Variant-II, we observe that the improvement in

terms of Micro-F1 is not signiﬁcant. For further analysis, we believe that the main rea-

son is that the concepts contained in micro-videos are sparse. Moreover, the information

contained in any single modality is almost covered by the other two modalities. In other

words, the complementary parts contain little external information. erefore, the removal

of the complementary parts barely aﬀects the performance.

Visualization

Apart from achieving more accurate prediction, the key advantage of NMCL over other meth-

ods is that it exhibits the consistent and complementary features. Toward this end, we show

examples drawn from our model to visualize two representation components.

Since the acoustic modality is the hardest one to be visualized among the multiple modal-

ities, we utilized the concept-level features to present the acoustic one. To extract the concept

from ﬁne-grained acoustic features, we leveraged an external dataset namely AudioSet, which

is a large-scale dataset released by Google.

e AudioSet consists of an expanding ontology of 632 audio event classes and a collection

of 2,084,320 human labeled 10-s sound clips drawn from YouTube videos. e ontology is

speciﬁed as a hierarchical graph of event categories, covering a wide range of human and animal

e external audio dataset was just used for the visualization. https://research.google.com/audioset/.

4.6. MULTIMODAL COOPERATIVE LEARNING 105

sounds, musical instruments and genres, and everyday common sounds from the environment,

like “Speech,” “Laughter,” and “Guitar.”

To estimate the concepts in the audio, we employed a VGG-like model [62] and trained

it over the AudioSet. According to the input format of the CNN model, we regenerated the

acoustic features of the micro-videos. e extracted audios are divided into non-overlapping

960 ms frames, and then the spectrogram transformed from the frames are integrated into 64

mel-spaced frequency bins. Finally, we took the mean pooling strategy over all the frames of the

micro-video to yield a new acoustic feature vector.

With the new acoustic conceptual features, we conducted experiments to shed some light

on the correlation between the acoustic modality and the other modalities. In addition, we visu-

alized the attention score matrix between the acoustic concepts and venue categories to validate

our proposed model intuitively.

• To visualize the consistent and the complementary components, we selected exemplary

demonstrations of two micro-videos categorized as “Park” and “Piazza place,” as shown

in Figure 4.18a and 4.18b. For these demonstrations, we treated the acoustic modality as

the host part, the visual and the textual modalities as the guests. And we showed a heat

map to illustrate the correlation between the host and guest feature pairs, where the darker

color indicates that the host feature is consistent with the guest modalities and vice versa.

From Figure 4.18a, we observe that several acoustic concepts are consistent with the visual

and textual modalities, such as “Music” and “Violin,” and some are exclusive ones hardly

revealed from the other modalities, such as “Applause,” “Noise,” and “Car alarm.” In con-

trast, given Figure 4.18b, we ﬁnd that the correlation score distribution is totally diﬀerent.

e concepts, such as “Applause,” “Crowed,” and “Noisy,” can be represented by the guest

Music

Classical music

Song

Applause

Crowd

Children playing

Animal

Dog

Car alarm

Violin, ﬁddle

Electric guitar

Harmonica

Accordian

Shofar

Guitar

Bagpipes

Noise

Echo

Outside, urban or manmade

Outside, rural or natural

Music

Classical music

Song

Applause

Crowd

Children playing

Animal

Dog

Car alarm

Violin, ﬁddle

Electric guitar

Harmonica

Accordian

Shofar

Guitar

Bagpipes

Noise

Echo

Outside, urban or manmade

Outside, rural or natural

Street music in Boston Football up high

(a) Park Example (b) Piazza Place Example

Park Piazza Place

Figure 4.18: Visualization of the correlation scores between the same acoustic concept-level

features and diﬀerent visual and textual features.

106 4. MULTIMODAL COOPERATIVE LEARNING

features, and the “Music” and the “Violin” are barely captured in the other modalities.

However, these “lighter-colored” features provide the exclusive and discriminative infor-

mation to predict the venue category. In our proposed model, we can explicitly capture the

exclusive information as a supplement, rather than omitting it during the learning produce.

ese observations verify the assumption that the information from diﬀerent modalities

is complementary to each other and demonstrate that our proposed model can explicitly

separate the consistent information from the complementary one.

• To save the space, we performed the part of the attention matrix via a heat map, where

the lighter color indicates weak attention and vice versa, as shown in Figure 4.19. We can

see that every selected venue category has various relations to each acoustic concept. For

instance, the micro-videos with the venue of “Mall” have strong correlations with “Speech”

and “Children shouting;” the correlation with “Babbling” is loose. In addition, for the

Animal Shelter

College Classroom

Pet Store

Doctor’s Oﬃce

Cocktail Bar

Diner

Dog Run

Lounge

Playground

Hookah Bar

Mall

Plaza

Art Museum

Cooking Space

College Bookstore

Cemetery

Outdoors and Recreation

Automotive Shop

Museum

Soccer Stadium

Speech

Male speech, man speaking

Female speech, woman speaking

Child speech, kid speaking

Conversation

Narration, monologue

Babbling

Speech synthesizer

Shout

Bellow

Whoop

Yell

Battle cry

Children shouting

Screaming

Whispering

Laughter

Baby laughter

Giggle

Snicker

Figure 4.19: Visualization of the attention scores of acoustic concept-level features and venue

category pairs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Summary

Create new playlist

Sign In

Sign Up

Table of Contents for
Summary