Multimodal Cooperative Learning

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

86 4. MULTIMODAL COOPERATIVE LEARNING

Table 4.4: Comparison between models with CNN-based dictionary learning and our dictionary

learning for the venue category estimation (p-value



: p-value over accuracy)

Models Accuracy Micro-F1 P-value*

DPL 4.64 ± 0.24% 4.87 ± 0.28% 3.86e-08

INTIMATE 6.28 ± 0.08% 6.60 ± 0.09% –

Eﬀectiveness of the Tree Structure

We argued that encoding the tree structure to constrain the sparse representation learning can

strengthen the representation discrimination. In this part, we carried out experiments to verify

the eﬀectiveness of the tree structure from quantitative and qualitative aspects.

Quantitative Analysis: To show the eﬀect of the tree structure on the sparse representa-

tion learning, we compared it with a ﬂat model without tree structure, dubbed INTIMATE

min

D;A

mD1



 D





mD1

c2C



2;1



mD1



;

s.t.



 1; 8k; m;

(4.38)

where C is the set of categories. We only took the leaf nodes into consideration, and did not

consider the hierarchical tree structure to regularize the representation learning. To ensure a fair

comparison, we trained INTIMATE and INTIMATE

over the same oﬄine training set and

reported the ﬁnal results over the testing set. Analogous to other experiments, we also repeated

this one on ten round training/testing data sampling.

e experimental results are shown in Table 4.5. From this table, it can be seen that com-

pared to INTIMATE, the performance of INTIMATE

drops signiﬁcantly regarding accu-

racy and micro-F1 metrics. is is because, the INTIMATE

baseline does take the class label

information into consideration and learns the category-aware sparse representations for each

micro-video, however, it completely ignores the hierarchical relatedness among categories. is

further justiﬁes the usefulness of encoding the tree structure to learn the sparse representations.

Table 4.5: Comparison between models with and without structure information for the venue

category estimation on Dataset II (p-value



: p-value over accuracy)

Models Accuracy Micro-F1 P-value*

INTIMATE

–

3.00 ± 0.04% 3.18 ± 0.05% 6.86e-15

INTIMATE 6.28 ± 0.08% 6.60 ± 0.09% –

4.5. MULTIMODAL COMPLEMENTARY LEARNING 87

Qualitative Analysis: Apart from the quantitative analysis, we also conducted the qual-

itative one to intuitively show the eﬀects of the hierarchical tree. As illustrated in Figure 4.10,

we selected an internal node (i.e., “Outdoors & Recreation”) and descendants from the given

tree. It is worth noting that to save the display space, we deliberately selected the internal node

close to the leaf ones. For each leaf node under the selected “Outdoors & Recreation” node, we

randomly sampled a few testing micro-videos categorized by our model for demonstration.

⋯

Outdoors & Recreation

Athletics & Sports

Basketball Court

Baseball Field

Beach

NBA@LosAngeles

NBA@Toronto

Baseball@Rogers Center

Beach@Miami

{

,…}

{

,…}

{

,…}

⋯

Figure 4.10: Qualitative analysis of the structure eﬀectiveness. We can see that visually similar

micro-videos have geographically close categories.

In Figure 4.10, in comparison to v

and v

, we can see that v

and v

are in the same leaf

node “Basketball Court,” and they share most of the visual concepts and are the most visually

similar pairs. Meanwhile, as compared to v

and v

, we can see that v

and v

are visually

closer. erefore, we can conclude that the geographically closer the venue categories of two

micro-videos are, the more visually similar they will be. is observation strongly supports our

assumption of hierarchical smoothness.

Justiﬁcation of Modality Combination

We also studied the performance of our model with diﬀerent modality combinations. e results

are summarized in Table 4.6. It can be seen that: (1) the visual modality outperforms the acoustic

and textual ones. is is due to the fact that the visual modality is more intuitive to show venue

information than that of the acoustic and textual ones; (2) combining the visual modality with

acoustic modality outperforms the combination of visual modality and the textual modality. is

reﬂects that the acoustic modality conveys more important cues on venue categories than the

textual one does. Because the textual descriptions are of low-quality, noisy, incomplete, sparse,

and even irrelevant to the venue categories; (3) the single modality is insuﬃcient to estimate the

venue category, but combining them can largely enhance the performance; and (4) our proposed

88 4. MULTIMODAL COOPERATIVE LEARNING

INTIMATE achieves the best performance over three modalities. is further justiﬁes that

multi-modalities are complementary instead of conﬂicting.

Table 4.6: Performance of our proposed INTIMATE model with diﬀerent modality combina-

tions on Dataset II (p-value



: p-value over accuracy)

Modality Accuracy Micro-F1 P-value*

Visual 5.02 ± 0.14% 5.30 ± 0.18% 1.31e-07

Audio 4.78 ± 0.13% 5.08 ± 0.17% 8.96e-09

Text 4.61 ± 0.12% 4.96 ± 0.14% 8.68e-10

Visual + Audio 5.46 ± 0.18% 5.65 ± 0.18% 8.92e-05

Visual + Text 5.15 ± 0.24% 5.54 ± 0.18% 7.47e-06

Audio + Text 5.05 ± 0.14% 5.48 ± 0.24% 1.90e-07

All 6.28 ± 0.08% 6.60 ± 0.09% –

120

140

120

140

120

140

0.9

0.8

0.7

0.6

0.5

0.4

0.3

5 10 15 20 5 10 15 20 5 10 15 20

(a) DDL (b) Oﬄine (c) Oﬄine + Online

Figure 4.11: Visualizing the sparse representations of some micro-videos. e y-axis denotes the

atom IDs and the x-axis stands for the micro-video exemplar IDs. Brighter bars refer to higher

weights.

Visualization of Sparse Representation

We visualized the representations of some examples. In particular, we ﬁrst randomly selected

two internal nodes that are geometrically far away in the tree. Following that, we randomly

selected 10 micro-videos from each of the two nodes from the testing set. We then visualized

4.5. MULTIMODAL COMPLEMENTARY LEARNING 89

their sparse representations of the visual modality as shown in Figure 4.11. Video ID 1-10 in

the x-axis are from one venue category and 11-20 are from another. We can see that: (1) sparse

representations achieved by DDL in Figure 4.11a are somehow independent and not very sparse;

(2) the sparse representations based on our INTIMATE model trained on the oﬄine training

data are more sparse as shown Figure 4.11b; and (3) after enhanced by the online data, we can

see from Figure 4.11c that samples from the same node often sparsely share the same set of

atoms.

Categorization Examples

To gain the deep insights into our proposed INTIMATE model, we illustrated two categoriza-

tion results of micro-videos in Figure 4.12.

e top micro-video in Figure 4.12 contains band and colorful lights. And we heard sound

of music and singing from its acoustic channel. Obviously, the venue category of this one is

Text Description: #mattyhealy #ﬁllmoredetroit #the 1975 #detroit.

Text Description: kayaking!

Ground Truth: Concert Hall Predict Venue: Concert Hall

Ground Truth: Garden Predict Venue: Garden

Figure 4.12: Exemplars of micro-videos categorization.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Multimodal Cooperative Learning

Create new playlist

Sign In

Sign Up

Table of Contents for
Multimodal Cooperative Learning