86 4. MULTIMODAL COOPERATIVE LEARNING
Table 4.4: Comparison between models with CNN-based dictionary learning and our dictionary
learning for the venue category estimation (p-value
: p-value over accuracy)
Models Accuracy Micro-F1 P-value*
DPL 4.64 ± 0.24% 4.87 ± 0.28% 3.86e-08
INTIMATE 6.28 ± 0.08% 6.60 ± 0.09%
Effectiveness of the Tree Structure
We argued that encoding the tree structure to constrain the sparse representation learning can
strengthen the representation discrimination. In this part, we carried out experiments to verify
the effectiveness of the tree structure from quantitative and qualitative aspects.
Quantitative Analysis: To show the effect of the tree structure on the sparse representa-
tion learning, we compared it with a flat model without tree structure, dubbed INTIMATE
-
,
min
D;A
1
2
M
X
mD1
X
m
D
m
A
m
2
F
C
2
M
X
mD1
X
c2C
A
m
c
2;1
C
2
M
X
mD1
A
m
2
F
;
s.t.
d
m
k
1; 8k; m;
(4.38)
where C is the set of categories. We only took the leaf nodes into consideration, and did not
consider the hierarchical tree structure to regularize the representation learning. To ensure a fair
comparison, we trained INTIMATE and INTIMATE
-
over the same offline training set and
reported the final results over the testing set. Analogous to other experiments, we also repeated
this one on ten round training/testing data sampling.
e experimental results are shown in Table 4.5. From this table, it can be seen that com-
pared to INTIMATE, the performance of INTIMATE
-
drops significantly regarding accu-
racy and micro-F1 metrics. is is because, the INTIMATE
-
baseline does take the class label
information into consideration and learns the category-aware sparse representations for each
micro-video, however, it completely ignores the hierarchical relatedness among categories. is
further justifies the usefulness of encoding the tree structure to learn the sparse representations.
Table 4.5: Comparison between models with and without structure information for the venue
category estimation on Dataset II (p-value
: p-value over accuracy)
Models Accuracy Micro-F1 P-value*
INTIMATE
3.00 ± 0.04% 3.18 ± 0.05% 6.86e-15
INTIMATE 6.28 ± 0.08% 6.60 ± 0.09%
4.5. MULTIMODAL COMPLEMENTARY LEARNING 87
Qualitative Analysis: Apart from the quantitative analysis, we also conducted the qual-
itative one to intuitively show the effects of the hierarchical tree. As illustrated in Figure 4.10,
we selected an internal node (i.e., “Outdoors & Recreation”) and descendants from the given
tree. It is worth noting that to save the display space, we deliberately selected the internal node
close to the leaf ones. For each leaf node under the selected “Outdoors & Recreation node, we
randomly sampled a few testing micro-videos categorized by our model for demonstration.
Outdoors & Recreation
Athletics & Sports
Basketball Court
Baseball Field
Beach
NBA@LosAngeles
NBA@Toronto
Baseball@Rogers Center
Beach@Miami
1
{
3
,…}
{
4
,…}
{
1
,
2
,…}
2
3
4
Figure 4.10: Qualitative analysis of the structure effectiveness. We can see that visually similar
micro-videos have geographically close categories.
In Figure 4.10, in comparison to v
1
and v
3
, we can see that v
1
and v
2
are in the same leaf
node “Basketball Court,” and they share most of the visual concepts and are the most visually
similar pairs. Meanwhile, as compared to v
1
and v
4
, we can see that v
1
and v
3
are visually
closer. erefore, we can conclude that the geographically closer the venue categories of two
micro-videos are, the more visually similar they will be. is observation strongly supports our
assumption of hierarchical smoothness.
Justification of Modality Combination
We also studied the performance of our model with different modality combinations. e results
are summarized in Table 4.6. It can be seen that: (1) the visual modality outperforms the acoustic
and textual ones. is is due to the fact that the visual modality is more intuitive to show venue
information than that of the acoustic and textual ones; (2) combining the visual modality with
acoustic modality outperforms the combination of visual modality and the textual modality. is
reflects that the acoustic modality conveys more important cues on venue categories than the
textual one does. Because the textual descriptions are of low-quality, noisy, incomplete, sparse,
and even irrelevant to the venue categories; (3) the single modality is insufficient to estimate the
venue category, but combining them can largely enhance the performance; and (4) our proposed
88 4. MULTIMODAL COOPERATIVE LEARNING
INTIMATE achieves the best performance over three modalities. is further justifies that
multi-modalities are complementary instead of conflicting.
Table 4.6: Performance of our proposed INTIMATE model with different modality combina-
tions on Dataset II (p-value
: p-value over accuracy)
Modality Accuracy Micro-F1 P-value*
Visual 5.02 ± 0.14% 5.30 ± 0.18% 1.31e-07
Audio 4.78 ± 0.13% 5.08 ± 0.17% 8.96e-09
Text 4.61 ± 0.12% 4.96 ± 0.14% 8.68e-10
Visual + Audio 5.46 ± 0.18% 5.65 ± 0.18% 8.92e-05
Visual + Text 5.15 ± 0.24% 5.54 ± 0.18% 7.47e-06
Audio + Text 5.05 ± 0.14% 5.48 ± 0.24% 1.90e-07
All 6.28 ± 0.08% 6.60 ± 0.09%
20
40
60
80
10
120
140
20
40
60
80
10
120
140
20
40
60
80
10
120
140
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
5 10 15 20 5 10 15 20 5 10 15 20
(a) DDL (b) Offline (c) Offline + Online
Figure 4.11: Visualizing the sparse representations of some micro-videos. e y-axis denotes the
atom IDs and the x-axis stands for the micro-video exemplar IDs. Brighter bars refer to higher
weights.
Visualization of Sparse Representation
We visualized the representations of some examples. In particular, we first randomly selected
two internal nodes that are geometrically far away in the tree. Following that, we randomly
selected 10 micro-videos from each of the two nodes from the testing set. We then visualized
4.5. MULTIMODAL COMPLEMENTARY LEARNING 89
their sparse representations of the visual modality as shown in Figure 4.11. Video ID 1-10 in
the x-axis are from one venue category and 11-20 are from another. We can see that: (1) sparse
representations achieved by DDL in Figure 4.11a are somehow independent and not very sparse;
(2) the sparse representations based on our INTIMATE model trained on the offline training
data are more sparse as shown Figure 4.11b; and (3) after enhanced by the online data, we can
see from Figure 4.11c that samples from the same node often sparsely share the same set of
atoms.
Categorization Examples
To gain the deep insights into our proposed INTIMATE model, we illustrated two categoriza-
tion results of micro-videos in Figure 4.12.
e top micro-video in Figure 4.12 contains band and colorful lights. And we heard sound
of music and singing from its acoustic channel. Obviously, the venue category of this one is
Text Description: #mattyhealy #fillmoredetroit #the 1975 #detroit.
Text Description: kayaking!
Ground Truth: Concert Hall Predict Venue: Concert Hall
Ground Truth: Garden Predict Venue: Garden
Figure 4.12: Exemplars of micro-videos categorization.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset