98 4. MULTIMODAL COOPERATIVE LEARNING
where z
k
is a weight vector of the k-th venue category, and
m
k
can be viewed as the discriminative
representation of k-th venue category in the modality m. ereafter, we obtain the probabilistic
label vector Oy
m
D ΠOy
m
1
; ; Oy
m
K
over K venue categories.
For multiple modalities, we fuse the probabilistic label vector over three modalities, de-
fined as follows:
Oy D
X
m2M
.Oy
m
/: (4.55)
Following that, we adopt a function to minimize the loss between the estimated label vector and
its target values, as
L
2
D
X
x2X
K
X
kD1
y
k
log.Oy
k
/: (4.56)
Ultimately, this function and the KL divergence of consistent representation pairs are
combined as the objective function of our proposed method, as follows:
L D L
1
C L
2
: (4.57)
4.6.4 EXPERIMENTS
We validated our proposed NMCL model and its components over micro-video understanding.
Experiment Settings
In addition to Macro-F1 and Micro-F1, we provide the Receiver Operating Curves (ROC) of
our method and four baselines and use Areas Under Curve (AUC) scores to evaluate the results.
AUC is the area under the ROC curve, which is created by plotting the true positive rate against
the false positive rate. We divided our Dataset II into three chunks: 132;370 for training, 56;731
for validation, and 81;044 for testing. e training set is used to adjust the parameters, while the
validation one provides an unbiased evaluation of a model fit on the training dataset and tunes
the model’s hyperparameters. e testing one is used only to report the final solution to confirm
the actual predictive power of our model with the optimal parameter settings.
We compared the performance of our proposed model with several state-of-the-art base-
lines.
Early Fusion [119]: For any given micro-video, we concatenated multimodal features into
one vector, and then learned a model consisting of three fully connected layers to estimate
the venue category over the concatenation vectors.
Late Fusion [119]: To calculate the categories distribution, we devised the classifiers
which are respectively implemented by a neural network with one, two, and three hid-
den layers for the textual, acoustic, and visual modalities. And we fused these distributions
to yield a final prediction venue category.
4.6. MULTIMODAL COOPERATIVE LEARNING 99
EarlyCAtt [64]: is baseline is the combination of the early fusion and attention model.
In particular, the attention model gives different attention weights to all features inte-
grated from multiple modalities according to different venue categories. Here, the atten-
tion weights are calculated by a scoring function of the concatenated features and venue
category. After that, a neural network is devised with three fully connected layers to cate-
gorize the unseen micro-videos over the attended feature vectors.
LateCAtt [64]: For various venue categories, features in each modality have varying con-
tributions to the final prediction. erefore, this baseline introduces the attention mech-
anism into classifiers of each modality to obtain the venue category representations and
then fuses these representations to yield a final venue category.
TRUMANN: is is a tree-guided multi-task multi-modal learning method introduced
in Section 4.4, which is the first one toward the micro-video venue category estimation.
is model is able to jointly learn a common space from multiple modalities and leverage
the predefined Foursquare hierarchical structure to regularize the relatedness among venue
categories.
DARE [122]: is work is a deep transfer model which harnesses the external knowledge
to enhance the acoustic modality and regularizes the representation learning of micro-
videos of the same venue category to alleviate the sparsity problem of unpopular categories.
We implemented our model with the help of Tensorflow.
1
Particularly, we applied the
Xavier approach to initialize the model parameters, which has been proved as an excellent ini-
tialization method for the neural network models. e mini-batch size and learning rate are
respectively searched in {128, 256, 512} and {0.001, 0.005, 0.01, 0.05, 0.1}. e optimizer is set
as Adaptive Moment Estimation (Adam) [80]. Moreover, we empirically set the size of each
hidden layer as 256 and the activation function as ReLU. Without special mention, all the mod-
els employ one hidden layer and one prediction layer. For a fair comparison, we initialized other
competitors with an analogous procedure. e average results over five-round predictions are
illustrated in the testing set.
Performance Comparison
e comparative results are shown in Table 4.7 and Figure 4.16. From this table, we have the
following observations:
1. In terms of the Micro-F1, Early Fusion and Late Fusion achieve the worst performance,
since these standard fusion approaches rarely exploit the correlations between different
modalities.
1
https://www.tensorflow.org
100 4. MULTIMODAL COOPERATIVE LEARNING
Table 4.7: Performance comparison between our model and the baselines (p-value*: p-value
2
over
micro-F1)
Micro-F1 Macro-F1 P-value*
Early Fusion 11.39 ± 0.01% 0.12 ± 0.01% 1.31e-8
Late Fusion 12.57 ± 0.23% 0.20 ± 0.04% 4.29e-9
Early + Att 31.24 ± 0.37% 14.03 ± 0.19% 2.48e-8
Late + Att 30.00 ± 0.31% 13.71 ± 0.51% 1.52e-7
TRUMANN 27.38 ± 0.21% 10.87 ± 0.05% 8.71e-8
DARE 34.40 ± 0.32% 20.21 ± 0.35% 5.94e-7
NMCL 40.04 ± 0.37% 26.78 ± 0.42%
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.2 0.4
Micro ROC (area = 0.9487)
Micro ROC (area = 0.8554)
0.6 0.8 1.0
False Positive Rate
(a) ROC_EarlyAtt (b) ROC_LateAtt (c) ROC_TRUMANN (d) ROC_DARE (e) ROC_NMCL
True Positive Rate
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.2 0.4
Micro ROC (area = 0.9383)
Micro ROC (area = 0.8096)
0.6 0.8 1.0
False Positive Rate
True Positive Rate
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.2 0.4
Micro ROC (area = 0.9438)
Micro ROC (area = 0.8339)
0.6 0.8 1.0
False Positive Rate
True Positive Rate
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.2 0.4
Micro ROC (area = 0.9512)
Micro ROC (area = 0.8622)
0.6 0.8 1.0
False Positive Rate
True Positive Rate
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.2 0.4
Micro ROC (area = 0.9609)
Micro ROC (area = 0.8894)
0.6 0.8 1.0
False Positive Rate
True Positive Rate
Figure 4.16: ROC curves and AUC scores of methods.
2. Integrating the attention model to the standard fusion is able to improve the performance
obviously. Taking the advantages of the attention mechanism, EarlyCAtt and LateCAtt
can dynamically select the discriminative features, which are tailored to the prediction task.
is verifies the feasibility of revising the weight of each feature.
3. When performing the estimation task, TRUMANN outperforms Early Fusion and Late
Fusion. It is reasonable since it considers the hierarchical structure of venue categories
and employs the multi-task learning, whereas EarlyCAtt and LateCAtt outperform the
TRUMANN. It again admits the effectiveness of assigning the attentive weights to the
features.
4. e performance of DARE exceeds the others except ours, indicating that DARE benefits
from the enhanced audio modality via an external dataset and alleviates the sparse problem
of unpopular categories by regularizing the similarity among the categories.
5. Our proposed model achieves the best w.r.t. micro-F1 and macro-F1. By exhibiting the
consistency and complementary of features, our model achieves a better expressiveness
2
In statistical hypothesis testing, the probability value (p-value) is the probability for a given statistical model that, when
the null hypothesis is true, the statistical summary would be greater than or equal to the actual observed results.
4.6. MULTIMODAL COOPERATIVE LEARNING 101
compared to all baselines. While DARE and TRUMANN treat all features linearly, in-
dependently, and equally, our model can capture and leverage the correlation between
different modalities, as well as employ the attention networks to identify the tailored at-
tention of each feature. We further conducted a pair wise significant test to verify that all
improvements are statistically significant with p-value < 0:05.
6. As shown in Figure 4.16, NMCL archives an AUC score of more than 96% and is superior
to the baselines, further demonstrating the effectiveness of our proposed method. Despite
DARE yields the AUC score of 95.12% and ranks the second-best performance among
all the methods, our proposed method outperforms it by a gain of about 1%. Besides, in
terms of the macro-average ROC curve, NMCL gets an AUC score of about 89%, which
increases by 3–10% than the baselines.
Study of NMCL Model
We studied the effectiveness of combining different modalities. Tables 4.8 and 4.9 show the
performance of different modality pairs and each enhanced modality with our proposed model,
respectively. In addition, we plotted the Macro-F1 and Micro-F1 w:r:t: the number of iterations
in Figure 4.17 to illustrate the convergence and efficiency of our model in each modality.
From these tables and the figure, we observed the following.
1. On the first row in Table 4.8, solely considering the visual modality achieves the best
performance compared to the other mono-modal estimation methods. is is consistent
with the finding in [122, 192], verifying the rich geographic information conveyed by
the visual features. In addition, the CNN features are capable of capturing the prominent
visual characteristics of the venue categories.
2. e acoustic modality and textual modality perform similarly in estimating the venue cate-
gories, which are listed on the second row and the third row in Table 4.8, respectively. Only
Table 4.8: Representativeness of different modalities
Micro-F1 Macro-F1
Textual 13.40 ± 0.14% 2.23 ± 0.1%
Acoustic 14.21 ± 0.12% 3.40 ± 0.02%
Visual 28.16 ± 0.23% 11.22 ± 0.41%
Acoustic + Textual 20.57 ± 0.41% 7.08 ± 0.09%
Visual + Textual 38.45 ± 0.34% 23.83 ± 0.34%
Visual + Acoustic 37.07 ± 0.35% 23.34 ± 0.11%
All 40.04 ± 0.37% 26.78 ± 0.42%
102 4. MULTIMODAL COOPERATIVE LEARNING
Table 4.9: Performance of each enhanced modality in different modality pairs. (V-MicroF1, A-
MicroF1, and T-MicroF1 denote Micro-F1 score on the visual, acoustic, and textual modality,
respectively.)
V-Micro-F1 A-Micro-F1 T-Micro-F1
Acoustic + Textual 20.12 ± 0.15% 20.13 ± 0.14%
Visual + Textual 37.46 ± 0.26% 37.75 ± 0.36%
Visual + Acoustic 35.09 ± 0.15% 34.8 ± 0.16%
All 36.07 ± 0.28% 35.27 ± 0.17% 33.73 ± 0.51%
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0 5 10 15 20 25 3530
Iteration
Performance
0.40
0.35
0.30
0.25
0.20
0.15
0.10
0.05
0.00
0 5 10 15 20 25 3530
Iteration
Performance
All
Visual
Acoustic
Textual
All
Visual
Acoustic
Textual
(a) Micro-F1 (b) Macro-F1
Figure 4.17: Convergence and effective study of the NMCL.
using one modality, however, is insufficient to estimate the categories for most micro-
videos, since the textual and acoustic information is noisy, sparse, and even irrelevant to
the venue categories.
3. e more modalities we incorporate, the better performance we can achieve, as the display
on last three rows in Tables 4.8 and 4.9. is implies that the information of one modal-
ity is insufficient and multiple modalities are complementary to each other rather than
conflicting. is is a consensus to the old saying “two heads are better than one.”
4. Table 4.9 shows that the performance of each modality enhanced by our proposed ap-
proach is improved obviously, especially the acoustic and textual modalities are combined
with the visual modality. is improvement validates that each modality can be enforced
by the other modalities in our model.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset