Experiments (2/2)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

Experiments (1/2)

Next Chapter

Summary

98 4. MULTIMODAL COOPERATIVE LEARNING

where z

is a weight vector of the k-th venue category, and 

can be viewed as the discriminative

representation of k-th venue category in the modality m. ereafter, we obtain the probabilistic

label vector Oy

D Œ Oy

;    ; Oy

 over K venue categories.

For multiple modalities, we fuse the probabilistic label vector over three modalities, de-

ﬁned as follows:

Oy D

m2M

.Oy

/: (4.55)

Following that, we adopt a function to minimize the loss between the estimated label vector and

its target values, as

D 

x2X

kD1

log.Oy

/: (4.56)

Ultimately, this function and the KL divergence of consistent representation pairs are

combined as the objective function of our proposed method, as follows:

L D L

C L

: (4.57)

4.6.4 EXPERIMENTS

We validated our proposed NMCL model and its components over micro-video understanding.

Experiment Settings

In addition to Macro-F1 and Micro-F1, we provide the Receiver Operating Curves (ROC) of

our method and four baselines and use Areas Under Curve (AUC) scores to evaluate the results.

AUC is the area under the ROC curve, which is created by plotting the true positive rate against

the false positive rate. We divided our Dataset II into three chunks: 132;370 for training, 56;731

for validation, and 81;044 for testing. e training set is used to adjust the parameters, while the

validation one provides an unbiased evaluation of a model ﬁt on the training dataset and tunes

the model’s hyperparameters. e testing one is used only to report the ﬁnal solution to conﬁrm

the actual predictive power of our model with the optimal parameter settings.

We compared the performance of our proposed model with several state-of-the-art base-

lines.

• Early Fusion [119]: For any given micro-video, we concatenated multimodal features into

one vector, and then learned a model consisting of three fully connected layers to estimate

the venue category over the concatenation vectors.

• Late Fusion [119]: To calculate the categories distribution, we devised the classiﬁers

which are respectively implemented by a neural network with one, two, and three hid-

den layers for the textual, acoustic, and visual modalities. And we fused these distributions

to yield a ﬁnal prediction venue category.

4.6. MULTIMODAL COOPERATIVE LEARNING 99

• EarlyCAtt [64]: is baseline is the combination of the early fusion and attention model.

In particular, the attention model gives diﬀerent attention weights to all features inte-

grated from multiple modalities according to diﬀerent venue categories. Here, the atten-

tion weights are calculated by a scoring function of the concatenated features and venue

category. After that, a neural network is devised with three fully connected layers to cate-

gorize the unseen micro-videos over the attended feature vectors.

• LateCAtt [64]: For various venue categories, features in each modality have varying con-

tributions to the ﬁnal prediction. erefore, this baseline introduces the attention mech-

anism into classiﬁers of each modality to obtain the venue category representations and

then fuses these representations to yield a ﬁnal venue category.

• TRUMANN: is is a tree-guided multi-task multi-modal learning method introduced

in Section 4.4, which is the ﬁrst one toward the micro-video venue category estimation.

is model is able to jointly learn a common space from multiple modalities and leverage

the predeﬁned Foursquare hierarchical structure to regularize the relatedness among venue

categories.

• DARE [122]: is work is a deep transfer model which harnesses the external knowledge

to enhance the acoustic modality and regularizes the representation learning of micro-

videos of the same venue category to alleviate the sparsity problem of unpopular categories.

We implemented our model with the help of Tensorﬂow.

Particularly, we applied the

Xavier approach to initialize the model parameters, which has been proved as an excellent ini-

tialization method for the neural network models. e mini-batch size and learning rate are

respectively searched in {128, 256, 512} and {0.001, 0.005, 0.01, 0.05, 0.1}. e optimizer is set

as Adaptive Moment Estimation (Adam) [80]. Moreover, we empirically set the size of each

hidden layer as 256 and the activation function as ReLU. Without special mention, all the mod-

els employ one hidden layer and one prediction layer. For a fair comparison, we initialized other

competitors with an analogous procedure. e average results over ﬁve-round predictions are

illustrated in the testing set.

Performance Comparison

e comparative results are shown in Table 4.7 and Figure 4.16. From this table, we have the

following observations:

1. In terms of the Micro-F1, Early Fusion and Late Fusion achieve the worst performance,

since these standard fusion approaches rarely exploit the correlations between diﬀerent

modalities.

https://www.tensorﬂow.org

100 4. MULTIMODAL COOPERATIVE LEARNING

Table 4.7: Performance comparison between our model and the baselines (p-value*: p-value

over

micro-F1)

Micro-F1 Macro-F1 P-value*

Early Fusion 11.39 ± 0.01% 0.12 ± 0.01% 1.31e-8

Late Fusion 12.57 ± 0.23% 0.20 ± 0.04% 4.29e-9

Early + Att 31.24 ± 0.37% 14.03 ± 0.19% 2.48e-8

Late + Att 30.00 ± 0.31% 13.71 ± 0.51% 1.52e-7

TRUMANN 27.38 ± 0.21% 10.87 ± 0.05% 8.71e-8

DARE 34.40 ± 0.32% 20.21 ± 0.35% 5.94e-7

NMCL 40.04 ± 0.37% 26.78 ± 0.42% –

1.0

0.8

0.6

0.4

0.2

0.0

0.0 0.2 0.4

Micro ROC (area = 0.9487)

Micro ROC (area = 0.8554)

0.6 0.8 1.0

False Positive Rate

(a) ROC_EarlyAtt (b) ROC_LateAtt (c) ROC_TRUMANN (d) ROC_DARE (e) ROC_NMCL

True Positive Rate

1.0

0.8

0.6

0.4

0.2

0.0

0.0 0.2 0.4

Micro ROC (area = 0.9383)

Micro ROC (area = 0.8096)

0.6 0.8 1.0

False Positive Rate

True Positive Rate

1.0

0.8

0.6

0.4

0.2

0.0

0.0 0.2 0.4

Micro ROC (area = 0.9438)

Micro ROC (area = 0.8339)

0.6 0.8 1.0

False Positive Rate

True Positive Rate

1.0

0.8

0.6

0.4

0.2

0.0

0.0 0.2 0.4

Micro ROC (area = 0.9512)

Micro ROC (area = 0.8622)

0.6 0.8 1.0

False Positive Rate

True Positive Rate

1.0

0.8

0.6

0.4

0.2

0.0

0.0 0.2 0.4

Micro ROC (area = 0.9609)

Micro ROC (area = 0.8894)

0.6 0.8 1.0

False Positive Rate

True Positive Rate

Figure 4.16: ROC curves and AUC scores of methods.

2. Integrating the attention model to the standard fusion is able to improve the performance

obviously. Taking the advantages of the attention mechanism, EarlyCAtt and LateCAtt

can dynamically select the discriminative features, which are tailored to the prediction task.

is veriﬁes the feasibility of revising the weight of each feature.

3. When performing the estimation task, TRUMANN outperforms Early Fusion and Late

Fusion. It is reasonable since it considers the hierarchical structure of venue categories

and employs the multi-task learning, whereas EarlyCAtt and LateCAtt outperform the

TRUMANN. It again admits the eﬀectiveness of assigning the attentive weights to the

features.

4. e performance of DARE exceeds the others except ours, indicating that DARE beneﬁts

from the enhanced audio modality via an external dataset and alleviates the sparse problem

of unpopular categories by regularizing the similarity among the categories.

5. Our proposed model achieves the best w.r.t. micro-F1 and macro-F1. By exhibiting the

consistency and complementary of features, our model achieves a better expressiveness

In statistical hypothesis testing, the probability value (p-value) is the probability for a given statistical model that, when

the null hypothesis is true, the statistical summary would be greater than or equal to the actual observed results.

4.6. MULTIMODAL COOPERATIVE LEARNING 101

compared to all baselines. While DARE and TRUMANN treat all features linearly, in-

dependently, and equally, our model can capture and leverage the correlation between

diﬀerent modalities, as well as employ the attention networks to identify the tailored at-

tention of each feature. We further conducted a pair wise signiﬁcant test to verify that all

improvements are statistically signiﬁcant with p-value < 0:05.

6. As shown in Figure 4.16, NMCL archives an AUC score of more than 96% and is superior

to the baselines, further demonstrating the eﬀectiveness of our proposed method. Despite

DARE yields the AUC score of 95.12% and ranks the second-best performance among

all the methods, our proposed method outperforms it by a gain of about 1%. Besides, in

terms of the macro-average ROC curve, NMCL gets an AUC score of about 89%, which

increases by 3–10% than the baselines.

Study of NMCL Model

We studied the eﬀectiveness of combining diﬀerent modalities. Tables 4.8 and 4.9 show the

performance of diﬀerent modality pairs and each enhanced modality with our proposed model,

respectively. In addition, we plotted the Macro-F1 and Micro-F1 w:r:t: the number of iterations

in Figure 4.17 to illustrate the convergence and eﬃciency of our model in each modality.

From these tables and the ﬁgure, we observed the following.

1. On the ﬁrst row in Table 4.8, solely considering the visual modality achieves the best

performance compared to the other mono-modal estimation methods. is is consistent

with the ﬁnding in [122, 192], verifying the rich geographic information conveyed by

the visual features. In addition, the CNN features are capable of capturing the prominent

visual characteristics of the venue categories.

2. e acoustic modality and textual modality perform similarly in estimating the venue cate-

gories, which are listed on the second row and the third row in Table 4.8, respectively. Only

Table 4.8: Representativeness of diﬀerent modalities

Micro-F1 Macro-F1

Textual 13.40 ± 0.14% 2.23 ± 0.1%

Acoustic 14.21 ± 0.12% 3.40 ± 0.02%

Visual 28.16 ± 0.23% 11.22 ± 0.41%

Acoustic + Textual 20.57 ± 0.41% 7.08 ± 0.09%

Visual + Textual 38.45 ± 0.34% 23.83 ± 0.34%

Visual + Acoustic 37.07 ± 0.35% 23.34 ± 0.11%

All 40.04 ± 0.37% 26.78 ± 0.42%

102 4. MULTIMODAL COOPERATIVE LEARNING

Table 4.9: Performance of each enhanced modality in diﬀerent modality pairs. (V-MicroF1, A-

MicroF1, and T-MicroF1 denote Micro-F1 score on the visual, acoustic, and textual modality,

respectively.)

V-Micro-F1 A-Micro-F1 T-Micro-F1

Acoustic + Textual – 20.12 ± 0.15% 20.13 ± 0.14%

Visual + Textual 37.46 ± 0.26% – 37.75 ± 0.36%

Visual + Acoustic 35.09 ± 0.15% 34.8 ± 0.16% –

All 36.07 ± 0.28% 35.27 ± 0.17% 33.73 ± 0.51%

0.40

0.35

0.30

0.25

0.20

0.15

0.10

0.05

0.00

0 5 10 15 20 25 3530

Iteration

Performance

0.40

0.35

0.30

0.25

0.20

0.15

0.10

0.05

0.00

0 5 10 15 20 25 3530

Iteration

Performance

All

Visual

Acoustic

Textual

All

Visual

Acoustic

Textual

(a) Micro-F1 (b) Macro-F1

Figure 4.17: Convergence and eﬀective study of the NMCL.

using one modality, however, is insuﬃcient to estimate the categories for most micro-

videos, since the textual and acoustic information is noisy, sparse, and even irrelevant to

the venue categories.

3. e more modalities we incorporate, the better performance we can achieve, as the display

on last three rows in Tables 4.8 and 4.9. is implies that the information of one modal-

ity is insuﬃcient and multiple modalities are complementary to each other rather than

conﬂicting. is is a consensus to the old saying “two heads are better than one.”

4. Table 4.9 shows that the performance of each modality enhanced by our proposed ap-

proach is improved obviously, especially the acoustic and textual modalities are combined

with the visual modality. is improvement validates that each modality can be enforced

by the other modalities in our model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Experiments (2/2)

Create new playlist

Sign In

Sign Up

Table of Contents for
Experiments (2/2)