Experiments (2/2)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

4.5. MULTIMODAL COMPLEMENTARY LEARNING 81

where a

.0/ is the n-th column of A

.0/

, and x

.0/ is the n-th column of the oﬄine sample matrix

We then can rewrite Eq. (4.33) as

.t/ D

.t/



.t/  D

.t1/

.t/



C d

.t  1/: (4.35)

Because kd

.t/k  1, we need to use

.t/ D

max





.t/



; 1



.t/; (4.36)

to normalize the atom d

.t/.

An Incoming Unlabeled Sample: If the new sample x

does not have a label, we have to

predict its label. We ﬁrst utilize the previous learned dictionaries to obtain the sparse represen-

tations for the sample x

. In the oﬄine stage, we utilize the sparse representations of training

samples to learn a Softmax model for venue category classiﬁcation. We use the trained classiﬁers

to predict the label of the new sample x

. Speciﬁcally, we judge the label of the sample with the

following method:

y D arg min

;t2f1;:::;T g

mD1



 y



; (4.37)

where q

is a binary vector for the t -th category, whereby its t-th coordinate is one and the

remaining ones are zeros, and y

is the predicted label based on the m-th modality.

Due to the fact that we adopt the optimization method proposed in [109] and the conver-

gence of the algorithm has been proved in their paper, here we omit the proof of the convergence.

4.5.5 EXPERIMENTS

Since our proposed model is an online dictionary learning algorithm, we do not need too much

data to train the oﬄine model. In order to demonstrate the robustness of our model, we only

selected a small set of data from the Dataset II as our training data. Speciﬁcally, we randomly

selected 5,396 micro-videos as our oﬄine training data to learn the dictionaries, 10,807 micro-

videos as our online training data, and 2,170 micro-videos as our testing data. We repeated the

random selection ten times and reported the average experimental results.

Baselines

To shed light on the eﬀectiveness of our proposed approach, we compared it with the following

several state-of-the-art baselines.

• Without dictionary learning (WDL): We trained the softmax classiﬁer directly with the

raw features without considering the dictionary learning.

82 4. MULTIMODAL COOPERATIVE LEARNING

• Sparse graph regularization multi-task learning (SRMTL): is is a multi-task learn-

ing model capturing the structural relatedness between task pairs [99]. is model is also

trained with raw features.

• Data-driven dictionary learning (DDL): is is a classic mono-modal unsupervised dic-

tionary learning framework utilizing elastic-net [35].

• Online dictionary learning for sparse coding (ODLSC): is is an online mono-modal

dictionary learning relying on the stochastic approximations. It is capable of scaling up to

millions of training samples [109].

• Multi-modal unsupervised dictionary learning (MDL): It is a joint sparse representation

model via `

2;1

norm (as formulated in Eq. (4.19)).

• Multi-modal task-driven dictionary learning (MTDL): is is a multi-modal task-driven

dictionary learning algorithm [6] enabling modalities to collaborate at both the feature

level by using joint sparse representation and the decision level by using a sum of the

decision scores.

• Tree-guided multi-task multi-modal learning (TRUMANN): It learns a common space

from the multi-modal data and utilizes it to represent each micro-video. en, it leverages

a multi-task learning model to predict the category of a micro-video [192].

Overall Performance Comparison

We trained our approach and the baselines over the oﬄine training set and veriﬁed them over the

testing one. e results are comparatively summarized in Table 4.3. We observed that: (1) DDL

Table 4.3: Eﬃcient and eﬀective performance comparison between our model and the baselines.

We also reported the average results over 10-round experiments based pairwise signiﬁcance test.

(p-value



: p-value over accuracy)

Models Accuracy Micro-F1 Time(s) P-value*

WDL 2.79 ± 0.07% 2.89 ± 0.04% 65.0 3.05e-16

SRMTL 1.79 ± 0.05% 1.47 ± 0.10% 140.7 8.40e-19

DDL 2.95 ± 0.09% 3.12 ± 0.10% 1,194.7 1.51e-15

ODLSC 2.92 ± 0.15% 3.04 ± 0.17% 1,046.2 1.97e-14

MDL 4.43 ± 0.06% 4.66 ± 0.07% 4,468.9 7.19e-12

MTDL 4.50 ± 0.13% 4.75 ± 0.15% 3,338.6 4.50e-10

TRUMANN 4.19 ± 0.02% 4.46 ± 0.03% 51.4 9.14e-14

INTIMATE 6.28 ± 0.08% 6.60 ± 0.09% 150.1 –

4.5. MULTIMODAL COMPLEMENTARY LEARNING 83

and ODLSC methods outperformed WDL and SRMTL. e latter two methods leveraged the

raw features to directly train the classiﬁer, whereas the former two learned the sparse represen-

tations of micro-videos before training the classiﬁers. is justiﬁes the necessity of dictionary

learning and the discrimination of sparse representation; (2) multi-modal dictionary learning

methods surpass the mono-modal ones. is demonstrates that there are indeed relatedness

among modalities. Appropriate capturing and modeling such relatedness can reinforce the dic-

tionary learning, and hence the discrimination of sparse representation; (3) multi-modal dictio-

nary learning methods achieve relatively better results than that of TRUMMAN. is signals

that not all the modalities of micro-videos share the same space and it thus may not be the best

to learn the common space via the agreement constrains; (4) MTDL shows its superiority to

the MDL. is tells us that in the supervised settings, joint minimization of misclassiﬁcation

and reconstruction errors result in the dictionaries that are adapted to the desired tasks. And

supervised methods can lead to a more accurate classiﬁcation compared with the unsupervised

ones; (5) our proposed INTIMATE substantially outperforms the others, including MTDL

and MDL. is veriﬁes that harvesting the prior knowledge of tree structure is useful for a more

discriminative dictionary learning toward venue classiﬁcation; and (6) as to the eﬃciency over

the oﬄine part, we can see that the cost of our model is relatively lower than that of the others.

In addition, we also conducted the signiﬁcance test between our model and each of the base-

lines relying on the average results over 10-round experiments cross validation results. We can

see that all the p-values are substantially smaller than 0.05, indicating that the advantage of our

model is statistically signiﬁcant. Besides, we can see that the performance of our model hardly

ﬂuctuates, showing the stability of our model.

Parameter Tuning and Sensitivity Analysis

Our model has three key parameters: the number of dictionary atoms K, the tradeoﬀ parameters

 and  . In each of the 10-round experiments, we adopted the grid search strategy to carefully

tune and select the optimal parameters from the training data [123]. Take one experiment as an

example. We ﬁrst performed the grid search in a coarse level within a wide range of [0, 1000]

using an adaptive step size. Once we obtained the approximate scope of each parameter, we then

performed the ﬁne tuning within a narrow range using a small step size.

Figure 4.7 shows the performance of our model regarding the three parameters, which

is accomplished by varying one and ﬁxing the others. We can see that the performance of our

model changes within small ranges nearby the optimal settings. is justiﬁes that our model is

non-sensitive to the parameters around their optimal settings. It is observed that the setting of

K D 150,  D 0:85, and  D 1 works well for all of our experiments. e procedures of tuning

the parameters are analogous to other competitors to ensure a fair comparison.

84 4. MULTIMODAL COOPERATIVE LEARNING

50 100 150 200 250 300 350 400 450 500

4.0

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

Accuracy(%)

0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95

4.0

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

Accuracy(%)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

4.0

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

Accuracy(%)

(a) (b) (c)

Figure 4.7: Parameter tuning and sensitivity analysis. is is implemented by varying one pa-

rameter and ﬁxing the rest.

6.0

5.5

5.0

4.5

4.0

3.5

3.0

6.0

5.5

5.0

4.5

4.0

3.5

3.0

2.5

1,000

900

800

700

600

500

400

300

200

100

500 1000 1500 2000 2500 3000 3500 4000 4500 5000

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Accuracy (%)

Time (s)

Offline Training Size Online Training Size

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Training Data Size

INTIMATE

ODLSC

INTIMATE

ODLSC

(a) (b) (c)

Figure 4.8: (a) Inﬂuence of oﬄine training size on the online update results. (b) Inﬂuence of

online training size on the accuracy of online update stage. (c) Time consuming of oﬄine stage

with diﬀerent sizes of the training data.

Justiﬁcation of Online Learning

In this experiment, we have three settings: (1) by ﬁxing the oﬄine training data (5,396 videos),

we studied the eﬀectiveness of the online part (increasing the samples from 1–10,807); (2) ﬁxing

the online training data (10,807 videos), we studied the size inﬂuence of the oﬄine training data

(increasing the samples from 1–5,396); and (3) we studied the time consuming of the oﬄine

stage with diﬀerent sizes of the training data, and compared the time with that of the online

stage.

e comparison results over 2,170 testing samples between INTIMATE and ODLSC

are illustrated in Figures 4.8a and 4.8b. We have the following observations: (1) the more online

training samples used to train our model, the better performance it achieves. is justiﬁes the

usefulness of leveraging the incoming samples to incrementally strengthen our model. Mean-

while, it stably outperforms ODLSC under the same experimental setting; and (2) the size of

the initial training samples does aﬀect the performance of our model, but not signiﬁcantly even

if the online samples are insuﬃcient. is shows the robustness of our model.

Figure 4.8c shows the time consuming of the oﬄine stage with diﬀerent sizes of the train-

ing data. From this ﬁgure, we ﬁnd that the amount of time increases monotonously with the size

4.5. MULTIMODAL COMPLEMENTARY LEARNING 85

of the training data. Namely, if a new labeled sample is given (now the training data is 10,001),

it may consume at least 900 s to retrain the oﬄine part for testing. However, our online learning

model only needs less than 1 s to strengthen our model for testing. It certiﬁes that the online

learning is very necessary.

Eﬀectiveness of Parallel Dictionary Learning

In order to evaluate the eﬀectiveness of our parallel multi-modal dictionary learning, we pro-

posed a share dictionary learning model. e work in [44] has justiﬁed that a deep convolutional

neural network is equivalent to the sparse dictionary learning pipeline, whereby each convolu-

tional ﬁlter can be seen as a dictionary atom that we aim to learn, and the sparse coding can be

seen as the activation value of the ﬁltered results. Motivated by this work, we designed a deep

share dictionary learning model as shown in Figure 4.9. Due to the fact that all the modality

features should utilize the same dictionary, we ﬁrst embedded the three modality features into

a common space. Suppose the dimension of the common space is n, we then constructed a dic-

tionary using n  1 CNN kernels. Representation learning over a dictionary with K atoms is

equivalent to applying K linear ﬁlters n  1 to each input feature vector (each column in the

matrix). e sparse coding solver will then iteratively process K coeﬃcients.

…

Visual Audio Text

Meet a new friend

of @virante via

@foodbankcenc

– stay tuned!

Feature Embedding

CNN-based Dictionary

Learning

Sparse

Representation

Figure 4.9: e pipeline of the baseline model DPL. It ﬁrstly embeds the three modality features

into a common space, and then uses a CNN-based dictionary learning method to learn sparse

representations. Finally, it concatenates these representations and throws them into a classiﬁer.

e experimental results are shown in Table 4.4. From this table, we ﬁnd that: (1) DPL

performs better than the above baseline methods, including MTDL and MDL. is demon-

strates the superior performance of deep learning-based model. (2) Although this model is a

deep learning one, our model still outperforms it. Because this model projects all the modalities

into a common space for sharing a common dictionary, it loses the complementary information

among the modalities. is also veriﬁes that the common space assumption proposed in [192]

is invalid.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Experiments (2/2)

Create new playlist

Sign In

Sign Up

Table of Contents for
Experiments (2/2)