Multimodal Complementary Learning

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

68 4. MULTIMODAL COOPERATIVE LEARNING

and .2NK

C K

/T . ereinto, M is the iteration times of the alternative optimization, which

is a small value less than 10 in our above analysis. N , T , S , K, and D, respectively, refer to the

number of micro-videos, venue categories, modalities, latent dimension, and the total feature

dimensions over all the modalities. Usually, we consider only a few number of modalities. S

is hence very small. In our experimental settings, K and T are in the order of a few hundreds.

Meanwhile, the number of feature dimension is about 5,000. erefore, D

is greater than K

T .

In light of this, we can reduce the time complexity to be ND

, which is faster than SVM, in

terms of O.N

4.4.4 EXPERIMENTS

To valid the eﬀectiveness of the ﬁrst model TRUMANN, we conducted several experiments

over a server equipped with Inter(R) Core(TM) CPU i7-4790 at 3.6 GHz on 32 Gb RAM,

8 cores and 64-bit Windows 10 operation system. To thoroughly measure our model and the

baselines, we employed multiple metrics, namely macro-F1 and micro-F1 [55]. e averaging

macro-F1 gives equal weight to each class-label in the averaging process, whereas the averaging

micro-F1 gives equal weight to all instances in the averaging process. Both macro-F1 and micro-

F1 metrics reach their best value at 1 and worst score at 0.

e experimental results reported in this paper were based on 10-fold cross-validation.

In particular, the stratiﬁed cross-validation [130] was adopted to ensure all categories contain

approximately the same percentage between training and testing samples. In each round of the

10-fold cross-validation, we split Dataset II into three chunks: 80% of the micro-videos (i.e.,

194,505 videos) were used for training, 10% (i.e., 24,313 videos) were used for validation, and

the rest (i.e., 24,313 videos) were held out for testing. e training set was used to adjust the

parameters, while the validation set was used to avoid overﬁtting, i.e., verifying that any perfor-

mance increase over the training dataset actually yields an accuracy increase over a dataset that

has not been shown to the model before. e testing set was used only for testing the ﬁnal solu-

tion to conﬁrm the actual predictive power of our model with optimal parameters. Grid search

was employed to select the optimal parameters with small but adaptive step size.

Performance Comparison among Models

We carried out experiments on Dataset II to compare the overall eﬀectiveness of our proposed

TRUMANN model with several state-of-the-art baselines.

• SRMTL: e Sparse Graph Regularization Multi-Task Learning method can capture the

relationship between task pairs and further impose a sparse graph regularization scheme

to enforce the related pairs close to each other [99].

• regMVMT: is semi-supervised inductive multi-view multi-task learning model consid-

ers information from multiple views and learns multiple related tasks simultaneously [190].

Besides, we also compared our model with the variant of regMVMT method, dubbed reg-

4.4. MULTIMODAL CONSISTENT LEARNING 69

MVMT+. regMVMT+ can achieve better performance by modeling the non-uniformly

related tasks.

• MvDA+RMTL: is baseline is the combination of Multi-view Discriminant Analy-

sis [75] and Robust Multi-Task Learning [26]. In particular, MvDA seeks for a single

discriminant common space for multiple views by jointly learning multiple view-speciﬁc

linear transforms. Meanwhile, the RMTL is able to capture the task relationships using a

low-rank structure via group-sparse lasso.

• TRUMANN-: is baseline is the variant of our proposed model by setting all e

Eq. (4.3) to be 1. In other words, this baseline does not incorporate the knowledge of the

pre-deﬁned hierarchical structure.

e comparative results are summarized in Table 4.1. From this table, we have the fol-

lowing observations: (1) TRUMANN achieves better performance, as compared to other multi-

task learning approaches, such as SRMTL. is is because the SRMTL cannot capture the

prior knowledge of task relatedness in terms of tree structure. On the other hand, it reﬂects

that micro-videos are more separable in the learned common space; (2) multi-modal multi-task

models, such as regMVMT and TRUMANN remarkably outperform pure multi-task learning

models, such as SRMTL. is again demonstrates that the relatedness among multi-modalities

can boost the learning performance; (3) the joint learning of multi-modal multi-task mod-

els, including regMVMT and TRUMANN, shows their superiors to the sequential learning

of multi-view multi-task model, MvDA+RMTL. is tells us that multi-modal learning and

multi-task learning can mutually reinforce each other; (4) we can see that TRUMANN outper-

forms TRUMANN-. is demonstrates the usefulness of the pre-deﬁned hierarchical structure,

and reveals the necessity of tree-guided multi-task learning; and (5) we conducted the analysis

of variance (known as ANOVA) micro-F1. In particular, we performed paired t-test between

our model and each of the competitors over 10-fold cross validation. We found that p-values

are substantially smaller than 0.05, which shows that the improvements of our proposed model

are statistically signiﬁcant.

Representativeness of Modalities

We also studied the eﬀectiveness of diﬀerent modality combination. Table 4.2 shows the re-

sults. From this table, we observed that: (1) the visual modality is the most discriminant one

among visual, textual, and acoustic modalities. is is because the visual modality contains more

location-speciﬁc information than acoustic and textual modality. On the other hand, it signals

that the CNN features are capable of capturing the prominent visual characteristics of venue

categories; (2) the acoustic modality provide important cues for venue categories as compared to

the textual modality across micro-F1 and macro-F1 metrics. But only given the acoustic modal-

ity, it is hard to estimate the venue categories for most of the videos, while the combination of

visual and acoustic modality get an improvement than visual modality; (3) textual modality is

70 4. MULTIMODAL COOPERATIVE LEARNING

Table 4.1: Performance comparison between our model and the baselines on the venue category

estimation over Dataset II (p-value*: p-value over micro-F1)

Models Macro-F1 Micro-F1 P-value*

SRMTL 2.61 ± 0.19% 15.71 ± 0.21% 1.1e-3

regMVMT 4.33 ± 0.41% 17.16 ± 0.28% 7.0e-3

regMVMT+ 4.53 ± 0.31% 18.35 ± 0.13% 9.1e-3

MvDA + RMTL 2.46 ± 0.18% 17.28 ± 1.67% 1.0e-3

TRUMANN- 3.75 ± 0.17% 24.01 ± 0.35% 1.0e-2

TRUMANN 5.21 ± 0.29% 25.27 ± 0.17% –

Table 4.2: Representativeness of diﬀerent modalities on Dataset II (p-value*: p-value over micro-

F1)

Modality Macro-F1 Micro-F1 P-value*

Visual 4.49 ± 0.09% 22.56 ± 0.10% 2.3e-2

Acoustic 2.79 ± 0.01% 16.25 ± 0.46% 2.9e-4

Textual 1.44 ± 0.29% 12.36 ± 0.38% 5.4e-4

Acoustic + Textual 2.87 ± 0.16% 16.86 ± 0.06% 6.4e-3

Visual + Acoustic 4.61 ± 0.08% 23.85 ± 0.20% 1.8e-2

Visual + Textual 4.52 ± 0.11% 23.54 ± 0.17% 1.1e-2

ALL 5.21 ± 0.29% 25.27 ± 0.17% –

the least descriptive for venue category estimation. is is due to that the textual descriptions

are noisy, missing, sparse, and even irrelevant to the venue categories; and (4) the more modali-

ties we incorporate, the better performance we can achieve. is implies that the information of

one modality is insuﬃcient and multi-modalities are complementary to each other rather than

mutually conﬂicting. is is a consensus to the old saying “two heads are better than one.”

Case Studies

In Figure 4.2, we respectively list the top 8 categories with best performance in only visual

modality, acoustic modality, textual modality, and their combination. From this ﬁgure, we have

the following observations: (1) for visual modality, our model achieves stable and satisfactory

performance on many venue categories, especially on those with discriminant visual characteris-

tic, such as the micro-videos related to “Zoo” and “Beach;” (2) regarding the acoustic modality,

our model performs better on those with regular sounds or noisy noise, such as “Music Venue”

4.4. MULTIMODAL CONSISTENT LEARNING 71

Micro-F1 Scores

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Classes

Baseball Stadium

Beach

Sculpture Garden

Aquarium

Music Venue

Basketball Stadium

City

Zoo

Top Categories Under Visual Modality

Micro-F1 Scores

0.1 0.2 0.3 0.4 0.5 0.6

Classes

Music Venue

City

Football Stadium

eme Park

River

Concert Hall

Baseball Stadium

Nightclub

Top Categories Under Acoustic Modality

Micro-F1 Scores

0.1 0.2 0.3 0.4 0.5 0.6

Classes

eme Park

City

Museum

Baseball Stadium

Beach

Park

Airport

Casino

Top Categories Under Textual Modality

Micro-F1 Scores

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

Classes

Aquarium

Beach

Baseball Stadium

Zoo

Basketball Stadium

eme Park

Music Venue

Bridge

Top Categories Under ree Modalities

(a) Visual Modality

(b) Acoustic Modality

(d) Modality Combination

Figure 4.2: Categories with best classiﬁcation performance under visual, acoustic, textual modal-

ity, and their combination, respectively. Experiments were conducted on Dataset II.

and “Concert Hall,” which have discriminate acoustic signals as compared to other venue cat-

egories; (3) when it comes to the textual modality, we found that the top 8 best-performing

categories are with high frequencies in micro-video descriptions. For instance, the terms of

“Park” and “Beach” occur 2,992 and 3,882 times in our dataset, respectively. It is worth noting

that not all the textual descriptions are correlated with the actual venue category, which in fact

decreases the performance. For example, the textual description of one micro-video is “I love my

city.” Nevertheless, its venue category is “Park;” and (4) unsurprisingly, we obtained a signiﬁ-

cant improvement for “Aquarium” category, which is hard to recognize with only one modality.

Moreover, compared to the performance over visual modality, the “Basketball Stadium” and

“Zoo” categories are also improved about 8% in micro-F1. Besides, the more training samples

one venue category contains, the higher probability of this category will yield, such as “eme

Park” and “City.”

72 4. MULTIMODAL COOPERATIVE LEARNING

Parameter Tuning and Sensitivity

We have four key parameters as shown in Eq. (4.3): K, 

, 

, and 

. e optimal values of these

parameters were carefully tuned with 10-fold cross-validation in the training data. In particular,

for each of the 10-fold, we chose the optimal parameters by grid search with a small but adaptive

step size. Our parameters were searched in the range of [50, 500], [0.01,1], [0,1], and [0,1],

respectively. e parameters corresponding to the best micro-F1 score were used to report the

ﬁnal results. For other competitors, the procedures to tune the parameters are analogous to the

ensure fair comparison.

Take the parameter tuning in one of the 10-fold as an example. We observed that our

model reached the optimal performance when K D 200, 

D 0:7, 

D 0:4, and 

D 0:3. We

then investigated the sensitivity of our model to these parameters by varying one and ﬁxing the

others. Figure 4.3 illustrates the performance of our model with respect to K, 

, 

, and 

We can see that: (1) when ﬁxing 

, 

and tuning K, the micro-F1 score value increases

ﬁrst and then reaches the peak value at K D 200; and (2) the micro-F1 score value changes in

a small range, when varying 

, 

, and 

from 0–1. e slight change demonstrates that our

model is non-sensitive to parameters. At last, we recorded the value of micro-F1 along with the

iteration time using the optimal parameter settings. Figure 4.4 shows the convergence process

(a) Parameter k (b) Parameter

(d) Parameter

50 100 150 200 250 300 350 400 450 500

Micro-F1 (%)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Micro-F1

24.5

25.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Micro-F1 (%)

24.5

25.5

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Micro-F1 (%)

Figure 4.3: Performance of TRUMANN on Dataset II with regards to varying parameters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Multimodal Complementary Learning

Create new playlist

Sign In

Sign Up

Table of Contents for
Multimodal Complementary Learning