68 4. MULTIMODAL COOPERATIVE LEARNING
and .2NK
2
C K
3
/T . ereinto, M is the iteration times of the alternative optimization, which
is a small value less than 10 in our above analysis. N , T , S , K, and D, respectively, refer to the
number of micro-videos, venue categories, modalities, latent dimension, and the total feature
dimensions over all the modalities. Usually, we consider only a few number of modalities. S
is hence very small. In our experimental settings, K and T are in the order of a few hundreds.
Meanwhile, the number of feature dimension is about 5,000. erefore, D
2
is greater than K
2
T .
In light of this, we can reduce the time complexity to be ND
2
, which is faster than SVM, in
terms of O.N
3
/.
4.4.4 EXPERIMENTS
To valid the effectiveness of the first model TRUMANN, we conducted several experiments
over a server equipped with Inter(R) Core(TM) CPU i7-4790 at 3.6 GHz on 32 Gb RAM,
8 cores and 64-bit Windows 10 operation system. To thoroughly measure our model and the
baselines, we employed multiple metrics, namely macro-F1 and micro-F1 [55]. e averaging
macro-F1 gives equal weight to each class-label in the averaging process, whereas the averaging
micro-F1 gives equal weight to all instances in the averaging process. Both macro-F1 and micro-
F1 metrics reach their best value at 1 and worst score at 0.
e experimental results reported in this paper were based on 10-fold cross-validation.
In particular, the stratified cross-validation [130] was adopted to ensure all categories contain
approximately the same percentage between training and testing samples. In each round of the
10-fold cross-validation, we split Dataset II into three chunks: 80% of the micro-videos (i.e.,
194,505 videos) were used for training, 10% (i.e., 24,313 videos) were used for validation, and
the rest (i.e., 24,313 videos) were held out for testing. e training set was used to adjust the
parameters, while the validation set was used to avoid overfitting, i.e., verifying that any perfor-
mance increase over the training dataset actually yields an accuracy increase over a dataset that
has not been shown to the model before. e testing set was used only for testing the final solu-
tion to confirm the actual predictive power of our model with optimal parameters. Grid search
was employed to select the optimal parameters with small but adaptive step size.
Performance Comparison among Models
We carried out experiments on Dataset II to compare the overall effectiveness of our proposed
TRUMANN model with several state-of-the-art baselines.
SRMTL: e Sparse Graph Regularization Multi-Task Learning method can capture the
relationship between task pairs and further impose a sparse graph regularization scheme
to enforce the related pairs close to each other [99].
regMVMT: is semi-supervised inductive multi-view multi-task learning model consid-
ers information from multiple views and learns multiple related tasks simultaneously [190].
Besides, we also compared our model with the variant of regMVMT method, dubbed reg-
4.4. MULTIMODAL CONSISTENT LEARNING 69
MVMT+. regMVMT+ can achieve better performance by modeling the non-uniformly
related tasks.
MvDA+RMTL: is baseline is the combination of Multi-view Discriminant Analy-
sis [75] and Robust Multi-Task Learning [26]. In particular, MvDA seeks for a single
discriminant common space for multiple views by jointly learning multiple view-specific
linear transforms. Meanwhile, the RMTL is able to capture the task relationships using a
low-rank structure via group-sparse lasso.
TRUMANN-: is baseline is the variant of our proposed model by setting all e
v
in
Eq. (4.3) to be 1. In other words, this baseline does not incorporate the knowledge of the
pre-defined hierarchical structure.
e comparative results are summarized in Table 4.1. From this table, we have the fol-
lowing observations: (1) TRUMANN achieves better performance, as compared to other multi-
task learning approaches, such as SRMTL. is is because the SRMTL cannot capture the
prior knowledge of task relatedness in terms of tree structure. On the other hand, it reflects
that micro-videos are more separable in the learned common space; (2) multi-modal multi-task
models, such as regMVMT and TRUMANN remarkably outperform pure multi-task learning
models, such as SRMTL. is again demonstrates that the relatedness among multi-modalities
can boost the learning performance; (3) the joint learning of multi-modal multi-task mod-
els, including regMVMT and TRUMANN, shows their superiors to the sequential learning
of multi-view multi-task model, MvDA+RMTL. is tells us that multi-modal learning and
multi-task learning can mutually reinforce each other; (4) we can see that TRUMANN outper-
forms TRUMANN-. is demonstrates the usefulness of the pre-defined hierarchical structure,
and reveals the necessity of tree-guided multi-task learning; and (5) we conducted the analysis
of variance (known as ANOVA) micro-F1. In particular, we performed paired t-test between
our model and each of the competitors over 10-fold cross validation. We found that p-values
are substantially smaller than 0.05, which shows that the improvements of our proposed model
are statistically significant.
Representativeness of Modalities
We also studied the effectiveness of different modality combination. Table 4.2 shows the re-
sults. From this table, we observed that: (1) the visual modality is the most discriminant one
among visual, textual, and acoustic modalities. is is because the visual modality contains more
location-specific information than acoustic and textual modality. On the other hand, it signals
that the CNN features are capable of capturing the prominent visual characteristics of venue
categories; (2) the acoustic modality provide important cues for venue categories as compared to
the textual modality across micro-F1 and macro-F1 metrics. But only given the acoustic modal-
ity, it is hard to estimate the venue categories for most of the videos, while the combination of
visual and acoustic modality get an improvement than visual modality; (3) textual modality is
70 4. MULTIMODAL COOPERATIVE LEARNING
Table 4.1: Performance comparison between our model and the baselines on the venue category
estimation over Dataset II (p-value*: p-value over micro-F1)
Models Macro-F1 Micro-F1 P-value*
SRMTL 2.61 ± 0.19% 15.71 ± 0.21% 1.1e-3
regMVMT 4.33 ± 0.41% 17.16 ± 0.28% 7.0e-3
regMVMT+ 4.53 ± 0.31% 18.35 ± 0.13% 9.1e-3
MvDA + RMTL 2.46 ± 0.18% 17.28 ± 1.67% 1.0e-3
TRUMANN- 3.75 ± 0.17% 24.01 ± 0.35% 1.0e-2
TRUMANN 5.21 ± 0.29% 25.27 ± 0.17%
Table 4.2: Representativeness of different modalities on Dataset II (p-value*: p-value over micro-
F1)
Modality Macro-F1 Micro-F1 P-value*
Visual 4.49 ± 0.09% 22.56 ± 0.10% 2.3e-2
Acoustic 2.79 ± 0.01% 16.25 ± 0.46% 2.9e-4
Textual 1.44 ± 0.29% 12.36 ± 0.38% 5.4e-4
Acoustic + Textual 2.87 ± 0.16% 16.86 ± 0.06% 6.4e-3
Visual + Acoustic 4.61 ± 0.08% 23.85 ± 0.20% 1.8e-2
Visual + Textual 4.52 ± 0.11% 23.54 ± 0.17% 1.1e-2
ALL 5.21 ± 0.29% 25.27 ± 0.17%
the least descriptive for venue category estimation. is is due to that the textual descriptions
are noisy, missing, sparse, and even irrelevant to the venue categories; and (4) the more modali-
ties we incorporate, the better performance we can achieve. is implies that the information of
one modality is insufficient and multi-modalities are complementary to each other rather than
mutually conflicting. is is a consensus to the old saying “two heads are better than one.”
Case Studies
In Figure 4.2, we respectively list the top 8 categories with best performance in only visual
modality, acoustic modality, textual modality, and their combination. From this figure, we have
the following observations: (1) for visual modality, our model achieves stable and satisfactory
performance on many venue categories, especially on those with discriminant visual characteris-
tic, such as the micro-videos related to “Zoo” and “Beach;” (2) regarding the acoustic modality,
our model performs better on those with regular sounds or noisy noise, such as “Music Venue”
4.4. MULTIMODAL CONSISTENT LEARNING 71
Micro-F1 Scores
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Classes
Baseball Stadium
Beach
Sculpture Garden
Aquarium
Music Venue
Basketball Stadium
City
Zoo
Top Categories Under Visual Modality
Micro-F1 Scores
0.1 0.2 0.3 0.4 0.5 0.6
Classes
Music Venue
City
Football Stadium
eme Park
River
Concert Hall
Baseball Stadium
Nightclub
Top Categories Under Acoustic Modality
Micro-F1 Scores
0.1 0.2 0.3 0.4 0.5 0.6
Classes
eme Park
City
Museum
Baseball Stadium
Beach
Park
Airport
Casino
Top Categories Under Textual Modality
Micro-F1 Scores
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Classes
Aquarium
Beach
Baseball Stadium
Zoo
Basketball Stadium
eme Park
Music Venue
Bridge
Top Categories Under ree Modalities
(a) Visual Modality
(c) Textual Modality
(b) Acoustic Modality
(d) Modality Combination
Figure 4.2: Categories with best classification performance under visual, acoustic, textual modal-
ity, and their combination, respectively. Experiments were conducted on Dataset II.
and “Concert Hall, which have discriminate acoustic signals as compared to other venue cat-
egories; (3) when it comes to the textual modality, we found that the top 8 best-performing
categories are with high frequencies in micro-video descriptions. For instance, the terms of
“Park” and Beach occur 2,992 and 3,882 times in our dataset, respectively. It is worth noting
that not all the textual descriptions are correlated with the actual venue category, which in fact
decreases the performance. For example, the textual description of one micro-video is “I love my
city.” Nevertheless, its venue category is “Park;” and (4) unsurprisingly, we obtained a signifi-
cant improvement for Aquarium category, which is hard to recognize with only one modality.
Moreover, compared to the performance over visual modality, the “Basketball Stadium” and
“Zoo” categories are also improved about 8% in micro-F1. Besides, the more training samples
one venue category contains, the higher probability of this category will yield, such as eme
Park” and “City.”
72 4. MULTIMODAL COOPERATIVE LEARNING
Parameter Tuning and Sensitivity
We have four key parameters as shown in Eq. (4.3): K,
1
,
2
, and
3
. e optimal values of these
parameters were carefully tuned with 10-fold cross-validation in the training data. In particular,
for each of the 10-fold, we chose the optimal parameters by grid search with a small but adaptive
step size. Our parameters were searched in the range of [50, 500], [0.01,1], [0,1], and [0,1],
respectively. e parameters corresponding to the best micro-F1 score were used to report the
final results. For other competitors, the procedures to tune the parameters are analogous to the
ensure fair comparison.
Take the parameter tuning in one of the 10-fold as an example. We observed that our
model reached the optimal performance when K D 200,
1
D 0:7,
2
D 0:4, and
3
D 0:3. We
then investigated the sensitivity of our model to these parameters by varying one and fixing the
others. Figure 4.3 illustrates the performance of our model with respect to K,
1
,
2
, and
3
.
We can see that: (1) when fixing
1
,
2
,
3
and tuning K, the micro-F1 score value increases
first and then reaches the peak value at K D 200; and (2) the micro-F1 score value changes in
a small range, when varying
1
,
2
, and
3
from 0–1. e slight change demonstrates that our
model is non-sensitive to parameters. At last, we recorded the value of micro-F1 along with the
iteration time using the optimal parameter settings. Figure 4.4 shows the convergence process
(a) Parameter k (b) Parameter
λ
1
(c) Parameter
λ
2
(d) Parameter
λ
3
k
50 100 150 200 250 300 350 400 450 500
Micro-F1 (%)
22
23
24
25
26
27
λ
1
λ
3
λ
2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Micro-F1
24
24.5
25
25.5
26
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Micro-F1 (%)
24
24.5
25
25.5
26
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Micro-F1 (%)
22
23
24
25
26
27
Figure 4.3: Performance of TRUMANN on Dataset II with regards to varying parameters.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset