4.5. MULTIMODAL COMPLEMENTARY LEARNING 81
where a
n
.0/ is the n-th column of A
.0/
, and x
n
.0/ is the n-th column of the offline sample matrix
X.
We then can rewrite Eq. (4.33) as
d
j
.t/ D
1
U
jj
.t/
f
j
.t/ D
.t1/
u
j
.t/
C d
j
.t 1/: (4.35)
Because kd
j
.t/k 1, we need to use
d
j
.t/ D
1
max
d
j
.t/
; 1
d
j
.t/; (4.36)
to normalize the atom d
j
.t/.
An Incoming Unlabeled Sample: If the new sample x
t
does not have a label, we have to
predict its label. We first utilize the previous learned dictionaries to obtain the sparse represen-
tations for the sample x
t
. In the offline stage, we utilize the sparse representations of training
samples to learn a Softmax model for venue category classification. We use the trained classifiers
to predict the label of the new sample x
t
. Specifically, we judge the label of the sample with the
following method:
y D arg min
q
t
;t2f1;:::;T g
M
X
mD1
q
t
y
m
2
2
; (4.37)
where q
t
is a binary vector for the t -th category, whereby its t-th coordinate is one and the
remaining ones are zeros, and y
m
is the predicted label based on the m-th modality.
Due to the fact that we adopt the optimization method proposed in [109] and the conver-
gence of the algorithm has been proved in their paper, here we omit the proof of the convergence.
4.5.5 EXPERIMENTS
Since our proposed model is an online dictionary learning algorithm, we do not need too much
data to train the offline model. In order to demonstrate the robustness of our model, we only
selected a small set of data from the Dataset II as our training data. Specifically, we randomly
selected 5,396 micro-videos as our offline training data to learn the dictionaries, 10,807 micro-
videos as our online training data, and 2,170 micro-videos as our testing data. We repeated the
random selection ten times and reported the average experimental results.
Baselines
To shed light on the effectiveness of our proposed approach, we compared it with the following
several state-of-the-art baselines.
Without dictionary learning (WDL): We trained the softmax classifier directly with the
raw features without considering the dictionary learning.
82 4. MULTIMODAL COOPERATIVE LEARNING
Sparse graph regularization multi-task learning (SRMTL): is is a multi-task learn-
ing model capturing the structural relatedness between task pairs [99]. is model is also
trained with raw features.
Data-driven dictionary learning (DDL): is is a classic mono-modal unsupervised dic-
tionary learning framework utilizing elastic-net [35].
Online dictionary learning for sparse coding (ODLSC): is is an online mono-modal
dictionary learning relying on the stochastic approximations. It is capable of scaling up to
millions of training samples [109].
Multi-modal unsupervised dictionary learning (MDL): It is a joint sparse representation
model via `
2;1
norm (as formulated in Eq. (4.19)).
Multi-modal task-driven dictionary learning (MTDL): is is a multi-modal task-driven
dictionary learning algorithm [6] enabling modalities to collaborate at both the feature
level by using joint sparse representation and the decision level by using a sum of the
decision scores.
Tree-guided multi-task multi-modal learning (TRUMANN): It learns a common space
from the multi-modal data and utilizes it to represent each micro-video. en, it leverages
a multi-task learning model to predict the category of a micro-video [192].
Overall Performance Comparison
We trained our approach and the baselines over the offline training set and verified them over the
testing one. e results are comparatively summarized in Table 4.3. We observed that: (1) DDL
Table 4.3: Efficient and effective performance comparison between our model and the baselines.
We also reported the average results over 10-round experiments based pairwise significance test.
(p-value
: p-value over accuracy)
Models Accuracy Micro-F1 Time(s) P-value*
WDL 2.79 ± 0.07% 2.89 ± 0.04% 65.0 3.05e-16
SRMTL 1.79 ± 0.05% 1.47 ± 0.10% 140.7 8.40e-19
DDL 2.95 ± 0.09% 3.12 ± 0.10% 1,194.7 1.51e-15
ODLSC 2.92 ± 0.15% 3.04 ± 0.17% 1,046.2 1.97e-14
MDL 4.43 ± 0.06% 4.66 ± 0.07% 4,468.9 7.19e-12
MTDL 4.50 ± 0.13% 4.75 ± 0.15% 3,338.6 4.50e-10
TRUMANN 4.19 ± 0.02% 4.46 ± 0.03% 51.4 9.14e-14
INTIMATE 6.28 ± 0.08% 6.60 ± 0.09% 150.1
4.5. MULTIMODAL COMPLEMENTARY LEARNING 83
and ODLSC methods outperformed WDL and SRMTL. e latter two methods leveraged the
raw features to directly train the classifier, whereas the former two learned the sparse represen-
tations of micro-videos before training the classifiers. is justifies the necessity of dictionary
learning and the discrimination of sparse representation; (2) multi-modal dictionary learning
methods surpass the mono-modal ones. is demonstrates that there are indeed relatedness
among modalities. Appropriate capturing and modeling such relatedness can reinforce the dic-
tionary learning, and hence the discrimination of sparse representation; (3) multi-modal dictio-
nary learning methods achieve relatively better results than that of TRUMMAN. is signals
that not all the modalities of micro-videos share the same space and it thus may not be the best
to learn the common space via the agreement constrains; (4) MTDL shows its superiority to
the MDL. is tells us that in the supervised settings, joint minimization of misclassification
and reconstruction errors result in the dictionaries that are adapted to the desired tasks. And
supervised methods can lead to a more accurate classification compared with the unsupervised
ones; (5) our proposed INTIMATE substantially outperforms the others, including MTDL
and MDL. is verifies that harvesting the prior knowledge of tree structure is useful for a more
discriminative dictionary learning toward venue classification; and (6) as to the efficiency over
the offline part, we can see that the cost of our model is relatively lower than that of the others.
In addition, we also conducted the significance test between our model and each of the base-
lines relying on the average results over 10-round experiments cross validation results. We can
see that all the p-values are substantially smaller than 0.05, indicating that the advantage of our
model is statistically significant. Besides, we can see that the performance of our model hardly
fluctuates, showing the stability of our model.
Parameter Tuning and Sensitivity Analysis
Our model has three key parameters: the number of dictionary atoms K, the tradeoff parameters
and . In each of the 10-round experiments, we adopted the grid search strategy to carefully
tune and select the optimal parameters from the training data [123]. Take one experiment as an
example. We first performed the grid search in a coarse level within a wide range of [0, 1000]
using an adaptive step size. Once we obtained the approximate scope of each parameter, we then
performed the fine tuning within a narrow range using a small step size.
Figure 4.7 shows the performance of our model regarding the three parameters, which
is accomplished by varying one and fixing the others. We can see that the performance of our
model changes within small ranges nearby the optimal settings. is justifies that our model is
non-sensitive to the parameters around their optimal settings. It is observed that the setting of
K D 150, D 0:85, and D 1 works well for all of our experiments. e procedures of tuning
the parameters are analogous to other competitors to ensure a fair comparison.
84 4. MULTIMODAL COOPERATIVE LEARNING
50 100 150 200 250 300 350 400 450 500
K
4.0
4.5
5.0
5.5
6.0
6.5
7.0
7.5
8.0
Accuracy(%)
0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95
λ
4.0
4.5
5.0
5.5
6.0
6.5
7.0
7.5
8.0
Accuracy(%)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
γ
4.0
4.5
5.0
5.5
6.0
6.5
7.0
7.5
8.0
Accuracy(%)
(a) (b) (c)
Figure 4.7: Parameter tuning and sensitivity analysis. is is implemented by varying one pa-
rameter and fixing the rest.
6.0
5.5
5.0
4.5
4.0
3.5
3.0
6.0
5.5
5.0
4.5
4.0
3.5
3.0
2.5
1,000
900
800
700
600
500
400
300
200
100
0
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Accuracy (%)
Accuracy (%)
Time (s)
Offline Training Size Online Training Size
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Training Data Size
INTIMATE
ODLSC
INTIMATE
ODLSC
(a) (b) (c)
Figure 4.8: (a) Influence of offline training size on the online update results. (b) Influence of
online training size on the accuracy of online update stage. (c) Time consuming of offline stage
with different sizes of the training data.
Justification of Online Learning
In this experiment, we have three settings: (1) by fixing the offline training data (5,396 videos),
we studied the effectiveness of the online part (increasing the samples from 1–10,807); (2) fixing
the online training data (10,807 videos), we studied the size influence of the offline training data
(increasing the samples from 1–5,396); and (3) we studied the time consuming of the offline
stage with different sizes of the training data, and compared the time with that of the online
stage.
e comparison results over 2,170 testing samples between INTIMATE and ODLSC
are illustrated in Figures 4.8a and 4.8b. We have the following observations: (1) the more online
training samples used to train our model, the better performance it achieves. is justifies the
usefulness of leveraging the incoming samples to incrementally strengthen our model. Mean-
while, it stably outperforms ODLSC under the same experimental setting; and (2) the size of
the initial training samples does affect the performance of our model, but not significantly even
if the online samples are insufficient. is shows the robustness of our model.
Figure 4.8c shows the time consuming of the offline stage with different sizes of the train-
ing data. From this figure, we find that the amount of time increases monotonously with the size
4.5. MULTIMODAL COMPLEMENTARY LEARNING 85
of the training data. Namely, if a new labeled sample is given (now the training data is 10,001),
it may consume at least 900 s to retrain the offline part for testing. However, our online learning
model only needs less than 1 s to strengthen our model for testing. It certifies that the online
learning is very necessary.
Effectiveness of Parallel Dictionary Learning
In order to evaluate the effectiveness of our parallel multi-modal dictionary learning, we pro-
posed a share dictionary learning model. e work in [44] has justified that a deep convolutional
neural network is equivalent to the sparse dictionary learning pipeline, whereby each convolu-
tional filter can be seen as a dictionary atom that we aim to learn, and the sparse coding can be
seen as the activation value of the filtered results. Motivated by this work, we designed a deep
share dictionary learning model as shown in Figure 4.9. Due to the fact that all the modality
features should utilize the same dictionary, we first embedded the three modality features into
a common space. Suppose the dimension of the common space is n, we then constructed a dic-
tionary using n 1 CNN kernels. Representation learning over a dictionary with K atoms is
equivalent to applying K linear filters n 1 to each input feature vector (each column in the
matrix). e sparse coding solver will then iteratively process K coefficients.
Visual Audio Text
Meet a new friend
of @virante via
@foodbankcenc
– stay tuned!
Feature Embedding
CNN-based Dictionary
Learning
Sparse
Representation
Figure 4.9: e pipeline of the baseline model DPL. It firstly embeds the three modality features
into a common space, and then uses a CNN-based dictionary learning method to learn sparse
representations. Finally, it concatenates these representations and throws them into a classifier.
e experimental results are shown in Table 4.4. From this table, we find that: (1) DPL
performs better than the above baseline methods, including MTDL and MDL. is demon-
strates the superior performance of deep learning-based model. (2) Although this model is a
deep learning one, our model still outperforms it. Because this model projects all the modalities
into a common space for sharing a common dictionary, it loses the complementary information
among the modalities. is also verifies that the common space assumption proposed in [192]
is invalid.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset