3.6. MULTIMODAL TRANSDUCTIVE LEARNING 33
Substituting Eq. (3.18) into e
T
ˇ D 1, we have:
8
ˆ
ˆ
<
ˆ
ˆ
:
ı D
1 C e
T
H
1
g
e
T
H
1
e
;
ˇ D H
1
e C e
T
H
1
ge
e
T
H
1
e
g
:
(3.19)
According to the definition of positive-definite matrix, H is always positive definite and hence
invertible. erefore, H
1
is also positive definite, which ensures e
T
H
1
e > 0.
Computing f with ˇ
j
Fixed
With fixed ˇ
j
, taking derivative of with respect to f
i
, where 1 i N , we have
@
@f
i
D 2.f
i
y
i
/ C 2
N CM
X
j D1
Q
L.i; j /f
j
: (3.20)
We then take derivative of the with respect to f
i
, where N C 1 i N C M . We reach
@
@f
i
D 2
N CM
X
j D1
Q
L.i; j /f
j
: (3.21)
In a vector-wise form, we restate the solution of f as follows:
f D G
1
Oy; (3.22)
where G D
O
I C
P
K
kD1
ˇ
k
Q
L
k
, Oy D fy
1
; y
2
; : : : ; y
N
; 0; 0; : : : ; 0g and
O
I 2 R
.N CM /.N CM/
is de-
fined as follows:
O
I .i; j / D
8
<
:
1; if i D j and 1 i N ;
0; otherwise.
(3.23)
3.6.3 EXPERIMENTS AND RESULTS
In this part, we conducted extensive experiments to verify our proposed TMALL model on
Dataset I.
Experimental Settings
e remaining experiments were conducted over a cluster of 50 servers equipped with Intel
Xeon(2x) CPU E5-2620 v3 at 2:40 GHz on 64 GB RAM, 24 cores and 64-bit Linux operating
system. Regarding the deep feature extraction, we deployed Caffe framework [71] on a server
equipped with a NVIDIA Titan Z GPU. e experimental results reported in this chapter
34 3. MULTIMODAL TRANSDUCTIVE LEARNING
were based on 10-fold cross-validation. In each round of the 10-fold cross-validation, we split
Dataset I into two chunks: 90% of the micro-videos were used for training and 10% were used
for testing.
We report performance in terms of normalized mean square error (nMSE) [123] between
the predicted popularity and the actual popularity. e nMSE is an estimator of the overall
deviations between predicted and measured values. It is defined as
nMSE D
P
iD1
.p
i
r
i
/
2
P
iD1
r
2
i
; (3.24)
where p
i
is the predicted value and r
i
is the target value in ground truth.
We have three key parameters as shown in Eq. (3.10). e optimal values of these param-
eters were carefully tuned with the training data in each of the 10 fold. We employed the grid
search strategy to obtain the optimal parameters between 10
5
to 10
2
with small but adaptive
step sizes. In particular, the step sizes were 0:00001, 0:0001, 0:001, 0:01, 0:1, 1, and 10 for the
range of [0:00001,0:0001], [0:0001,0:001], [0:001,0:01], [0:01,0:1], [0:1,1], [1,10], and [10,100],
respectively. e parameters corresponding to the best nMSE were used to report the final re-
sults. For other compared systems, the procedures to tune the parameters are analogous to ensure
the fair comparison. Considering one fold as an example, we observed that our model reached
the optimal performance at D 1, D 0:01 and D 100.
On Model Comparison
To demonstrate the effectiveness of our proposed TMALL model, we carried out experiments
on Dataset I with several state-of-the-art multi-view learning approaches.
Early_Fusion. e first baseline concatenates the features extracted from the four modal-
ities into a single joint feature vector, on which traditional machine learning models can
be applied. In this work, we adopted the widely used regression model—SVR, and imple-
mented it with the help of scikit-learn [130].
Late_Fusion. e second baseline first separately predicts the popularity of micro-videos
from each modality via SVR model, and then linearly integrates them to obtain the final
results.
regMVMT. e third baseline is the regularized multi-view learning model [190]. is
model only regulates the relationships among different views within the original space.
MSNL. e fourth one is the multiple social network learning (MSNL) model proposed
in [149]. is model takes the source confidence and source consistency into consideration.
MvDA. e fifth baseline is a multi-view discriminant analysis (MvDA) model [75],
which aims to learn a single unified discriminant common space for multiple views by
3.6. MULTIMODAL TRANSDUCTIVE LEARNING 35
jointly optimizing multiple view-specific transforms, one for each view. e model ex-
ploits both the intra-view and inter-view correlations.
Table 3.1 shows the performance comparison among different models. From this table, we
have the following observations. (1) TMALL outperforms the Early_Fusion and Late_Fusion.
Regarding the Early_Fusion, features extracted from various sources may not fall into the same
semantic space. Simply appending all features actually brings in a certain amount of noise and
ambiguity. Besides, Early_Fusion may lead to the curse of dimensionality since the final feature
vector would be of very high dimension. For the Late_Fusion, the fused result however might
not be reasonably accurate due to two reasons. First, a single modality might not be sufficiently
descriptive to represent the complex semantics of the videos. Separate results would be thus sub-
optimal and the integration may not result in a desired outcome. Second, it is labor-intensive
to tune the fusion weights for different modalities. Even worse, the optimal parameters for one
application cannot be directly applied to another one. (2) TMALL achieves better performance,
as compared with regMVMT and MSNL. is could be explained that linking different modal-
ities via a unified latent space is better than imposing disagreement penalty directly over original
spaces. (3) e less satisfactory performance of MvDA indicates that it is necessary to explore
the consistency among different modalities when building the latent space. (4) As compared
to the multi-view learning baselines, such as regMVMT, MSNL, and MvDA, our model sta-
bly demonstrates its advantage. is signals that the proposed transductive models can achieve
higher performance than inductive models under the same experimental settings. is can be
explained by the fact that TMALL leverages the knowledge of testing samples.
Table 3.1: Performance comparison between our proposed TMALL model and several state-
of-the-art baselines on Dataset I in terms of nMSE
Methods nMSE P-value
Early Fusion 59.931 ± 41.09 9.91e-04
Late Fusion 8.461 ± 5.34 3.25e-03
regMVMT 1.058 ± 0.05 1.88e-03
MSNL 1.098 ± 0.13 1.42e-02
MvDA 0.982 ± 7.00e-03 9.91e-04
TMALL 0.979 ± 9.42e-03
Moreover, we performed the paired t-test between TMALL and each baseline on 10-
fold cross validation. We found that all the p-values are much smaller than 0:05, which shows
that the performance improvements of our proposed model over other baselines are statistically
significant.
36 3. MULTIMODAL TRANSDUCTIVE LEARNING
On Modality Comparison
To verify the effectiveness of multi-modal integration, we also conducted experiments over dif-
ferent modality combinations of the four modalities. Table 3.2 summarizes the multi-modal
analysis and the paired t-test results. It is obvious that the more modalities we incorporated, the
better performance we can obtain. is implies the complementary relationships rather than
mutual conflicting relationships among the different modalities. Moreover, we found that re-
moving features from any of these four modalities suffers from a decrease in performance. In
a sense, this is consensus with the old saying “two heads are better than one.” Additionally, as
the performance obtained from different combinations are not the same, this validates that in-
corporating ˇ which controls the confidence of different modalities is reasonable. Interestingly,
we observed that the combination without social modality obtains the worst result which indi-
cates that the social modality plays a pivotal role in micro-video propagation, as compared to
visual, textual or acoustic modality. is also validates that the features developed from social
modality are much discriminative, even though they are with low-dimensions. On the other
hand, the textual modality contributes the least among all modalities, as the performance of
our model without textual modality still achieves good performance. is may be caused by the
sparse textual description, which is usually given in one short sentence.
Table 3.2: Performance comparison among different modality combinations on Dataset I with
respect to nMSE. We denote T, V, A, and S as textual, visual, acoustic, and social modality,
respectively.
View
Combinations
nMSE P-value
T + V + A 0.996 ± 4.20e-03 2.62e-05
T + A + S 0.982 ± 4.27e-03 2.59e-05
T + V + S 0.982 ± 4.13e-03 3.05e-04
V + A + S 0.981 ± 5.16e-03 2.16e-05
T + V + A + S 0.979 ± 9.42e-03
On Visual Feature Comparison
To further examine the discriminative visual features we extracted, we conducted experiments
over different kinds of visual features using TMALL. We also performed significant test to
validate the advantage of combining multiple features. Table 3.3 comparatively shows the per-
formance of TMALL in terms of different visual feature configurations. It can be seen that
the object, visual sentiment and aesthetic features achieve similar improvement in performance,
as compared to color histogram features. is reveals that micro-videos’ popularity is better
reflected by their content, sentiment, and design, including what objects they contain, which
3.6. MULTIMODAL TRANSDUCTIVE LEARNING 37
emotion they convey and what design standards they follow. is is highly consistent with our
oberservations and also implies that micro-videos which aim to gain high popularity need to be
well designed and considered more from the visual content.
Table 3.3: Performance comparison among different visual features on Dataset I with respect to
nMSE
Features nMSE P-value
Color Histogram 0.996 ± 6.88e-03 1.94e-04
Object Feature 0.994 ± 6.71e-03 2.47e-04
Visual Sentiment 0.994 ± 6.72e-03 2.49e-04
Aesthetic Feature 0.984 ± 6.95e-03 4.44e-01
ALL 0.979 ± 9.42e-03
Illustrative Examples
To gain the insights of the influential factors in the task of popularity prediction of micro-
videos, we comparatively illustrate a few representative examples in Figure 3.2. From this figure,
we have the following observations. (1) Figure 3.2 shows three micro-video pairs. Each of the
three micro-video pairs describes the similar semantics, i.e., animals, football game, and sun-
set, respectively, but they were published by different users. e publishers of the videos in top
row are much more famous than those of the bottom. We found that the corresponding pop-
ularity of micro-videos in the second row are much lower than those in the first row, although
they have no significant difference from the perspective of video contents, which clearly justifies
the importance of social modality. (2) Figure 3.2 illustrates three micro-video pairs, where each
pair of micro-videos were published by the same user. However, the micro-videos in the first
row achieve much higher popularity than those in the second row, which demonstrates that the
contents of micro-videos also contribute to their popularity. In particular, the comparisons in
Figure 3.2, from left to right, are (i) the existence of skillful pianolude compared with noisy
dance music,” (ii) “funny animals” compared with motionless dog,” and (iii) beautiful flow-
ers” compared with “gloomy sky.” ese examples indicate the necessity of developing acoustic
features, visual sentiment and visual aesthetic features for the task of micro-video popularity.
(3) Figure 3.2 shows a group of micro-videos, whose textual descriptions contain either super-
star names, hot hashtags, or informative descriptions. ese micro-videos received a lot of loops,
comments, likes, and reposts. ese examples thus reflect the value of textual modality.
Complexity Analysis
To theoretically analyze the computational cost of our proposed TMALL model, we first com-
pute the complexity in the construction of H and g, as well as the inverse of matrices H and G.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset