3.7. MULTI-MODAL TRANSDUCTIVE LOW-RANK LEARNING 47
3.7.3 EXPERIMENTS AND RESULTS
We verify our proposed TLRMVR model over Dataset I.
Experimental Settings
As mentioned in Section 2.2, considering popularity is highly related to online social interac-
tions, the mean values of four types of statistics, namely, the numbers of comments, reposts, likes
and views/loops, are taken into account to formulate the final popularity scores of micro-videos.
Figure 3.4 shows sample micro-videos that span a wide range of popularity scores.
We tested the prediction performance over 10 random splits of the Dataset I and report
the average results. In each round, we used 90% of the micro-videos for training and the remain-
ing for testing. We empirically set the adaptive parameters as ˛ D 1, ı D 0:1, and D 0:01 as
default. e trade-off parameters ˇ and in TLRMVR model are selected by a grid-search ap-
proach. We first performed a coarse grid. Once we identified a ideal region, we then conducted
a finer grid search on that region. Finally, we set D 0:1 and ˇ D 0:5.
We denote X
1
; X
2
; X
3
; X
4
as the feature matrices corresponding to visual, acoustic, tex-
tual, and social views, respectively. It is worth mentioning that in this experiments, we did not
consider visual quality assessment features in the visual modality.
Results and Discussions
To comprehensively validate the proposed algorithm, in the following experiments, we justified
the proposed algorithm from the following six perspectives.
Convergence analysis: We tested the convergence of our algorithms based on the pro-
posed alternating algorithm.
Component analysis: To verify the effectiveness of different components in our proposed
scheme, we compared the prediction performance by removing each component in our
method.
Feature analysis: To evaluate how features contribute to the micro-video popularity pre-
diction, we considered two forms of evaluation: (i) performance comparison among dif-
ferent views and (ii) performance comparison among different visual features.
Parameter sensitivity analysis: We conducted experiments to investigate the influence of
various weighting parameters on the prediction accuracy.
Comparison with state-of-the-art methods: Performance comparisons with several
state-of-the-art algorithms were conducted to demonstrate the effectiveness of our
method.
Convergence Analysis In this part, we tested the convergence of our objective function based
on the proposed alternating algorithm and randomly selected a trial to report the results. Because
48 3. MULTIMODAL TRANSDUCTIVE LEARNING
0.2870
0.2463
0.2429
0.1934
0.1215
0.1364
0.0404
0.0315
0.0277
0.0087
0.0094
0.0046
Figure 3.4: Micro-video examples sampled from Dataset I with various popularity scores. e
micro-videos are sorted from more popular (left) to less popular (right).
3.7. MULTI-MODAL TRANSDUCTIVE LOW-RANK LEARNING 49
Z is used for predicting the popularity of micro-videos, we would like to measure the variance
between two sequential Zs by the following metric:
D.t/ D
k
Z
t
Z
t1
k
F
: (3.45)
is will guarantee that the final feature results will not be drastically changed. Figure 3.5
presents the absolute values of the variance during the iterations. As shown in this figure, the
divergence values obtained for our proposed algorithm decrease rapidly with increasing numbers
of iterations and converge after approximately 20 iterations. Based on the above analysis, the
iterative criteria are essential to guarantee the convergence of our objective function. erefore,
in this chapter, we used the relative change between two consecutive iterations falling below
a threshold of 1e-3 and a maximum of 30 iterations as the stopping criteria for our proposed
method.
5 10 15 20 25 30
30
25
20
15
10
5
0
35 40 5045
× 10
-8
Figure 3.5: e convergence curve of our proposed TLRMVR method. e horizontal axis rep-
resents the number of iterations, and the vertical axis is the divergence between two consecutive
measured Zs.
Component Analysis To validate the contributions of each component in our proposed
framework, we compared the prediction performance by removing the relevant components.
noLR: We eliminated the influence of the low-rank constraint imposed on Z by replacing
it with the Frobenius norm.
50 3. MULTIMODAL TRANSDUCTIVE LEARNING
noGR: We eliminated the influence of the graph regularization term by setting D 0.
noMR: We eliminated the effect of multi-view embedding learning by setting ı D 0.
noSP: We eliminated the influence of supervised information, i.e., view information and
regression information, by discarding both the regression coefficient and view-specific
transformation matrices learning.
In the case of noSP, our algorithm degenerates to a typical unsupervised low-rank feature
representation. In order to get comparable results, a least squares regression model is trained
to predict popularity scores. Table 3.4 shows the prediction results of different schemes. In
this table, we selected the top 50; 100; 200 images with the highest ground truth and the bot-
tom 50; 100; 200 images with the lowest ground truth to report the average popularity scores
based on their predicted results. As shown in this table, the predicted popularity scores over
different ranges are Top50>Top100>Top200>Bottom200>Bottom100>Bottom50, illustrating
that the behavior of the predicted results is reasonable. Moreover, we sorted the nMSE values
of different methods in descending order and found that noSP>noLR>noGR D noMR; thus,
the following conclusions are obtained. (1) Without supervised information, noSP performs
the worst, indicating that the valuable supervised information is essential to learn a more robust
prediction model. Moreover, noSP separates micro-video plurality prediction into two phases,
which may lead to sub-optimal prediction results. (2) noMR and noLR impose similar signifi-
cant effects on the prediction results, which means the low-rank representation and multi-view
embedding learning are important in reducing the heterogeneous gap among features and alle-
viating the influence of feature noises. (3) Our proposed TLRMVR outperforms noGR, which
demonstrates that our proposed method benefits from the use of graph regularization. is result
further indicates that multi-graph regularization can indeed be employed to address multi-view
Table 3.4: Performance comparison of involved components in our proposed framework on
Dataset I
noLR noGR noMR noSP TRLMVR
Top50 0.296 0.347 0.326 0.204 0.309
Top100 0.291 0.317 0.311 0.201 0.280
Top200 0.276 0.296 0.285 0.192 0.276
Bottom200 0.253 0.271 0.269 0.172 0.265
Bottom100 0.249 0.258 0.254 0.161 0.256
Bottom50 0.246 0.251 0.251 0.157 0.249
nMSE 0.950 0.949 0.949 0.973 0.934
P-value < 0.05 < 0.05 < 0.05 < 0.05
3.7. MULTI-MODAL TRANSDUCTIVE LOW-RANK LEARNING 51
feature fusion problem. (4) P-value [124] is adopted to assess whether the superiority of the TL-
RMVR method is statistically significant. We can discover that the P-values are smaller than
the significance level of 0.05, which indicates that the null hypothesis is clearly rejected and that
the improvements of TLRMVR are statistically significant.
Feature Analysis Under our proposed framework, we investigated the influence of differ-
ent features on the micro-video popularity prediction from two perspectives: (i) performance
comparison of different visual-level feature combinations and (ii) performance comparison of
different view-level feature combinations.
We first selected one of four visual features to represent the visual content of micro-videos
and integrate with contextual, social, and acoustic cues together to conduct our experiments. Ta-
ble 3.5 reports the average results over 10 random splits in terms of nMSE and P-value. From
Table 3.5, we can observe the following results: (1) object features perform the best among vi-
sual features, indicating that object semantics can encode important information that makes a
micro-video popular; (2) visual sentiment has a significant influence on prediction performance,
illustrating that high-level sentiment semantics are helpful for micro-video popularity predic-
tion; (3) the aesthetics exhibits better performance than the color histogram since aesthetic fea-
tures specify the highly subjective nature of human perception; (4) the worst performance is still
achieved by color histogram, although color histogram is effective in modeling the color percep-
tion of the human visual system; and (5) the best performance is achieved when all visual features
are combined, illustrating the benefit of exploiting the complementary information offered by
different visual representations.
Subsequently, we evaluated how various view-level feature combinations contribute to the
popularity of micro-videos under our proposed framework. For simplicity, the features extracted
from textual, visual, acoustic, and social cues are indicated as T”, V”, A”, and “S”, respec-
tively. Table 3.6 shows the average results in terms of nMSE and P-value. From Table 3.6,
we can observe the following results: (1) similar to other existing studies, T+V+A” provides
the most unsatisfactory results when removing social cues, which indicates that social cues can
largely facilitate popularity prediction compared to other types of cues; (2) the prediction per-
formance of T+A+S” sharply decreases after removing visual cues. is result shows that visual
cues of micro-videos serve as an indispensable component to further improve the prediction
performance; (3) V+A+S” yields a good result of nMSE=0.955 compared to the other forms of
combinations, indicating that textual cues exhibit little effect on popularity. One possible reason
causing this phenomenon is that there are quite a fair number of micro-videos that lack textual
descriptions. Moreover, the weak correlation between textual descriptions and micro-videos is
also a common cause of this effect; and (4) when combined all view features together, the best
performance is achieved with a minimum nMSE of 0.934. Additionally, it could therefore be
concluded that the sequences of all cues, which are sorted in descending order in terms of their
importance, is social>visual>acoustic>textual cues.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset