Experiments and Results (2/2)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

3.7. MULTI-MODAL TRANSDUCTIVE LOW-RANK LEARNING 47

3.7.3 EXPERIMENTS AND RESULTS

We verify our proposed TLRMVR model over Dataset I.

Experimental Settings

As mentioned in Section 2.2, considering popularity is highly related to online social interac-

tions, the mean values of four types of statistics, namely, the numbers of comments, reposts, likes

and views/loops, are taken into account to formulate the ﬁnal popularity scores of micro-videos.

Figure 3.4 shows sample micro-videos that span a wide range of popularity scores.

We tested the prediction performance over 10 random splits of the Dataset I and report

the average results. In each round, we used 90% of the micro-videos for training and the remain-

ing for testing. We empirically set the adaptive parameters as ˛ D 1, ı D 0:1, and  D 0:01 as

default. e trade-oﬀ parameters ˇ and  in TLRMVR model are selected by a grid-search ap-

proach. We ﬁrst performed a coarse grid. Once we identiﬁed a ideal region, we then conducted

a ﬁner grid search on that region. Finally, we set  D 0:1 and ˇ D 0:5.

We denote X

; X

as the feature matrices corresponding to visual, acoustic, tex-

tual, and social views, respectively. It is worth mentioning that in this experiments, we did not

consider visual quality assessment features in the visual modality.

Results and Discussions

To comprehensively validate the proposed algorithm, in the following experiments, we justiﬁed

the proposed algorithm from the following six perspectives.

• Convergence analysis: We tested the convergence of our algorithms based on the pro-

posed alternating algorithm.

• Component analysis: To verify the eﬀectiveness of diﬀerent components in our proposed

scheme, we compared the prediction performance by removing each component in our

method.

• Feature analysis: To evaluate how features contribute to the micro-video popularity pre-

diction, we considered two forms of evaluation: (i) performance comparison among dif-

ferent views and (ii) performance comparison among diﬀerent visual features.

• Parameter sensitivity analysis: We conducted experiments to investigate the inﬂuence of

various weighting parameters on the prediction accuracy.

• Comparison with state-of-the-art methods: Performance comparisons with several

state-of-the-art algorithms were conducted to demonstrate the eﬀectiveness of our

method.

Convergence Analysis In this part, we tested the convergence of our objective function based

on the proposed alternating algorithm and randomly selected a trial to report the results. Because

48 3. MULTIMODAL TRANSDUCTIVE LEARNING

0.2870

0.2463

0.2429

0.1934

0.1215

0.1364

0.0404

0.0315

0.0277

0.0087

0.0094

0.0046

Figure 3.4: Micro-video examples sampled from Dataset I with various popularity scores. e

micro-videos are sorted from more popular (left) to less popular (right).

3.7. MULTI-MODAL TRANSDUCTIVE LOW-RANK LEARNING 49

Z is used for predicting the popularity of micro-videos, we would like to measure the variance

between two sequential Zs by the following metric:

D.t/ D

 Z

t1

: (3.45)

is will guarantee that the ﬁnal feature results will not be drastically changed. Figure 3.5

presents the absolute values of the variance during the iterations. As shown in this ﬁgure, the

divergence values obtained for our proposed algorithm decrease rapidly with increasing numbers

of iterations and converge after approximately 20 iterations. Based on the above analysis, the

iterative criteria are essential to guarantee the convergence of our objective function. erefore,

in this chapter, we used the relative change between two consecutive iterations falling below

a threshold of 1e-3 and a maximum of 30 iterations as the stopping criteria for our proposed

method.

5 10 15 20 25 30

35 40 5045

× 10

-8

Figure 3.5: e convergence curve of our proposed TLRMVR method. e horizontal axis rep-

resents the number of iterations, and the vertical axis is the divergence between two consecutive

measured Zs.

Component Analysis To validate the contributions of each component in our proposed

framework, we compared the prediction performance by removing the relevant components.

• noLR: We eliminated the inﬂuence of the low-rank constraint imposed on Z by replacing

it with the Frobenius norm.

50 3. MULTIMODAL TRANSDUCTIVE LEARNING

• noGR: We eliminated the inﬂuence of the graph regularization term by setting  D 0.

• noMR: We eliminated the eﬀect of multi-view embedding learning by setting ı D 0.

• noSP: We eliminated the inﬂuence of supervised information, i.e., view information and

regression information, by discarding both the regression coeﬃcient and view-speciﬁc

transformation matrices learning.

In the case of noSP, our algorithm degenerates to a typical unsupervised low-rank feature

representation. In order to get comparable results, a least squares regression model is trained

to predict popularity scores. Table 3.4 shows the prediction results of diﬀerent schemes. In

this table, we selected the top 50; 100; 200 images with the highest ground truth and the bot-

tom 50; 100; 200 images with the lowest ground truth to report the average popularity scores

based on their predicted results. As shown in this table, the predicted popularity scores over

diﬀerent ranges are Top50>Top100>Top200>Bottom200>Bottom100>Bottom50, illustrating

that the behavior of the predicted results is reasonable. Moreover, we sorted the nMSE values

of diﬀerent methods in descending order and found that noSP>noLR>noGR D noMR; thus,

the following conclusions are obtained. (1) Without supervised information, noSP performs

the worst, indicating that the valuable supervised information is essential to learn a more robust

prediction model. Moreover, noSP separates micro-video plurality prediction into two phases,

which may lead to sub-optimal prediction results. (2) noMR and noLR impose similar signiﬁ-

cant eﬀects on the prediction results, which means the low-rank representation and multi-view

embedding learning are important in reducing the heterogeneous gap among features and alle-

viating the inﬂuence of feature noises. (3) Our proposed TLRMVR outperforms noGR, which

demonstrates that our proposed method beneﬁts from the use of graph regularization. is result

further indicates that multi-graph regularization can indeed be employed to address multi-view

Table 3.4: Performance comparison of involved components in our proposed framework on

Dataset I

noLR noGR noMR noSP TRLMVR

Top50 0.296 0.347 0.326 0.204 0.309

Top100 0.291 0.317 0.311 0.201 0.280

Top200 0.276 0.296 0.285 0.192 0.276

Bottom200 0.253 0.271 0.269 0.172 0.265

Bottom100 0.249 0.258 0.254 0.161 0.256

Bottom50 0.246 0.251 0.251 0.157 0.249

nMSE 0.950 0.949 0.949 0.973 0.934

P-value < 0.05 < 0.05 < 0.05 < 0.05 –

3.7. MULTI-MODAL TRANSDUCTIVE LOW-RANK LEARNING 51

feature fusion problem. (4) P-value [124] is adopted to assess whether the superiority of the TL-

RMVR method is statistically signiﬁcant. We can discover that the P-values are smaller than

the signiﬁcance level of 0.05, which indicates that the null hypothesis is clearly rejected and that

the improvements of TLRMVR are statistically signiﬁcant.

Feature Analysis Under our proposed framework, we investigated the inﬂuence of diﬀer-

ent features on the micro-video popularity prediction from two perspectives: (i) performance

comparison of diﬀerent visual-level feature combinations and (ii) performance comparison of

diﬀerent view-level feature combinations.

We ﬁrst selected one of four visual features to represent the visual content of micro-videos

and integrate with contextual, social, and acoustic cues together to conduct our experiments. Ta-

ble 3.5 reports the average results over 10 random splits in terms of nMSE and P-value. From

Table 3.5, we can observe the following results: (1) object features perform the best among vi-

sual features, indicating that object semantics can encode important information that makes a

micro-video popular; (2) visual sentiment has a signiﬁcant inﬂuence on prediction performance,

illustrating that high-level sentiment semantics are helpful for micro-video popularity predic-

tion; (3) the aesthetics exhibits better performance than the color histogram since aesthetic fea-

tures specify the highly subjective nature of human perception; (4) the worst performance is still

achieved by color histogram, although color histogram is eﬀective in modeling the color percep-

tion of the human visual system; and (5) the best performance is achieved when all visual features

are combined, illustrating the beneﬁt of exploiting the complementary information oﬀered by

diﬀerent visual representations.

Subsequently, we evaluated how various view-level feature combinations contribute to the

popularity of micro-videos under our proposed framework. For simplicity, the features extracted

from textual, visual, acoustic, and social cues are indicated as “T”, “V”, “A”, and “S”, respec-

tively. Table 3.6 shows the average results in terms of nMSE and P-value. From Table 3.6,

we can observe the following results: (1) similar to other existing studies, “T+V+A” provides

the most unsatisfactory results when removing social cues, which indicates that social cues can

largely facilitate popularity prediction compared to other types of cues; (2) the prediction per-

formance of “T+A+S” sharply decreases after removing visual cues. is result shows that visual

cues of micro-videos serve as an indispensable component to further improve the prediction

performance; (3) “V+A+S” yields a good result of nMSE=0.955 compared to the other forms of

combinations, indicating that textual cues exhibit little eﬀect on popularity. One possible reason

causing this phenomenon is that there are quite a fair number of micro-videos that lack textual

descriptions. Moreover, the weak correlation between textual descriptions and micro-videos is

also a common cause of this eﬀect; and (4) when combined all view features together, the best

performance is achieved with a minimum nMSE of 0.934. Additionally, it could therefore be

concluded that the sequences of all cues, which are sorted in descending order in terms of their

importance, is social>visual>acoustic>textual cues.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Experiments and Results (2/2)

Create new playlist

Sign In

Sign Up

Table of Contents for
Experiments and Results (2/2)