Experiments and Results (2/2)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

3.6. MULTIMODAL TRANSDUCTIVE LEARNING 33

Substituting Eq. (3.18) into e

ˇ D 1, we have:

ı D

1 C e

1

;

ˇ D H

1



e C e

1

 g



(3.19)

According to the deﬁnition of positive-deﬁnite matrix, H is always positive deﬁnite and hence

invertible. erefore, H

1

is also positive deﬁnite, which ensures e

1

e > 0.

Computing f with ˇ

Fixed

With ﬁxed ˇ

, taking derivative of  with respect to f

, where 1  i  N , we have

@

D 2.f

 y

/ C 2

N CM

j D1

L.i; j /f

: (3.20)

We then take derivative of the  with respect to f

, where N C 1  i  N C M . We reach

@

D 2

N CM

j D1

L.i; j /f

: (3.21)

In a vector-wise form, we restate the solution of f as follows:

f D G

1

Oy; (3.22)

where G D

I C 

kD1

, Oy D fy

; y

; : : : ; y

; 0; 0; : : : ; 0g and

I 2 R

.N CM /.N CM/

is de-

ﬁned as follows:

I .i; j / D

1; if i D j and 1  i  N ;

0; otherwise.

(3.23)

3.6.3 EXPERIMENTS AND RESULTS

In this part, we conducted extensive experiments to verify our proposed TMALL model on

Dataset I.

Experimental Settings

e remaining experiments were conducted over a cluster of 50 servers equipped with Intel

Xeon(2x) CPU E5-2620 v3 at 2:40 GHz on 64 GB RAM, 24 cores and 64-bit Linux operating

system. Regarding the deep feature extraction, we deployed Caﬀe framework [71] on a server

equipped with a NVIDIA Titan Z GPU. e experimental results reported in this chapter

34 3. MULTIMODAL TRANSDUCTIVE LEARNING

were based on 10-fold cross-validation. In each round of the 10-fold cross-validation, we split

Dataset I into two chunks: 90% of the micro-videos were used for training and 10% were used

for testing.

We report performance in terms of normalized mean square error (nMSE) [123] between

the predicted popularity and the actual popularity. e nMSE is an estimator of the overall

deviations between predicted and measured values. It is deﬁned as

nMSE D

iD1

 r

iD1

; (3.24)

where p

is the predicted value and r

is the target value in ground truth.

We have three key parameters as shown in Eq. (3.10). e optimal values of these param-

eters were carefully tuned with the training data in each of the 10 fold. We employed the grid

search strategy to obtain the optimal parameters between 10



to 10

with small but adaptive

step sizes. In particular, the step sizes were 0:00001, 0:0001, 0:001, 0:01, 0:1, 1, and 10 for the

range of [0:00001,0:0001], [0:0001,0:001], [0:001,0:01], [0:01,0:1], [0:1,1], [1,10], and [10,100],

respectively. e parameters corresponding to the best nMSE were used to report the ﬁnal re-

sults. For other compared systems, the procedures to tune the parameters are analogous to ensure

the fair comparison. Considering one fold as an example, we observed that our model reached

the optimal performance at  D 1,  D 0:01 and  D 100.

On Model Comparison

To demonstrate the eﬀectiveness of our proposed TMALL model, we carried out experiments

on Dataset I with several state-of-the-art multi-view learning approaches.

• Early_Fusion. e ﬁrst baseline concatenates the features extracted from the four modal-

ities into a single joint feature vector, on which traditional machine learning models can

be applied. In this work, we adopted the widely used regression model—SVR, and imple-

mented it with the help of scikit-learn [130].

• Late_Fusion. e second baseline ﬁrst separately predicts the popularity of micro-videos

from each modality via SVR model, and then linearly integrates them to obtain the ﬁnal

results.

• regMVMT. e third baseline is the regularized multi-view learning model [190]. is

model only regulates the relationships among diﬀerent views within the original space.

• MSNL. e fourth one is the multiple social network learning (MSNL) model proposed

in [149]. is model takes the source conﬁdence and source consistency into consideration.

• MvDA. e ﬁfth baseline is a multi-view discriminant analysis (MvDA) model [75],

which aims to learn a single uniﬁed discriminant common space for multiple views by

3.6. MULTIMODAL TRANSDUCTIVE LEARNING 35

jointly optimizing multiple view-speciﬁc transforms, one for each view. e model ex-

ploits both the intra-view and inter-view correlations.

Table 3.1 shows the performance comparison among diﬀerent models. From this table, we

have the following observations. (1) TMALL outperforms the Early_Fusion and Late_Fusion.

Regarding the Early_Fusion, features extracted from various sources may not fall into the same

semantic space. Simply appending all features actually brings in a certain amount of noise and

ambiguity. Besides, Early_Fusion may lead to the curse of dimensionality since the ﬁnal feature

vector would be of very high dimension. For the Late_Fusion, the fused result however might

not be reasonably accurate due to two reasons. First, a single modality might not be suﬃciently

descriptive to represent the complex semantics of the videos. Separate results would be thus sub-

optimal and the integration may not result in a desired outcome. Second, it is labor-intensive

to tune the fusion weights for diﬀerent modalities. Even worse, the optimal parameters for one

application cannot be directly applied to another one. (2) TMALL achieves better performance,

as compared with regMVMT and MSNL. is could be explained that linking diﬀerent modal-

ities via a uniﬁed latent space is better than imposing disagreement penalty directly over original

spaces. (3) e less satisfactory performance of MvDA indicates that it is necessary to explore

the consistency among diﬀerent modalities when building the latent space. (4) As compared

to the multi-view learning baselines, such as regMVMT, MSNL, and MvDA, our model sta-

bly demonstrates its advantage. is signals that the proposed transductive models can achieve

higher performance than inductive models under the same experimental settings. is can be

explained by the fact that TMALL leverages the knowledge of testing samples.

Table 3.1: Performance comparison between our proposed TMALL model and several state-

of-the-art baselines on Dataset I in terms of nMSE

Methods nMSE P-value

Early Fusion 59.931 ± 41.09 9.91e-04

Late Fusion 8.461 ± 5.34 3.25e-03

regMVMT 1.058 ± 0.05 1.88e-03

MSNL 1.098 ± 0.13 1.42e-02

MvDA 0.982 ± 7.00e-03 9.91e-04

TMALL 0.979 ± 9.42e-03 –

Moreover, we performed the paired t-test between TMALL and each baseline on 10-

fold cross validation. We found that all the p-values are much smaller than 0:05, which shows

that the performance improvements of our proposed model over other baselines are statistically

signiﬁcant.

36 3. MULTIMODAL TRANSDUCTIVE LEARNING

On Modality Comparison

To verify the eﬀectiveness of multi-modal integration, we also conducted experiments over dif-

ferent modality combinations of the four modalities. Table 3.2 summarizes the multi-modal

analysis and the paired t-test results. It is obvious that the more modalities we incorporated, the

better performance we can obtain. is implies the complementary relationships rather than

mutual conﬂicting relationships among the diﬀerent modalities. Moreover, we found that re-

moving features from any of these four modalities suﬀers from a decrease in performance. In

a sense, this is consensus with the old saying “two heads are better than one.” Additionally, as

the performance obtained from diﬀerent combinations are not the same, this validates that in-

corporating ˇ which controls the conﬁdence of diﬀerent modalities is reasonable. Interestingly,

we observed that the combination without social modality obtains the worst result which indi-

cates that the social modality plays a pivotal role in micro-video propagation, as compared to

visual, textual or acoustic modality. is also validates that the features developed from social

modality are much discriminative, even though they are with low-dimensions. On the other

hand, the textual modality contributes the least among all modalities, as the performance of

our model without textual modality still achieves good performance. is may be caused by the

sparse textual description, which is usually given in one short sentence.

Table 3.2: Performance comparison among diﬀerent modality combinations on Dataset I with

respect to nMSE. We denote T, V, A, and S as textual, visual, acoustic, and social modality,

respectively.

View

Combinations

nMSE P-value

T + V + A 0.996 ± 4.20e-03 2.62e-05

T + A + S 0.982 ± 4.27e-03 2.59e-05

T + V + S 0.982 ± 4.13e-03 3.05e-04

V + A + S 0.981 ± 5.16e-03 2.16e-05

T + V + A + S 0.979 ± 9.42e-03 –

On Visual Feature Comparison

To further examine the discriminative visual features we extracted, we conducted experiments

over diﬀerent kinds of visual features using TMALL. We also performed signiﬁcant test to

validate the advantage of combining multiple features. Table 3.3 comparatively shows the per-

formance of TMALL in terms of diﬀerent visual feature conﬁgurations. It can be seen that

the object, visual sentiment and aesthetic features achieve similar improvement in performance,

as compared to color histogram features. is reveals that micro-videos’ popularity is better

reﬂected by their content, sentiment, and design, including what objects they contain, which

3.6. MULTIMODAL TRANSDUCTIVE LEARNING 37

emotion they convey and what design standards they follow. is is highly consistent with our

oberservations and also implies that micro-videos which aim to gain high popularity need to be

well designed and considered more from the visual content.

Table 3.3: Performance comparison among diﬀerent visual features on Dataset I with respect to

nMSE

Features nMSE P-value

Color Histogram 0.996 ± 6.88e-03 1.94e-04

Object Feature 0.994 ± 6.71e-03 2.47e-04

Visual Sentiment 0.994 ± 6.72e-03 2.49e-04

Aesthetic Feature 0.984 ± 6.95e-03 4.44e-01

ALL 0.979 ± 9.42e-03 –

Illustrative Examples

To gain the insights of the inﬂuential factors in the task of popularity prediction of micro-

videos, we comparatively illustrate a few representative examples in Figure 3.2. From this ﬁgure,

we have the following observations. (1) Figure 3.2 shows three micro-video pairs. Each of the

three micro-video pairs describes the similar semantics, i.e., animals, football game, and sun-

set, respectively, but they were published by diﬀerent users. e publishers of the videos in top

row are much more famous than those of the bottom. We found that the corresponding pop-

ularity of micro-videos in the second row are much lower than those in the ﬁrst row, although

they have no signiﬁcant diﬀerence from the perspective of video contents, which clearly justiﬁes

the importance of social modality. (2) Figure 3.2 illustrates three micro-video pairs, where each

pair of micro-videos were published by the same user. However, the micro-videos in the ﬁrst

row achieve much higher popularity than those in the second row, which demonstrates that the

contents of micro-videos also contribute to their popularity. In particular, the comparisons in

Figure 3.2, from left to right, are (i) the existence of “skillful pianolude” compared with “noisy

dance music,” (ii) “funny animals” compared with “motionless dog,” and (iii) “beautiful ﬂow-

ers” compared with “gloomy sky.” ese examples indicate the necessity of developing acoustic

features, visual sentiment and visual aesthetic features for the task of micro-video popularity.

(3) Figure 3.2 shows a group of micro-videos, whose textual descriptions contain either super-

star names, hot hashtags, or informative descriptions. ese micro-videos received a lot of loops,

comments, likes, and reposts. ese examples thus reﬂect the value of textual modality.

Complexity Analysis

To theoretically analyze the computational cost of our proposed TMALL model, we ﬁrst com-

pute the complexity in the construction of H and g, as well as the inverse of matrices H and G.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Experiments and Results (2/2)

Create new playlist

Sign In

Sign Up

Table of Contents for
Experiments and Results (2/2)