3.7. MULTI-MODAL TRANSDUCTIVE LOW-RANK LEARNING 51
feature fusion problem. (4) P-value [124] is adopted to assess whether the superiority of the TL-
RMVR method is statistically significant. We can discover that the P-values are smaller than
the significance level of 0.05, which indicates that the null hypothesis is clearly rejected and that
the improvements of TLRMVR are statistically significant.
Feature Analysis Under our proposed framework, we investigated the influence of differ-
ent features on the micro-video popularity prediction from two perspectives: (i) performance
comparison of different visual-level feature combinations and (ii) performance comparison of
different view-level feature combinations.
We first selected one of four visual features to represent the visual content of micro-videos
and integrate with contextual, social, and acoustic cues together to conduct our experiments. Ta-
ble 3.5 reports the average results over 10 random splits in terms of nMSE and P-value. From
Table 3.5, we can observe the following results: (1) object features perform the best among vi-
sual features, indicating that object semantics can encode important information that makes a
micro-video popular; (2) visual sentiment has a significant influence on prediction performance,
illustrating that high-level sentiment semantics are helpful for micro-video popularity predic-
tion; (3) the aesthetics exhibits better performance than the color histogram since aesthetic fea-
tures specify the highly subjective nature of human perception; (4) the worst performance is still
achieved by color histogram, although color histogram is effective in modeling the color percep-
tion of the human visual system; and (5) the best performance is achieved when all visual features
are combined, illustrating the benefit of exploiting the complementary information offered by
different visual representations.
Subsequently, we evaluated how various view-level feature combinations contribute to the
popularity of micro-videos under our proposed framework. For simplicity, the features extracted
from textual, visual, acoustic, and social cues are indicated as “T”, “V”, “A”, and “S”, respec-
tively. Table 3.6 shows the average results in terms of nMSE and P-value. From Table 3.6,
we can observe the following results: (1) similar to other existing studies, “T+V+A” provides
the most unsatisfactory results when removing social cues, which indicates that social cues can
largely facilitate popularity prediction compared to other types of cues; (2) the prediction per-
formance of “T+A+S” sharply decreases after removing visual cues. is result shows that visual
cues of micro-videos serve as an indispensable component to further improve the prediction
performance; (3) “V+A+S” yields a good result of nMSE=0.955 compared to the other forms of
combinations, indicating that textual cues exhibit little effect on popularity. One possible reason
causing this phenomenon is that there are quite a fair number of micro-videos that lack textual
descriptions. Moreover, the weak correlation between textual descriptions and micro-videos is
also a common cause of this effect; and (4) when combined all view features together, the best
performance is achieved with a minimum nMSE of 0.934. Additionally, it could therefore be
concluded that the sequences of all cues, which are sorted in descending order in terms of their
importance, is social>visual>acoustic>textual cues.