Acoustic modality usually works as an important complement to visual modality in many video-
related tasks, such as video classification [175]. In fact, audio channels embedded in the micro-
videos may also contribute to the popularity of micro-videos to a large extent. For example, the
audio channel may indicate the quality of a given micro-video and convey rich background infor-
mation about the emotion as well as the scene contained in the micro-video, which significantly
affects the popularity of a micro-video. e acoustic information is especially useful for the cases
where the visual features could not carry enough information. erefore, we adopted the fol-
lowing widely used acoustic features, i.e., mel-frequency cepstral coefficients (MFCC) [88] and
Audio-Six (i.e., Energy Entropy, Signal Energy, Zero Crossing Rate, Spectral Rolloff, Spectral
Centroid, and Spectral Flux [171]). ese features are frequently used in different audio-related
tasks, such as emotion detection and music recognition. We finally obtained a 36-d acoustic fea-
ture vector for each micro-video.
Micro-videos are usually associated with textual modality in the form of descriptions, such as
“when Leo finally gets the Oscar” and “Puppy dog dreams,” which may precisely summarize the
micro-videos. Such summarization may depict the topics and sentiment information regard-
ing the micro-videos, which has been proven to be of significance in online article popularity
prediction [9].
Sentence2Vector We found that the popular micro-videos are sometimes related to the topics
of the textual descriptions. is observation propels us to conduct content analysis over the tex-
tual descriptions of micro-videos. Considering the short-length of descriptions, to perform con-
tent analysis, we employed the state-of-the-art textual feature extraction tool Sentence2Vector,
which was developed on the basis of work embedding algorithm Word2Vector [115]. In this
way, we extracted 100-d features for video descriptions.
Textual Sentiment We also analyze the sentiments over text, which has been proven to play
an important role in popularity prediction [8]. With the help of the Sentiment Analysis tool in
Stanford CoreNLP tools,
we assigned each micro-video a sentiment score ranging from 04
and they correspond to very negative, negative, neutral, positive, and very positive, respectively.
