of the micro-video, due to the following reasons: (1) the semantic ontology of the micro-video
is massive, and the hierarchical structure is complicated, making it a great challenge to auto-
matically construct a hierarchical ontology adapting to the micro-video classification; (2) the
micro-video is composed of multiple modalities, such as visual, acoustic, and textual modalities.
How to adaptively confuse the modalities to the multi-level hierarchical semantic ontology clas-
sification poses another challenge for us; and (3) although there are many public datasets toward
the semantic ontology classification of micro-videos, there is a lack of the large-scale bench-
mark dataset for hierarchical ontology classification. Accordingly, how to construct a large-scale
benchmark dataset to facilitate the implement and evaluation of the proposed research problem
constitutes a tough challenge.
Toward this end, we will fulfill the aforementioned challenges from the following three
directions: (1) we plan to construct the structured hierarchical ontology from the existing taxon-
omy knowledge adapted to the micro-video; (2) we will propose a modality-based and hierarchy-
based attention mechanism into the hierarchical semantic ontology classification of the micro-
video; and (3) we plan to construct a large-scale dataset from the online micro-video commu-
nities to facilitate studying the proposed research problem.
ough the exponentially growing micro-videos have brought prosperity to the new industry of
micro-video platforms, they have brought the issue that anchorwomen propagate pornographic
content through these platforms in order to attract people’s eyeballs and earn more money.
Pornographic content disrupts the cleanliness of the Internet, as juveniles are easy to expose
to these unwholesome content and such phenomenon will pose a threat to the their physical
and mental health. However, different from traditional porn videos, the anchorwomen on the
micro-video platforms usually try to fool the platform by teasing the audience with their voices
while keep their activities normal. Namely, traditional methods for filtering out inappropriate
materials are inadequate for the micro-video platforms. In a sense, making further research on
automatically detecting and isolating porn content for micro-video platforms is of immediate
However, identifying pornographic videos from various micro-videos is non-trivial due to
the following challenges. (1) For the micro-videos, identifying the voices of these anchorwomen
is a pivotal issue. However, due to the complexity of social platforms, the cocktail effect of the
complicated voice environments poses a challenge for automatic identification. To be specific,
humans have the capacity to recognize what the other people are saying in a noisy crowd at
a cocktail party as they can automatically exclude the unrelated sounds, while for machine, it
is hard to select the related or unrelated information. (2) e increasing social slangs increase
the difficulty of the identification, such as abbreviation and Internet buzzwords. For example,
XSWL points to laughing my ass off while Europe represents the one who gets what he wants,
which is even challenging for humans. (3) It is hard to distinguish the sexy voices and the normal
voices only due to the audios. e modalities of the micro-videos consist of video, text, and audio.
erefore, how to jointly utilize different modalities for analyzing is a considerable problem.
In the future, we plan to tackle this from the following three aspects. First, we plan to
simulate cocktail parties, namely, employing the mixture of artificial noises and the audios of
the videos. Based on this, we will model the identification for individual voices and assignment
for the voice track for each source. Second, we will introduce the voiceprint recognition for
the selected audios to transforming the audios into text descriptions and filter out these audios
without any semantics. Based on the above efforts, we plan to model the text anti-spam for text
descriptions to identify the pornographic words and pure voice classification for these audios
without semantics to decide whether erotic or not. ird, we will introduce multi-layer attention
mechanisms into cross-modal learning of video, audio, and text for adaptively recognizing the
sensitive and pornographic content of the micro-videos.
