8 1. INTRODUCTION
to that of each single modality. In this transductive framework, we integrate the low-rank con-
straints to somehow alleviate the information sparseness and low-quality problems. Because the
formulated objective function is not smooth and hard to solve, we design an effective algorithm
based on the augmented Lagrange multiplier to optimize it and ensure a fast convergence.
As to the venue category estimation of micro-videos, we shed light on characterizing and
modeling the correlations between modalities, especially the consistent and complementary re-
lations. e consistent part is to strengthen the confidence and the complementary one is able
to supplement a lot of exclusive information. We argue that explicitly parsing these two kinds
of correlations and treating them separately within a unified model can boost the representation
discrimination for multimodal samples. Toward this goal, we devise a series of cooperative learn-
ing models, which split the consistent information from the complementary one, and leverage
them to strengthen the expressiveness of each modality. In addition, we regularize the hierar-
chical structure of micro-videos via the tree-guided group lasso, which characterizes the inter-
and intra-relatedness among venue categories. Meanwhile, we integrate the dictionary learn-
ing component to learn the concept-level sparse representation of micro-videos, whereby the
problem of information sparseness can be alleviated.
We also explore methods to enhance the low-quality problems for the venue category
estimation task, especially compensating the acoustic modality. Toward this goal, we develop
a deep multimodal transfer learning scheme, which transfers knowledge from external high-
quality sound clips to strengthen the description of the internal acoustic modality in micro-
videos. is is accomplished by enforcing the external sound clips and the internal acoustic
modality to share the same acoustic concept-level space. Notably, this scheme is applicable to
enhance other modality representation, like visual and textual ones.
In order to intelligently route micro-videos to the target users, we present a multimodal
sequential learning scheme. To capture the users’ dynamic and diverse interest, we encode their
historical complicated interaction sequences into a temporal graph and then design a novel tem-
poral graph-based long short-term memory (LSTM) network to model it. Afterwards, we esti-
mate the click probability via calculating the similarity between the users’ interest representation
and the embedding of the given micro-video. Considering that users’ interest is multi-level, we
introduce a user matrix to enhance the user interest modeling by incorporating their “like” and
“follow” information. And at this step, we also get a click probability with respect to users’
more precise interest information. Analogously, since we know the sequence of users’ disliked
micro-videos, another temporal graph-based LSTM is built to characterize users’ uninterested
information, and the other click probability can be estimated based on true negative samples.
We can thus obtain a click probability regarding users’ uninterested information. Finally, the
weighted sum of the above three probability scores is set as our final prediction result.