6.4. MULTIMODAL SEQUENTIAL LEARNING 127
to the hybrid models, they aim to combine the above two methods within a unified frame-
work. For example, the recommendation model presented in [201] generates multiple ranking
lists via exploring different information sources in a multi-task framework. Since the underly-
ing assumption of the traditional video recommendation models is that users’ interest is static,
therefore they cannot be applied to extract users’ dynamic interest.
Recently, many models have been proposed to characterize users’ dynamic preferences.
ese methods are in three variants: CNN-based methods [152, 157], recurrent neural net-
work (RNN) based methods [52, 132], and self-attention based methods [28, 203]. As a typical
example in the first category, Tuan et al. [157] utilized 3-D CNNs to combine session clicks
and content features to generate recommendations. As for RNN based methods, Quadrana et
al. [132] proposed the RNN based approach for session-based recommendation, which relays
and evolves latent hidden states of the RNNs across user sessions. In [52], the authors proposed
a dynamic RNN to model users’ dynamic interest for the personalized video recommendation.
Due to the high time consumption and long sequence restriction, the self-attention mechanism
has been applied to recommender systems and gained impressive performance. For example,
Zhou et al. [203] proposed an attention-based user behavior model by considering heteroge-
neous user behaviors in e-commerce. Although the aforementioned methods have considered
users’ dynamic interest and been successfully applied to video communities, they are inadequate
to handle micro-video communities due to their different characteristics. In particular, micro-
video communities continuously route micro-videos to users and users click their interested ones
by previewing the thumbnails, whereas traditional video communities are apt to display users’
interested videos via their query information. In addition, users’ interest information in micro-
video communities has a multi-level structure.
6.4 MULTIMODAL SEQUENTIAL LEARNING
To address the aforementioned problems, in this chapter, we develop an end-to-end temporAL
graPh-guIded recommeNdation systEm, dubbed ALPINE, to route micro-videos. e scheme
of our proposed approach is illustrated in Figure 6.2. Specifically, to model users’ diverse and
dynamic interest, we encode users’ click history information into a graph where the node refers
to micro-videos in the click history and the edge between two nodes stands for the tempo-
ral relationship. Based upon this graph, we design a novel long short-term memory (LSTM)
network to learn users’ interest representation. Afterward, we estimate the click probability via
calculating the similarity between the users’ interest representation and the embedding of the
given micro-video. Considering that users’ interest is multi-level, we introduce a user matrix to
enhance the user interest modeling by incorporating their like and “follow information. And
at this step, we also get a click probability with respect to users’ more precise interest informa-
tion. Analogously, since we know the sequence of users’ disliked micro-videos, another temporal
graph-based LSTM is built to characterize users’ uninterested information, and the other click
probability can be estimated based on true negative samples. We can thus obtain a click prob-
128 6. MULTIMODAL SEQUENTIAL LEARNING
Interested Feature Sequence Uninterested Feature Sequence
Temporal Graph LSTM Temporal Graph LSTM
Enhanced Interest
Representation
Multi-level Interest
Prediction Layer
Item Embedding
ŷ
Figure 6.2: Illustration of our proposed ALPINE model.
ability regarding users’ uninterested information. Finally, the weighted sum of the above three
probability scores is set as our final prediction result.
Let v and u denote a micro-video and a user, respectively. We present the users histori-
cal information as a sequence of micro-videos U D f.u; v
t
j
/g
m
tD1
, where j 2 fc; n ; l; f g, respec-
tively, represents users click,” “not click,” like,” and “follow behaviors, and m is the length
of the sequence. As the users interest is multi-level, its sequential behaviors can be segmented
into four sub-sequences, namely “click” sequence U
c
D f.u; v
t
c
c
/g
m
c
t
c
D1
, not click” sequence U
n
D
f.u; v
t
n
n
; /g
m
n
t
n
D1
, like” sequence U
l
D f.u; v
t
l
l
/g
m
l
t
l
D1
, and “follow sequence U
f
D f.u; v
t
f
f
/g
m
f
t
f
D1
,
where m
c
C m
n
C m
l
C m
f
D m. As such, the micro-video recommendation problem can be
formally defined as:
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset