6.4. MULTIMODAL SEQUENTIAL LEARNING 131
where x
t
is the micro-video embedding at the time step t, h
t1
and c
t1
are, respectively, the
hidden state and memory cell at the time step t 1, linking by edge < v
t
c
1
c
; v
t
c
c
>, and h
and
c
are the hidden state and memory cell at the time step t
, linking by edge < v
t
c
c
; v
t
c
c
>. ere-
fore, our temporal graph-based LSTM network can simultaneously leverage user’s neighbor and
cross-time interested context information to enhance the memorization of diverse interest and
further strengthen the interest representation. And we can obtain the user’s interested feature se-
quence F
in
D Œh
in;1
; h
in;2
; : : : ; h
in;m
c
2 R
d
c
m
c
, where d
c
is the dimension of each hidden state
in F
in
.
As the user’s uninterested points are also dynamic and diverse, we build another tempo-
ral graph-based LSTM layer to model the user’s U
n
sequence and then obtain the uninterested
feature sequence of the user, i.e., F
un
D Œh
un;1
; h
un;2
; : : : ; h
un;m
n
2 R
d
n
m
n
, where d
n
is the di-
mension of each hidden state in F
un
.
6.4.2 THE MULTI-LEVEL INTEREST MODELING LAYER
Since there are multiple interactions between a user and a micro-video and they reflect different
degrees of user’s interest, we propose a multi-level interest modeling layer to further obtain the
enhanced interest representation. As the “like” and “follow” behaviors indicate users’ stronger
interest compared with the “click” one, we hence utilize the “like” and “follow” information
to enhance the interest representation. Particularly, for the user u, we set the weighted sum of
micro-video representations in U
l
and U
f
as the user’s enhanced interest feature f
en
, formulated
as
f
en
D w
l
m
l
X
t
l
D1
x
t
l
l
C w
f
m
f
X
t
f
D1
x
t
f
f
; (6.3)
where x
t
l
l
is the embedding of micro-video v
t
l
l
in U
l
, x
t
f
f
is the embedding of micro-video v
t
f
f
in
U
f
, w
l
, and w
f
are the hyper parameters controlling the weights between “like” and “follow.”
With the enhanced interest representation f
en
, we can construct an embedding matrix
U 2 R
N D
, i.e., user matrix, where N and D, respectively, denote the number of users and the
dimension of the enhanced interest representations. As the user’s “like” and “follow” information
more precisely indicates the user’s interest, we can obtain more accurate interest representations
using the user matrix. e user matrix U will be updated in the training phrase. Moreover, for
each user, we utilize embedding lookup strategy to search the user’s enhanced interest represen-
tation from the matrix U during the training and testing phrase.
6.4.3 THE PREDICTION LAYER
Standing on the shoulder of the user’s interested feature sequence F
in
, uninterested feature se-
quence F
un
, and enhanced interest representation f
en
, we place a prediction layer to get the click
probability of the given micro-video v
new
, as shown in Figure 6.4. Specifically, we first feed F
in
and the embedding of the given micro-video x
new
into a vanilla attention layer to obtain the