6.5. EXPERIMENTS 133
where W
1
2 R
d
0
c
.d
c
CD/
and W
2
2 R
1d
c
0
denote the weight matrixes, b
1
2 R
d
0
c
and b
2
, re-
spectively, denote the bias vector and the bias value, and denotes the ReLU activation function.
Oy
in
is the click probability calculated by the improved interested representation f
in
.
Similarly, we can obtain the improved uninterested representation f
un
based on F
un
and
x
new
using another vanilla attention layer. Afterward, we feed the concatenation of the improved
uninterested representation f
un
and the new micro-video embedding x
new
into two MLP layers,
and obtain the click probability based on the improved uninterested representation, i.e., Oy
un
.
Analogously, the click probability based on the enhanced interest representation, i.e., Oy
en
, can
be obtained by feeding the concatenation of the enhanced interest representation f
en
and the
new micro-video embedding x
new
into two MLP layers.
Finally, the weighted sum of the above three probability values is set as our prediction
result,
Oy D ˛
1
Oy
in
C ˛
2
Oy
un
C ˛
3
Oy
en
; (6.7)
where ˛
1
, ˛
2
, and ˛
3
are the hyper parameters controlling the weights of Oy
in
, Oy
un
, and Oy
en
,
respectively, and Oy is the final output of our model denoting the click probability of the given
user on the given new micro-video.
Our method is trained as an end-to-end deep learning model equipped with the sigmoid
cross-entropy loss:
L. Oy/ D
.
y log
.
Oy
/
C .1 y/ log
.
1
.
Oy
///
; (6.8)
where denotes the sigmoid activation function and y 2 f0; 1g is the ground truth that indicates
whether the user clicks the micro-video or not. Besides, the back-propagation through time
(BPTT) method is adopted to train our ALPINE model.
6.5 EXPERIMENTS
6.5.1 EXPERIMENTAL SETTINGS
Implementation Details. In the Dataset III-1, we utilized the 64-d visual embedding to repre-
sent the micro-video. As for the Dataset III-2, the concatenation of the 64-d category embed-
ding and the 64-d visual embedding is set as the micro-video embedding. e length of users’
historical sequence is set to 300. If it exceeds 300, we truncated it to 300; otherwise, we padded
it to 300 and masked the padding in the network. We optimized the parameters using Adam
with the initial learning rate 0.001, and the batch size is 2048.
6.5.2 BASELINES
To demonstrate the effectiveness of our proposed ALPINE model, we compared it with the
following state-of-the-art methods.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset