18 3. DATA-DRIVEN COMPATIBILITY MODELING
Taking advantage of the back-propagation strategy, we first calculate the
@L
bpr
@W
tv
K
,
@L
mod
@W
tv
K
, and
@L
rec
@W
tv
K
as follows:
8
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
<
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
ˆ
:
@L
bpr
@W
tv
K
D .m
ijk
/
@.Qv
t
i
/
@W
tv
K
.v
b
j
v
b
k
/
@L
mod
@W
tv
K
D .z
t
i
/
@.Qv
t
i
/
@W
tv
K
Qc
t
i
@L
rec
@
O
W
tv
K
D .Ov
t
i
v
t
i
/
@.Ov
t
i
/
@
O
W
tv
K
:
(3.9)
As
@.Qv
t
i
/
@W
tv
K
and
@.Ov
t
i
/
@
O
W
tv
K
can be derived from Ov
t
i
D .
O
W
tv
K
O
h
tv
K1
C
O
b
tv
K
/ and Qv
t
i
D .W
tv
K
h
K1
C b
tv
K
/,
we can easily access @L
bpr
=@W
tv
K
, @L
mod
=@W
tv
K
and @L
rec
=@
O
W
tv
K
. en we can iteratively ob-
tain @L
bpr
=@W
tv
k
and @L
mod
=@W
tv
k
; k D K; : : : ; 1. Meanwhile, we obtain the @L
rec
=@
O
W
tv
k
and
@L
rec
=@W
tv
k
, k D K; : : : ; 1, in the similar manner. We then employ the stochastic gradient de-
scent to optimize the proposed model, where the network parameters can be updated as follows:
8
ˆ
ˆ
ˆ
ˆ
<
ˆ
ˆ
ˆ
ˆ
:
W
tv
k
W
tv
k
@L
bpr
@W
tv
k
C
@L
mod
@W
tv
k
C
@L
rec
@W
tv
k
C W
tv
k
O
W
tv
k
O
W
tv
k
@L
rec
@W
kv
k
C
O
W
kv
k
!
;
(3.10)
where is the learning rate.
3.4 EXPERIMENT
In this part, we conducted extensive experiments to verify our proposed BPR-DAE model on
the Dataset I by answering the following research questions.
Does BPR-DAE outperform the state-of-the-art methods?
What is the contribution of each component of BPR-DAE?
How does each modality contribute to the compatibility modeling?
3.4.1 EXPERIMENT SETTINGS
In this chapter, we extract the visual and contextual features of fashion items as follows.
Visual Modality. In this work, we took advantage of the advanced deep CNNs, which
has been proven to be the state-of-the-art model for image representation learning [10, 55,
134]. In particular, we chose the pre-trained ImageNet deep neural network provided by the
Caffe software package [51], which consists of five convolutional layers followed by three fully
3.4. EXPERIMENT 19
connected layers. We fed the image of each fashion item to the CNNs, and adopted the fc7
layer output as the visual feature. erefore, for each item, its visual modality is represented by
a 4096-D vector.
Contextual Modality. Considering the short length of such contextual information, we
utilized the bag-of-words scheme [50], which has been proven to be effective to encode con-
textual metadata [26]. We first constructed a style vocabulary based on the categories and the
words in all the titles in our dataset. As such user-generated metadata can be inevitably noisy,
we thus filtered out the categories and words that appeared in less than five items as well as the
words with less than three characters, which are more likely to be noise. We ultimately obtained
a vocabulary of 3,529 phrases, and hence compiled the contextual modality of each fashion item
with a 3,529-D boolean vector.
We separated the positive pair set S in Dataset I into three chunks: 80% of tripes for
training, 10% for validation, and 10% for testing, which are denoted as S
train
, S
valid
, and S
test
,
respectively. en we generated the triple set D
S
train
, D
S
valid
, and D
S
test
according to Eq. (3.5). In
particular, for each positive top-bottom pair t
i
and b
j
, we randomly sampled M bottoms b
k
s
to construct M triplets .i; j; k/, where b
k
B
C
i
and M is set as 3. We then adopted the widely
used metric AUC (Area Under the ROC curve) [99], which is defined as
AUC D
1
jT j
X
i
1
E.i /
X
.j;k/2E.i /
ı
m
ij
> m
ik
; (3.11)
where the evaluation pairs per top i are defined as
E.i/ WD
f
.j; k/j.i; j / 2 S
test
^ .i; k/ S
g
: (3.12)
ı.b/ is the indicator function that returns one if the argument b is true and zero otherwise.
For optimization, we employed the stochastic gradient descent (SGD) [3] with the mo-
mentum factor as 0:9. We adopted the grid search strategy to determine the optimal values
for the regularization parameters (i.e., ; ; ) among the values f10
r
jr 2 f5; : : : ; 1gg. In
addition, the mini-batch size, the number of hidden units and learning rate for all methods
were searched in Œ32; 64; 128; 256; 512; 1024, Œ128; 256; 512; 1024, and Œ0:001; 0:01; 0:1, re-
spectively. e proposed model was fine-tuned based on training set and validation set for 30
epochs, and the performance on testing set was reported. We experimentally found that the pro-
posed model achieves the optimal performance with K D 1 hidden layer of 512 hidden units.
All the experiments were conducted over a server equipped with four NVIDIA Titan X GPUs.
We first experimentally verified the convergence of the proposed learning algorithm. e
changes of the objective function in Eq. (3.8) and the training AUC with one run of the training
algorithm are illustrated in Figure 3.4. As we can see, both values first change rapidly within a
few epochs and then tend to go steady finally, which well demonstrates the convergence of our
model.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset