36 4. KNOWLEDGE-GUIDED COMPATIBILITY MODELING
In this section, we first introduce the experiment setting and then provide the experiment results
as well as discussion on each above research question.
4.4.1 EXPERIMENT SETTINGS
In this chapter, we extract the visual and contextual representations of fashion items as follows.
Visual Modality. Regarding the visual modality, similar to our previous work, we adopted
the pre-trained ImageNet deep neural network provided by the Caffe software package [51],
which consists of five convolutional layers followed by three fully connected layers. We repre-
sented the visual modality of each item with the 4096-D output vector of the fc7 layer.
Contextual Modality. In this work, contextual description of each fashion item refers to
its title and category labels in different granularity. To obtain the effective contextual represen-
tation, instead of the traditional linguistic features [106, 107], we adopted the CNN architec-
ture [57], which has achieved compelling performance in various natural language processing
tasks [102]. In particular, we first represented each contextual description as a concatenated
word vector, where each row represents one constituent word and each word is represented by
the publicly available 300-D word2vec vector. We then deployed the single-channel CNN, con-
sisting of a convolutional layer on top of the concatenated word vectors and a max pooling layer.
In particular, we had four kernels with sizes of 2, 3, 4, and 5, 100 feature maps for each and
the rectified linear unit (ReLU) as the activation function. Ultimately, we obtained a 400-D
contextual representation for each item.
We divided the positive pair set S into three chunks: 80% of triplets for training, 10% for
validation, and 10% for testing, denoted as S
trai n
, S
valid
, and S
test
, respectively. We then gener-
ated the triplets D
S
train
, D
S
valid
, and D
S
test
according to Eq. (4.3). For each positive pair of t
i
and
b
j
, we randomly sampled M bottoms b
k
’s and each b
k
contributes to a triplet .i; j; k/, where
b
k
… B
C
i
and M is set as 3. We adopted the area under the ROC curve (AUC) [133] as the eval-
uation metric. For optimization, we employed the stochastic gradient descent (SGD) [3] with
the momentum factor as 0:9. We adopted the grid search strategy to determine the optimal
values for the regularization parameters (i.e., ; C ) among the values f10
r
jr 2 f4; : : : ; 1gg
and Œ2; 4; 6; 8, respectively. In addition, the mini-batch size, the number of hidden units, and
learning rate were searched in Œ32; 64; 128; 256, Œ128; 256; 512; 1024, and Œ0:01; 0:05; 0:1, re-
spectively. e proposed model was fine-tuned for 40 epochs, and the performance on the testing
set was reported. We empirically found that the proposed model achieves the optimal perfor-
mance with K D 1 hidden layer of 1,024 hidden units.
We first experimentally verified the convergence of the proposed learning scheme. Fig-
ure 4.4 shows the changes of the objective function in Eq. (4.5) and the training AUC with one
iteration of our algorithm. As we can see, both values first change rapidly in a few epochs and
then go steady finally, which well demonstrates the convergence of our model.