70 6. PERSONALIZED COMPATIBILITY MODELING
Algorithm 6.3 Personalized Compatibility Modeling.
Input: Training set D D f.m; i; j; k/g, learning rate , regularization parameter , trade-off
parameters , , and .
Output: Parameters
F
.
1: Initialize parameters
F
.
2: repeat
3: Draw .m; i; j; k/ from D.
4: Compute p
m
ij
according to Eq. (6.2).
5: for each parameter in
F
do
6: Update C ..p
m
ij
/
p
m
ij
/.
7: end for
8: until Converge
6.4.1 IMPLEMENTATION
In this chapter, we extract the visual and contextual representations of fashion items as follows.
Visual Modality. Regarding the visual modality, we applied the deep CNNs, which has
proven to be the state-of-the-art model for image representation learning [10, 55, 68, 72]. In
particular, we chose the 50-layer residual network (ResNet50) in [33]. We fed the image of each
fashion item to the network, and adopted the output of the last average pooling layer as the
visual representation. ereby, we represented the visual modality of each item with a 2048-D
vector.
Contextual Modality. As a pioneering attempt of the personalized clothing matching,
here we only considered the title description and category metadata as the contextual informa-
tion of the fashion item. We first tokenized the text with the help of the Japanese morphological
analyzer Kuromoji.
3
To obtain the effective contextual representation, instead of the traditional
linguistic features [106, 107], we adopted the CNN architecture [57], which has achieved com-
pelling success in various natural language processing tasks [45, 102]. In particular, we first rep-
resented each contextual description as a concatenated word vector, where each row represents
one constituent word. To represent each word, we employed the 300-D vector provided by the
Japanese word2vec Nwjc2vec in the search mode, which is created from NINJAL Web Japanese
Corpus [103]. We then deployed the single-channel CNN, consisting of a convolutional layer
on top of the concatenated word vectors and a max pooling layer. In particular, we used four
kernels with sizes of 2, 3, 4, and 5, respectively. For each kernel, we had 100 feature maps. We
employed the rectified linear unit (ReLU) as the activation function. Ultimately, we obtained a
400-D contextual representation for each item.
3
http://www.atilika.org/
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset