On Model Comparison (RQ1)

70 6. PERSONALIZED COMPATIBILITY MODELING

Algorithm 6.3 Personalized Compatibility Modeling.

Input: Training set D D f.m; i; j; k/g, learning rate , regularization parameter , trade-oﬀ

parameters , , and .

Output: Parameters ‚

1: Initialize parameters ‚

2: repeat

3: Draw .m; i; j; k/ from D.

4: Compute p

according to Eq. (6.2).

5: for each parameter  in ‚

6: Update   C ..p



 /.

7: end for

8: until Converge

6.4.1 IMPLEMENTATION

In this chapter, we extract the visual and contextual representations of fashion items as follows.

Visual Modality. Regarding the visual modality, we applied the deep CNNs, which has

proven to be the state-of-the-art model for image representation learning [10, 55, 68, 72]. In

particular, we chose the 50-layer residual network (ResNet50) in [33]. We fed the image of each

fashion item to the network, and adopted the output of the last average pooling layer as the

visual representation. ereby, we represented the visual modality of each item with a 2048-D

vector.

Contextual Modality. As a pioneering attempt of the personalized clothing matching,

here we only considered the title description and category metadata as the contextual informa-

tion of the fashion item. We ﬁrst tokenized the text with the help of the Japanese morphological

analyzer Kuromoji.

To obtain the eﬀective contextual representation, instead of the traditional

linguistic features [106, 107], we adopted the CNN architecture [57], which has achieved com-

pelling success in various natural language processing tasks [45, 102]. In particular, we ﬁrst rep-

resented each contextual description as a concatenated word vector, where each row represents

one constituent word. To represent each word, we employed the 300-D vector provided by the

Japanese word2vec Nwjc2vec in the search mode, which is created from NINJAL Web Japanese

Corpus [103]. We then deployed the single-channel CNN, consisting of a convolutional layer

on top of the concatenated word vectors and a max pooling layer. In particular, we used four

kernels with sizes of 2, 3, 4, and 5, respectively. For each kernel, we had 100 feature maps. We

employed the rectiﬁed linear unit (ReLU) as the activation function. Ultimately, we obtained a

400-D contextual representation for each item.

http://www.atilika.org/

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for On Model Comparison (RQ1)