Problem Formulation

14 3. DATA-DRIVEN COMPATIBILITY MODELING

(CNNs) and introduced a similarity metric to model the human notion of complement objects.

Similarly, He et al. [34] introduced a scalable matrix factorization approach that incorporates

visual signals of product images to fulﬁll the recommendation task. Although these researches

have achieved huge success, previous eﬀorts on fashion analysis mainly focus on the visual sig-

nals but fail to take the contextual information into consideration. To bridge this gap, Li et

al. [67] proposed a multi-modal, multi-instance deep learning system to classify an given outﬁt

as a popular or nonpopular one. Distinguished from the above research, we particularly focus

on modeling the sophisticated compatibility between fashion items by seeking the nonlinear

latent compatibility space with neural networks. Moreover, we seamlessly aggregate the multi-

modal data of fashion items and exploit the inherent relationship between diﬀerent modalities

to comprehensively model the compatibility between fashion items.

3.2.2 REPRESENTATION LEARNING

Representation learning has long been an active research topic for machine learning, which aims

to learn more eﬀective representations for data, as compared to hand-designed representations,

and hence achieve better performance in machine learning tasks [64]. In particular, recently, the

advances in neural networks also propelled a handful of models, such as autoencoders (AE) [90],

deep belief networks (DBN) [39], deep Boltzmann machine (DBM) [27], and CNNs [62] to

tackle various problems. For example, Want et al. [118] utilized deep autoencoders to capture

the highly nonlinear network structure and thus learn accurate network embedding. Due to the

increasingly complex data and tasks, multi-view representation learning has attracted several re-

search attempts. One basic training criterion that has been applied to multi-view representation

learning is to learn a latent compact representation that can reconstruct the input as much as

possible [117], where autoencoders are naturally adopted [22]. For example, Ngiam et al. [90]

ﬁrst proposed a structure based on multimodal autoencoders to learn the shared representation

for speech and visual inputs and solve the problem of speech recognition. In addition, Wang

et al. [117] proposed a multimodal deep model to learn image-text uniﬁed representations to

tackle the cross-modality retrieval problem. Although representation learning has been success-

fully applied to solving cross modality retrieval [20, 22], phonetic recognition [117], and mul-

tilingual classiﬁcation [98], limited eﬀorts have been dedicated to the fashion domain, which is

the research gap we aim to bridge in this work.

3.3 METHODOLOGY

In this section, we ﬁrst introduce the notation for the following research problem formulation,

and then detail the proposed BPR-DAE.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Problem Formulation