14 3. DATA-DRIVEN COMPATIBILITY MODELING
(CNNs) and introduced a similarity metric to model the human notion of complement objects.
Similarly, He et al. [34] introduced a scalable matrix factorization approach that incorporates
visual signals of product images to fulfill the recommendation task. Although these researches
have achieved huge success, previous efforts on fashion analysis mainly focus on the visual sig-
nals but fail to take the contextual information into consideration. To bridge this gap, Li et
al. [67] proposed a multi-modal, multi-instance deep learning system to classify an given outfit
as a popular or nonpopular one. Distinguished from the above research, we particularly focus
on modeling the sophisticated compatibility between fashion items by seeking the nonlinear
latent compatibility space with neural networks. Moreover, we seamlessly aggregate the multi-
modal data of fashion items and exploit the inherent relationship between different modalities
to comprehensively model the compatibility between fashion items.
3.2.2 REPRESENTATION LEARNING
Representation learning has long been an active research topic for machine learning, which aims
to learn more effective representations for data, as compared to hand-designed representations,
and hence achieve better performance in machine learning tasks [64]. In particular, recently, the
advances in neural networks also propelled a handful of models, such as autoencoders (AE) [90],
deep belief networks (DBN) [39], deep Boltzmann machine (DBM) [27], and CNNs [62] to
tackle various problems. For example, Want et al. [118] utilized deep autoencoders to capture
the highly nonlinear network structure and thus learn accurate network embedding. Due to the
increasingly complex data and tasks, multi-view representation learning has attracted several re-
search attempts. One basic training criterion that has been applied to multi-view representation
learning is to learn a latent compact representation that can reconstruct the input as much as
possible [117], where autoencoders are naturally adopted [22]. For example, Ngiam et al. [90]
first proposed a structure based on multimodal autoencoders to learn the shared representation
for speech and visual inputs and solve the problem of speech recognition. In addition, Wang
et al. [117] proposed a multimodal deep model to learn image-text unified representations to
tackle the cross-modality retrieval problem. Although representation learning has been success-
fully applied to solving cross modality retrieval [20, 22], phonetic recognition [117], and mul-
tilingual classification [98], limited efforts have been dedicated to the fashion domain, which is
the research gap we aim to bridge in this work.
3.3 METHODOLOGY
In this section, we first introduce the notation for the following research problem formulation,
and then detail the proposed BPR-DAE.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset