Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 9

Style Recognition and Kinship Understanding

Shuhui Jiang^⁎; Ming Shao^†; Caiming Xiong^‡; Yun Fu^§ ^⁎Department of Electrical and Computer Engineering, Northeastern University, Boston, MA, United States
^†Computer and Information Science, University of Massachusetts Dartmouth, Dartmouth, MA, United States
^‡Salesforce Research, Palo Alto, CA, United States
^§Department of Electrical and Computer Engineering and College of Computer and Information Science (Affiliated), Northeastern University, Boston, MA, United States

Abstract

This chapter addresses two novel applications using deep learning: style recognition and kinship understanding. For the former, while style classification has drawn much attention in many fields such as fashion, architecture, and manga, most existing methods of style classification focus on extracting discriminative features from local patches or patterns. Usually multiple low-level visual features are extracted and concatenated together as the style descriptor. However, style classification usually relies on high-level abstract concepts. Meanwhile, there exists spread out phenomenon in style classification so that visually less representative images in a style class are usually very diverse and easily getting misclassified. In this section, we firstly describe related works and challenges in style classification task. Then we describe deep learning based solutions (i.e., consensus style centralizing autoencoder (CSCAE) [1]) addressing these challenges. We show experimental results on fashion, manga and architecture style classification problems with both deep learning and non-deep learning methods.

For the latter, we propose a new familial feature extraction method towards better kinship parsing. First, we propose a novel concept called “family faces” which are the component-wise k-nearest-neighbor of the family mean face. Second, a parallel autoencoder's structure is designed for feature extraction with faces from certain family as the input and their corresponding family faces as the output. A low-rank regularizer is further imposed on the model parameters of parallel autoencoders to jointly learn common familial features. Extensive experimental results on KFW and Family101 databases demonstrate that our feature learning method is superior to the state-of-the-art methods.

Keywords

Deep learning; Autoencoder; Style classification; Fashion; Manga; Architecture

9.1 Style Classification by Deep Learning¹

9.1.1 Background

Style classification has attracted increasing attention from both researchers and artists in many fields. Style classification is related, but essentially different from most existing classification tasks. For example, in current online clothing shopping website, usually the items are categorized to skirt, dress, suit, etc. However, one clothing category may consist of diverse fashion styles. For example, a “suit” could be either casual or renascent fashion styles and a dress could be either romantic and elegant fashion styles. Style classification may help people identify style classes and generate relationships between styles. Therefore, learning robust and discriminative feature representation for style classification becomes an interesting and challenging research topic. Most style classification methods mainly focus on extracting discriminative local patches or patterns based on low-level features. Some recent works of the fashion, manga, and architecture style classification based on low-level feature representation are described as follows:

Fashion style classification. “Fashion and AI” is becoming a hot research topic recently, e.g., in clothing parsing [2], retrieval [3], recognition [4] and generation [5]. Bossard et al. densely extracted feature descriptors like HOG in the bounding box of the upper body followed by a bag-of-words model [6]. In Hipster Wars [7], Kiapour et al. proposed an online game to collect a fashion dataset. A style descriptor was formed by accumulating visual features like color and texture. Then they applied mean-std pooling and concatenated all the pooled features as the final style descriptor, followed by a linear SVM for classification.

Manga style classification. Chu et al. [8] paved the way for manga style classification, which classifies whether the manga is targeting young boys (i.e., shonen) or young girls (i.e., shojo). They designed both explicit (e.g., the density of line segments) and implicit (e.g., included angles between lines) feature descriptors, and concatenated these descriptors as feature representation.

Architecture style classification. Goel et al. focused on architectural style classification (e.g., baroque and gothic) [9]. They mined characteristic features with the semantic utility from the low-level features with various scales. Van et al. created correspondences across images by a generalized spatial pyramid matching scheme [10]. They assumed that images within a category share a similar style defined by attributes such as colorfulness and lighting. Xu et al. adopted deformable part-based models (DPM) to capture the morphological characteristics of basic architectural components [11].

However, style is usually reflected by the high-level abstract concepts. These works may fail to extract some mid/high level features for style presentation. Furthermore, they haven't discussed the spread out phenomenon of style images, which is observed in [12,1]. Fig. 9.1 illustrates this “spread out” phenomenon. We take fashion style classification of “Goth” and “Preppy” as an example. Representative images in the center of each class are assigned strong style level $l_{3}$ . It is easy to distinguish “Goth” and “Preppy” images in strong style. Images which are less representative and distant to the center are assigned as lower style level $l_{1}$ . They are named as weak style images. Week style images within one style could be visually diverse, and images in two classes could be visually similar (shown in red frames). The spread out natural makes the weak style images easily get misclassified with other classes. To better illustrate the spread out idea, in Fig. 9.2, two feature descriptors of manga data consisting of shojo and shonen classes [8] are visualized. PCA is conducted to reduce the dimension of feature descriptors into two for visualization. We could see that strong style data points (e.g., in blue) are with high density and well separated; however, weak style data points (e.g., in magenta) are spread out and hard to be separated.

Figure 9.1 Illustration of *weak style* phenomenon in fashion and manga style images in (A) and (B), respectively. Style images are usually “spread out”. Images in the center of each style circle are representatives of this style and defined as *strong style*. They are easy to be distinguished from other styles. Images far from the center are less similar to strong style images, and easy to be misclassified. Images in red frames on the boundary are weak style images from two different classes, but they seem visually similar. We denote by l₁ to l₃ the style levels from weakest to strongest. (For interpretation of the colors in the figures, the reader is referred to the web version of this chapter.)

Figure 9.2 Data visualization of the “spread out” phenomenon in “shoji” and “shonen” classes of manga style. PCA is conducted to reduce the dimension of feature descriptors of “line strength” and “included an angle between lines” into 2D for visualization in (A) and (B), respectively. In both (A) and (B), five colors, blue, green, red, cyan, and magenta, are used to present the data points in different style levels from the strongest to the weakest. We could see that strong style data points are dense and weak style data points are diffuse.

Furthermore, as described above, usually all the feature descriptors are concatenated together to form the style descriptor. It means that all the feature descriptors are treated equally important. However, for different styles, the importance of different feature descriptors maybe different. For example, color may be more important than other feature descriptors when describing “Goth” style, which is usually in black color. Thus, adaptively allocating weights for different feature descriptors becomes another challenge in style classification. To address this challenge, a “consensus” idea is introduced in [12,1] to jointly learn weights for different visual features in representation learning. For example, if one patch from the image is critical for discrimination (e.g., eye patch for face), a higher weight should be assigned to this patch for all the features, meaning a consistency of weights across different feature descriptors.

In the following of this section, we would describe a deep learning solution named consensus style centralizing autoencoder (CSCAE) for robust style feature extraction, especially for weak style classification [1]. First, we describe a style centralizing autoencoder (SCAE), which progressively draws weak style images back to the class center of one feature descriptor. The inputs of SCAE are concatenated low-level features from all the local patches of an image (e.g., eyes, nose, mouth patches in a face image). As shown in Fig. 9.3, for each autoencoder (AE), the corresponding output feature is the same type of feature in the same class but one style level stronger than the input feature. Only the neighbor samples are pulled together towards the center of this class, since weak style images could be very diverse even within one class. Then the progressive steps slowly mitigate the weak style distinction, and ensures the smoothness of the model. In addition, to approach the lack of consensus issue among different kinds of visual features, the weights are jointly learned for them by a CSCAE through rank-constrained group sparsity autoencoder (RCGSAE), based on the consensus idea. We show the evaluation of both deep learning and non-deep learning methods on three applications: fashion style classification, manga style classification and architecture style classification.

Figure 9.3 Illustration of the stacked style centralizing autoencoder (SSCAE). Example images of each level l are presented with colored frames. In the step k, samples in l_k are replaced by l_k+1 samples' (red) nearest neighbors found in l_k. Samples in the higher level than l_k are not changed (blue).

Deep learning structures such as autoencoders (AEs) have been exploited to learn discriminative feature representation [13–15]. Conventional AEs [13] include two parts: (1) encoder and (2) decoder. An encoder $f (\cdot)$ attempts to map the input feature $x_{i} \in R^{D}$ to the hidden layer representation $z_{i} \in R^{d}$ ,

$z_{i} = f (x_{i}) = σ (W_{1} \times x_{i} + b_{1}),$

(9.1)

where $W_{1} \in R^{d \times D}$ is a linear transform, $b_{1} \in R^{d}$ is the bias, and σ is the nonlinear activation (e.g., sigmoid function). The decoder $g (\cdot)$ manages to map the hidden representation $z_{i}$ back to the input feature $x_{i}$ , namely,

$x_{i} = g (z_{i}) = σ (W_{2} \times z_{i} + b_{2}),$

(9.2)

where $W_{2} \in R^{D \times d}$ is a linear transform, $b_{2} \in R^{D}$ is the bias.

To optimize the model parameters $W_{1}$ , $b_{1}$ , $W_{2}$ and $b_{2}$ , the least squared error problem is formulated as

$\min_{\binom{W_{1}, b_{1}}{W_{2}, b_{2}}} \frac{1}{2 N} \sum_{i = 1}^{N} {‖ x_{i} - g (f (x_{i})) ‖}^{2} + λ R (W_{1}, W_{2}),$

(9.3)

where N is the number of data points, $R (W_{1}, W_{2}) = ({‖ W_{1} ‖}_{F}^{2} + {‖ W_{2} ‖}_{F}^{2})$ works as a regularizer, ${‖ \cdot ‖}_{F}^{2}$ is the Frobenius norm, and λ is the weight decay parameter to suppress arbitrarily large weights.

9.1.2 Preliminary Knowledge of Stacked Autoencoder (SAE)

A stacked autoencoder (SAE) [16,17] stacks multiple AEs to form a deep structure. It feeds the hidden layer of the kth AE as the input feature to the $(k + 1)$ th layer. However, in the weak style classification problem, the performance of AE or SAE degrades due to the “spread out” phenomenon. The reason is that the conventional AE or SAE runs in an unsupervised fashion to learn mid/high-level feature representation, meaning there is no guidance to lead images in the same class close and images in the different classes far away to each other. This is very similar to the conventional PCA (SAE can be seen as a multilayer nonlinear PCA). In Fig. 9.2 where data are illustrated after PCA, weak-style classes represented by cyan and magenta are diffused and overlap with other classes, from which we can see that the mid/high-level feature representation by AE or SAE will suffer from the “spread out” phenomenon. A style centralizing autoencoder (SCAE) is introduced addressing these issues [1,12].

9.1.3 Style Centralizing Autoencoder

Local visual features are applied as the input for SCAE. Assume that there are N images from $N_{c}$ style classes, and $x_{i} (i \in {1, \dots, N})$ is the feature representation of the ith image. First, each image is divided into several patches (e.g., eyes, nose and mouth patches in a face image). Then visual features (e.g., HoG, RGB, Gabor) are extracted from each patch. For each feature descriptor (e.g., HoG), the extracted features from all the patches are concatenated as one part of input features for SCAE. By concatenating all different visual features, we obtain the final input features for SCAE. In addition, each image is assigned a style level label. Intuitively, representative images of each style are usually assigned the strong style level, while less representative images are assigned the weak style level. We use L distinct style levels denoted as ${l_{1}, l_{2}, \dots, l_{k}, \dots, l_{L}}$ from the weakest to the strongest.

9.1.3.1 One Layer Basic SCAE

Different from the conventional AE taking identical input and output features, the input and output of SCAE are different. Illustration of the full pipeline of SCAE can be found in Fig. 9.3. Suppose that we have $L = 4$ style levels, and the inputs of the SCAE in the first layer are the features of images in the ascent order of style level, namely, $X_{1}$ , $X_{2}$ , $X_{3}$ and $X_{4}$ . For example, $X_{2}$ is the set of features of images in style level $l_{2}$ . Let $X^{(k)}$ be the input feature of the kth step, where $x_{i}^{(k)} \in X^{(k)}$ is the feature of the ith sample with the hidden representation learning from the $(k - 1)$ th step.

SCAE handles the following mappings:

${X_{k}^{(k)}, X_{k + 1}^{(k)}, \dots, X_{L}^{(k)}} \to {X_{k + 1}^{(k)}, X_{k + 1}^{(k)}, \dots, X_{L}^{(k)}},$

(9.4)

where only $X_{k}$ is pulled towards stronger style level $l_{k + 1}$ , and others keep the same style level before and after the kth step. In this way, the weak style level will be gradually pulled towards the strong style level, i.e., centralization, till $k = L - 1$ . Thus, the $L - 1$ stacked AEs embody the stacked SCAE. Note that the mappings between $X_{k}$ and $X_{k + 1}$ are still unclear. To keep the style level transition smooth, for each output feature $x \in X_{k + 1}$ , the nearest neighbor in $X_{k}$ in the same style class is applied as the corresponding input to learn SCAE. The whole process is shown in Fig. 9.3.

9.1.3.2 Stacked SCAE (SSCAE)

After introducing the one layer basic SCAE, we explain how to build the Stacked SCAE (SSCAE). Suppose that we have L style levels, in the kth step and style class c, the corresponding input for the output $x_{i, ξ + 1}^{(k, c)}$ is given by

${\tilde{x}}_{i, ξ}^{(c)} = {\begin{matrix} x_{j, ξ}^{(k, c)} \in u (x_{i, ξ + 1}^{(k, c)}), & if ξ = k, \\ x_{i, ξ}^{(k, c)}, & if ξ = k + 1, \dots, L, \end{matrix}$

(9.5)

where $u (x_{i, ξ + 1}^{(k, c)})$ is the set of nearest neighbors of $x_{i, ξ + 1}^{(k, c)}$ in the ξth layer.

As there are $N_{c}$ style classes, in each layer, we first separately learn parameters and hidden layer features $Z^{(k, c)}$ of SCAE of each class, and then combine all the $Z^{(k, c)}$ together as $Z^{(k)}$ . Mathematically, SSCAE can be formulated as

$\begin{matrix} \min_{\binom{W_{1}^{(k, c)}, b_{1}^{(k, c)}}{W_{2}^{(k, c)}, b_{2}^{(k, c)}}} \sum_{\binom{i, j}{x_{j, k}^{(k, c)} \in u (x_{i, k + 1}^{(k, c)})}} {‖ x_{i, k + 1}^{(k, c)} - g (f (x_{j, k}^{(k, c)})) ‖}^{2} + \\ \sum_{ξ = k + 1}^{L} \sum_{i} {‖ x_{i, ξ}^{(k, c)} - g (f (x_{i, ξ}^{(k, c)})) ‖}^{2} + λ R (W_{1}^{(k, c)}, W_{2}^{(k, c)}) . \end{matrix}$

(9.6)

The problem above can be solved in a similar way as the conventional AE by back propagation algorithms [18]. Similarly, the deep structure can be built in a layer-wise way, which is outlined in Algorithm 9.1.

Algorithm 9.1 Stacked Style Centralizing Autoencoder.

9.1.3.3 Visualization of Encoded Feature in SCAE

Fig. 9.4 shows the visualization of encoded features in the progressive step $k = 1, 2, 3$ in manga style classification. Similar to Fig. 9.2, PCA is employed to reduce the dimensionality of the descriptors. The low-level input feature is “density of line segments” in [8]. In all the subfigures, a dot represents a sample. In the right sub-figures, colors are used to distinguish different styles, while in left subfigures, colors are used to distinguish different style levels.

Figure 9.4 Visualization of the encoded features in SCAE in progressive step k = 1,2,3 on the Manga dataset. Meanings of different colors for different style levels are: Blue, level 5; Green, level 4; Red, level 3; Cyan, level 2; Magenta, level 1. Note that PCA is applied for dimensionality reduction before visualization.

From Fig. 9.4, we could see that at step $k = 1$ , in the right subfigures, the “shoji” samples (in red) and “shonen” samples (in blue) have overlaps. In the left subfigures, samples in the strong style level (in blue) are separated from each other. However, samples in weak style levels overlap with each other. For example, it is hard to separate the samples in cyan in two styles. During progressive steps, samples in different styles gradually separate due to the style centralization. At progressive step $k = 2, 3$ , in the right subfigures we could see the samples in red and blue gradually become separable. In the left subfigures, we could see that during style centralizing, the weak style samples, shown in green, red, cyan and magenta move closely to the locations of blue samples in two different styles. Since the blue samples represent the strong style level and could easily be separated, the centralization process makes the weak style samples more distinguishable.

9.1.3.4 Geometric Interpretation of SCAE

Recalling the mappings presented in Eq. (9.4), if we consider $X_{k}$ as a corrupted version of $X_{k + 1}$ , SCAE can be recognized as a denoising autoencoder (DAE) [16] that uses partially corrupted feature as the input and the clean noise free features as the output to learn robust representations [16,19]. Thus, inspired by the geometric perspective under the manifold assumption [20], we may offer a geometric interpretation for the SCAE, analogical to that of DAE [16].

Fig. 9.5 illustrates the manifold learning perspective of SCAE where images of Goth fashion style are shown as the examples. Suppose that the higher level $l_{k + 1}$ Goth style images lie close to a low dimensional manifold. The weak style examples are more likely being far away from the manifold than the higher level ones. Note that $x_{j, k}$ is the corrupted version of $x_{i, k + 1}$ by the operator $q (X_{k} | X_{k + 1})$ , and therefore lies far away from the manifold. In SCAE, $q (X_{k} | X_{k + 1})$ manages to find the nearest neighbor of $x_{i, k + 1}$ in level $l_{k}$ to obtain the corrupted the version of $x_{i, k + 1}$ as $x_{j, k}$ in the same category. During the centralizing training, similar to DAE, SCAE learns the stochastic operator $p (X_{k + 1} | X_{k})$ that maps the lower style level samples $X_{k}$ back to a higher level. Successful centralization implies that the operator $p (\cdot)$ is able to map spread-out weak style data back to the strong style data which are close to the manifold.

Figure 9.5 Manifold learning perspective of SCAE with fashion images in “Goth” style.

9.1.4 Consensus Style Centralizing Autoencoder

Given multiple low-level visual descriptors (e.g., HOG, RGB, Gabor), usually low-level feature based methods treat them equally and concatenate them to formulate the final representation [7,8]. Thus, they may fail to consider the correlation between different kinds of feature descriptors, i.e., consensus [21,22].

In this section, we introduce the consensus style centralizing autoencoder (CSCAE) with low-rank group sparsity constraint [1], based on the SCAE. Intuitively, the weights of two features from the same patch should be similar, as they encode the same visual information, but in different ways. Taking face recognition as an example, the eyes patch should be more important than the cheek patch, as demonstrated by many face recognition works. Thus, for manga style classification, given different kinds of features used by different SCAEs, the eyes patches in different SCAEs should be equally important. To that end, a consensus constraint through minimizing the differences of weights of the same patch from different feature descriptors is added onto the SCAE to form the CSCAE.

In the following of this section, we firstly introduce the low-rank constraint and group sparsity constraint for achieving the consensus idea. Then we introduce the rank-constrained group sparsity autoencoder (RCGSAE) and its solution. Finally, CSCAE is introduced based on the RCGSAE.

9.1.4.1 Low-Rank Constraint on the Model

Low-rank constraint has been widely used for discovering underlying structure and data recovery [23,24]. It has been suggested to extract the salient features despite of noises in the following formulations for latent low-rank representation (LaLRR) [24]. As described above, similar weights are expected across different feature descriptors for the same patch. This will give rise to an interesting phenomenon: if we concatenate all the weight matrices of different feature descriptors together denoted as W, the rank of W should be low. The main reason is the similar values across different columns in W. Thus, this idea is pursued through a low-rank matrix constraint on the concatenated weight matrix W.

To that end, we introduce a rank-constrained autoencoder model [1] to pursue the low-rankness of the weight matrix W in the following formula:

$\min_{W, E} {‖ W ‖}_{⁎} + λ {‖ E ‖}_{2, 1}, s.t. X = W \tilde{X} + E,$

(9.7)

where $\tilde{X}$ is the input feature and X is the output feature of AE, similar to SCAE; ${‖ \cdot ‖}_{⁎}$ is the nuclear norm of a matrix used as the convex surrogate of the original rank constraint, ${‖ E ‖}_{2, 1}$ is the matrix $ℓ_{2, 1}$ norm for characterizing the sparse noise E, and λ is a balancing parameter. Intuitively, the residual term E encourages sparsity, as both W and $W \tilde{X}$ are low-rank matrices. This is also well explored by many low-rank recover/representation works [24,23,14,22]. Equation (9.7) can also be considered as a special form of the work [24] by only considering features in the column space.

Fig. 9.6A illustrates this phenomenon. Each row represents a specific feature of one patch, and different colors indicate different kinds of features. In addition, each column represents all features of one sample. It can be seen that the concatenated features for one sample are formulated by first stacking different features of the same patch, and then stacking different patches. Note that for simplicity, one cell is used to represent one kind of features of one patch. We could see that ideally, both eye-feature1 and eye-feature2 should have the highest weights, and nose-feature1 and nose-feature2 have the second highest weights. Meanwhile, the less important patches cheek-feature1 and cheek-feature2 should be given lower weights to suppress noises. In brief, based on the definition of the matrix rank, the consensus constraint among different features induces the low-rank structure of matrix W identified in Fig. 9.6A.

Figure 9.6 Illustration of the consensus style centralizing autoencoder (CSCAE). The illustration of the low-rank group sparsity structure of low-level features is shown in (A). The illustration of the matrix factorization in the solution of the model is shown in (B).

9.1.4.2 Group Sparsity Constraint on the Model

To further consider the regularizers introduced in Eq. (9.3) under the new rank constraint autoencoder framework, we introduce an additional group sparsity constraint on W. The reasons are three-fold. First, like conventional regularizers in neural networks, it helps avoid the arbitrarily large magnitude in W. Second, it enforces the row-wise selection on W to ensure a better consensus effect together with the low-rank constraint. Third, it helps find the most discriminative representation.

Mathematically, we can achieve this by adding a matrix $ℓ_{2, 1}$ norm ${‖ W ‖}_{2, 1} = \sum_{i = 1}^{D} {‖ W (i) ‖}_{2}$ which is equal to the sum of the Euclidean norms of all columns of W. The $ℓ_{2}$ -norm constraint is applied to each group separately (i.e., each column of W). It ensures that all elements in the same column are either approaching zero or nonzero at the same time. The $ℓ_{1}$ norm guarantees that only a few columns are nonzero. Fig. 9.6A also illustrates how the group sparsity works. If the entry in the jth row and ith column of X indicates an unimportant patch, all the entries from the jth row are also less important, and vice-versa. As discussed above, these patches should have been assigned very low or zero weights to suppress the noise.

9.1.4.3 Rank-Constrained Group Sparsity Autoencoder

Considering both rank and group sparsity constraints, the objective function of the rank-constrained group sparsity autoencoder (RCGSAE) is formulated as

${\hat{W}}_{r} = \underset{rank (W) \leq r}{\arg \min} {| | X - W \tilde{X} | |_{F}^{2} + 2 λ {‖ W ‖}_{2, 1}},$

(9.8)

where ${\hat{W}}_{r}$ is the optimized matrix projection in Eq. (9.8) when the $rank (W) \leq r$ ; λ is the balancing parameter. Note that we skip the sparse error term in Eq. (9.7) for simplicity. Clearly, for $r = D$ , we have no rank constraint in Eq. (9.8) which degrades to a group sparsity problem, while for $λ = 0$ , we obtain the reduced-rank regression estimator. Thus, an appropriate rank r and λ will balance the two parts to yield better performance.

Low-rank and group sparsity constraints not only minimize the difference of patch weights among different descriptors, but also assign the weights of unimportant weights to be all zero. In this way, the influence of the noise of unimportant patches are decreased and we are able to find the most discriminative representation.

9.1.4.4 Efficient Solutions for RCGSAE

Here, we introduce how to solve the objective function of RCGSAE. It should be noted that the problem defined in Eq. (9.8) is nonconvex and has no closed-form solutions for W. Thus, an iterative algorithm is used to solve this in a fast manner. As shown in Fig. 9.6B, W is factorized into $W = V^{'} S$ , where V is an $r \times D$ orthogonal matrix, $V^{'}$ is the inverse of V and S is an $r \times D$ matrix with the group sparse constraint [25]. Then the optimization problem of W in Eq. (9.8) turns out to be

$(\hat{S}, \hat{V}) = \underset{S \in R^{r \times D}, V \in R^{r \times D}}{\arg \min} {| | X - V^{'} S \tilde{X} | |_{F}^{2} + 2 λ | | S | |_{2, 1}} .$

(9.9)

The details of the algorithm are outlined in Algorithm 9.2. In addition, the following theorem presents a convergence analysis for Algorithm 9.2 and ensures that the algorithm converges well regardless of the initial point.

Theorem 9.1

Given λ and an arbitrary starting point $V_{r, λ}^{(0)} \in O^{r \times D}$ , let $(S_{r, λ}^{(j)}, V_{r, λ}^{(j)}) (j = 1, 2, \dots)$ be the sequence of iterates generated by Algorithm 9.2. Then, any accumulation point of $(S_{r, λ}^{(j)}, V_{r, λ}^{(j)})$ is a coordinate-wise minimum point (and a stationary point) of F and $F (S_{r, λ}^{(j)}, V_{r, λ}^{(j)})$ converges monotonically to $F (S_{r, λ}^{⁎}, V_{r, λ}^{⁎})$ for some coordinate-wise minimum point $(S_{r, λ}^{⁎}, V_{r, λ}^{⁎})$ .

The proof is given in Appendix A.7 of [25].

9.1.4.5 Progressive CSCAE

After introducing RCGSAE, we introduce progressive CSCAE, which stacks multiple RCGSAEs in a progressive way. In the kth step, it will increase the style level of $\tilde{X}$ from k to $k + 1$ , and meanwhile keep the consensus of different features. As shown in Algorithm 9.3, the input of the CSCAE is the style feature X. The output of the algorithm is the encoded feature $h^{(k)}$ and the projection matrix $W^{(k)}$ in the kth step, $k \in [1, L - 1]$ .

To initialize, we set $h^{(0)} = X$ . For step k, the encoded feature $h^{(k - 1)}$ is regarded as the input. First, we calculate the output for $X^{(k)}$ as described in Eq. (9.5). Second, we optimize W by CSCAE using Algorithm 9.2. The learned W achieves the properties of both group sparsity and low-rankness. Afterwards, we calculate the new features $W \tilde{X}$ followed by a nonlinear function for normalization. Following the suggestion in [26], we use $\tanh (\cdot)$ to achieve the nonlinearity performance. The encoded feature $h^{(k)}$ is regarded as the input in the next step $k + 1$ . After $(L - 1)$ steps, we obtain $(L - 1)$ sets of weight matrices and the corresponding encoded features.

9.1.5 Experiments

In this section, we discuss the performance of several low-level based and deep learning based feature representation methods for fashion, manga and architecture style classification tasks.

9.1.5.1 Dataset

Fashion style classification dataset. Kiapour et al. collected a fashion style dataset named Hipster Wars [7] including 1,893 images of 5 fashion styles, as shown in Fig. 9.7. They also launched an online style comparison game to collect human judgments and provide style level information for each image.

Figure 9.7 Examples for 5 categories in the Hipster Wars dataset: (A) bohemian, (B) hipster, (C) goth, (D) pinup, and (E) preppy.

Pose estimation is applied to extract key boxes of the human body [27]. Seven dense features are extracted for each box following [7]: RGB color value, LAB color value, HSI color value, Gabor, MR8 texture response [28], HOG descriptor, and the probability of pixels belonging to skin categories.

Manga style classification dataset. Chu et al. collected a shonen (boy targeting) and shojo (girl targeting) manga dataset, which includes 240 panels [8]. Six computational features, including angle between lines, line orientation, density of line segments, orientation of nearby lines, number of nearby lines with similar orientation and line strength, are calculated. Example shojo and shonen style panels are shown in Fig. 9.8.

Figure 9.8 Examples for shojo style and shonen style in the Manga dataset. The first row is shoji style and the second row is shonen style.

Since Manga dataset does not provide manually labeled style level information, an automatic way to calculate the style level is applied. First, a mean-shift clustering is applied to find the peak of the density of the images for each style based on the line strength feature. Line strength feature is most discriminative among 6 features measured by p-values. Images at the peak of the density are regarded the most representative ones. Then images are ranked according to the distances between the most centralized images, and evenly divided from lowest to highest distance as five style levels.

Architecture style classification dataset. Xu et al. collected an architecture style dataset containing 5000 images [11]. It is the largest publicly available dataset for architectural style classification. The category definition is according to “Architecture_by_style” of Wikimedia.² Example images of ten classes are shown in Fig. 9.9. As there is no manually labeled style level information, a similar strategy to that used in Manga dataset is applied to generate the style level information.

Figure 9.9 Examples for 10 categories in the Architecture Style dataset: (A) American craftsman, (B) Baroque architecture, (C) Chicago school architecture, (D) Colonial architecture, (E) Georgian architecture, (F) Gothic architecture, (G) Greek Revival architecture, (H) Queen Anne architecture, (I) Romanesque architecture, and (J) Russian Revival architecture.

9.1.5.2 Compared Methods

Here we briefly describe several low-level feature representation based methods on fashion [7,29,6], manga [8] and architecture [11] style classification tasks. Then we describe general deep learning methods and deep learning methods for style classification tasks.

Low-level based methods:

Kiapour et al. [7] applied mean-std pooling for 7 dense low-level features from clothing images, and then concatenated them as the input to the classifier and named the concatenated features as the style descriptor.

Yamaguchi et al. [29] approached clothing parsing via retrieval, and considered robust style feature for retrieving similar style. They concatenated the pooling features similar to [7], followed by PCA for dimension reduction.

Bossard et al. [6] focused on apparel classification with style. For style feature representation, they first learned a codebook through k-means clustering based on low-level features. Then the bag-of-words features were further processed by spatial pyramids and max-pooling.

Chu and Chao [8] designed 6 computational features derived from line segments to describe drawing styles. Then they concatenated 6 features with equal weights.

Xu et al. [11] adopted the deformable part-based models (DPM) to capture the morphological characteristics of basic architectural components, where DPM describes an image by a multiscale HOG feature pyramid.

MultiFea [12,1]. The baseline in [11] only employed the HOG feature, but CSCAE employed multiple features. Another low-level feature based method using multiple features is generated for fair comparisons. Six low-level features are chosen according to SUN dataset,³ including HoG, GIST, DSIFT, LAB, LBP, and tinny image. First, PCA dimension reduction is applied to each feature. Then, the normalized features are concatenated together.

Deep learning based methods:

AE [13]. A conventional autoencoder (AE) [13] is applied for learning mid/high-level features. The inputs of AE are the concatenated low-level features.

DAE [26]. Marginalized stacked denoising autoencoder (mSDA) [26] is a widely applied version of denoising autoencoder (DAE). Both SCAE and DAE share the spirit of “noise”. The inputs of the DAE are corrupted image features. As in [15], [19] and [26], the corruption rate of the dropout noise is learned by cross-validation. Other settings such as the number of stacked layers and the layer size are the same as SCAE.

SCAE [12,1]. Style centralizing autoencoder (SCAE) [12,1] is applied for learning mid/high-level features. The inputs of SCAE are the concatenated various kinds of low-level feature descriptors, i.e., an early fusion for SCAE.

CAE [1]. To demonstrate the roles of “progressive style centralizing” in CSCAE, a consensus autoencoder (CAE) is generated as another baseline. CAE is similar to CSCAE except that in each progressive step, the input and output features are exactly the same.

CSCAE [1]. This method contains the full pipeline of consensus style centralizing autoencoder (CSCAE) in [1].

In all the classification tasks, cross-validation is applied with a $9 : 1$ training-to-test ratio. SVM classifier is applied in Hipster Wars and Manga datasets by following the settings in [7,8], while nearest neighbor classifier (NN) is applied on Architecture datasets. For all deep learning baselines, same number of layers are used.

9.1.5.3 Experimental Results

Results on fashion style classification

Table 9.1 shows the accuracy (%) of low-level and deep learning based methods under different style levels $L = 1, \dots, 5$ . First, from Table 9.1 we can see that deep learning based methods (bottom 5 in the table) in general perform better than low-level based methods (top 3 in the table). CSCAE and CAE achieve the best and second best performance under all the style levels. When comparing DAE with SCAE, we could see that SCAE outperforms DAE, which shows the effectiveness of the style centralizing strategy compared with general denoising strategy for style classification. When comparing CSCAE with CAE, we see that CSCAE outperforms CAE in all the settings, which also due to the style centralizing learning strategy.

Table 9.1

Performances (%) of fashion style classification on Hipster Wars dataset. The best and second best results under each setting are shown in bold font and underline

Performance	$L = 5$	$L = 4$	$L = 3$	$L = 2$	$L = 1$
Kiapour et al. [7]:	77.73	62.86	53.34	37.74	34.61
Yamaguchi et al. [29]:	75.75	62.42	50.53	35.36	33.36
Bossard et al. [6]:	76.36	62.43	52.68	34.64	33.42
AE [13]	83.76	75.73	60.33	44.42	39.62
DAE [26]	83.89	73.58	58.83	46.87	38.33
SCAE [12,1]	84.37	72.15	59.47	48.32	38.41
CAE [1]	87.55	76.34	63.55	50.06	41.33
CSCAE [1]	90.31	78.42	64.35	54.72	45.31

Results on manga style classification

Table 9.2 shows the accuracy (%) of the deep learning based methods (bottom 5 in the table) and low-level feature based method (top in the table) under five style levels on Manga dataset. CSCAE and CAE achieve the highest and second highest performance under all the style levels. Compared to fashion and architecture images, CSCAE works especially well for the face images. We think that the face structure and different weights of patches (e.g., the weights of eye patch should be higher than cheek patch) work especially well with the low-rankness and group sparsity assumptions.

Table 9.2

Performance (%) of manga style classification

Performance	$L = 5$	$L = 4$	$L = 3$	$L = 2$	$L = 1$
LineBased [8]	83.21	71.35	68.62	64.79	60.07
AE [13]	83.61	72.52	69.32	65.18	61.28
DAE [26]	83.67	72.75	69.32	65.86	62.86
SCAE [12,1]	83.75	73.43	69.32	65.42	63.60
CAE [1]	85.35	76.45	72.57	67.85	65.79
CSCAE [1]	90.70	80.96	77.97	77.63	79.90

Results on architecture style classification

Table 9.3 shows the classification accuracy on the architecture style dataset. First, comparing the method in [8] with MultiFea, we learn that the additional low-level features do contribute to the performance. Second, all the deep learning methods (bottom 5 in the table) achieve better performance than low-level features based methods (top 2 in the table).

Table 9.3

Performance (%) of architecture style classification

Performance	$L = 5$	$L = 4$	$L = 3$	$L = 2$	$L = 1$
Xu et al. [8]	40.32	35.96	32.65	33.32	31.34
MultiFea	52.78	53.00	50.29	49.93	46.79
AE [13]	58.72	56.32	52.32	52.32	48.31
DAE [26]	58.55	56.99	53.34	52.39	50.33
SCAE [12,1]	59.61	57.00	53.27	54.28	51.76
CAE [1]	59.54	58.66	54.55	53.46	51.88
CSCAE [1]	60.37	59.41	55.12	54.74	54.68

9.2 Visual Kinship Understanding

9.2.1 Background

Kinship analysis and parsing have been popular research topics in psychology and biology community [30] for a long time, since kin relations build up the most fundamental social connections. However, verifying the kin relationship between people is not an easy task since there is not instant yet economic way to precisely verify. Although with the modern technology, advanced tools like DNA paternity test is able to provide top-level accuracy, its high-cost in terms of both time and money prevents it being an off-the-shelf verification tool. Recently, kinship verification has attracted substantial attention from computer vision and artificial intelligence society [31–42]. Inspired by the truth that children inherit gene from parents, these works exploit facial appearance as well as social context in the photo to predict the kin relationship. With confirmed relationship, applications like building family tree, seeking missing child, and family photos retrieval in real-world become promising.

Most of the state-of-the-art works concentrate on the pair-wise kinship verification, meaning given a pair of facial images, the algorithm determines if the two people have the kin relation or not. In general, the relationship is restricted to “parent–child”. In a more general case, kinship could include siblings as well, and a higher-order relationship, i.e., many-to-one verification, should be considered in the real-world applications. A typical scenario is, given a single test image and a family album, we determine if the test is from this family or not, which is first discussed in [37]. Apparently, this problem is more general than one-to-one verification and the latter one is actually a special case. In this part, we formally name it as “family membership recognition” (FMR). The problem illustration can be found in Fig. 9.10.

Figure 9.10 Illustration of kinship verification and family membership recognition problems.

In addition to the conventional kinship verification problem, in FMR, we encounter new challenges: (1) How to define familial features in terms of family rather than individual; (2) How to effectively extract familial features from input images. In this part, we proposed a low-rank regularized family faces guided parallel autoencoders (rPAE) for FMR problem, and the proposed method solves challenges through an integrated framework. In addition, rPAE can easily adapt to conventional kinship verification problem, and significantly boost the performance.

To that end, we first propose a novel concept called “family faces” which are constructed by family mean-face and its component-wise nearest neighbors in the training set. Second, we design a novel structure called parallel autoencoders whose outputs are guided by multiple family faces. To be concrete, the inputs of each encoder are all facial images from different families, while the outputs are corresponding family faces. In this way, we have better chance to capture the inherited family facial feature under the assumption that they are from different family members. Finally, to guarantee learned autoencoders yield common feature space, we impose a low-rank constraint on the model parameters in the first layer which enables us to learn parallel autoencoders in a unified framework, rather than one-by-one. Extensive experiments conducted on KFW [36] and Family101 databases [37] demonstrate the proposed model is effective in solving both kinship verification and family membership recognition problems compared with the state-of-the-art methods.

9.2.2 Related Work

There are three lines in the related work: (1) familial features, (2) autoencoder, (3) low-rank matrix analysis.

Facial appearance and its representation [43,44] have been considered as two of the most important familial features in kinship verification and FRM problems. Among the existing work [31–33,35,34,36–38,41,42], local features [45–47] and components based strategy provide better performance [31–33]. These facial components carry explicit semantics such as eyes, nose, mouth, cheek, forehead, jaw, brows, by which people could empirically determine the kin relationship. The final feature vector usually concatenates all the local descriptors extracted from these components for feature assembling. In addition, metric learning has been widely discussed in kinship verification to yield a high-level familial representation [36,39,40].

Although above hand-craft features [45–47] empirically provide superior performance, they are recently chased by learning based feature representation [48–52]. Among them, autoencoders [50–52] are able to generate honestly reflected features through hidden layers. Moreover, autoencoder based learning method has the flexibility to stack more than one encoder to build a deep structure [53] which has been empirically proved effective in visual recognition [54,17]. In this part, different from the traditional autoencoder that uses identical data for both input and output, we tune the model and enforce the output to be multiple family faces. Therefore, we have a parallel structure for a group of autoencoders and each of them runs in the supervised fashion guided by the family faces. It should be noted most recently gated autoencoders have been applied to kinship verification and family member recognition problems, and achieved appealing results [55].

Low-rank matrix constraint has been widely discussed recently due to its successful application in data recovery [56], subspace segmentation [57], image segmentation [58], visual domain adaptation [59], and multitask learning [60]. The underlying assumption is low-rank constraint on the linear feature space is able to discover the principal component in spite of noises with arbitrarily large magnitude. This constraint can work on the reconstruction coefficients to recover the hidden subspace structure [57,59]. It is also preferred by multiple features/tasks that have intrinsic connections [58,60]. In this part, different from them, we encourage the model parameters of different autoencoders to be low-rank, which helps with discovering common feature space shared by different family faces. In addition, this low-rank constraint enables us to learn parallel autoencoders in a joint framework, rather than one by one.

9.2.3 Family Faces

Familial features have been discussed recently for kinship verification, and most of them are based on empirical observation that children inherit facial traits from their parents. Therefore, the foundation of this group of methods is pair-wise comparisons between children and parents in terms of facial components. The introduction of FMR breaks through the limitation of pair-wise comparison since the test needs to refer to facial traits of a whole family, rather than a single person. Therefore, formulating family familial features from a group of family photos becomes critical.

It is straightforward to consider mean face of family photos as concise familial features of a family. However, using mean face inevitably introduces blurring and artificial effects on appearance level, and discards personal features of family members on feature level. In addition, it is not easy to extend to multiple familial representations which can guide the parallel autoencoders.

Instead, we consider the first several nearest neighbors of the mean face from family facial images as the ideal representations of the familial feature and call it “family faces”, which is illustrated in Fig. 9.11. Suppose we have n families $[X_{1}, X_{2}, \dots, X_{n}]$ , and each family has $m_{i}$ image folders for $m_{i}$ people [ $X_{(i, 1)}, X_{(i, 2)}, \dots, X_{(i, m_{i})}$ ] where i indexes the family. Further, we use $x_{(i, j, k)}$ to denote the facial image from the ith family, jth person's kth photo. Therefore, the family mean face from the ith family can be formulated as

$f_{i} = \frac{1}{m_{i} \times m_{i j}} \sum_{j, k} x_{(i, j, k)},$

(9.10)

where $m_{i}$ and $m_{i j}$ are numbers of people in family i, and number of images for the jth person in family i, and $x_{(i, j, k)}$ could be either raw images or visual descriptors. Therefore, the first family face for the ith family can be easily found by the nearest neighbor search and we denote it as ${\tilde{x}}_{(i, 1)}$ . To expand the size of family faces, more neighbors of the mean face can be added to the family faces by considering the second, third and kth neighbors, namely, ${\tilde{x}}_{(i, 2)}, {\tilde{x}}_{(i, 3)}, \dots, {\tilde{x}}_{(i, k)}$ .

Figure 9.11 Illustration of building family faces. For each family album $X_{i}$ , we compute its family mean face by averaging all face images in the family folder. Then for each family, we find family mean face's k-nearest-neighbor within this family, in a component-wise way. Note in each family face, the single face image on the left is the nearest neighbor search result based on the whole face, while the facial components on the right are those results from a component-wise way. The second family face in the first row with red border demonstrates components of family face are not necessarily coming from the same face.

Inspired by the previous works that facial components work better than holistic feature, we construct family faces in a component-wise way. The facial components can be defined by a few key points on the face. Consequently, each facial image in Eq. (9.10) is now replaced by certain facial component $x_{(i, j, k)}^{c}$ , and $f_{i}$ replaced by $f_{i}^{c}$ , where c indexes this component. After the nearest neighbor search, we obtain the local family face ${\tilde{x}}_{(i, j)}^{c}$ for the ith family jth nearest neighbor and cth component. Finally, we assemble these local components into one feature vector, and still use ${\tilde{x}}_{(i, j)}$ to indicate this family face.

Interestingly, when we assemble these local family faces into an integrated one on the pixel level, we found that not all components come from the same image, or even the same person, as mentioned in Fig. 9.11. This is reasonable since mean face is essentially a virtual face and each local component from the same face may have different rankings in nearest neighbor search of the mean face. In addition, it supports the fact that children inherit genes from both parents and different people may carry different family traits.

9.2.4 Regularized Parallel Autoencoders

In this section, we detail the structure of parallel autoencoders (shown in Fig. 9.12) and explain how to utilize low-rank regularizer in building parallel autoencoders.

Figure 9.12 Illustration of the proposed low-rank regularized parallel autoencoders (rPAE) for family membership recognition. Here $W_{1}^{(1)}, \dots, W_{k}^{(1)}$ and $W_{1}^{(2)}, \dots, W_{k}^{(2)}$ are weight matrices of the input and hidden layers. Low-rank constraint is imposed on $W = [W_{1}^{(1)}, \dots, W_{k}^{(1)}]$ to achieve better representation for family membership features.

9.2.4.1 Problem Formulation

A typical objective function for autoencoder with N training samples is

$\min_{W, b} \frac{1}{N} \sum_{i} L (W, b; x_{i}, y_{i}) + λ Ω (W),$

(9.11)

where $L (W, b; x, y)$ is the loss function, $Ω (W)$ is the regularization term, $W, b$ are model parameters, $x_{i}, y_{i}$ are input and target value, and weight decay parameter λ balances the relative importance of the two terms. In this part, the loss function is implemented by the squared error between hypothesis and target values, namely

$L (W, b; x_{i}, y_{i}) = {‖ h_{W, b} (x_{i}) - y_{i} ‖}^{2},$

(9.12)

and the regularization term can be written in square sum of all elements in the weight matrix of the first and second layers, equal to the norms ${‖ W^{(1)} ‖}_{F}^{2}$ and ${‖ W^{(2)} ‖}_{F}^{2}$ , where ${‖ \cdot ‖}_{F}$ is the matrix Frobenius norm.

Suppose we have k family faces, then we can formulate the new loss function for the family faces guided autoencoders by

$\frac{1}{N \times k} \sum_{i, j} {‖ h_{W_{j}, b_{j}} (x_{i}) - {\tilde{x}}_{(i, j)} ‖}^{2},$

(9.13)

where j indexes the family face, and k denotes the number of family faces in each family. Apparently, Eq. (9.13) includes k autoencoders, and these encoders can be learned at the same time. Therefore, we call it parallel autoencoders, which provide a wide structure rather than a deep one.

However, trivially combining these autoencoders will not necessarily boost the final performance since there are no connections between autoencoders. To guarantee all the encoders share common feature spaces, we introduce a low-rank regularization term to encourage the weight matrices assembled from k autoencoders in a matrix low-rank structure, which ensures the familial features generated by the hidden layers from k autoencoders to have common feature space. Combining with the parallel autoencoders, the proposed low-rank regularized objective function is written as

$\begin{matrix} \min_{W_{j}, b_{j}} \frac{1}{N \times k} \sum_{i, j} {‖ h_{W_{j}, b_{j}} (x_{i}) - {\tilde{x}}_{(i, j)} ‖}^{2} \\ + λ_{1} ({‖ W^{(1)} ‖}_{F}^{2} + {‖ W^{(2)} ‖}_{F}^{2}) + λ_{2} {‖ W^{(1)} ‖}_{⁎}, \end{matrix}$

(9.14)

where $W^{(1)} = [W_{1}^{(1)}, \dots, W_{k}^{(1)}]$ is a column-wise matrices concatenation, $W^{(2)} = [W_{1}^{(2)}, \dots, W_{k}^{(2)}]$ has a similar structure, and ${‖ \cdot ‖}_{⁎}$ is the matrix nuclear norm, which is identified as convex surrogate of the original rank minimization problem. Solving this problem is nontrivial since we introduce a nonsmooth term. Therefore, we cannot directly use gradient descent method to solve both $W^{(1)}$ and $W^{(2)}$ following the traditional way for the proposed autoencoders. Since in practice $W^{(1)}$ and $W^{(2)}$ are solved in an iterative way, which means one's update relies on another's update, we break down it into two subproblems and concentrate on $W^{(1)}$ first.

9.2.4.2 Low-Rank Reframing

Although the nonsmooth property of low-rank prevents from solving it directly, we need to keep it since it helps in recovering the common feature space among different encoders. Recently, atomic decomposition [61] has been proposed to tackle large-scale low-rank regularized classification problem [60], where the matrix trace norm is converted to vector $l_{1}$ -norm and therefore can be solved efficiently through coordinate descent algorithm. We borrow the idea of low-rank reframing and use it to solve our problem since it is fast and efficient in our problem.

Suppose there is an overcomplete and uncountable infinite dictionary of all possible “atoms”, or rank-one matrices in our problem, that is denoted by a matrices set $M$ ,

$M = {u v^{T} | u \in R^{d_{1}}, v \in R^{d_{2}}, {‖ u ‖}_{2} = {‖ v ‖}_{2} = 1} .$

(9.15)

Note that $M$ does not necessarily build a basis in $R^{d_{1} \times d_{2}}$ . Further, we use $I$ to represent the index set spanning the rank-1 matrix in the $M$ , namely,

$M = {M_{i} \in R^{d_{1} \times d_{2}} | i \in I} = {u_{i} v_{i}^{T} | i \in I} .$

(9.16)

Next, we consider a vector $θ \in R^{I}$ and its support $supp (θ) = {i, θ_{i} \neq 0}$ , and further define a vector set as $Θ = {θ \in R^{I} | supp (θ) is finite}$ . Then we have a decomposition for the matrix $W^{(1)}$ onto atoms in $M$ ,

$W^{(1)} = \sum_{i \in supp (θ)} θ_{i} M_{i} .$

(9.17)

Actually, this is the atom decomposition of the matrix $W^{(1)}$ onto a series of rank-1 matrices and, since the atom dictionary is overcomplete, this decomposition might not be unique. Through the formulation of this decomposition, we can reframe the original low-rank regularized problem proposed in Eq. (9.14) into the following one:

$\min_{θ \in Θ^{+}} I (θ) = \min_{θ \in Θ^{+}} λ_{2} \sum_{i \in supp (θ)} θ_{i} + R (W_{θ}),$

(9.18)

where $R (W_{θ})$ represents the remaining parts in Eq. (9.14) other than low-rank regularizer. We can see that $\sum_{i \in supp (θ)} θ_{i}$ is essentially the vector $l_{1}$ -norm for θ.

9.2.4.3 Solution

We describe how to use coordinate descent method to solve Eq. (9.18). The algorithm described here actually belongs to the family of atom descent algorithms, whose stopping criteria are given by ε-approximation optimality:

${\begin{matrix} \forall i \in I : \frac{\partial R (W)}{\partial θ_{i}} \geq - λ_{2} - ε, \\ \forall i \in supp (θ) : | \frac{\partial R (W)}{\partial θ_{i}} + λ_{2} | \leq ε . \end{matrix}$

(9.19)

The basic flow of coordinate descent is similar to gradient descent, but at each iteration with $θ_{t}$ ,⁴ we need to find the coordinate along which we can achieve the steepest descent while remaining in $Θ^{+}$ . In our model, this is equal to pick $i \in I$ with the largest $- \partial I (θ_{t}) / \partial θ_{i}$ .

In practice, it is easy to find that the coordinate corresponding to the largest $- \partial I (θ_{t}) / \partial θ_{i}$ can be computed by the singular vector pair corresponding to the top value of the matrix $- \nabla R (W_{θ_{t}})$ , namely,

$\max_{{‖ u ‖}_{2} = {‖ v ‖}_{2} = 1} u^{T} (- \nabla R (W_{θ_{t}})) v .$

(9.20)

There are two possible cases after we find the descent direction: if coordinate $i \notin supp (θ_{t})$ , then we only move in the positive direction; else we can move in either the positive or negative direction. To avoid aggressive update, we use a steep-enough direction (up to $ε / 2$ ) instead in the practice.

In each iteration, after we find the steep-enough direction defined by $u_{t} v_{t}^{T}$ , we need to tackle the step size of each update. Since we do not adopt the steepest direction in the algorithm, we will use a line-search to guarantee the objective value $I (θ)$ is decreased in each iteration. So each update can be described as $W_{t + 1} = W_{t} + δ u_{t} v_{t}^{T}$ , and $θ_{t + 1} = θ_{t} + δ e_{t}$ , where $e_{t}$ is an indicator vector, with the tth element being 1 in the tth iteration. The overall process can be found in Algorithm 9.4.

Algorithm 9.4 Coordinate descent algorithm for the reframed low-rank regularized problem.

Recall that the proposed regularized family faces guided parallel autoencoders in Eq. (9.14) should have been solved by gradient descent algorithm, however, the nonsmoothness property of the new regularized term makes it non-differential. Algorithm 9.4 actually tells us how to solve it with gradient algorithm even with low-rank term. To solve the original problem, we still need to run gradient descent algorithm on Eq. (9.14) with partial derivative solved by back-propagation. The difference is each time when we update $W^{(1)}$ , we need to run one iteration of Algorithm 9.4. The entire solution for Eq. (9.14) can be found in Algorithm 9.5.

Algorithm 9.5 Gradient descent algorithm for the regularized family faces guided parallel autoencoders.

Remarks. (1) We mainly focus on the solution of rPAE in this section and do not list the detailed solution of autoencoders, which involves “feed-forward” and “back-propagation” processes, and “nonlinear activation” function, since they can be easily checked from many relevant works. (2) We empirically set the model parameters as: $λ_{1} = λ_{2} = 0.1$ , $T = 50$ , and $ε = 0.01$ in our experiments, and achieve acceptable results. There might be better settings for them, but we leave the space for discussions of other important model parameters. (3) Although rPAE is proposed for FMR problem, it can easily adapt to kinship verification problems by sampling a few subsets of the training data to guide the learning of parallel autoencoders. We will introduce the implementation details in the experiment section. (4) The deep structure can be trained in a layer-wise way which only involves the training of single hidden layer of rPAE.

9.2.5 Experimental Results

In this section, we demonstrate the effectiveness of the proposed method through two groups of experiments: (1) kinship verification and (2) family membership recognition. For kinship verification experiments, we use KFW database published in [36,40] while for face recognition through familial feature experiments, we use Family101 database published in [37].

9.2.5.1 Kinship Verification

Kinship Face in the Wild (KFW)⁵ is a database of face images collected for studying the problem of kinship verification from unconstrained face images. It includes two parts, KFW-I and KFW-II, both of which include four detailed kin relations: father–son (F–S), father–daughter (F–D), mother–son (M–S), mother–daughter (M–D). In the KFW-I dataset, there are 156, 134, 116, and 127 pairs of kinship images for these four relations, while in KFW-II dataset, each relation contains 250 pairs of kinship images. The difference of these two datasets is that KFW-I uses facial images from different photo to build kinship pairs, but KFW-II uses facial images from the same photo. Therefore, the variations of lighting and expressions may be more dramatic in KFW-I, leading to a relatively lower performance. All faces in the database have been manually aligned and cropped to $64 \times 64$ images to exclude background. Sample images can be found in Fig. 9.13. In the following KFW relevant experiments, we use the HOG features provided by KFW benchmark website as the input to rPAE.

Figure 9.13 Sample faces from KFW-I and KFW-II datasets. Four rows illustrate four different kin relations: father–son, father–daughter, mother–son, and mother–daughter.

Different from family faces guided rPAE in FMR, here we sample a few small sets from the training data and use each of them to build an autoencoder and all of them to build rPAE. For each specific relation, we randomly sample half of the images from the given positive pairs, and then put these samples back. We repeat this several times to obtain enough sets for the training of rPAE. The input of rPAE is [child, parent] and corresponding target value is [parent, child]. After we learn the rPAE, we encode the input feature through rPAE, and concatenate all of them to form the final feature vector which will be fed to a binary SVM classifier. We use the absolute value of vector difference from positive kinship pairs as the positive samples and that from negative kinship pairs as the negative samples. Note that we use LibSVM [63] with rbf kernel to train the binary model and use its probabilistic output to compute both ROC and AUC. The model parameters of SVM such as slack variable C and bandwidth σ is learned through grid search on the training data. We strictly follow the benchmark protocol of KFW, and report the image restricted experimental results with five-fold cross-validation.

There are several key factors in the model: the number of hidden units in each layer, the number of hidden layers, and the number of sampling to generate different autoencoders. To evaluate their impacts on our model, we experiment step by step to show their effects. We first use a single hidden layer autoencoder and experiment with four kin relations on KFW-II, with the number of hidden units changing from 200 to 1800, as shown in Fig. 9.14A. It can be seen from four relations that there exists a performance peak within this range, and in general it differs from one relation to another. To balance the performance and the time cost, we suggest to use 800 hidden units in the following experiments. Second, we gradually add the number of layers from 1 to 4 and see if it helps with the performance in Fig. 9.14B. Here, we empirically choose $[800, 200, 100, 50]$ as our deepest layersize setting, and add layer one by one to see the impacts. Clearly, deep structure always benefits the feature learning, which has been reported by many deep learning works. Therefore, we take $[800, 200, 100, 50]$ as our layersize setting. In addition, we also conduct experiments to analyze the number of autoencoder in rPAE in Fig. 9.14C, from which we can observe that more sampling could bring in performance boost, but it takes more time as well. Therefore, we suggest to use 10 parallel autoencoders in our framework. Finally, from the three experiments, we can also conclude that the parallel structure and regularizer is able to boost the performance compared to the single autoencoder case in Figs. 9.14A and 9.14B.

Figure 9.14 ROC curves of kinship verification on KFW-I.

We further compare our method with other existing state-of-the-art methods on KFW-I and KFW-II, and show the results of ROC curves and area under curve (AUC) in Figs. 9.14 and 9.15 and Tables 9.4 and 9.5. Among these comparisons, “HOG” means we directly use HOG feature provided by KFW benchmark website as the input to the binary SVM, to train and classify the kinship pairs. “SILD (image restricted)” [62] and “NRML (image unrestricted)” [36,40] mean that we first feed the original HOG features to the two comparisons and then use their output as the new features for the binary SVM. From the result, we can see that almost all the methods perform better than the direct use of HOG, which demonstrates that these methods work well on kinship verification problem. With the help of rPAE, our method performs better than the other two closely related methods in the four cases on both KFW-I and KFW-II. Finally, we also compare with the most recent work based on gated autoencoders (GAE) [55], which is shown in Table 9.6.

Figure 9.15 ROC curves of kinship verification on KFW-II.

Table 9.4

AUC of KFW-I dataset

Method	F–S	F–D	M–S	M–D	Average
HOG [47]	0.849	0.717	0.703	0.747	0.754
SILD [62]	0.838	0.730	0.708	0.797	0.769
NRML [40]	0.862	0.757	0.721	0.801	0.785
Ours	0.890	0.808	0.801	0.856	0.838

Table 9.5

AUC of KFW-II dataset

Method	F–S	F–D	M–S	M–D	Average
HOG [47]	0.833	0.723	0.721	0.723	0.750
SILD [62]	0.853	0.739	0.765	0.723	0.770
NRML [40]	0.871	0.740	0.784	0.738	0.783
Ours	0.906	0.823	0.840	0.811	0.845

Table 9.6

Comparisons with gated autoencoder (GAE) based methods [55]. Mean average precision (%) is reported for this experiment

KFW-I	F–S	F–D	M–S	M–D	Average
GAE [55]	76.4	72.5	71.9	77.3	74.5
Ours	87.7	78.9	81.1	86.4	83.5

KFW-II	F–S	F–D	M–S	M–D	Average
GAE [55]	83.9	76.7	83.4	84.8	82.2
Ours	90.5	82.0	82.6	78.5	83.4

9.2.5.2 Family Membership Recognition

We conduct familial feature based face recognition experiments in this section, which includes two parts: (1) family membership recognition and (2) face identification through familial features. For both experiments, we use Family101 database published in [37]. Family 101 database has 101 different family trees, 206 nuclear families, and 607 individuals, including 14,816 images. Most of individual in the database are public figures. Since the number of family members in each family tree are different, and the image qualities are diverse, we select 25 different family trees, and ensure that each family tree contains at least five family members. In the preprocessing, we found that there are some dominant members in the family tree, i.e., an individual with significantly many images. Therefore, we restrict the number of each individual's images less than 50. In our test, we randomly select one family member as test and use rest family members for training (including both rPAE training and classifier training). We use nearest neighbor as the classifier, repeat this five times, and the average performance plus standard deviation is reported in Table 9.7.

Table 9.7

Experimental results of family membership recognition on Family101 database. NN, SVM and Our method uses local Gabor as feature extraction methods

Method	Random	NN	SVM	Group Sparsity [37]	Ours
Accuracy (%)	4.00	15.59 ± 5.6	19.93 ± 6.94	20.94 ± 5.97	23.96 ± 5.78

In the following evaluation, we still follow the parameters setting of kinship verification discussed in the last section, and use local Gabor [33] as our input feature. That is, we first crop the image to $127 \times 100$ and then partition the faces into components by the four key points on the face: two eyes, nose tip, center of the mouth. Gabor features are extracted from each component in 8 directions and 5 magnitudes, and then concatenate to a long vector. We use PCA to reduce the vector length to 1000 for simplicity. From Table 9.7, we can see that family membership recognition is very challenging [37], and fewer results have been reported before. Most methods are slightly better than random guess. Nonetheless, our method performs best thanks to family face + rPAE framework. It is interesting to find that the standard deviation is relatively large in all experiments. The reason is when selected as the test, the current individual may not inherit too much character from the family members, while for others they may inherit more. Therefore, the accuracy fluctuates significantly during the five evaluations.

In addition, we showcase how the familial feature assists in general face recognition. If we consider all the training data in the FMR as references/gallery in the face recognition, then FMR can also be regarded as a face recognition problem. The only difference is in face recognition problem, references are other facial images of the test individual, while in FMR, references are facial images of family members'. From Table 9.7, we are inspired that FMR can boost the face recognition since FMR can identify people by auxiliary data. Therefore, in the following experiments, we compare face recognition (FR) results with/without auxiliary data. In Fig. 9.16, we can observe that familial features and family member's facial images are helpful in face recognition problem.

Figure 9.16 Family membership recognition accuracy and the number of references in the training set. Methods with “FR” mean they do not use family members' facial images as training data, and only use several references from the same individual for training. Note group sparsity [37] is the only state-of-the-art method for FMR so far.

Finally, we illustrate some face recognition results with query images and returned nearest neighbor based on the proposed familial feature. In Fig. 9.17, we show three cases in rows 1–3. The first one is a failure case that returns incorrect facial images from other family. The second query image is correctly recognized because its nearest neighbor is from the same family as the query image. In the third case, we use doted frame to represent the training images from the query itself which follows conventional face recognition procedure. From Figs. 9.16 and 9.17, we can conclude that the proposed approach indeed can help with face recognition using family members.

Figure 9.17 The first column of images are query images while the 2nd to 6th columns are their first 5 nearest neighbor in the training set. The first row shows a failure case where the nearest neighbor is not a facial image from Babbar family. The second row is the success case because the nearest neighbor of query is from the same family. Last row is also a success case, but different from the previous one. Since we also include 2 images of the query in the training set, the nearest neighbor of the query turns out to be an image of the query itself, similar to conventional face recognition problem. Note we use blue border for the query, green for the correct family member, doted green for training images of the query, and red for the incorrect family member.

9.3 Research Challenges and Future Works

In this chapter, we described using deep learning for style recognition and kinship understanding.

For style classification, the style centralizing autoencoder (SCAE) progressively drew weak style images to the class center to increase the feature discrimination. The weights of different descriptors are automatically allocated due to the consensus constraints. We described a novel rank-constrained group sparsity autoencoder, and a corresponding fast solution to achieve competitive performance but saving half of the training time compared to nonlinear ones.

Currently, we are working on the scenario that each image only belongs to one single style. However, sometimes one image may be with multiple styles, such as the mix-and-match styles in fashion. In the future, we plan to explore multiple-style classification. Furthermore, the style classification applications described in this chapter are all vision based, in the future, we plan to explore audio and document style classification, e.g., music style classification. We believe that weak style phenomenon also exists in audio or document styles. In vision based application, we apply patch consensus of images to constrain the consensus of different features. It would be very interesting to find out the consensus rules for audio or document.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9: Style Recognition and Kinship Understanding

Create new playlist

Sign In

Sign Up

9.1 Style Classification by Deep Learning1