Zhangyang Wang Department of Computer Science and Engineering, Texas A&M University, College Station, TX, United States
Many clustering methods highly depend on extracted features. We propose a joint optimization framework in terms of both feature extraction and discriminative clustering. We utilize graph regularized sparse codes as the features, and formulate sparse coding as the constraint for clustering. Two cost functions are developed based on entropy-minimization and maximum-margin clustering principles, respectively. They are considered as the objectives to be minimized. Solving such a bi-level optimization mutually reinforces both sparse coding and clustering steps. Experiments on several benchmark datasets verify remarkable performance improvements led by the proposed joint optimization.
While sparse coding-based clustering methods have shown to be successful, their bottlenecks in both efficiency and scalability limit the practical usage. In recent years, deep learning has been proved to be a highly effective, efficient and scalable feature learning tool. We propose to emulate the sparse coding-based clustering pipeline in the context of deep learning, leading to a carefully crafted deep model benefiting from both. A feed-forward network structure, named TAGnet, is constructed based on a graph-regularized sparse coding algorithm. It is then trained with task-specific loss functions from end to end. We discover that connecting deep learning to sparse coding benefits not only the model performance, but also its initialization and interpretation. Moreover, by introducing auxiliary clustering tasks to the intermediate feature hierarchy, we formulate DTAGnet and obtain a further performance boost. Extensive experiments demonstrate that the proposed model gains remarkable margins over several state-of-the-art methods.
Sparse coding; Deep learning; Clustering; Manifold
Clustering aims to divide data into groups of similar objects (clusters), and plays an important role in many real world data mining applications. To learn the hidden patterns of the dataset in an unsupervised way, existing clustering algorithms can be described as either generative or discriminative in nature. Generative clustering algorithms model categories in terms of their geometric properties in feature spaces, or as statistical processes of data. Examples include K-means [1] and Gaussian mixture model (GMM) clustering [2], which assume a parametric form of the underlying category distributions. Rather than modeling categories explicitly, discriminative clustering techniques search for the boundaries or distinctions between categories. With fewer assumptions being made, these methods are powerful and flexible in practice. For example, maximum-margin clustering [3–5] aims to find the hyperplane that can separate the data from different classes with a maximum margin. Information-theoretic clustering [6,7] minimizes the conditional entropy of all samples. Many recent discriminative clustering methods have achieved very satisfactory performances [5].
Moreover, many clustering methods extract discriminative features from input data, prior to clustering. The principal component analysis (PCA) feature is a common choice but not necessarily discriminative [8]. Kernel-based clustering methods [9] were explored to find implicit feature representations of input data. In [10], the features are selected for optimizing the discriminativity of the used partitioning algorithm, by solving a linear discriminant analysis (LDA) problem. More recently, sparse codes have been shown to be robust to noise and capable of handling high-dimensional data [11]. Furthermore, -graph [12] builds the graph by reconstructing each data point sparsely and locally with other data. A spectral clustering [13] is followed based on the constructed graph matrix. In [14,15], dictionary learning is combined with the clustering process, which uses Lloyd's-type algorithms that iteratively reassign data to clusters and then optimize the dictionary associated with each cluster. In [8], the authors learned the sparse codes that explicitly preserve the local data manifold structures. Their results indicate that encoding geometrical information will significantly enhance the learning performance. A joint optimization of clustering and manifold structure were further considered in [16]. However, the clustering step in [8,16] is not correlated with the above mentioned discriminative clustering methods.
In this section, we propose to jointly optimize feature extraction and discriminative clustering, in which way they mutually reinforce each other. We focus on sparse codes as the extracted features, and develop our loss functions based on two representative discriminative clustering methods, the entropy-minimization [6] and maximum-margin [3] clustering, respectively. A task-driven bi-level optimization model [17,18] is then built upon the proposed framework. The sparse coding step is formulated as the lower-level constraint, where a graph regularization is enforced to preserve the local manifold structure [8]. The clustering-oriented cost functions are considered as the upper-level objectives to be minimized. Stochastic gradient descent algorithms are developed to solve both bi-level models. Experiments on several popular real datasets verify the noticeable performance improvement led by such a joint optimization framework.
Sparse codes have proved to be an effective feature for clustering. In [12], the authors suggested that the contribution of one sample to the reconstruction of another sample was a good indicator of similarity between these two samples. Therefore, the reconstruction coefficients (sparse codes) can be used to constitute the similarity graph for spectral clustering. The -graph performs sparse representation for each data point separately without considering the geometric information and manifold structure of the entire data. Further research shows that the graph regularized sparse representations produce superior results in various clustering and classification tasks [8,19]. In this section, we adopt the graph regularized sparse codes as the features for clustering.
We assume that all the data samples , are encoded into their corresponding sparse codes , , using a learned dictionary , where are the learned atoms. Moreover, given a pairwise similarity matrix W, the sparse representations that capture the geometric structure of the data according to the manifold assumption should minimize the following objective: , where L is the graph Laplacian matrix constructed from W. In this section, W is chosen as the Gaussian kernel, , where δ is the controlling parameter selected by cross-validation.
The graph regularized sparse codes are obtained by solving the following convex optimization:
Note that is necessary for proving the differentiability of the objective function (see [5.2] in the Appendix). However, setting proves to work well in practice, and thus the term will be omitted by default hereinafter (except for the differentiability proof).
Obviously, the effect of sparse codes A largely depends on the quality of dictionary D. Dictionary learning methods, such as K-SVD algorithm [20], are widely used in sparse coding literature. In regard to clustering, the authors of [12,19] constructed the dictionary by directly selecting atoms from data samples. Zheng et al. [8] learned the dictionary that can reconstruct input data well. However, it does not necessarily lead to discriminative features for clustering. In contrast, we will optimize D together with the clustering task.
The objective cost function for the joint framework can be expressed by the following bi-level optimization:
where is a cost function evaluating the loss of clustering. It can be formulated differently based on various clustering principles, two of which will be discussed and solved in Sect. 5.1.3.
Bi-level optimization [21] has been investigated in both theory and application. In [21], the authors proposed a general bi-level sparse coding model for learning dictionaries across coupled signal spaces. Another similar formulation has been studied in [17] for general regression tasks.
Assuming K clusters, let be the set of parameters of the loss function, where corresponds to the ith cluster, . We introduce two forms of loss functions, each of which is derived from a representative discriminative clustering method.
Maximization of the mutual information with respect to parameters of the encoder model effectively defines a discriminative unsupervised optimization framework. The model is parameterized similarly to a conditionally trained classifier, but the cluster allocations are unknown [7]. In [22,6], the authors adopted an information-theoretic framework as an implementation of the low-density separation assumption by minimizing the conditional entropy. By substituting the logistic posterior probability into the minimum conditional entropy principle, the authors got the logistics clustering algorithm, which is equivalent to finding a labeling strategy so that the total entropy of data clustering is minimized.
Since the true cluster label of each is unknown, we introduce the predicted confidence probability that sample belongs to cluster j, , , which is set as the likelihood of the multinomial logistic (softmax) regression
The loss function for all data could be defined accordingly in a entropy-like form
The predicted cluster label of is the cluster j where it achieves the largest likelihood probability . The logistics regression can deal with multiclass problems more easily compared with the support vector machine (SVM). The next important thing we need to study is the differentiability of (5.2).
Built on the differentiability proof, we are able to solve (5.1) using a projected first order stochastic gradient descent (SGD) algorithm, whose detailed steps are outlined in Algorithm 5.1. At a high level overview, it consists of an outer stochastic gradient descent loop that incrementally samples the training data. It uses each sample to approximate gradients with respect to the classifier parameter w and the dictionary D, which are then used to update them.
Xu et al. [3] proposed maximum margin clustering (MMC), which borrows the idea from the SVM theory. Their experimental results showed that the MMC technique could often obtain more accurate results than conventional clustering methods. Technically, MMC just finds a way to label the samples by running an SVM implicitly, and the SVM margin obtained is maximized over all possible labelings [5]. However, unlike supervised large margin methods, which are usually formulated as convex optimization problems, maximum margin clustering is a nonconvex integer optimization problem, which is much more difficult to solve. Li et al. [23] made several relaxations to the original MMC problem and reformulated it as a semidefinite programming (SDP) problem. The cutting plane maximum margin clustering (CPMMC) algorithm was presented in [5] to solve MMC with a much improved efficiency.
To develop the multiclass max-margin loss of clustering, we refer to the classical multiclass SVM formulation in [24]. Given the sparse code comprises the features to be clustered, we define the multiclass model as
where is the prototype for the jth cluster and is its corresponding weight vector. The predicted cluster label of is the cluster of the weight vector that achieves the maximum value . Let , the multiclass max-margin loss for , be defined as
Note that different from training a multiclass SVM classier, where is given as a training label, the clustering scenario requires us to jointly estimate as a variable. The overall max-margin loss to be minimized is (λ as the coefficient)
But to solve (5.8) or (5.9) with respect to the same framework as logistic loss will involve two additional concerns, which need to be handled specifically.
First, the hinge loss of the form (5.8) is not differentiable, with only a subgradient existing. This makes the objective function nondifferentiable on , and further the analysis in the proof of Theorem 5.1 cannot be applied. We could have used the squared hinge loss or modified Huber loss for a quadratically smoothed loss function [25]. However, as we checked in the experiments, the quadratically smoothed loss is not as good as hinge loss in training time and sparsity. Also, though not theoretically guaranteed, using the subgradient of works well in our case.
Second, given that w is fixed, it should be noted that and are both functions of . Therefore, calculating the derivative of (5.8) with respect to would involve expanding both and , making analysis quite complicated. Instead, we borrow ideas from the regularity of the elastic net solution [17], namely the set of nonzero coefficients of the elastic net solution should not change for small perturbations. Similarly, due to the continuity of the objective, it is assumed that a sufficiently small perturbation over the current will not change and . Therefore in each iteration, we could directly precalculate and using the current w and and fix them for updates.2
Given the above two approaches, for a single sample , if the hinge loss is larger than 0, the derivative of (5.8) with respect to w is
where denotes the jth element of the derivative for the sample . If the hinge loss is less than 0, then . The derivative of (5.8) with respect to is if the hinge loss is larger than 0, and 0 otherwise. Note the above deduction can be conducted in a batch mode. It is then similarly solved using a projected SGD algorithm, whose steps are outlined in Algorithm 5.2.
We conduct our clustering experiments on four popular real datasets, which are summarized in Table 5.1. The ORL face database contains 400 facial images for 40 subjects, and each subject has 10 images of size . The images are taken at different times with varying lighting and facial expressions. The subjects are all in an upright, frontal position with a dark homogeneous background. The MNIST handwritten digit database consists of a total number of 70,000 images, with digits ranging from 0 to 9. The digits are normalized and centered in fixed-size images of . The COIL20 image library contains 1440 images of size , for 20 objects. Each object has 72 images, that were taken 5 degree apart as the object was rotated on a turntable. The CMU-PIE face database contains 68 subjects with 41,368 face images as a whole. For each subject, we have 21 images of size , under different lighting conditions.
Table 5.1
Comparison of all datasets
Name | Number of Images | Class | Dimension |
---|---|---|---|
ORL | 400 | 10 | 1024 |
MNIST | 70,000 | 10 | 784 |
COIL20 | 1440 | 20 | 1024 |
CMU-PIE | 41,368 | 68 | 1024 |
We apply two widely-used measures to evaluate the performance of the clustering methods, the accuracy and the normalized mutual information (NMI) [8,12]. Suppose the predicted label of is , which is produced by the clustering method, and is the ground-truth label. The accuracy is defined as
where I is the indicator function, and Φ is the best permutation mapping function [26]. On the other hand, suppose the clusters obtained from the predicted labels and are and C, respectively. The mutual information between and C is defined as
where and are the probabilities that a data point belongs to the clusters and C, respectively, and is the probability that a data point jointly belongs to and C. The normalized mutual information (NMI) is defined as
where and are the entropies of and C, respectively. NMI takes values in [0,1].
We compare the following eight methods on all four datasets:
All images are first reshaped into vectors, and PCA is then applied to reducing the data dimensionality by keeping 98% information, which is also used in [8] to improve efficiency. The multiclass MMC algorithm is implemented based on the publicly available CPMMC code for two-class clustering [5], following the multiclass case descriptions in the original paper. For all algorithms that involve graph-regularized sparse coding, the graph regularization parameter α is fixed to be 1, and the dictionary size p is 128 by default. For joint EMC and joint MMC, we set ITER as 30, ρ as 0.9, and as 5. Other parameters in competing methods are tuned in cross-validation experiments to our best efforts.
All the comparison results (accuracy and NMI) are listed in Table 5.2, from which we could conclude the following:
Table 5.2
Accuracy and NMI performance comparisons on all datasets
KM | KM + SC | EMC | EMC + SC | MMC | MMC + SC | joint EMC | joint MMC | ||
---|---|---|---|---|---|---|---|---|---|
ORL | Acc | 0.5250 | 0.5887 | 0.6011 | 0.6404 | 0.6460 | 0.6968 | 0.7250 | 0.7458 |
NMI | 0.7182 | 0.7396 | 0.7502 | 0.7795 | 0.8050 | 0.8043 | 0.8125 | 0.8728 | |
MNIST | Acc | 0.6248 | 0.6407 | 0.6377 | 0.6493 | 0.6468 | 0.6581 | 0.6550 | 0.6784 |
NMI | 0.5142 | 0.5397 | 0.5274 | 0.5671 | 0.5934 | 0.6161 | 0.6150 | 0.6451 | |
COIL20 | Acc | 0.6280 | 0.7880 | 0.7399 | 0.7633 | 0.8075 | 0.8493 | 0.8225 | 0.8658 |
NMI | 0.7621 | 0.9010 | 0.8621 | 0.8887 | 0.8922 | 0.8977 | 0.8850 | 0.9127 | |
CMU-PIE | Acc | 0.3176 | 0.8457 | 0.7627 | 0.7836 | 0.8482 | 0.8491 | 0.8250 | 0.8783 |
NMI | 0.6383 | 0.9557 | 0.8043 | 0.8410 | 0.9237 | 0.9489 | 0.9020 | 0.9675 |
On the COIL20 dataset, we reconduct the clustering experiments with the cluster number K ranging from 2 to 20, using EMC + SC, MMC + SC, joint EMC, and joint MMC. For each K, except for 20, 10 test runs are conducted on different randomly chosen clusters, and the final scores are obtained by averaging over the 10 tests. Fig. 5.1 shows the clustering accuracy and NMI measurements versus the number of clusters. It is revealed that the two joint methods consistently outperform their non-joint counterparts. When K increases, the performance of joint methods seems to degrade less slowly.
As a typical case in machine learning, we use SGD in a setting where it is not guaranteed to converge in theory, but behaves well in practice. As observed in our experiments, a good initialization of D and w can affect the final results notably. We initialize joint EMC by the D and w solved from EMC + SC, and joint MMC by the solutions from MMC + SC, respectively.
There are two parameters that we empirically set in ahead, the graph regularization parameter α and the dictionary size p. The regularization term imposes stronger smoothness constraints on the sparse codes when α grows larger. Also, while a compact dictionary is more desirable computationally, more redundant dictionaries may lead to less cluttered features that can be better discriminated. We investigate how the clustering performances EMC + SC, MMC + SC, joint EMC, and joint MMC change on the ORL dataset, with various α and p values. As depicted in Figs. 5.2 and 5.3, we observe that:
We propose a joint framework to optimize sparse coding and discriminative clustering simultaneously. We adopt graph-regularized sparse codes as the feature to be learned, and design two clustering-oriented cost functions, by entropy-minimization and maximum-margin principles, respectively. The formulation of a task-driven bi-level optimization mutually reinforces both sparse coding and clustering steps. Experiments on several benchmark datasets verify the remarkable performance improvement led by the proposed joint optimization.
We recall the following lemma [5.2] in [17]:
While many classical clustering algorithms have been proposed, such as K-means, Gaussian mixture model (GMM) clustering [2], maximum-margin clustering [3] and information-theoretic clustering [6], most only work well when the data dimensionality is low. Since high-dimensional data exhibits dense grouping in low-dimensional embeddings [27], researchers have been motivated to first project the original data onto a low-dimensional subspace [10] and then cluster on the feature embeddings. Among many feature embedding learning methods, sparse codes [11] have proven to be robust and efficient features for clustering, as verified by many [12,8].
Effectiveness and scalability are two major concerns in designing a clustering algorithm under Big Data scenarios [28]. Conventional sparse coding models rely on iterative approximation algorithms, whose inherently sequential structure, as well as the data-dependent complexity and latency, often constitutes a major bottleneck in the computational efficiency [29]. This also results in the difficulty when one tries to jointly optimize the unsupervised feature learning and the supervised task-driven steps [17]. Such a joint optimization usually has to rely on solving complex bi-level optimization [30], such as in [31], which constitutes another efficiency bottleneck. What is more, to effectively model and represent datasets of growing sizes, sparse coding needs to refer to larger dictionaries [32]. Since the inference complexity of sparse coding increases more than linearly with respect to the dictionary size [31], the scalability of sparse coding-based clustering work turns out to be quite limited.
To conquer those limitations, we are motivated to introduce the tool of deep learning in clustering, to which there has been a lack of attention paid. The advantages of deep learning are achieved by its large learning capacity, the linear scalability with the aid of stochastic gradient descent (SGD), and the low inference complexity [33]. The feed-forward networks could be naturally tuned jointly with task-driven loss functions. On the other hand, generic deep architectures [34] largely ignore the problem-specific formulations and prior knowledge. As a result, one may encounter difficulties in choosing optimal architectures, interpreting their working mechanisms, and initializing the parameters.
In this section, we demonstrate how to combine the sparse coding-based pipeline into deep learning models for clustering. The proposed framework takes advantage of both sparse coding and deep learning. Specifically, the feature learning layers are inspired by the graph-regularized sparse coding inference process, via reformulating iterative algorithms [29] into a feed-forward network, named TAGnet. Those layers are then jointly optimized with the task-specific loss functions from end to end. Our technical novelty and merits are summarized as follows:
Assuming data samples , where and . They are encoded into sparse codes , where and , using a learned dictionary , where are the learned atoms. The sparse codes are obtained by solving the following convex optimization (λ is a constant) problem:
In [12], the authors suggested that the sparse codes can be used to construct the similarity graph for spectral clustering [13]. Furthermore, to capture the geometric structure of local data manifolds, the graph regularized sparse codes are further suggested in [8,19] by solving
where L is the graph Laplacian matrix and can be constructed from a prechosen pairwise similarity (affinity) matrix P. More recently in [31], the authors suggested to simultaneously learn feature extraction and discriminative clustering, by formulating a task-driven sparse coding model [17]. They proved that such joint methods consistently outperformed non-joint counterparts.
In [36], the authors explored the possibility of employing deep learning in graph clustering. They first learned a nonlinear embedding of the original graph by an auto encoder (AE), followed by a K-means algorithm on the embedding to obtain the final clustering result. However, it neither exploits more adapted deep architectures nor performs any task-specific joint optimization. In [37], a deep belief network with nonparametric clustering was presented. As a generative graphical model, DBN provides a faster feature learning, but is less effective than AEs in terms of learning discriminative features for clustering. In [38], the authors extended the seminonnegative matrix factorization (Semi-NMF) model to a Deep Semi-NMF model, whose architecture resembles stacked AEs. Our proposed model is substantially different from all these previous approaches, due to its unique task-specific architecture derived from sparse coding domain expertise, as well as the joint optimization with clustering-oriented loss functions.
The proposed pipeline consists of two blocks. As depicted in Fig. 5.4A, it is trained end-to-end in an unsupervised way. It includes a feed-forward architecture, termed Task-specific And Graph-regularized Network (TAGnet), to learn discriminative features, and the clustering-oriented loss function.
Different from generic deep architectures, TAGnet is designed in a way to take advantage of the successful sparse code-based clustering pipelines [8,31]. It aims to learn features that are optimized under clustering criteria, while encoding graph constraints (5.16) to regularize the target solution. TAGnet is derived from the following theorem:
The complete proof of Theorem 5.3 can be found in the Appendix. Theorem 5.3 outlines an iterative algorithm to solve (5.16). Under quite mild conditions [39], after A is initialized, one may repeat the shrinkage and thresholding process in (5.17) until convergence. Moreover, the iterative algorithm could be alternatively expressed as the block diagram in Fig. 5.4B, where
In particular, we define the new operator “”: , where the input A is multiplied by the prefixed L from the right and scaled by the constant .
By time-unfolding and truncating Fig. 5.4B to a fixed number of K iterations ( by default),4 we obtain the TAGnet form in Fig. 5.4A; W, S and θ are all to be learnt jointly from data, while S and θ are tied weights for both stages.5 It is important to note that the output A of TAGnet is not necessarily identical to the predicted sparse codes by solving (5.16). Instead, the goal of TAGnet is to learn discriminative embedding that is optimal for clustering.
To facilitate training, we further rewrite (5.18) as
Equation (5.20) indicates that the original neuron with trainable thresholds can be decomposed into two linear scaling layers plus a unit-threshold neuron. The weights of the two scaling layers are diagonal matrices defined by θ and its element-wise reciprocal, respectively.
A notable component in TAGnet is the branch of each stage. The graph Laplacian L could be computed in advance. In the feed-forward process, a branch takes the intermediate () as the input, and applies the “” operator defined above. The output is aggregated with the output from the learnable S layer. In the back propagation, L will not be altered. In such a way, the graph regularization is effectively encoded in the TAGnet structure as a prior.
An appealing highlight of (D)TAGnet lies in its very effective and straightforward initialization strategy. With sufficient data, many latest deep networks train well with random initializations without pretraining. However, it has been discovered that poor initializations hamper the effectiveness of first-order methods (e.g., SGD) in certain cases [40]. For (D)TAGnet, it is, however, much easier to initialize the model in the right regime. This benefits from the analytical relationships between sparse coding and network hyperparameters defined in (5.19): we could initialize deep models from corresponding sparse coding components, the latter of which is easier to obtain. Such an advantage becomes much more important when the training data is limited.
Assuming K clusters, and as the set of parameters of the loss function, where corresponds to the ith cluster, . In this section, we adopt the following two forms of clustering-oriented loss functions.
One natural choice of the loss function is extended from the popular softmax loss, and takes the entropy-like form as
where denotes the the probability that sample belongs to cluster j, and ,
In testing, the predicted cluster label of input is determined using the maximum likelihood criterion based on the predicted .
The maximum margin clustering (MMC) approach was proposed in [3]. MMC finds a way to label the samples by running an SVM implicitly, and the SVM margin obtained would be maximized over all possible labels [5]. By referring to the MMC definition, the authors of [31] designed the max-margin loss as
In the above equation, the loss for an individual sample is defined as
where is the prototype for the jth cluster. In testing, the predicted cluster label of input is determined by weight vector that achieves the maximum .
Model Complexity. The proposed framework can handle large-scale and high-dimensional data effectively via the stochastic gradient descent (SGD) algorithm. In each step, the back propagation procedure requires only operations of order O(p) [29]. The training algorithm takes O(Cnp) time (C is a constant in terms of the total numbers of epochs, stage numbers, etc.). In addition, SGD is easy to be parallelized and thus could be efficiently trained using GPUs.
There is a close connection between sparse coding and neural network. In [29], a feed-forward neural network, named LISTA, is proposed to efficiently approximate the sparse code a of input signal x, which is obtained by solving (5.15) in advance. The LISTA network learns the hyperparameters as a general regression model from training data to their pre-solved sparse codes using back-propagation.
LISTA overlooks the useful geometric information among data points [8], and therefore could be viewed as a special case of TAGnet in Fig. 5.4 when (i.e., removing the branches). Moreover, LISTA aims to approximate the “optimal” sparse codes preobtained from (5.15), and therefore requires the estimation of D and the tedious precomputation of A. The authors did not exploit its potential in supervised and task-specific feature learning.
Deep networks are well known for their capabilities to learn semantically rich representations by hidden layers [41]. In this section, we investigate how the intermediate features () in TAGnet (Fig. 5.4A) can be interpreted, and further utilized to improve the model, for specific clustering tasks. Compared to related non-deep models [31], such a hierarchical clustering property is another unique advantage of being deep.
Our strategy is mainly inspired by the algorithmic framework of deeply supervised nets [42]. As in Fig. 5.5, our proposed Deeply-Task-specific And Graph-regularized Network (DTAGnet) brings in additional deep feedbacks, by associating a clustering-oriented local auxiliary loss () with each stage. Such an auxiliary loss takes the same form as the overall , except that the expected cluster number may be different, depending on the auxiliary clustering task to be performed. The DTAGnet backpropagates errors not only from the overall loss layer, but also simultaneously from the auxiliary losses.
While seeking the optimal performance of the target clustering, DTAGnet is also driven by two auxiliary tasks that are explicitly targeted at clustering specific attributes. It enforces constraint at each hidden representation for directly making a good cluster prediction. In addition to the overall loss, the introduction of auxiliary losses gives another strong push to obtain discriminative and sensible features at each individual stage. As discovered in the classification experiments in [42], the auxiliary loss both acts as feature regularization to reduce generalization errors and results in faster convergence. We also find in Sect. 5.2.5 that every () is indeed most suited for its targeted task.
In [38], a Deep Semi-NMF model was proposed to learn hidden representations, which grant themselves an interpretation of clustering according to different attributes. The authors considered the problem of mapping facial images to their identities. A face image also contains attributes like pose and expression that help identify the person depicted. In their experiments, the authors found that by further factorizing this mapping in a way that each factor adds an extra layer of abstraction, the deep model could automatically learn latent intermediate representations that are implied for clustering identity-related attributes. Although there is a clustering interpretation, those hidden representations are not specifically optimized in clustering sense. Instead, the entire model is trained with only the overall reconstruction loss, after which clustering is performed using K-means on learnt features. Consequently, their clustering performance is not satisfactory. Our study shares the similar observation and motivation with [38], but in a more task-specific manner by performing the optimizations of auxiliary clustering tasks jointly with the overall task.
We evaluate the proposed model on three publicly available datasets:
Although the paper only evaluates the proposed method using image datasets, the methodology itself is not limited to only image subjects. We apply two widely-used measures to evaluate the clustering performances, the accuracy and the normalized mutual information (NMI) [8,12]. We follow the convention of many clustering works [8,19,31] and do not distinguish training from testing. We train our models on all available samples of each dataset, reporting the clustering performances as our testing results. Results are averaged from 5 independent runs.
The proposed networks are implemented using the cuda-convnet package [34]. The network takes stages by default. We apply a constant learning rate of 0.01 with no momentum to all trainable layers. The batch size of 128. In particular, to encode graph regularization as a prior, we fix L during model training by setting its learning rate to be 0. Experiments run on a workstation with 12 Intel Xeon 2.67 GHz CPUs and 1 GTX680 GPU. The training takes approximately 1 hour on the MNIST dataset. It is also observed that the training efficiency of our model scales approximately linearly with data.
In our experiments, we set the default value of α to be 5, p to be 128, and λ to be chosen from [0.1, 1] by cross-validation.6 A dictionary D is first learned from X by K-SVD [20]; W, S and θ are then initialized based on (5.19), while L is also pre-calculated from P, which is formulated by the Gaussian kernel, (δ is also selected by cross-validation). After obtaining the output A from the initial (D)TAGnet models, ω (or ) could be initialized based on minimizing (5.21) or (5.23) over A (or ).
We denote the proposed model of TAGnet plus entropy-minimization loss (EML) (5.21) as TAGnet-EML, and the one plus maximum-margin loss (MML) (5.23) as TAGnet-MML, respectively. We include the following comparison methods:
As revealed by the full comparison results in Table 5.3, the proposed task-specific deep architectures outperform others with a noticeable margin. The underlying domain expertise guides the data-driven training in a more principled way. In contrast, the “general-architecture” baseline encoders (BE-EML and BE-MML) appear to produce much worse (even worst) results. Furthermore, it is evident that the proposed end-to-end optimized models outperform their “non-joint” counterparts. For example, on the MNIST dataset, TAGnet-MML surpasses NJ-TAGnet-MML by around 4% in accuracy and 5% in NMI.
Table 5.3
Accuracy and NMI performance comparisons on all three datasets
TAGnet-EML | TAGnet-MML | NJ-TAGnet-EML | NJ-TAGnet-MML | BE-EML | BE-MML | SC-EML | SC-MML | Deep Semi-NMF | ||
---|---|---|---|---|---|---|---|---|---|---|
MNIST | Acc | 0.6704 | 0.6922 | 0.6472 | 0.5052 | 0.5401 | 0.6521 | 0.6550 | 0.6784 | / |
NMI | 0.6261 | 0.6511 | 0.5624 | 0.6067 | 0.5002 | 0.5011 | 0.6150 | 0.6451 | / | |
CMU | Acc | 0.2176 | 0.2347 | 0.1727 | 0.1861 | 0.1204 | 0.1451 | 0.2002 | 0.2090 | 0.17 |
MultiPIE | NMI | 0.4338 | 0.4555 | 0.3167 | 0.3284 | 0.2672 | 0.2821 | 0.3337 | 0.3521 | 0.36 |
COIL20 | Acc | 0.8553 | 0.8991 | 0.7432 | 0.7882 | 0.7441 | 0.7645 | 0.8225 | 0.8658 | / |
NMI | 0.9090 | 0.9277 | 0.8707 | 0.8814 | 0.8028 | 0.8321 | 0.8850 | 0.9127 | / |
By comparing the TAGnet-EML/TAGnet-MML with SC-EML/SC-MML, we draw a promising conclusion: adopting a more parameterized deep architecture allows a larger feature learning capacity compared to conventional sparse coding. Although similar points are well made in many other fields [34], we are interested in a closer look between the two. Fig. 5.6 plots the clustering accuracy and NMI curves of TAGnet-EML/TAGnet-MML on the MNIST dataset, along with iteration numbers. Each model is well initialized at the very beginning, and the clustering accuracy and NMI are computed every 100 iterations. At first, the clustering performances of deep models are even slightly worse than sparse-coding methods, mainly since the initialization of TAGnet hinges on a truncated approximation of graph-regularized sparse coding. After a small number of iterations, the performance of the deep models surpass sparse coding ones, and continue rising monotonically until reaching a higher plateau.
In (5.16), the graph regularization term imposes stronger smoothness constraints on the sparse codes with a larger α. It also happens to the TAGnet. We investigate how the clustering performances of TAGnet-EML/TAGnet-MML are influenced by various α values. From Fig. 5.7, we observe the identical general tendency on all three datasets. While α increases, the accuracy/NMI result will first rise then decrease, with the peak appearing for . As an interpretation, the local manifold information is not sufficiently encoded when α is too small ( will completely disable the branch of TAGnet, and reduce it to the LISTA network [29] fine-tuned by the losses). On the other hand, when α is large, the sparse codes are “oversmoothed” with a reduced discriminative ability. Note that similar phenomena are also reported in other relevant literature, e.g., [8,31].
Furthermore, comparing Fig. 5.7A–F, it is noteworthy to observe how graph regularization behaves differently on three of them. We notice that the COIL20 dataset is the most sensitive to the choice of α. Increasing α from 0.01 to 50 leads to an improvement of more than 10%, in terms of both accuracy and NMI. It verifies the significance of graph regularization when training samples are limited [19]. On the MNIST dataset, both models obtain a gain of up to 6% in accuracy and 5% in NMI, by tuning α from 0.01 to 10. However, unlike COIL20 that almost always favors larger α, the model performance on the MNIST dataset tends to be not only saturated, but even significantly hampered when α continues rising to 50. The CMU MultiPIE dataset witnesses moderate improvements of around 2% in both measurements. It is not as sensitive to α as the other two. Potentially, it might be due to the complex variability in original images that makes the graph W unreliable for estimating the underlying manifold geometry. We suspect that more sophisticated graphs may help alleviate the problem, and we will explore it in the future.
On the MNIST dataset, we reconduct the clustering experiments with the cluster number ranging from 2 to 10, using TAGnet-EML/TAGnet-MML. Fig. 5.8 shows that the clustering accuracy and NMI change by varying the number of clusters. The clustering performance transits smoothly and robustly when the task scale changes.
To examine the proposed models' robustness to noise, we add various Gaussian noise, whose standard deviation s ranges from 0 (noiseless) to 0.3, to retrain our MNIST model. Fig. 5.9 indicates that both TAGnet-EML and TAGnet-MML own certain robustness to noise. When s is less than 0.1, there is even little visible performance degradation. While TAGnet-MML constantly outperforms TAGnet-EML in all experiments (as MMC is well-known to be highly discriminative [3]), it is interesting to observe in Fig. 5.9 that the latter is slightly more robust to noise than the former. It is perhaps owing to the probability-driven loss form (5.21) of EML that allows for more flexibility.
As observed, CMU MultiPIE is very challenging for the basic identity clustering task. However, it comes with several other attributes: pose, expression, and illumination, which could be of assistance in our proposed DTAGnet framework. In this section, we apply a similar setting of [38] on the same CMU MultiPIE subset, by setting pose clustering as the Stage I auxiliary task, and expression clustering as the Stage II auxiliary task.9 In that way, we target at 5 clusters, at 6 clusters, and finally, at 147 clusters.
The training of DTAGnet-EML/DTAGnet-MML follows the same aforementioned process except for considered extra back-propagated gradients from task in Stage k (). After that we test each separately on their targeted task. In DTAGnet, each auxiliary task is also jointly optimized with its intermediate feature , which differentiates our methodology substantially from [38]. It is thus no surprise to see in Table 5.4 that each auxiliary task obtains much improved performances than [38].10 Most notably, the performances of the overall identity clustering task witness a very impressive boost of around 7% in accuracy. We also test DTAGnet-EML/DTAGnet-MML with only or kept. Experiments verify that by adding auxiliary tasks gradually, the overall task keeps being benefited. Those auxiliary tasks, when enforced together, can also reinforce each other mutually.
Table 5.4
Effects of incorporating auxiliary clustering tasks in DTAGnet-EML/DTAGnet-MML (P, Pose; E, Expression; I, Identity)
Method | Stage I | Stage II | Overall | |||
---|---|---|---|---|---|---|
Task | Acc | Task | Acc | Task | Acc | |
DTAGnet-EML | / | / | / | / | I | 0.2176 |
P | 0.5067 | / | / | I | 0.2303 | |
/ | / | E | 0.3676 | I | 0.2507 | |
P | 0.5407 | E | 0.7027 | I | 0.2833 | |
DTAGnet-MML | / | / | / | / | I | 0.2347 |
P | 0.5251 | / | / | I | 0.2635 | |
/ | / | E | 0.3988 | I | 0.2858 | |
P | 0.5538 | E | 0.4231 | I | 0.3021 |
One might be curious of which one matters more in the performance boost: the deeply task-specific architecture that brings extra discriminative feature learning, or the proper design of auxiliary tasks that capture the intrinsic data structure characterized by attributes.
To answer this important question, we vary the target cluster number in either or , and reconduct the experiments. Table 5.5 reveals that more auxiliary tasks, even those without any straightforward task-specific interpretation (e.g., partitioning the Multi-PIE subset into 4, 8, 12 or 20 clusters hardly makes semantic sense), may still help gain better performances. It is comprehensible that they simply promote more discriminative feature learning in a low-to-high, coarse-to-fine scheme. In fact, it is a complementary observation to the conclusion found in classification [42]. On the other hand, at least in this specific case, while the target cluster numbers of auxiliary tasks get closer to the ground truth (5 and 6 here), the models seem to achieve the best performances. We conjecture that when properly “matched”, every hidden representation in each layer is in fact most suited for clustering the attributes corresponding to the layer of interest. The whole model can be resembled to the problem of sharing low-level feature filters among several relevant high-level tasks in convolutional networks [44], but in a distinct context.
Table 5.5
Effects of varying target cluster numbers of auxiliary tasks in DTAGnet-EML/DTAGnet-MML
Method | #clusters in Stage I | #clusters in Stage II | Overall Accuracy |
---|---|---|---|
DTAGnet-EML | 4 | 4 | 0.2827 |
8 | 8 | 0.2813 | |
12 | 12 | 0.2802 | |
20 | 20 | 0.2757 | |
DTAGnet-MML | 4 | 4 | 0.3030 |
8 | 8 | 0.3006 | |
12 | 12 | 0.2927 | |
20 | 20 | 0.2805 |
We hence conclude that the deeply-supervised fashion shows to be helpful for the deep clustering models, even when there are no explicit attributes for constructing a practically meaningful hierarchical clustering problem. However, it is preferable to exploit those attributes when available, as they lead to not only superior performances but more clearly interpretable models. The learned intermediate features can be potentially utilized for multitask learning [45].
In this section, we present a deep learning-based clustering framework. Trained from end to end, it features a task-specific deep architecture inspired by the sparse coding domain expertise, which is then optimized under clustering-oriented losses. Such a well-designed architecture leads to more effective initialization and training, and significantly outperforms generic architectures of the same parameter complexity. The model could be further interpreted and enhanced, by introducing auxiliary clustering losses to the intermediate features. Extensive experiments verify the effectiveness and robustness of the proposed models.