Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5

From Bi-Level Sparse Clustering to Deep Clustering

Zhangyang Wang Department of Computer Science and Engineering, Texas A&M University, College Station, TX, United States

Abstract

Many clustering methods highly depend on extracted features. We propose a joint optimization framework in terms of both feature extraction and discriminative clustering. We utilize graph regularized sparse codes as the features, and formulate sparse coding as the constraint for clustering. Two cost functions are developed based on entropy-minimization and maximum-margin clustering principles, respectively. They are considered as the objectives to be minimized. Solving such a bi-level optimization mutually reinforces both sparse coding and clustering steps. Experiments on several benchmark datasets verify remarkable performance improvements led by the proposed joint optimization.

While sparse coding-based clustering methods have shown to be successful, their bottlenecks in both efficiency and scalability limit the practical usage. In recent years, deep learning has been proved to be a highly effective, efficient and scalable feature learning tool. We propose to emulate the sparse coding-based clustering pipeline in the context of deep learning, leading to a carefully crafted deep model benefiting from both. A feed-forward network structure, named TAGnet, is constructed based on a graph-regularized sparse coding algorithm. It is then trained with task-specific loss functions from end to end. We discover that connecting deep learning to sparse coding benefits not only the model performance, but also its initialization and interpretation. Moreover, by introducing auxiliary clustering tasks to the intermediate feature hierarchy, we formulate DTAGnet and obtain a further performance boost. Extensive experiments demonstrate that the proposed model gains remarkable margins over several state-of-the-art methods.

Keywords

Sparse coding; Deep learning; Clustering; Manifold

5.1 A Joint Optimization Framework of Sparse Coding and Discriminative Clustering¹

5.1.1 Introduction

Clustering aims to divide data into groups of similar objects (clusters), and plays an important role in many real world data mining applications. To learn the hidden patterns of the dataset in an unsupervised way, existing clustering algorithms can be described as either generative or discriminative in nature. Generative clustering algorithms model categories in terms of their geometric properties in feature spaces, or as statistical processes of data. Examples include K-means [1] and Gaussian mixture model (GMM) clustering [2], which assume a parametric form of the underlying category distributions. Rather than modeling categories explicitly, discriminative clustering techniques search for the boundaries or distinctions between categories. With fewer assumptions being made, these methods are powerful and flexible in practice. For example, maximum-margin clustering [3–5] aims to find the hyperplane that can separate the data from different classes with a maximum margin. Information-theoretic clustering [6,7] minimizes the conditional entropy of all samples. Many recent discriminative clustering methods have achieved very satisfactory performances [5].

Moreover, many clustering methods extract discriminative features from input data, prior to clustering. The principal component analysis (PCA) feature is a common choice but not necessarily discriminative [8]. Kernel-based clustering methods [9] were explored to find implicit feature representations of input data. In [10], the features are selected for optimizing the discriminativity of the used partitioning algorithm, by solving a linear discriminant analysis (LDA) problem. More recently, sparse codes have been shown to be robust to noise and capable of handling high-dimensional data [11]. Furthermore, $ℓ_{1}$ $ℓ_{1}$ -graph [12] builds the graph by reconstructing each data point sparsely and locally with other data. A spectral clustering [13] is followed based on the constructed graph matrix. In [14,15], dictionary learning is combined with the clustering process, which uses Lloyd's-type algorithms that iteratively reassign data to clusters and then optimize the dictionary associated with each cluster. In [8], the authors learned the sparse codes that explicitly preserve the local data manifold structures. Their results indicate that encoding geometrical information will significantly enhance the learning performance. A joint optimization of clustering and manifold structure were further considered in [16]. However, the clustering step in [8,16] is not correlated with the above mentioned discriminative clustering methods.

In this section, we propose to jointly optimize feature extraction and discriminative clustering, in which way they mutually reinforce each other. We focus on sparse codes as the extracted features, and develop our loss functions based on two representative discriminative clustering methods, the entropy-minimization [6] and maximum-margin [3] clustering, respectively. A task-driven bi-level optimization model [17,18] is then built upon the proposed framework. The sparse coding step is formulated as the lower-level constraint, where a graph regularization is enforced to preserve the local manifold structure [8]. The clustering-oriented cost functions are considered as the upper-level objectives to be minimized. Stochastic gradient descent algorithms are developed to solve both bi-level models. Experiments on several popular real datasets verify the noticeable performance improvement led by such a joint optimization framework.

5.1.2 Model Formulation

5.1.2.1 Sparse Coding with Graph Regularization

Sparse codes have proved to be an effective feature for clustering. In [12], the authors suggested that the contribution of one sample to the reconstruction of another sample was a good indicator of similarity between these two samples. Therefore, the reconstruction coefficients (sparse codes) can be used to constitute the similarity graph for spectral clustering. The $ℓ_{1}$ $ℓ_{1}$ -graph performs sparse representation for each data point separately without considering the geometric information and manifold structure of the entire data. Further research shows that the graph regularized sparse representations produce superior results in various clustering and classification tasks [8,19]. In this section, we adopt the graph regularized sparse codes as the features for clustering.

We assume that all the data samples $X = [x_{1}, x_{2}, \dots, x_{n}], x_{i} \in R^{m \times 1}, i = 1, 2, \dots, n$ $X = [x_{1}, x_{2}, \dots, x_{n}], x_{i} \in R^{m \times 1}, i = 1, 2, \dots, n$ , are encoded into their corresponding sparse codes $A = [a_{1}, a_{2}, \dots, a_{n}]$ $A = [a_{1}, a_{2}, \dots, a_{n}]$ , $a_{i} \in R^{p \times 1}, i = 1, 2, \dots, n$ $a_{i} \in R^{p \times 1}, i = 1, 2, \dots, n$ , using a learned dictionary $D = [d_{1}, d_{2}, \dots, d_{p}]$ $D = [d_{1}, d_{2}, \dots, d_{p}]$ , where $d_{i} \in R^{m \times 1}, i = 1, 2, \dots, p$ $d_{i} \in R^{m \times 1}, i = 1, 2, \dots, p$ are the learned atoms. Moreover, given a pairwise similarity matrix W, the sparse representations that capture the geometric structure of the data according to the manifold assumption should minimize the following objective: $\frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} W_{i j} | | a_{i} - a_{j} | |_{2}^{2} = Tr (A L A^{T})$ $\frac{1}{2} \sum_{i = 1}^{n} \sum_{j = 1}^{n} W_{i j} | | a_{i} - a_{j} | |_{2}^{2} = Tr (A L A^{T})$ , where L is the graph Laplacian matrix constructed from W. In this section, W is chosen as the Gaussian kernel, $W_{i j} = \exp (- \frac{| | x_{i} - x_{j} | |_{2}^{2}}{δ^{2}})$ $W_{i j} = \exp (- \frac{| | x_{i} - x_{j} | |_{2}^{2}}{δ^{2}})$ , where δ is the controlling parameter selected by cross-validation.

The graph regularized sparse codes are obtained by solving the following convex optimization:

$\begin{matrix} A = \arg \min_{A} \frac{1}{2} | | X - D A | |_{F}^{2} + λ \sum_{i} | | a_{i} | |_{1} + α T r (A L A^{T}) + λ_{2} | | A | |_{F}^{2} . \end{matrix}$ $\begin{matrix} A = \arg \min_{A} \frac{1}{2} | | X - D A | |_{F}^{2} + λ \sum_{i} | | a_{i} | |_{1} + α T r (A L A^{T}) + λ_{2} | | A | |_{F}^{2} . \end{matrix}$

(5.1)

Note that $λ_{2} > 0$ $λ_{2} > 0$ is necessary for proving the differentiability of the objective function (see [5.2] in the Appendix). However, setting $λ_{2} = 0$ $λ_{2} = 0$ proves to work well in practice, and thus the term $λ_{2} | | A | |_{F}^{2}$ $λ_{2} | | A | |_{F}^{2}$ will be omitted by default hereinafter (except for the differentiability proof).

Obviously, the effect of sparse codes A largely depends on the quality of dictionary D. Dictionary learning methods, such as K-SVD algorithm [20], are widely used in sparse coding literature. In regard to clustering, the authors of [12,19] constructed the dictionary by directly selecting atoms from data samples. Zheng et al. [8] learned the dictionary that can reconstruct input data well. However, it does not necessarily lead to discriminative features for clustering. In contrast, we will optimize D together with the clustering task.

5.1.2.2 Bi-level Optimization Formulation

The objective cost function for the joint framework can be expressed by the following bi-level optimization:

$\begin{matrix} \min_{D, w} C (A, w) \\ s . t . A = \arg \min_{A} \frac{1}{2} | | X - D A | |_{F}^{2} + λ \sum_{i} | | a_{i} | |_{1} + α Tr (A L A^{T}), \end{matrix}$ $\begin{matrix} \min_{D, w} C (A, w) \\ s . t . A = \arg \min_{A} \frac{1}{2} | | X - D A | |_{F}^{2} + λ \sum_{i} | | a_{i} | |_{1} + α Tr (A L A^{T}), \end{matrix}$

(5.2)

where $C (A, w)$ $C (A, w)$ is a cost function evaluating the loss of clustering. It can be formulated differently based on various clustering principles, two of which will be discussed and solved in Sect. 5.1.3.

Bi-level optimization [21] has been investigated in both theory and application. In [21], the authors proposed a general bi-level sparse coding model for learning dictionaries across coupled signal spaces. Another similar formulation has been studied in [17] for general regression tasks.

5.1.3 Clustering-Oriented Cost Functions

Assuming K clusters, let $w = [w_{1}, \dots, w_{K}]$ $w = [w_{1}, \dots, w_{K}]$ be the set of parameters of the loss function, where $w_{i}$ $w_{i}$ corresponds to the ith cluster, $i = 1, 2, \dots, K$ $i = 1, 2, \dots, K$ . We introduce two forms of loss functions, each of which is derived from a representative discriminative clustering method.

5.1.3.1 Entropy-Minimization Loss

Maximization of the mutual information with respect to parameters of the encoder model effectively defines a discriminative unsupervised optimization framework. The model is parameterized similarly to a conditionally trained classifier, but the cluster allocations are unknown [7]. In [22,6], the authors adopted an information-theoretic framework as an implementation of the low-density separation assumption by minimizing the conditional entropy. By substituting the logistic posterior probability into the minimum conditional entropy principle, the authors got the logistics clustering algorithm, which is equivalent to finding a labeling strategy so that the total entropy of data clustering is minimized.

Since the true cluster label of each $x_{i}$ $x_{i}$ is unknown, we introduce the predicted confidence probability $p_{i j}$ $p_{i j}$ that sample $x_{i}$ $x_{i}$ belongs to cluster j, $i = 1, 2, \dots, N$ $i = 1, 2, \dots, N$ , $j = 1, 2, \dots, K$ $j = 1, 2, \dots, K$ , which is set as the likelihood of the multinomial logistic (softmax) regression

$\begin{matrix} p_{i j} = p (j | {w, a}_{i}) = \frac{1}{1 + e^{- j w^{T} a_{i}}} . \end{matrix}$ $\begin{matrix} p_{i j} = p (j | {w, a}_{i}) = \frac{1}{1 + e^{- j w^{T} a_{i}}} . \end{matrix}$

(5.3)

The loss function for all data could be defined accordingly in a entropy-like form

$\begin{matrix} C (A, w) = - \sum_{i = 1}^{n} \sum_{j = 1}^{K} p_{i j} \log p_{i j} . \end{matrix}$ $\begin{matrix} C (A, w) = - \sum_{i = 1}^{n} \sum_{j = 1}^{K} p_{i j} \log p_{i j} . \end{matrix}$

(5.4)

The predicted cluster label of $a_{i}$ $a_{i}$ is the cluster j where it achieves the largest likelihood probability $p_{i j}$ $p_{i j}$ . The logistics regression can deal with multiclass problems more easily compared with the support vector machine (SVM). The next important thing we need to study is the differentiability of (5.2).

Theorem 5.1

The objective $C (A, w)$ $C (A, w)$ defined in (5.4) is differentiable on $D \times w$ $D \times w$ .

Proof

Denote $X \in X$ $X \in X$ , and $D \in D$ $D \in D$ . Also let the objective function $C (A, w)$ $C (A, w)$ in (5.4) be denoted as C for short. The differentiability of C with respect to w is easy to show, using only the compactness of $X$ $X$ , as well as the fact that C is twice differentiable.

We will therefore focus on showing that C is differentiable with respect to D, which is more difficult since A, and thus $a_{i}$ $a_{i}$ , is not differentiable everywhere. Without loss of generality, we use a vector a instead of A for simplifying the derivations hereinafter. In some cases, we may equivalently express a as $a (D, w)$ $a (D, w)$ in order to emphasize the functional dependence. Based on [5.2] in Appendix, and given a small perturbation $E \in R^{m \times p}$ $E \in R^{m \times p}$ , it follows that

$\begin{matrix} C (a (D + E), w) - C (a (D), w) = \nabla_{z} C_{w}^{T} (a (D + E) - a (D)) + O (| | E | |_{F}^{2}), \end{matrix}$ $\begin{matrix} C (a (D + E), w) - C (a (D), w) = \nabla_{z} C_{w}^{T} (a (D + E) - a (D)) + O (| | E | |_{F}^{2}), \end{matrix}$

(5.5)

where the term $O (| | E | |_{F}^{2})$ $O (| | E | |_{F}^{2})$ is based on the fact that $a (D, x)$ $a (D, x)$ is uniformly Lipschitz and $X \times D$ $X \times D$ is compact. It is then possible to show that

$\begin{matrix} C (a (D + E), w) - C (a (D), w) = Tr (E^{T} g (a (D + E), w)) + O (| | E | |_{F}^{2}), \end{matrix}$ $\begin{matrix} C (a (D + E), w) - C (a (D), w) = Tr (E^{T} g (a (D + E), w)) + O (| | E | |_{F}^{2}), \end{matrix}$

(5.6)

where g has the form given in Algorithm 5.1. This shows that C is differentiable on $D$ $D$ . □

Algorithm 5.1 Stochastic gradient descent algorithm for solving (5.2), with C(A,w) as defined in (5.4).

Built on the differentiability proof, we are able to solve (5.1) using a projected first order stochastic gradient descent (SGD) algorithm, whose detailed steps are outlined in Algorithm 5.1. At a high level overview, it consists of an outer stochastic gradient descent loop that incrementally samples the training data. It uses each sample to approximate gradients with respect to the classifier parameter w and the dictionary D, which are then used to update them.

5.1.3.2 Maximum-Margin Loss

Xu et al. [3] proposed maximum margin clustering (MMC), which borrows the idea from the SVM theory. Their experimental results showed that the MMC technique could often obtain more accurate results than conventional clustering methods. Technically, MMC just finds a way to label the samples by running an SVM implicitly, and the SVM margin obtained is maximized over all possible labelings [5]. However, unlike supervised large margin methods, which are usually formulated as convex optimization problems, maximum margin clustering is a nonconvex integer optimization problem, which is much more difficult to solve. Li et al. [23] made several relaxations to the original MMC problem and reformulated it as a semidefinite programming (SDP) problem. The cutting plane maximum margin clustering (CPMMC) algorithm was presented in [5] to solve MMC with a much improved efficiency.

To develop the multiclass max-margin loss of clustering, we refer to the classical multiclass SVM formulation in [24]. Given the sparse code $a_{i}$ $a_{i}$ comprises the features to be clustered, we define the multiclass model as

$\begin{matrix} f (a_{i}) = \underset{j = 1, \dots, K}{\arg \max} f^{j} (a_{i}) = \underset{j = 1, \dots, K}{\arg \max} (w_{j}^{T} a_{i}), \end{matrix}$ $\begin{matrix} f (a_{i}) = \underset{j = 1, \dots, K}{\arg \max} f^{j} (a_{i}) = \underset{j = 1, \dots, K}{\arg \max} (w_{j}^{T} a_{i}), \end{matrix}$

(5.7)

where $f^{j}$ $f^{j}$ is the prototype for the jth cluster and $w_{j}$ $w_{j}$ is its corresponding weight vector. The predicted cluster label of $a_{i}$ $a_{i}$ is the cluster of the weight vector that achieves the maximum value $w_{j}^{T} a_{i}$ $w_{j}^{T} a_{i}$ . Let $w = [w_{1}, \dots, w_{K}]$ $w = [w_{1}, \dots, w_{K}]$ , the multiclass max-margin loss for $a_{i}$ $a_{i}$ , be defined as

$\begin{matrix} C (a_{i}, w) = \max (0, 1 + f^{r_{i}} (a_{i}) - f^{y_{i}} (a_{i})), \\ where y_{i} = \underset{j = 1, \dots, K}{\arg \max} f^{j} (a_{i}), r_{i} = \underset{j = 1, \dots, K, j \neq y_{i}}{\arg \max} f^{j} (a_{i}) . \end{matrix}$ $\begin{matrix} C (a_{i}, w) = \max (0, 1 + f^{r_{i}} (a_{i}) - f^{y_{i}} (a_{i})), \\ where y_{i} = \underset{j = 1, \dots, K}{\arg \max} f^{j} (a_{i}), r_{i} = \underset{j = 1, \dots, K, j \neq y_{i}}{\arg \max} f^{j} (a_{i}) . \end{matrix}$

(5.8)

Note that different from training a multiclass SVM classier, where $y_{i}$ $y_{i}$ is given as a training label, the clustering scenario requires us to jointly estimate $y_{i}$ $y_{i}$ as a variable. The overall max-margin loss to be minimized is (λ as the coefficient)

$\begin{matrix} C (A, w) = \frac{λ}{2} | | w | |^{2} + \sum_{i = 1}^{n} C (a_{i}, w) . \end{matrix}$ $\begin{matrix} C (A, w) = \frac{λ}{2} | | w | |^{2} + \sum_{i = 1}^{n} C (a_{i}, w) . \end{matrix}$

(5.9)

But to solve (5.8) or (5.9) with respect to the same framework as logistic loss will involve two additional concerns, which need to be handled specifically.

First, the hinge loss of the form (5.8) is not differentiable, with only a subgradient existing. This makes the objective function $C (A, w)$ $C (A, w)$ nondifferentiable on $D \times w$ $D \times w$ , and further the analysis in the proof of Theorem 5.1 cannot be applied. We could have used the squared hinge loss or modified Huber loss for a quadratically smoothed loss function [25]. However, as we checked in the experiments, the quadratically smoothed loss is not as good as hinge loss in training time and sparsity. Also, though not theoretically guaranteed, using the subgradient of $C (A, w)$ $C (A, w)$ works well in our case.

Second, given that w is fixed, it should be noted that $y_{i}$ $y_{i}$ and $r_{i}$ $r_{i}$ are both functions of $a_{i}$ $a_{i}$ . Therefore, calculating the derivative of (5.8) with respect to $a_{i}$ $a_{i}$ would involve expanding both $r_{i}$ $r_{i}$ and $y_{i}$ $y_{i}$ , making analysis quite complicated. Instead, we borrow ideas from the regularity of the elastic net solution [17], namely the set of nonzero coefficients of the elastic net solution should not change for small perturbations. Similarly, due to the continuity of the objective, it is assumed that a sufficiently small perturbation over the current $a_{i}$ $a_{i}$ will not change $y_{i}$ $y_{i}$ and $r_{i}$ $r_{i}$ . Therefore in each iteration, we could directly precalculate $y_{i}$ $y_{i}$ and $r_{i}$ $r_{i}$ using the current w and $a_{i}$ $a_{i}$ and fix them for $a_{i}$ $a_{i}$ updates.²

Given the above two approaches, for a single sample $a_{i}$ $a_{i}$ , if the hinge loss is larger than 0, the derivative of (5.8) with respect to w is

$Δ_{i}^{j} = {\begin{matrix} λ w_{i}^{j} - a_{i} & if j = y_{i}, \\ λ w_{i}^{j} + a_{i} & if j = r_{i}, \\ λ w_{i}^{j} & otherwise, \end{matrix}$ $Δ_{i}^{j} = {\begin{matrix} λ w_{i}^{j} - a_{i} & if j = y_{i}, \\ λ w_{i}^{j} + a_{i} & if j = r_{i}, \\ λ w_{i}^{j} & otherwise, \end{matrix}$

(5.10)

where $Δ_{i}^{j}$ $Δ_{i}^{j}$ denotes the jth element of the derivative for the sample $a_{i}$ $a_{i}$ . If the hinge loss is less than 0, then $Δ_{i}^{j} = λ w_{i}^{j}$ $Δ_{i}^{j} = λ w_{i}^{j}$ . The derivative of (5.8) with respect to $a_{i}$ $a_{i}$ is $w^{r_{i}} - w^{y_{i}}$ $w^{r_{i}} - w^{y_{i}}$ if the hinge loss is larger than 0, and 0 otherwise. Note the above deduction can be conducted in a batch mode. It is then similarly solved using a projected SGD algorithm, whose steps are outlined in Algorithm 5.2.

Algorithm 5.2 Stochastic gradient descent algorithm for solving (5.2), with C(A,w) as defined in (5.9).

5.1.4 Experiments

5.1.4.1 Datasets

We conduct our clustering experiments on four popular real datasets, which are summarized in Table 5.1. The ORL face database contains 400 facial images for 40 subjects, and each subject has 10 images of size $32 \times 32$ $32 \times 32$ . The images are taken at different times with varying lighting and facial expressions. The subjects are all in an upright, frontal position with a dark homogeneous background. The MNIST handwritten digit database consists of a total number of 70,000 images, with digits ranging from 0 to 9. The digits are normalized and centered in fixed-size images of $28 \times 28$ $28 \times 28$ . The COIL20 image library contains 1440 images of size $32 \times 32$ $32 \times 32$ , for 20 objects. Each object has 72 images, that were taken 5 degree apart as the object was rotated on a turntable. The CMU-PIE face database contains 68 subjects with 41,368 face images as a whole. For each subject, we have 21 images of size $32 \times 32$ $32 \times 32$ , under different lighting conditions.

Table 5.1

Comparison of all datasets

Name	Number of Images	Class	Dimension
ORL	400	10	1024
MNIST	70,000	10	784
COIL20	1440	20	1024
CMU-PIE	41,368	68	1024

5.1.4.2 Evaluation Metrics

We apply two widely-used measures to evaluate the performance of the clustering methods, the accuracy and the normalized mutual information (NMI) [8,12]. Suppose the predicted label of $x_{i}$ $x_{i}$ is ${\hat{y}}_{i}$ ${\hat{y}}_{i}$ , which is produced by the clustering method, and $y_{i}$ $y_{i}$ is the ground-truth label. The accuracy is defined as

$\begin{matrix} Acc = \frac{I_{Φ ({\hat{y}}_{i}) \neq y_{i}}}{n}, \end{matrix}$ $\begin{matrix} Acc = \frac{I_{Φ ({\hat{y}}_{i}) \neq y_{i}}}{n}, \end{matrix}$

(5.11)

where I is the indicator function, and Φ is the best permutation mapping function [26]. On the other hand, suppose the clusters obtained from the predicted labels ${{\hat{y}}_{i}}_{i = 1}^{n}$ ${{\hat{y}}_{i}}_{i = 1}^{n}$ and ${y_{i}}_{i = 1}^{n}$ ${y_{i}}_{i = 1}^{n}$ are $\hat{C}$ $\hat{C}$ and C, respectively. The mutual information between $\hat{C}$ $\hat{C}$ and C is defined as

$\begin{matrix} MI (\hat{C}, C) = \sum_{\hat{c} \in \hat{C}, c \in C} p (\hat{c}, c) \log \frac{p (\hat{c}, c)}{p (\hat{c}) p (c)}, \end{matrix}$ $\begin{matrix} MI (\hat{C}, C) = \sum_{\hat{c} \in \hat{C}, c \in C} p (\hat{c}, c) \log \frac{p (\hat{c}, c)}{p (\hat{c}) p (c)}, \end{matrix}$

(5.12)

where $p (\hat{c})$ $p (\hat{c})$ and $p (c)$ $p (c)$ are the probabilities that a data point belongs to the clusters $\hat{C}$ $\hat{C}$ and C, respectively, and $p (\hat{c}, c)$ $p (\hat{c}, c)$ is the probability that a data point jointly belongs to $\hat{C}$ $\hat{C}$ and C. The normalized mutual information (NMI) is defined as

$\begin{matrix} NMI (\hat{C}, C) = \frac{MI (\hat{C}, C)}{\max {H (\hat{C}), H (C)}}, \end{matrix}$ $\begin{matrix} NMI (\hat{C}, C) = \frac{MI (\hat{C}, C)}{\max {H (\hat{C}), H (C)}}, \end{matrix}$

(5.13)

where $H (\hat{C})$ $H (\hat{C})$ and $H (C)$ $H (C)$ are the entropies of $\hat{C}$ $\hat{C}$ and C, respectively. NMI takes values in [0,1].

5.1.4.3 Comparison Experiments

Comparison Methods

We compare the following eight methods on all four datasets:

• KM, K-means clustering on the input data.
• KM + SC, a dictionary D is first learned from the input data by K-SVD [20]. Then KM is performed on the graph-regularized sparse code features (5.1) over D.
• EMC, entropy-minimization clustering, by minimizing (5.4) on the input data.
• EMC + SC, EMC performed on the graph-regularized sparse codes over the pre-learned K-SVD dictionary D.
• MMC, maximum-margin clustering [5].
• MMC + SC, MMC performed on the graph-regularized sparse codes over the pre-learned K-SVD dictionary D.
• Joint EMC, the proposed joint optimization (5.2), with $C (A, w)$ $C (A, w)$ as defined in (5.4).
• Joint MMC, the proposed joint optimization (5.2), with $C (A, w)$ $C (A, w)$ as defined in (5.9).

All images are first reshaped into vectors, and PCA is then applied to reducing the data dimensionality by keeping 98% information, which is also used in [8] to improve efficiency. The multiclass MMC algorithm is implemented based on the publicly available CPMMC code for two-class clustering [5], following the multiclass case descriptions in the original paper. For all algorithms that involve graph-regularized sparse coding, the graph regularization parameter α is fixed to be 1, and the dictionary size p is 128 by default. For joint EMC and joint MMC, we set ITER as 30, ρ as 0.9, and $t_{0}$ $t_{0}$ as 5. Other parameters in competing methods are tuned in cross-validation experiments to our best efforts.

Comparison Analysis

All the comparison results (accuracy and NMI) are listed in Table 5.2, from which we could conclude the following:

1. The joint EMC and joint MMC methods each outperform their “non-joint” counterparts, e.g., EMC + SC and MMC + SC, respectively. For example, on the ORL dataset, joint MMC surpasses MMC + SC by around 5% in accuracy and 7% in NMI. This demonstrates that the key contribution of this section, i.e., joint optimization of the sparse coding and clustering steps, indeed leads to improved performances.
2. KM + SC, EMC + SC, and MMC + SC all outperform their counterparts using raw input data, which verifies that sparse codes are effective features that help improve the clustering discriminability.
3. The joint MMC obtains the best performance in all cases, outperforming the others, including joint EMC, with significant margins. The MMC + SC obtains the second best performance for the last three datasets (for ORL, it is joint EMC that ranks second). The above facts reveal the power of the max-margin loss (5.9).

Table 5.2

Accuracy and NMI performance comparisons on all datasets

		KM	KM + SC	EMC	EMC + SC	MMC	MMC + SC	joint EMC	joint MMC
ORL	Acc	0.5250	0.5887	0.6011	0.6404	0.6460	0.6968	0.7250	0.7458
ORL	NMI	0.7182	0.7396	0.7502	0.7795	0.8050	0.8043	0.8125	0.8728

MNIST	Acc	0.6248	0.6407	0.6377	0.6493	0.6468	0.6581	0.6550	0.6784
MNIST	NMI	0.5142	0.5397	0.5274	0.5671	0.5934	0.6161	0.6150	0.6451

COIL20	Acc	0.6280	0.7880	0.7399	0.7633	0.8075	0.8493	0.8225	0.8658
COIL20	NMI	0.7621	0.9010	0.8621	0.8887	0.8922	0.8977	0.8850	0.9127

CMU-PIE	Acc	0.3176	0.8457	0.7627	0.7836	0.8482	0.8491	0.8250	0.8783
CMU-PIE	NMI	0.6383	0.9557	0.8043	0.8410	0.9237	0.9489	0.9020	0.9675

Varying the Number of Clusters

On the COIL20 dataset, we reconduct the clustering experiments with the cluster number K ranging from 2 to 20, using EMC + SC, MMC + SC, joint EMC, and joint MMC. For each K, except for 20, 10 test runs are conducted on different randomly chosen clusters, and the final scores are obtained by averaging over the 10 tests. Fig. 5.1 shows the clustering accuracy and NMI measurements versus the number of clusters. It is revealed that the two joint methods consistently outperform their non-joint counterparts. When K increases, the performance of joint methods seems to degrade less slowly.

Figure 5.1 The clustering accuracy and NMI measurements versus the number of clusters K.

Initialization and Parameters

As a typical case in machine learning, we use SGD in a setting where it is not guaranteed to converge in theory, but behaves well in practice. As observed in our experiments, a good initialization of D and w can affect the final results notably. We initialize joint EMC by the D and w solved from EMC + SC, and joint MMC by the solutions from MMC + SC, respectively.

There are two parameters that we empirically set in ahead, the graph regularization parameter α and the dictionary size p. The regularization term imposes stronger smoothness constraints on the sparse codes when α grows larger. Also, while a compact dictionary is more desirable computationally, more redundant dictionaries may lead to less cluttered features that can be better discriminated. We investigate how the clustering performances EMC + SC, MMC + SC, joint EMC, and joint MMC change on the ORL dataset, with various α and p values. As depicted in Figs. 5.2 and 5.3, we observe that:

1. While α increases, the accuracy result will first increase then decrease (the peak is around $α = 1$ $α = 1$ ). This could be interpreted as when α is too small, the local manifold information is not sufficiently encoded. On the other hand, when α turns overly large, the sparse codes are “oversmoothed” with a reduced discriminability.
2. Increasing dictionary size p will first improve the accuracy sharply, which, however, soon reaches a plateau. Thus in practice, we keep a medium dictionary size $p = 128$ $p = 128$ for all experiments.

Figure 5.2 The clustering accuracy and NMI measurements versus the parameter choices of α.

Figure 5.3 The clustering accuracy and NMI measurements versus the parameter choices of p.

5.1.5 Conclusion

We propose a joint framework to optimize sparse coding and discriminative clustering simultaneously. We adopt graph-regularized sparse codes as the feature to be learned, and design two clustering-oriented cost functions, by entropy-minimization and maximum-margin principles, respectively. The formulation of a task-driven bi-level optimization mutually reinforces both sparse coding and clustering steps. Experiments on several benchmark datasets verify the remarkable performance improvement led by the proposed joint optimization.

5.1.6 Appendix

We recall the following lemma [5.2] in [17]:

Theorem 5.2

Regularity of the elastic net solution

Consider the formulation in (5.1) (we may drop the last term to obtain the exact elastic net form, without affecting the differentiability conclusions). Assume $λ_{2} > 0$ $λ_{2} > 0$ , and that $X$ $X$ is compact. Then

• a is uniformly Lipschitz on $X \times D$ $X \times D$ .
• Let $D \in D$ $D \in D$ , σ be a positive scalar and s be a vector in ${- 1, 0, 1}^{p}$ ${- 1, 0, 1}^{p}$ . Define $K_{s} (D, σ)$ $K_{s} (D, σ)$ as the set of vectors x satisfying for all j in ${1, \dots, p}$ ${1, \dots, p}$ ,

$\begin{matrix} | d_{j}^{T} (x - D a) - λ_{2} a [j] | ⩽ λ_{1} - σ if s [j] = 0, \\ s [j] a [j] ⩾ σ if s [j] \neq 0 . \end{matrix}$ $\begin{matrix} | d_{j}^{T} (x - D a) - λ_{2} a [j] | ⩽ λ_{1} - σ if s [j] = 0, \\ s [j] a [j] ⩾ σ if s [j] \neq 0 . \end{matrix}$

(5.14)

Then there exists κ> 0 independent of $s, D$ $s, D$ and σ so that for all $x \in K_{s} (D, σ)$ $x \in K_{s} (D, σ)$ , the function a is twice continuously differentiable on $B_{κ σ} (x) \times B_{κ σ} (D)$ $B_{κ σ} (x) \times B_{κ σ} (D)$ , where $B_{κ σ} (x)$ $B_{κ σ} (x)$ and $B_{κ σ} (D)$ $B_{κ σ} (D)$ denote the open balls of radius κσ respectively centered at x and D.

5.2 Learning a Task-Specific Deep Architecture for Clustering³

5.2.1 Introduction

While many classical clustering algorithms have been proposed, such as K-means, Gaussian mixture model (GMM) clustering [2], maximum-margin clustering [3] and information-theoretic clustering [6], most only work well when the data dimensionality is low. Since high-dimensional data exhibits dense grouping in low-dimensional embeddings [27], researchers have been motivated to first project the original data onto a low-dimensional subspace [10] and then cluster on the feature embeddings. Among many feature embedding learning methods, sparse codes [11] have proven to be robust and efficient features for clustering, as verified by many [12,8].

Effectiveness and scalability are two major concerns in designing a clustering algorithm under Big Data scenarios [28]. Conventional sparse coding models rely on iterative approximation algorithms, whose inherently sequential structure, as well as the data-dependent complexity and latency, often constitutes a major bottleneck in the computational efficiency [29]. This also results in the difficulty when one tries to jointly optimize the unsupervised feature learning and the supervised task-driven steps [17]. Such a joint optimization usually has to rely on solving complex bi-level optimization [30], such as in [31], which constitutes another efficiency bottleneck. What is more, to effectively model and represent datasets of growing sizes, sparse coding needs to refer to larger dictionaries [32]. Since the inference complexity of sparse coding increases more than linearly with respect to the dictionary size [31], the scalability of sparse coding-based clustering work turns out to be quite limited.

To conquer those limitations, we are motivated to introduce the tool of deep learning in clustering, to which there has been a lack of attention paid. The advantages of deep learning are achieved by its large learning capacity, the linear scalability with the aid of stochastic gradient descent (SGD), and the low inference complexity [33]. The feed-forward networks could be naturally tuned jointly with task-driven loss functions. On the other hand, generic deep architectures [34] largely ignore the problem-specific formulations and prior knowledge. As a result, one may encounter difficulties in choosing optimal architectures, interpreting their working mechanisms, and initializing the parameters.

In this section, we demonstrate how to combine the sparse coding-based pipeline into deep learning models for clustering. The proposed framework takes advantage of both sparse coding and deep learning. Specifically, the feature learning layers are inspired by the graph-regularized sparse coding inference process, via reformulating iterative algorithms [29] into a feed-forward network, named TAGnet. Those layers are then jointly optimized with the task-specific loss functions from end to end. Our technical novelty and merits are summarized as follows:

• As a deep feed-forward model, the proposed framework provides extremely efficient inference process and high scalability to large scale data. It allows learning more descriptive features than conventional sparse codes.
• We discover that incorporating the expertise of sparse code-based clustering pipelines [12,8] improves our performances significantly. Moreover, it greatly facilitates the model initialization and interpretation.
• We further enforce auxiliary clustering tasks on the hierarchy of features, we develop DTAGnet and observe further performance boosts on the CMU MultiPIE dataset [35].

5.2.2 Related Work

5.2.2.1 Sparse Coding for Clustering

Assuming data samples $X = [x_{1}, x_{2}, \dots, x_{n}]$ $X = [x_{1}, x_{2}, \dots, x_{n}]$ , where $x_{i} \in R^{m \times 1}$ $x_{i} \in R^{m \times 1}$ and $i = 1, 2, \dots, n$ $i = 1, 2, \dots, n$ . They are encoded into sparse codes $A = [a_{1}, a_{2}, \dots, a_{n}]$ $A = [a_{1}, a_{2}, \dots, a_{n}]$ , where $a_{i} \in R^{p \times 1}$ $a_{i} \in R^{p \times 1}$ and $i = 1, 2, \dots, n$ $i = 1, 2, \dots, n$ , using a learned dictionary $D = [d_{1}, d_{2}, \dots, d_{p}]$ $D = [d_{1}, d_{2}, \dots, d_{p}]$ , where $d_{i} \in R^{m \times 1}, i = 1, 2, \dots, p$ $d_{i} \in R^{m \times 1}, i = 1, 2, \dots, p$ are the learned atoms. The sparse codes are obtained by solving the following convex optimization (λ is a constant) problem:

$\begin{matrix} A = \arg \min_{A} \frac{1}{2} | | X - D A | |_{F}^{2} + λ \sum_{i} | | a_{i} | |_{1}, \end{matrix}$ $\begin{matrix} A = \arg \min_{A} \frac{1}{2} | | X - D A | |_{F}^{2} + λ \sum_{i} | | a_{i} | |_{1}, \end{matrix}$

(5.15)

In [12], the authors suggested that the sparse codes can be used to construct the similarity graph for spectral clustering [13]. Furthermore, to capture the geometric structure of local data manifolds, the graph regularized sparse codes are further suggested in [8,19] by solving

$\begin{matrix} A = \arg \min_{A} \frac{1}{2} | | X - D A | |_{F}^{2} + λ \sum_{i} | | a_{i} | |_{1} + \frac{α}{2} Tr (A L A^{T}), \end{matrix}$ $\begin{matrix} A = \arg \min_{A} \frac{1}{2} | | X - D A | |_{F}^{2} + λ \sum_{i} | | a_{i} | |_{1} + \frac{α}{2} Tr (A L A^{T}), \end{matrix}$

(5.16)

where L is the graph Laplacian matrix and can be constructed from a prechosen pairwise similarity (affinity) matrix P. More recently in [31], the authors suggested to simultaneously learn feature extraction and discriminative clustering, by formulating a task-driven sparse coding model [17]. They proved that such joint methods consistently outperformed non-joint counterparts.

5.2.2.2 Deep Learning for Clustering

In [36], the authors explored the possibility of employing deep learning in graph clustering. They first learned a nonlinear embedding of the original graph by an auto encoder (AE), followed by a K-means algorithm on the embedding to obtain the final clustering result. However, it neither exploits more adapted deep architectures nor performs any task-specific joint optimization. In [37], a deep belief network with nonparametric clustering was presented. As a generative graphical model, DBN provides a faster feature learning, but is less effective than AEs in terms of learning discriminative features for clustering. In [38], the authors extended the seminonnegative matrix factorization (Semi-NMF) model to a Deep Semi-NMF model, whose architecture resembles stacked AEs. Our proposed model is substantially different from all these previous approaches, due to its unique task-specific architecture derived from sparse coding domain expertise, as well as the joint optimization with clustering-oriented loss functions.

5.2.3 Model Formulation

The proposed pipeline consists of two blocks. As depicted in Fig. 5.4A, it is trained end-to-end in an unsupervised way. It includes a feed-forward architecture, termed Task-specific And Graph-regularized Network (TAGnet), to learn discriminative features, and the clustering-oriented loss function.

Figure 5.4 (A) The proposed pipeline, consisting of the TAGnet network for feature learning, followed by the clustering-oriented loss functions. The parameters $W, S, θ$ $W, S, θ$ and ω are all learnt end-to-end from training data. (B) The block diagram of solving (5.17).

5.2.3.1 TAGnet: Task-specific And Graph-regularized Network

Different from generic deep architectures, TAGnet is designed in a way to take advantage of the successful sparse code-based clustering pipelines [8,31]. It aims to learn features that are optimized under clustering criteria, while encoding graph constraints (5.16) to regularize the target solution. TAGnet is derived from the following theorem:

Theorem 5.3

The optimal sparse code A from (5.16) is the fixed point of

$\begin{matrix} A = h_{\frac{λ}{N}} [(I - \frac{1}{N} D^{T} D) A - A (\frac{α}{N} L) + \frac{1}{N} D^{T} X], \end{matrix}$ $\begin{matrix} A = h_{\frac{λ}{N}} [(I - \frac{1}{N} D^{T} D) A - A (\frac{α}{N} L) + \frac{1}{N} D^{T} X], \end{matrix}$

(5.17)

where $h_{θ}$ $h_{θ}$ is an element-wise shrinkage function parameterized by θ:

$\begin{matrix} {[h_{θ} (u)]}_{i} = sign (u_{i}) {(| u_{i} | - θ_{i})}_{+}, \end{matrix}$ $\begin{matrix} {[h_{θ} (u)]}_{i} = sign (u_{i}) {(| u_{i} | - θ_{i})}_{+}, \end{matrix}$

(5.18)

N is an upper bound on the largest eigenvalue of $D^{T} D$ $D^{T} D$ .

The complete proof of Theorem 5.3 can be found in the Appendix. Theorem 5.3 outlines an iterative algorithm to solve (5.16). Under quite mild conditions [39], after A is initialized, one may repeat the shrinkage and thresholding process in (5.17) until convergence. Moreover, the iterative algorithm could be alternatively expressed as the block diagram in Fig. 5.4B, where

$\begin{matrix} W = \frac{1}{N} D^{T}, S = I - \frac{1}{N} D^{T} D, θ = \frac{λ}{N} . \end{matrix}$ $\begin{matrix} W = \frac{1}{N} D^{T}, S = I - \frac{1}{N} D^{T} D, θ = \frac{λ}{N} . \end{matrix}$

(5.19)

In particular, we define the new operator “ $\times L$ $\times L$ ”: $A \to - \frac{α}{N} A L$ $A \to - \frac{α}{N} A L$ , where the input A is multiplied by the prefixed L from the right and scaled by the constant $- \frac{α}{N}$ $- \frac{α}{N}$ .

By time-unfolding and truncating Fig. 5.4B to a fixed number of K iterations ( $K = 2$ $K = 2$ by default),⁴ we obtain the TAGnet form in Fig. 5.4A; W, S and θ are all to be learnt jointly from data, while S and θ are tied weights for both stages.⁵ It is important to note that the output A of TAGnet is not necessarily identical to the predicted sparse codes by solving (5.16). Instead, the goal of TAGnet is to learn discriminative embedding that is optimal for clustering.

To facilitate training, we further rewrite (5.18) as

$\begin{matrix} {[h_{θ} (u)]}_{i} = θ_{i} \cdot sign (u_{i}) {(| u_{i} | / θ_{i} - 1)}_{+} = θ_{i} h_{1} (u_{i} / θ_{i}) . \end{matrix}$ $\begin{matrix} {[h_{θ} (u)]}_{i} = θ_{i} \cdot sign (u_{i}) {(| u_{i} | / θ_{i} - 1)}_{+} = θ_{i} h_{1} (u_{i} / θ_{i}) . \end{matrix}$

(5.20)

Equation (5.20) indicates that the original neuron with trainable thresholds can be decomposed into two linear scaling layers plus a unit-threshold neuron. The weights of the two scaling layers are diagonal matrices defined by θ and its element-wise reciprocal, respectively.

A notable component in TAGnet is the $\times L$ $\times L$ branch of each stage. The graph Laplacian L could be computed in advance. In the feed-forward process, a $\times L$ $\times L$ branch takes the intermediate $Z_{k}$ $Z_{k}$ ( $k = 1, 2$ $k = 1, 2$ ) as the input, and applies the “ $\times L$ $\times L$ ” operator defined above. The output is aggregated with the output from the learnable S layer. In the back propagation, L will not be altered. In such a way, the graph regularization is effectively encoded in the TAGnet structure as a prior.

An appealing highlight of (D)TAGnet lies in its very effective and straightforward initialization strategy. With sufficient data, many latest deep networks train well with random initializations without pretraining. However, it has been discovered that poor initializations hamper the effectiveness of first-order methods (e.g., SGD) in certain cases [40]. For (D)TAGnet, it is, however, much easier to initialize the model in the right regime. This benefits from the analytical relationships between sparse coding and network hyperparameters defined in (5.19): we could initialize deep models from corresponding sparse coding components, the latter of which is easier to obtain. Such an advantage becomes much more important when the training data is limited.

5.2.3.2 Clustering-Oriented Loss Functions

Assuming K clusters, and $ω = [ω_{1}, \dots, ω_{K}]$ $ω = [ω_{1}, \dots, ω_{K}]$ as the set of parameters of the loss function, where $ω_{i}$ $ω_{i}$ corresponds to the ith cluster, $i = 1, 2, \dots, K$ $i = 1, 2, \dots, K$ . In this section, we adopt the following two forms of clustering-oriented loss functions.

One natural choice of the loss function is extended from the popular softmax loss, and takes the entropy-like form as

$\begin{matrix} C (A, ω) = - \sum_{i = 1}^{n} \sum_{j = 1}^{K} p_{i j} \log p_{i j}, \end{matrix}$ $\begin{matrix} C (A, ω) = - \sum_{i = 1}^{n} \sum_{j = 1}^{K} p_{i j} \log p_{i j}, \end{matrix}$

(5.21)

where $p_{i j}$ $p_{i j}$ denotes the the probability that sample $x_{i}$ $x_{i}$ belongs to cluster j, $i = 1, 2, \dots, N$ $i = 1, 2, \dots, N$ and $j = 1, 2, \dots, K$ $j = 1, 2, \dots, K$ ,

$\begin{matrix} p_{i j} = p (j | {ω, a}_{i}) = \frac{e^{- ω_{j}^{T} a_{i}}}{\sum_{l = 1}^{K} e^{- ω_{l}^{T} a_{i}}} . \end{matrix}$ $\begin{matrix} p_{i j} = p (j | {ω, a}_{i}) = \frac{e^{- ω_{j}^{T} a_{i}}}{\sum_{l = 1}^{K} e^{- ω_{l}^{T} a_{i}}} . \end{matrix}$

(5.22)

In testing, the predicted cluster label of input $a_{i}$ $a_{i}$ is determined using the maximum likelihood criterion based on the predicted $p_{i j}$ $p_{i j}$ .

The maximum margin clustering (MMC) approach was proposed in [3]. MMC finds a way to label the samples by running an SVM implicitly, and the SVM margin obtained would be maximized over all possible labels [5]. By referring to the MMC definition, the authors of [31] designed the max-margin loss as

$\begin{matrix} C (A, ω) = \frac{λ}{2} | | ω | |^{2} + \sum_{i = 1}^{n} C (a_{i}, ω) . \end{matrix}$ $\begin{matrix} C (A, ω) = \frac{λ}{2} | | ω | |^{2} + \sum_{i = 1}^{n} C (a_{i}, ω) . \end{matrix}$

(5.23)

In the above equation, the loss for an individual sample $a_{i}$ $a_{i}$ is defined as

$\begin{matrix} C (a_{i}, ω) = \max (0, 1 + f^{r_{i}} (a_{i}) - f^{y_{i}} (a_{i})), \\ where y_{i} = \underset{j = 1, \dots, K}{\arg \max} f^{j} (a_{i}), r_{i} = \underset{j = 1, \dots, K, j \neq y_{i}}{\arg \max} f^{j} (a_{i}), \end{matrix}$ $\begin{matrix} C (a_{i}, ω) = \max (0, 1 + f^{r_{i}} (a_{i}) - f^{y_{i}} (a_{i})), \\ where y_{i} = \underset{j = 1, \dots, K}{\arg \max} f^{j} (a_{i}), r_{i} = \underset{j = 1, \dots, K, j \neq y_{i}}{\arg \max} f^{j} (a_{i}), \end{matrix}$

(5.24)

where $f^{j}$ $f^{j}$ is the prototype for the jth cluster. In testing, the predicted cluster label of input $a_{i}$ $a_{i}$ is determined by weight vector that achieves the maximum $ω_{j}^{T} a_{i}$ $ω_{j}^{T} a_{i}$ .

Model Complexity. The proposed framework can handle large-scale and high-dimensional data effectively via the stochastic gradient descent (SGD) algorithm. In each step, the back propagation procedure requires only operations of order O(p) [29]. The training algorithm takes O(Cnp) time (C is a constant in terms of the total numbers of epochs, stage numbers, etc.). In addition, SGD is easy to be parallelized and thus could be efficiently trained using GPUs.

5.2.3.3 Connections to Existing Models

There is a close connection between sparse coding and neural network. In [29], a feed-forward neural network, named LISTA, is proposed to efficiently approximate the sparse code a of input signal x, which is obtained by solving (5.15) in advance. The LISTA network learns the hyperparameters as a general regression model from training data to their pre-solved sparse codes using back-propagation.

LISTA overlooks the useful geometric information among data points [8], and therefore could be viewed as a special case of TAGnet in Fig. 5.4 when $α = 0$ $α = 0$ (i.e., removing the $\times L$ $\times L$ branches). Moreover, LISTA aims to approximate the “optimal” sparse codes preobtained from (5.15), and therefore requires the estimation of D and the tedious precomputation of A. The authors did not exploit its potential in supervised and task-specific feature learning.

5.2.4 A Deeper Look: Hierarchical Clustering by DTAGnet

Deep networks are well known for their capabilities to learn semantically rich representations by hidden layers [41]. In this section, we investigate how the intermediate features $Z_{k}$ $Z_{k}$ ( $k = 1, 2$ $k = 1, 2$ ) in TAGnet (Fig. 5.4A) can be interpreted, and further utilized to improve the model, for specific clustering tasks. Compared to related non-deep models [31], such a hierarchical clustering property is another unique advantage of being deep.

Our strategy is mainly inspired by the algorithmic framework of deeply supervised nets [42]. As in Fig. 5.5, our proposed Deeply-Task-specific And Graph-regularized Network (DTAGnet) brings in additional deep feedbacks, by associating a clustering-oriented local auxiliary loss $C_{k} (Z_{k}, ω_{k})$ $C_{k} (Z_{k}, ω_{k})$ ( $k = 1, 2$ $k = 1, 2$ ) with each stage. Such an auxiliary loss takes the same form as the overall $C (A, ω)$ $C (A, ω)$ , except that the expected cluster number may be different, depending on the auxiliary clustering task to be performed. The DTAGnet backpropagates errors not only from the overall loss layer, but also simultaneously from the auxiliary losses.

Figure 5.5 The DTAGnet architecture, taking the CMU MultiPIE dataset as an example. The model is able to simultaneously learn features for pose clustering (Z₁), for expression clustering (Z₂), and for identity clustering (A). The first two attributes are related to and helpful for the last (overall) task. Part of image sources are referred from [35] and [38].

While seeking the optimal performance of the target clustering, DTAGnet is also driven by two auxiliary tasks that are explicitly targeted at clustering specific attributes. It enforces constraint at each hidden representation for directly making a good cluster prediction. In addition to the overall loss, the introduction of auxiliary losses gives another strong push to obtain discriminative and sensible features at each individual stage. As discovered in the classification experiments in [42], the auxiliary loss both acts as feature regularization to reduce generalization errors and results in faster convergence. We also find in Sect. 5.2.5 that every $Z_{k}$ $Z_{k}$ ( $k = 1, 2$ $k = 1, 2$ ) is indeed most suited for its targeted task.

In [38], a Deep Semi-NMF model was proposed to learn hidden representations, which grant themselves an interpretation of clustering according to different attributes. The authors considered the problem of mapping facial images to their identities. A face image also contains attributes like pose and expression that help identify the person depicted. In their experiments, the authors found that by further factorizing this mapping in a way that each factor adds an extra layer of abstraction, the deep model could automatically learn latent intermediate representations that are implied for clustering identity-related attributes. Although there is a clustering interpretation, those hidden representations are not specifically optimized in clustering sense. Instead, the entire model is trained with only the overall reconstruction loss, after which clustering is performed using K-means on learnt features. Consequently, their clustering performance is not satisfactory. Our study shares the similar observation and motivation with [38], but in a more task-specific manner by performing the optimizations of auxiliary clustering tasks jointly with the overall task.

5.2.5 Experiment Results

5.2.5.1 Datasets and Measurements

We evaluate the proposed model on three publicly available datasets:

• MNIST [8] consists of a total number of 70,000 quasi-binary, handwritten digit images, with digits 0 to 9. The digits are normalized and centered in fixed-size images of $28 \times 28$ $28 \times 28$ .
• CMU MultiPIE [35] contains around 750,000 images of 337 subjects that are captured under varied laboratory conditions. A unique property of CMU MultiPIE lies in that each image comes with labels for the identity, illumination, pose and expression attributes. That is why CMU MultiPIE is chosen in [38] to learn multiattribute features (Fig. 5.5) for hierarchical clustering. In our experiments, we follow [38] and adopt a subset of 13,230 images of 147 subjects in 5 different poses and 6 different emotions. Notably, we do not preprocess the images by using piecewise affine warping as utilized by [38] to align these images.
• COIL20 [43] contains 1440 $32 \times 32$ $32 \times 32$ gray scale images of 20 objects (72 images per object). The images of each object were taken 5 degrees apart.

Although the paper only evaluates the proposed method using image datasets, the methodology itself is not limited to only image subjects. We apply two widely-used measures to evaluate the clustering performances, the accuracy and the normalized mutual information (NMI) [8,12]. We follow the convention of many clustering works [8,19,31] and do not distinguish training from testing. We train our models on all available samples of each dataset, reporting the clustering performances as our testing results. Results are averaged from 5 independent runs.

5.2.5.2 Experiment Settings

The proposed networks are implemented using the cuda-convnet package [34]. The network takes $K = 2$ $K = 2$ stages by default. We apply a constant learning rate of 0.01 with no momentum to all trainable layers. The batch size of 128. In particular, to encode graph regularization as a prior, we fix L during model training by setting its learning rate to be 0. Experiments run on a workstation with 12 Intel Xeon 2.67 GHz CPUs and 1 GTX680 GPU. The training takes approximately 1 hour on the MNIST dataset. It is also observed that the training efficiency of our model scales approximately linearly with data.

In our experiments, we set the default value of α to be 5, p to be 128, and λ to be chosen from [0.1, 1] by cross-validation.⁶ A dictionary D is first learned from X by K-SVD [20]; W, S and θ are then initialized based on (5.19), while L is also pre-calculated from P, which is formulated by the Gaussian kernel, $P_{i j} = \exp (- \frac{| | x_{i} - x_{j} | |_{2}^{2}}{δ^{2}})$ $P_{i j} = \exp (- \frac{| | x_{i} - x_{j} | |_{2}^{2}}{δ^{2}})$ (δ is also selected by cross-validation). After obtaining the output A from the initial (D)TAGnet models, ω (or $ω_{k}$ $ω_{k}$ ) could be initialized based on minimizing (5.21) or (5.23) over A (or $Z_{k}$ $Z_{k}$ ).

5.2.5.3 Comparison Experiments and Analysis

Benefits of the Task-specific Deep Architecture

We denote the proposed model of TAGnet plus entropy-minimization loss (EML) (5.21) as TAGnet-EML, and the one plus maximum-margin loss (MML) (5.23) as TAGnet-MML, respectively. We include the following comparison methods:

• We refer to the initializations of the proposed joint models as their “Non-Joint” counterparts, denoted as NJ-TAGnet-EML and NJ-TAGnet-MML (NJ is short for non-joint), respectively.
• We design a baseline encoder (BE), which is a fully-connected feed-forward network, consisting of three hidden layers of dimension p with ReLU neuron. It is obvious that the BE has the same parameter complexity as TAGnet.⁷ The BEs are also tuned by EML or MML in the same way, denoted as BE-EML or BE-MML, respectively. We intend to verify our important claim that the proposed model benefits from the task-specific TAGnet architecture, rather than just the large learning capacity of generic deep models.
• We compare the proposed models with their closest “shallow” competitors, i.e., the joint optimization methods of graph-regularized sparse coding and discriminative clustering in [31]. We reimplement their work using both (5.21) or (5.23) losses, denoted as SC-EML and SC-MML (SC is short for sparse coding). Since in [31] the authors already revealed that SC-MML outperforms the classical methods such as MMC and $ℓ_{1}$ $ℓ_{1}$ -graph methods, we do not compare with them again.
• We also include Deep Semi-NMF [38] as a state-of-the-art deep learning-based clustering method. We mainly compare our results with their reported performances on CMU MultiPIE.⁸

As revealed by the full comparison results in Table 5.3, the proposed task-specific deep architectures outperform others with a noticeable margin. The underlying domain expertise guides the data-driven training in a more principled way. In contrast, the “general-architecture” baseline encoders (BE-EML and BE-MML) appear to produce much worse (even worst) results. Furthermore, it is evident that the proposed end-to-end optimized models outperform their “non-joint” counterparts. For example, on the MNIST dataset, TAGnet-MML surpasses NJ-TAGnet-MML by around 4% in accuracy and 5% in NMI.

Table 5.3

Accuracy and NMI performance comparisons on all three datasets

		TAGnet-EML	TAGnet-MML	NJ-TAGnet-EML	NJ-TAGnet-MML	BE-EML	BE-MML	SC-EML	SC-MML	Deep Semi-NMF
MNIST	Acc	0.6704	0.6922	0.6472	0.5052	0.5401	0.6521	0.6550	0.6784	/
MNIST	NMI	0.6261	0.6511	0.5624	0.6067	0.5002	0.5011	0.6150	0.6451	/
CMU	Acc	0.2176	0.2347	0.1727	0.1861	0.1204	0.1451	0.2002	0.2090	0.17
MultiPIE	NMI	0.4338	0.4555	0.3167	0.3284	0.2672	0.2821	0.3337	0.3521	0.36

COIL20	Acc	0.8553	0.8991	0.7432	0.7882	0.7441	0.7645	0.8225	0.8658	/
COIL20	NMI	0.9090	0.9277	0.8707	0.8814	0.8028	0.8321	0.8850	0.9127	/

By comparing the TAGnet-EML/TAGnet-MML with SC-EML/SC-MML, we draw a promising conclusion: adopting a more parameterized deep architecture allows a larger feature learning capacity compared to conventional sparse coding. Although similar points are well made in many other fields [34], we are interested in a closer look between the two. Fig. 5.6 plots the clustering accuracy and NMI curves of TAGnet-EML/TAGnet-MML on the MNIST dataset, along with iteration numbers. Each model is well initialized at the very beginning, and the clustering accuracy and NMI are computed every 100 iterations. At first, the clustering performances of deep models are even slightly worse than sparse-coding methods, mainly since the initialization of TAGnet hinges on a truncated approximation of graph-regularized sparse coding. After a small number of iterations, the performance of the deep models surpass sparse coding ones, and continue rising monotonically until reaching a higher plateau.

Figure 5.6 The accuracy and NMI plots of TAGnet-EML/TAGnet-MML on MNIST, starting from the initialization, and tested every 100 iterations. The accuracy and NMI of SC-EML/SC-MML are also plotted as baselines.

Effects of Graph Regularization

In (5.16), the graph regularization term imposes stronger smoothness constraints on the sparse codes with a larger α. It also happens to the TAGnet. We investigate how the clustering performances of TAGnet-EML/TAGnet-MML are influenced by various α values. From Fig. 5.7, we observe the identical general tendency on all three datasets. While α increases, the accuracy/NMI result will first rise then decrease, with the peak appearing for $α \in [5, 10]$ $α \in [5, 10]$ . As an interpretation, the local manifold information is not sufficiently encoded when α is too small ( $α = 0$ $α = 0$ will completely disable the $\times L$ $\times L$ branch of TAGnet, and reduce it to the LISTA network [29] fine-tuned by the losses). On the other hand, when α is large, the sparse codes are “oversmoothed” with a reduced discriminative ability. Note that similar phenomena are also reported in other relevant literature, e.g., [8,31].

Figure 5.7 The clustering accuracy and NMI plots (x-axis logarithm scale) of TAGnet-EML/TAGnet-MML versus the parameter choices of α, on: (A)–(B) MNIST; (C)–(D) CMU MultiPIE; (E)–(F) COIL20.

Furthermore, comparing Fig. 5.7A–F, it is noteworthy to observe how graph regularization behaves differently on three of them. We notice that the COIL20 dataset is the most sensitive to the choice of α. Increasing α from 0.01 to 50 leads to an improvement of more than 10%, in terms of both accuracy and NMI. It verifies the significance of graph regularization when training samples are limited [19]. On the MNIST dataset, both models obtain a gain of up to 6% in accuracy and 5% in NMI, by tuning α from 0.01 to 10. However, unlike COIL20 that almost always favors larger α, the model performance on the MNIST dataset tends to be not only saturated, but even significantly hampered when α continues rising to 50. The CMU MultiPIE dataset witnesses moderate improvements of around 2% in both measurements. It is not as sensitive to α as the other two. Potentially, it might be due to the complex variability in original images that makes the graph W unreliable for estimating the underlying manifold geometry. We suspect that more sophisticated graphs may help alleviate the problem, and we will explore it in the future.

Scalability and Robustness

On the MNIST dataset, we reconduct the clustering experiments with the cluster number $N_{c}$ $N_{c}$ ranging from 2 to 10, using TAGnet-EML/TAGnet-MML. Fig. 5.8 shows that the clustering accuracy and NMI change by varying the number of clusters. The clustering performance transits smoothly and robustly when the task scale changes.

Figure 5.8 The clustering accuracy and NMI plots of TAGnet-EML/TAGnet-EML versus the cluster number N_c ranging from 2 to 10 on MNIST.

To examine the proposed models' robustness to noise, we add various Gaussian noise, whose standard deviation s ranges from 0 (noiseless) to 0.3, to retrain our MNIST model. Fig. 5.9 indicates that both TAGnet-EML and TAGnet-MML own certain robustness to noise. When s is less than 0.1, there is even little visible performance degradation. While TAGnet-MML constantly outperforms TAGnet-EML in all experiments (as MMC is well-known to be highly discriminative [3]), it is interesting to observe in Fig. 5.9 that the latter is slightly more robust to noise than the former. It is perhaps owing to the probability-driven loss form (5.21) of EML that allows for more flexibility.

Figure 5.9 The clustering accuracy and NMI plots of TAGnet-EML/TAGnet-MML versus the noise level s on MNIST.

5.2.5.4 Hierarchical Clustering on CMU MultiPIE

As observed, CMU MultiPIE is very challenging for the basic identity clustering task. However, it comes with several other attributes: pose, expression, and illumination, which could be of assistance in our proposed DTAGnet framework. In this section, we apply a similar setting of [38] on the same CMU MultiPIE subset, by setting pose clustering as the Stage I auxiliary task, and expression clustering as the Stage II auxiliary task.⁹ In that way, we target $C_{1} (Z_{1}, ω_{1})$ $C_{1} (Z_{1}, ω_{1})$ at 5 clusters, $C_{2} (Z_{2}, ω_{2})$ $C_{2} (Z_{2}, ω_{2})$ at 6 clusters, and finally, $C (A, ω)$ $C (A, ω)$ at 147 clusters.

The training of DTAGnet-EML/DTAGnet-MML follows the same aforementioned process except for considered extra back-propagated gradients from task $C_{k} (Z_{k}, ω_{k})$ $C_{k} (Z_{k}, ω_{k})$ in Stage k ( $k = 1, 2$ $k = 1, 2$ ). After that we test each $C_{k} (Z_{k}, ω_{k})$ $C_{k} (Z_{k}, ω_{k})$ separately on their targeted task. In DTAGnet, each auxiliary task is also jointly optimized with its intermediate feature $Z_{k}$ $Z_{k}$ , which differentiates our methodology substantially from [38]. It is thus no surprise to see in Table 5.4 that each auxiliary task obtains much improved performances than [38].¹⁰ Most notably, the performances of the overall identity clustering task witness a very impressive boost of around 7% in accuracy. We also test DTAGnet-EML/DTAGnet-MML with only $C_{1} (Z_{1}, ω_{1})$ $C_{1} (Z_{1}, ω_{1})$ or $C_{2} (Z_{2}, ω_{2})$ $C_{2} (Z_{2}, ω_{2})$ kept. Experiments verify that by adding auxiliary tasks gradually, the overall task keeps being benefited. Those auxiliary tasks, when enforced together, can also reinforce each other mutually.

Table 5.4

Effects of incorporating auxiliary clustering tasks in DTAGnet-EML/DTAGnet-MML (P, Pose; E, Expression; I, Identity)

Method	Stage I		Stage II		Overall
Method	Task	Acc	Task	Acc	Task	Acc
DTAGnet-EML	/	/	/	/	I	0.2176
	P	0.5067	/	/	I	0.2303
	/	/	E	0.3676	I	0.2507
	P	0.5407	E	0.7027	I	0.2833

DTAGnet-MML	/	/	/	/	I	0.2347
	P	0.5251	/	/	I	0.2635
	/	/	E	0.3988	I	0.2858
	P	0.5538	E	0.4231	I	0.3021

One might be curious of which one matters more in the performance boost: the deeply task-specific architecture that brings extra discriminative feature learning, or the proper design of auxiliary tasks that capture the intrinsic data structure characterized by attributes.

To answer this important question, we vary the target cluster number in either $C_{1} (Z_{1}, ω_{1})$ $C_{1} (Z_{1}, ω_{1})$ or $C_{2} (Z_{2}, ω_{2})$ $C_{2} (Z_{2}, ω_{2})$ , and reconduct the experiments. Table 5.5 reveals that more auxiliary tasks, even those without any straightforward task-specific interpretation (e.g., partitioning the Multi-PIE subset into 4, 8, 12 or 20 clusters hardly makes semantic sense), may still help gain better performances. It is comprehensible that they simply promote more discriminative feature learning in a low-to-high, coarse-to-fine scheme. In fact, it is a complementary observation to the conclusion found in classification [42]. On the other hand, at least in this specific case, while the target cluster numbers of auxiliary tasks get closer to the ground truth (5 and 6 here), the models seem to achieve the best performances. We conjecture that when properly “matched”, every hidden representation in each layer is in fact most suited for clustering the attributes corresponding to the layer of interest. The whole model can be resembled to the problem of sharing low-level feature filters among several relevant high-level tasks in convolutional networks [44], but in a distinct context.

Table 5.5

Effects of varying target cluster numbers of auxiliary tasks in DTAGnet-EML/DTAGnet-MML

Method	#clusters in Stage I	#clusters in Stage II	Overall Accuracy
DTAGnet-EML	4	4	0.2827
	8	8	0.2813
	12	12	0.2802
	20	20	0.2757

DTAGnet-MML	4	4	0.3030
	8	8	0.3006
	12	12	0.2927
	20	20	0.2805

We hence conclude that the deeply-supervised fashion shows to be helpful for the deep clustering models, even when there are no explicit attributes for constructing a practically meaningful hierarchical clustering problem. However, it is preferable to exploit those attributes when available, as they lead to not only superior performances but more clearly interpretable models. The learned intermediate features can be potentially utilized for multitask learning [45].

5.2.6 Conclusion

In this section, we present a deep learning-based clustering framework. Trained from end to end, it features a task-specific deep architecture inspired by the sparse coding domain expertise, which is then optimized under clustering-oriented losses. Such a well-designed architecture leads to more effective initialization and training, and significantly outperforms generic architectures of the same parameter complexity. The model could be further interpreted and enhanced, by introducing auxiliary clustering losses to the intermediate features. Extensive experiments verify the effectiveness and robustness of the proposed models.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 5: From Bi-Level Sparse Clustering to Deep Clustering

Create new playlist

Sign In

Sign Up

5.1 A Joint Optimization Framework of Sparse Coding and Discriminative Clustering1