Zhangyang Wang Department of Computer Science and Engineering, Texas A&M University, College Station, TX, United States
Despite its nonconvex nature, sparse approximation is desirable in many theoretical and application cases. We study the sparse approximation problem with the tool of deep learning, by proposing deep encoders. Two typical forms, the -regularized problem and the M-sparse problem, are investigated. Based on solid iterative algorithms, we model them as feed-forward neural networks, by introducing novel neurons and pooling functions. Enforcing such structural priors acts as an effective network regularization. The deep encoders also enjoy faster inference, larger learning capacity, and better scalability compared to conventional sparse coding solutions. Furthermore, under task-driven losses, the models can be conveniently optimized from end to end. Numerical results demonstrate the impressive performances of the proposed encoders.
Sparse approximation; Deep learning; Neuron; Pooling
Sparse signal approximation has gained popularity over the last decade. The sparse approximation model suggests that a natural signal could be compactly approximated by only a few atoms out of a properly given dictionary, where the weights associated with the dictionary atoms are called the sparse codes. Proven to be both robust to noise and scalable to high-dimensional data, sparse codes are known as powerful features, and benefit a wide range of signal processing applications, such as source coding [1], denoising [2], source separation [3], pattern classification [4], and clustering [5].
We are particularly interested in the -based sparse approximation problem, which is the fundamental formulation of sparse coding [6]. The nonconvex problem is intractable and often instead attacked by minimizing surrogate measures, such as the -norm, which leads to more tractable computational methods. However, it has been both theoretically and practically discovered that solving sparse approximation is still preferable in many cases.
More recently, deep learning has attracted great attentions in many feature learning problems [7]. The advantages of deep learning lie in its composition of multiple nonlinear transformations to yield more abstract and descriptive embedding representations. With the aid of gradient descent, it also scales linearly in time and space with the number of train samples.
It has been noticed that sparse approximation and deep learning bear certain connections [8]. Their similar methodology has been lately exploited in [9], [10], [11], [12]. By turning sparse coding models into deep networks, one may expect faster inference, larger learning capacity, and better scalability. The network formulation also facilitates the integration of task-driven optimization.
In this chapter, we investigate two typical forms of -based sparse approximation problems, the -regularized problem, and the M-sparse problem. Based on solid iterative algorithms [13], we formulate them as feed-forward neural networks [8], called deep encoders, through introducing novel neurons and pooling functions. We study their applications in image classification and clustering; in both cases the models are optimized in a task-driven, end-to-end manner. Impressive performances are observed in numerical experiments.
Finding the sparsest, or minimum -norm, representation of a signal given a dictionary of basis atoms is an important problem in many application domains. Consider a data sample that is encoded into its sparse code using a learned dictionary , where are the learned atoms. The sparse codes are obtained by solving the -regularized problem (λ is a constant)
Alternatively, one could explicitly impose constraints on the number of nonzero coefficients of the solution, by solving the M-sparse problem
Unfortunately, these optimization problems are often intractable because there is a combinatorial increase in the number of local minima as the number of the candidate basis vectors increases. One potential remedy is to employ a convex surrogate measure, such as the -norm, in place of the -norm that leads to a more tractable optimization problem. For example, (3.1) could be relaxed as
It creates an unimodal optimization problem that can be solved via linear programming techniques. The downside is that we have now introduced a mismatch between the ultimate goal and the objective function [14]. Under certain conditions, the minimum -norm solution equals to that of the minimum -norm [6]. But in practice, the -approximation is often used way beyond these conditions, and is thus quite heuristic. As a result, we often get a solution which is not exactly minimizing the original -norm.
That said, -approximation is found to work practically well for many sparse coding problems. Yet in certain applications, we intend to control the exact number of nonzero elements, such as basis selection [14], where -approximation is indispensable. Beyond that, -approximation is desirable for performance concerns in many ways. In compressive sensing literature, empirical evidence [15] suggested that using an iterative reweighted -scheme to approximate the -solution often improved the quality of signal recovery. In image enhancement, it was shown in [16] that data fidelity was more suitable for reconstructing images corrupted with impulse noise. For the purpose of image smoothing, the authors of [17] utilized gradient minimization to globally control how many nonzero gradients are used to approximate prominent structures in a structure-sparsity-management manner. Recent work [18] revealed that sparse subspace clustering can completely characterize the set of minimal union-of-subspace structure, without additional separation conditions required by its counterpart.
In [8], a feed-forward neural network, as illustrated in Fig. 3.1, was proposed to efficiently approximate the -based sparse code a of the input signal x; the sparse code a is obtained by solving (3.3) for a given dictionary D in advance. The network has a finite number of stages, each of which updates the intermediate sparse code () according to
where is an element-wise shrinkage function (u is a vector and is its ith element, ),
The parameterized encoder, named learned ISTA (LISTA), is a natural network implementation of the iterative shrinkage and thresholding algorithm (ISTA). LISTA learned all its parameters W, S and θ from training data using a back-propagation algorithm [19]. In this way, a good approximation of the underlying sparse code can be obtained after a fixed small number of stages.
In [10], the authors leveraged a similar idea on fast trainable regressors and constructed feed-forward network approximations of the learned sparse models. Such a process-centric view was later extended in [11] to develop a principled process of learned deterministic fixed-complexity pursuits, in lieu of iterative proximal gradient descent algorithms, for structured sparse and robust low rank models. Very recently, [9] further summarized the methodology of the problem-level and model-based “deep unfolding”, and developed new architectures as inference algorithms for both Markov random fields and nonnegative matrix factorization. Our work shares the same prior ideas, yet studies the unexplored -problems and obtains further insights.
To solve the optimization problem in (3.1), an iterative hard-thresholding (IHT) algorithm was derived in [13], namely
where denotes the intermediate result of the kth iteration, and is an element-wise hard thresholding operator given by
Equation (3.6) could be alternatively rewritten as
and expressed as the block diagram in Fig. 3.2, which outlines a recurrent network form of solving (3.6).
By time-unfolding and truncating Fig. 3.2 to a fixed number of K iterations ( in this chapter by default),1 we obtain a feed-forward network structure in Fig. 3.3, where W, S and θ are shared among both stages, named deep -regularized encoder. Furthermore, W, S and θ are all to be learnt, instead of being directly constructed from any precomputed D. Although the equations in (3.8) do not directly apply any more to solving the deep -regularized encoder, they can usually serve as a high-quality initialization of the latter.
Note that the activation thresholds θ are less straightforward to update. We rewrite (3.5) as . It indicates that the original neuron with trainable thresholds can be decomposed into two linear scaling layers, plus a unit-hard-threshold neuron, which is later called Hard thrEsholding Linear Unit (HELU) by us. The weights of the two scaling layers are diagonal matrices defined by θ and its element-wise reciprocal, respectively.
Discussion about HELU. While being inspired by LISTA, the differentiating point of deep -regularized encoder lies in the HELU neuron. Compared to classical neuron functions such as logistic, sigmoid, and ReLU [20], as well as the soft shrinkage and thresholding operation (3.5) in LISTA, HELU does not penalize large values, yet enforces strong (in theory infinite) penalty for small values. As such, HELU tends to produce highly sparse solutions.
The neuron form of LISTA (3.5) could be viewed as a double-sided and translated variant of ReLU, which is continuous and piecewise linear. In contrast, HELU is a discontinuous function that rarely occurs in existing deep network neurons. As pointed out by [21], HELU has countably many discontinuities and is thus (Borel) measurable, in which case the universal approximation capability of the network is not compromised. However, experiments remind us that the algorithmic learnability with such discontinuous neurons (using popular first-order methods) is in question, and the training is in general hard. For computation concerns, we replace HELU with the following continuous and piecewise linear function , during network training:
Obviously, becomes HELU when . To approximate HELU, we tend to choose very small σ, while avoiding putting the training to be ill-posed. As a practical strategy, we start with a moderate σ (0.2 by default), and divide it by 10 after each epoch. After several epochs, turns very close to the ideal HELU.
In [22], the authors introduced an ideal hard thresholding function for solving sparse coding, whose formulation was close to HELU. Note that [22] approximates the ideal function with a sigmoid function, which has connections with our approximation. In [23], a similar truncated linear ReLU was utilized in the networks.
Both the -regularized problem in (3.1) and deep -regularized encoder have no explicit control on the sparsity level of the solution. One would therefore turn to the M-sparse problem in (3.2), and derive the following iterative algorithm [13]:
Equation (3.10) resembles (3.6), except that is now a nonlinear operator retaining the M coefficients with the top M-largest absolute values. Following the same methodology as in the previous section, the iterative form could be time-unfolded and truncated to the deep M-sparse encoder, as in Fig. 3.4. To deal with the operation, we refer to the popular concepts of pooling and unpooling [24] in deep networks, and introduce the pairs of maxM pooling and unpooling, as in Fig. 3.4.
Discussion about maxM pooling/unpooling. Pooling is popular in convolutional networks to obtain translation-invariant features [7]. It is yet less common in other forms of deep networks [25]. The unpooling operation was introduced in [24] to insert the pooled values back to the appropriate locations of feature maps for reconstruction purposes.
In our proposed deep M-sparse encoder, the pooling and unpooling operation pair is used to construct a projection from to its subset . The maxM pooling and unpooling functions are intuitively defined as
For each input u, the pooled map records the top M-largest values (irrespective of sign), and the switch records their locations. The corresponding unpooling operation takes the elements in and places them in at the locations specified by , the remaining elements being set to zero. The resulting is of the same dimension as u, but has exactly no more than M nonzero elements. In back propagation, each position in is propagated with the entire error signal.
It is showed in [13] that the iterative algorithms in both (3.6) and (3.10) are guaranteed not to increase the cost functions. Under mild conditions, their targeted fixed points are local minima of the original problems. As the next step after the time truncation, the deep encoder models are to be solved by the stochastic gradient descent (SGD) algorithm, which converges to stationary points under a few stricter assumptions than those satisfied in this chapter [26].2 However, the entanglement of the iterative algorithms and the SGD algorithm makes the overall convergence analysis really difficult.
One must emphasize that in each step, the back propagation procedure requires only operations of order O(p) [8]. The training algorithm takes O(Cnp) time (C is the constant absorbing epochs, stage numbers, etc.). The testing process is purely feed-forward and is therefore dramatically faster than traditional inference methods by solving (3.1) or (3.2). SGD is also easy to be parallelized.
It is often desirable to jointly optimize the learned sparse code features and the targeted task so that they mutually reinforce each other. The authors of [27] associated label information with each dictionary item by adding discriminable regularization terms to the objective. Recent work [28], [29] developed task-driven sparse coding via bi-level optimization models, where (-based) sparse coding is formulated as the lower-level constraint while a task-oriented cost function is minimized as its upper-level objective. The above approaches in sparse coding are complicated and computationally expensive. It is much more convenient to implement end-to-end task-driven training in deep architectures, by concatenating the proposed deep encoders with certain task-driven loss functions.
In this chapter, we mainly discuss two tasks: classification and clustering, while being aware of other immediate extensions, such as semisupervised learning. Assuming K classes (or clusters), and as the set of parameters of the loss function, where corresponds to the jth class (cluster), . For the classification case, one natural choice is the well-known softmax loss function. For the clustering case, since the true cluster label of each x is unknown, we define the predicted confidence probability that sample x belongs to cluster j as the likelihood of softmax regression
The predicted cluster label of a is the cluster j where it achieves the largest .
Two proposed deep encoders are implemented with the CUDA ConvNet package [7]. We use a constant learning rate of 0.01 with no momentum, and a batch size of 128. In practice, given that the model is well initialized, the training takes approximately 1 hour on the MNIST dataset, on a workstation with 12 Intel Xeon 2.67 GHz CPUs and 1 GTX680 GPU. It is also observed that the training efficiency of our model scales approximately linearly with the size of data.
While many neural networks train well with random initializations without pre-training, given that the training data is sufficient, it has been discovered that poorly initialized networks can hamper the effectiveness of first-order methods (e.g., SGD) [30]. For the proposed models, it is, however, much easier to initialize the model in the right regime, benefiting from the analytical relationships between sparse coding and network hyperparameters in (3.8).
We first compare the performance of different methods on sparse code approximation. The first 60,000 samples of the MNIST dataset are used for training and the last 10,000 for testing. Each patch is resized to and then preprocessed to remove its mean and normalize its variance. The patches with small standard deviations are discarded. A sparsity coefficient is used in (3.1), and the sparsity level is fixed in (3.2). The sparse code dimension (dictionary size) p is to be varied.
Our prediction task resembles the setup in [8]: first learning a dictionary from training data, followed by solving sparse approximation (3.3) with respect to the dictionary, and finally training the network as a regressor from input samples to the solved sparse codes. The only major difference here lies in that unlike the -based problems, the nonconvex -based minimization could only reach a (nonunique) local minimum. To improve stability, we first solve the -problems to obtain a good initialization for -problems, and then run the iterative algorithms (3.6) or (3.10) until convergence. The obtained sparse codes are called “optimal codes” hereinafter and used in both training and testing evaluation (as “ground-truth”). One must keep in mind that we are not seeking to produce approximate sparse code for all possible input vectors, but only for input vectors drawn from the same distribution as our training samples.
We compare the proposed deep encoders with the iterative algorithms under different number of iterations. In addition, we include a baseline encoder into comparison, which is a fully-connected feed-forward network, consisting of three hidden layers of dimension p with ReLu neurons. The baseline encoder thus has the same parameter capacity as deep encoders.3 We apply dropout to the baseline encoders, with the probabilities of retaining the units being 0.9, 0.9, and 0.5. The proposed encoders do not apply dropout.
The deep encoders and the baseline encoder are first trained, and all are then evaluated on the testing set. We calculate the total prediction errors, i.e., the normalized squared errors between the optimal codes and the predicted codes, as in Tables 3.1 and 3.2. For the M-sparse case, we also compare their recovery of nonzero supports in Table 3.3, by counting the mismatched nonzero element locations between optimal and predicted codes (averaged on all samples). Immediate conclusions from the numerical results are as follows:
Table 3.1
Prediction error (%) comparison of all methods on solving the ℓ0-regularized problem (3.1)
p | 128 | 256 | 512 |
---|---|---|---|
Iterative (2 iterations) | 17.52 | 18.73 | 22.40 |
Iterative (5 iterations) | 8.14 | 6.75 | 9.37 |
Iterative (10 iterations) | 3.55 | 4.33 | 4.08 |
Baseline Encoder | 8.94 | 8.76 | 10.17 |
Deep ℓ0-Regularized Encoder | 0.92 | 0.91 | 0.81 |
Table 3.2
Prediction error (%) comparison of all methods on solving the M-sparse problem (3.2)
p | 128 | 256 | 512 |
---|---|---|---|
Iterative (2 iterations) | 17.23 | 19.27 | 19.31 |
Iterative (5 iterations) | 10.84 | 12.52 | 12.40 |
Iterative (10 iterations) | 5.67 | 5.44 | 5.20 |
Baseline Encoder | 14.04 | 16.76 | 12.86 |
Deep M-Sparse Encoder | 2.94 | 2.87 | 3.29 |
Table 3.3
Averaged nonzero support error comparison of all methods on solving the M-sparse problem (3.2)
p | 128 | 256 | 512 |
---|---|---|---|
Iterative (2 iterations) | 10.8 | 13.4 | 13.2 |
Iterative (5 iterations) | 6.1 | 8.0 | 8.8 |
Iterative (10 iterations) | 4.6 | 5.6 | 5.3 |
Deep M-Sparse Encoder | 2.2 | 2.7 | 2.7 |
Since the task-driven models are trained from end to end, no precomputation of a is needed. For classification, we evaluate our methods on the MNIST dataset, and the AVIRIS Indiana Pines hyperspectral image dataset (see [31] for details). We compare our two proposed deep encoders with two competitive sparse coding-based methods: (1) task-driven sparse coding (TDSC) in [28], with the original setting followed and all parameters carefully tuned; (2) a pre-trained LISTA followed by supervised tuning with softmax loss. Note that for the deep M-sparse encoder, M is not known in advance and has to be tuned. To our surprise, the fine-tuning of M is likely to improve the performances significantly, which is analyzed next. The overall error rates are compared in Tables 3.4 and 3.5.
Table 3.4
Classification error rate (%) comparison of all methods on the MNIST dataset
p | 128 | 256 | 512 |
---|---|---|---|
TDSC | 0.71 | 0.55 | 0.53 |
Tuned LISTA | 0.74 | 0.62 | 0.57 |
Deep ℓ0-Regularized | 0.72 | 0.58 | 0.52 |
Deep M-Sparse (M = 10) | 0.72 | 0.57 | 0.53 |
Deep M-Sparse (M = 20) | 0.69 | 0.54 | 0.51 |
Deep M-Sparse (M = 30) | 0.73 | 0.57 | 0.52 |
Table 3.5
Classification error rate (%) comparison of all methods on the AVIRIS Indiana Pines dataset
p | 128 | 256 | 512 |
---|---|---|---|
TDSC | 15.55 | 15.27 | 15.21 |
Tuned LISTA | 16.12 | 16.05 | 15.97 |
Deep ℓ0-Regularized | 15.20 | 15.07 | 15.01 |
Deep M-Sparse (M = 10) | 13.77 | 13.56 | 13.52 |
Deep M-Sparse (M = 20) | 14.67 | 14.23 | 14.07 |
Deep M-Sparse (M = 30) | 15.14 | 15.02 | 15.00 |
In general, the proposed deep encoders provide superior results to the deep -based method (tuned LISTA). TDSC also generates competitive results, but at the cost of the high complexity for inference, i.e., solving conventional sparse coding. It is of particular interest to us that when supplied with specific M values, the deep M-sparse encoder can generate remarkably improved results.4 Especially in Table 3.5, when , the error rate is around 1.5% lower than that of . Note that in the AVIRIS Indiana Pines dataset, the training data volume is much smaller than that of MNIST. In this way, we conjecture that it might not be sufficiently effective to make the training process depend fully on data; instead, crafting a stronger sparsity prior by smaller M could help learn more discriminative features.5 Such a behavior provides us with an important hint to impose suitable structural priors to deep networks.
For clustering, we evaluate our methods on the COIL 20 and the CMU PIE datasets [32]. Two state-of-the-art methods to compare are the jointly optimized sparse coding and clustering method proposed in [29], as well as the graph-regularized deep clustering method in [33].6 The overall error rates are compared in Tables 3.6 and 3.7.
Table 3.6
Clustering error rate (%) comparison of all methods on the COIL 20 dataset
p | 128 | 256 | 512 |
---|---|---|---|
[29] | 17.75 | 17.14 | 17.15 |
[33] | 14.47 | 14.17 | 14.08 |
Deep ℓ0-Regularized | 14.52 | 14.27 | 14.06 |
Deep M-Sparse (M = 10) | 14.59 | 14.25 | 14.03 |
Deep M-Sparse (M = 20) | 14.84 | 14.33 | 14.15 |
Deep M-Sparse (M = 30) | 14.77 | 14.37 | 14.12 |
Table 3.7
Clustering error rate (%) comparison of all methods on the CMU PIE dataset
p | 128 | 256 | 512 |
---|---|---|---|
[29] | 17.50 | 17.26 | 17.20 |
[33] | 16.14 | 15.58 | 15.09 |
Deep ℓ0-Regularized | 16.08 | 15.72 | 15.41 |
Deep M-Sparse (M = 10) | 16.77 | 16.46 | 16.02 |
Deep M-Sparse (M = 20) | 16.44 | 16.23 | 16.05 |
Deep M-Sparse (M = 30) | 16.46 | 16.17 | 16.01 |
Note that the method in [33] incorporated Laplacian regularization as an additional prior while the others did not. It is thus no wonder that this method often performs better than others. Even without any graph information utilized, the proposed deep encoders are able to obtain very close performances, and outperform [33] in certain cases. On the COIL 20 dataset, the lowest error rate is reached by the deep M-sparse () Encoder, when , followed by the deep -regularized encoder.
On the CMU PIE dataset, the deep -regularized encoder leads to competitive accuracy with [33], and outperforms all deep M-sparse encoders with noticeable margins, which is different from other cases. Previous work discovered that sparse approximations over CMU PIE had significant errors [34], which is also verified by us. Therefore, hardcoding exact sparsity could even hamper the model performance.
Remark. From the experiments, we gain additional insights in designing deep architectures:
We hope the above insights could be of reference to many other deep learning models.
We propose deep encoders to solve the sparse approximation problem. Rooted in solid iterative algorithms, the deep -regularized encoder and deep M-sparse encoder are developed, each designed to solve one typical formulation, accompanied with the introduction of the novel HELU neuron and maxM pooling/unpooling. When applied to specific tasks of classification and clustering, the models are optimized in an end-to-end manner. The latest deep learning tools enable us to solve considered problems in a highly effective and efficient fashion. They not only provide us with impressive performance in numerical experiments, but also enlighten with important insights into designing deep models.
While many recent works followed the idea of constructing feed-forward networks by unfolding and truncating iterative algorithms, as fast trainable regressors to approximate the solutions of sparse coding models, progress has been slow towards understanding the efficient approximation from a theoretical perspective. [35] investigated the convergence property of our proposed Deep Encoders. The authors argued that they can use data to train a transformation of dictionary that can improve its restricted isometry property (RIP) constant, when the original dictionary is highly correlated, causing IHT to fail easily. They moreover showed it beneficial to allow the weights to decouple across layers. However, the analysis in [35] cannot be straightforwardly extended to ISTA although IHT is linearly convergent [36] under rather strong assumptions.
Beside, [37] attempted to explain the mechanism of LISTA by re-factorizing the Gram matrix of dictionary, which tries to nearly diagonalize the Gram matrix with a basis that produces a small perturbation of the ball. They the re-parameterized LISTA into a new factorized architecture that achieved similar acceleration gain to LISTA. Using an “indirect” proof, [37] was able to show that LISTA can converge faster than ISTA, but still sublinearly. In [38], a similar learning-based model inspired by another iterative algorithm solve LASSO, approximated message passing (AMP), was studied. The idea was advanced in [39] to substitute the AMP proximal operator (soft-thresholding) with a learnable Gaussian denoiser. However, their main theoretical tool, named “state evolution”, is not directly applicable to analyzing LISTA.