Shuyang Wang⁎; Zhangyang Wang†; Yun Fu‡ ⁎Department of Electrical and Computer Engineering, Northeastern University, Boston, MA, United States
†Department of Computer Science and Engineering, Texas A&M University, College Station, TX, United States
‡Department of Electrical and Computer Engineering and College of Computer and Information Science (Affiliated), Northeastern University, Boston, MA, United States
Learning good representation for images is always a hot topic in machine learning and pattern recognition fields. Among the numerous algorithms, dictionary learning is a well-known strategy for effective feature extraction. Recently, more discriminative subdictionaries have been built by Fisher discriminative dictionary learning with specific class labels. Different types of constraints, such as sparsity, low-rankness and locality, are also exploited to make use of global and local information. On the other hand, as the basic building block of deep structure, the auto-encoder has demonstrated its promising performance in extracting new feature representation.
Marginalized denoising auto-encoder; Locality constraint; Dictionary learning
Learning good representation for images is always a hot topic in machine learning and pattern recognition fields. Among the numerous algorithms, dictionary learning is a well-known strategy for effective feature extraction. Recently, more discriminative sub-dictionaries have been built by Fisher discriminative dictionary learning with specific class labels. Different types of constraints, such as sparsity, low-rankness and locality, are also exploited to make use of global and local information. On the other hand, as the basic building block of a deep structure, the auto-encoder has demonstrated its promising performance in extracting new feature representation. In this section, a unified feature learning framework by incorporating the marginalized denoising auto-encoder into a locality-constrained dictionary learning scheme, named Marginalized Denoising Dictionary Learning (MDDL), will be introduced [1]. Overall, the introduced method deploys a low-rank constraint on each sub-dictionary and locality constraint instead of sparsity on coefficients, in order to learn more concise and pure feature spaces while inheriting the discrimination from sub-dictionary learning. Experimental results listed in this section demonstrate the effectiveness and efficiency of MDDL by comparing with several state-of-the-art methods.
Sparse representation has experienced rapid growth in both theory and application from recent researches and led to interesting results in image classification [2–4], speech denoising [5], and bioinformatics [6], etc. For each input signal, the key idea is to find a linear combination using atoms from a given over-complete dictionary D as a new representation. Therefore, sparse representation is capable of revealing the underlying structure of high dimensional data. Among the large range of problems sparse representations has been applied to, in this section we focus on image classification which demands a discriminative representation.
The generative and discriminative ability of dictionary, apparently, is a major factor for sparse representation. Directly using the original training samples as dictionary in [7] will raise a problem that the noise and ambiguity contained in the training set could impede the test sample being faithfully represented. In addition, the discerning information hidden behind the training samples will be ignored in this strategy. Actually, the aforementioned problem can be solved through dictionary learning which intends to learn a proper dictionary from the original training samples. After the training process is finished, a new coming signal can be well represented by a set of basis learned from the original training set. Solutions to problems like face recognition have been dramatically improved with a well-learned dictionary. In order to distinctively represent the test samples, a lot of research efforts have been made to seek a well-adapted dictionary. Recently, a discriminative constraint was added to the dictionary learning model based on K-SVD [8], in which the classification error was considered in order to gain discriminability [9]. In Jiang et al.'s paper, the discerning ability of dictionary is enforced by associating label information with each dictionary atom [10]. In order to learn a structured dictionary, the Fisher criterion was introduced to learn a set of sub-dictionaries with specific class labels [11]. The above algorithms are designed to deal with clear signals or those corrupted only by small noise. For the situation in which training samples are corrupted by large noise, the learned dictionary will include corruptions resulting in a failure to represent the test samples.
In this section, we mainly discuss two lines of related works, namely dictionary learning and auto-encoder.
Recent researches on dictionary learning have demonstrated that a well-learned dictionary will greatly boost the performance by yielding a better representation in human action recognition [21], scene categorization [22], and image coloration [23].
A sparse representation based classification (SRC) method for robust face recognition was proposed by Wright et al. Suppose we have c classes of a training set X=[X1,X2,…,Xc] where Xi∈Rd×ni is the training sample from the ith class with dimension d and ni samples. The SRC procedure is specified by two phases to classify a given test sample xtest. First, in the coding phase, we solve the following l1-norm minimization problem:
ˉa=argmina‖xtest−Xa‖22+λ‖a‖1,
in which the l1-norm is used as the convex envelope to replace the l0-norm in order to avoid an NP-hard problem and while keeping the sparsity. Then the classification is done via
identity(xtest)=argminiεi,
where εi=‖xtest−Xiˉai‖2, and ˉai is the coefficient vector associated with class i. SRC classifies a test image by picking the smallest reconstruction error εi.
Instead of directly using the training set itself as a dictionary, several algorithms and regularizations have been introduced into the dictionary learning framework to learn a compact dictionary with more representation power. In FDDL, a set of class-specified sub-dictionaries whose atoms correspond to the class labels are updated iteratively based on the Fisher discrimination criterion to include discriminative information. Jiang et al. presented a Label Consistent K-SVD (LC-KSVD) algorithm to make a learned dictionary more discriminative for sparse coding [10]. These methods have shown that a structured dictionary could dramatically improve classification performance. However, the performance of these methods will drop a lot if the training data is largely corrupted.
Recently introduced low-rank dictionary learning [12] aims to uncover the global structure by grouping similar samples into one cluster, which has been successfully applied to many applications, e.g., object detection [13], multi-view learning [14], unsupervised subspace segmentation [15], and 3D visual recovery [16]. The goal is to generate a low-rank matrix from corrupted original input data. That is, if a given data matrix X is corrupted by a sparse matrix E while the samples share a similar pattern, then the sparse noisy matrix can be separated to practically recover X via rank minimization. According to the previous research, many works have integrated low-rank regularization into sparse representation for separating the sparse noises from inputs signals while simultaneously optimizing the dictionary atoms in order to faithfully reconstruct the de-noised data. Moreover, a low-rank dictionary addresses the noisy data well by adding an error term with different norms, e.g., l1-norm, l2,1-norm.
Applying augmented Lagrange multipliers (ALM), Lin [24] proposed Robust PCA, with which the corrupted data in a single subspace can be recovered by solving a matrix rank minimization problem. In [12], Liu et al. proposed converting the task of face image clustering into a subspace segmentation problem with the assumption that face images from different individuals lie in different, nearly independent subspaces.
Most recently, deep learning has attracted a lot of interest when looking for better feature extraction. In this direction, auto-encoder [18] is one of the most popular building blocks to form a lite-version deep learning framework. The auto-encoder has drawn increasing attention in the feature learning area and has been considered as a simulation of the way the human visual system processes imagery. The auto-encoder architecture explicitly involves two modules, an encoder and a decoder. The encoder outputs a group of hidden representation units, which is realized by a linear deterministic mapping with a weight matrix and a nonlinear transformation employing a logistic sigmoid. The decoder reconstructs the input data based on the received sparse hidden representation. The aforementioned dictionary learning model can be formalized as a decoder module.
As a typical single hidden layer neural network with identical input and target, the auto-encoder [29] aims to discover data's intrinsic structure by encouraging the output to be as similar to the target as possible. Essentially, the neurons in the hidden layer can be seen as a good representation since they are able to reconstruct the input data. To encourage structured feature learning, further constraints have been imposed on the parameters during training.
Moreover, there is a well-known trick of the trade to deal with noisy data, that is, manually injecting noise into the training samples thereby learning with artificially corrupted data. Denoising auto-encoders (DAEs) [30,18], learned with artificial corrupted data as input, have been successfully applied to a wide range of machine learning tasks by learning a new denoising representation. During the training process, DAEs reconstruct the input data from partial corruption with a pre-specified corrupting distribution to its original clean version. This process learns a robust representation which ensures tolerance to certain distortions in input data.
On this basis, stacked denoising auto-encoders (SDAs) [19] have been successfully used to learn new representations and attained record accuracy on standard benchmark for domain adaptation. However, there are two crucial limitations of SDAs: (i) high computational cost, and (ii) lack of scalability to high-dimensional features. To address these two problems, the authors of [20] proposed marginalized SDAs (mSDA). Different from SDAs, mSDA marginalizes noise and thus the parameters can be computed in closed-form rather than using stochastic gradient descent or other optimization algorithms. Consequently, mSDA significantly speeds up SDAs by two orders of magnitude.
Wang et al. [28] explored the effectiveness of a locality constraint on two different types of feature learning technique, i.e., dictionary learning and auto-encoder, respectively. A locality-constrained collaborative auto-encoder (LCAE) was proposed to extract features with local information for enhancing the classification ability. In order to introduce locality into the coding procedure, the input data is first reconstructed by LLC coding criteria and then used as the target of the auto-encoder. That is, target ˆx is replaced by a locality reconstruction as follows:
minC∑Ni=1‖xi−Dci‖2+λ‖li⊙ci‖2 s.t. 1Tci=1,∀i,
where dictionary D will be initialized by PCA on the input training matrix X. The proposed LCAE can be trained using the backpropagation algorithm, which updates W and b by backpropagating the reconstruction error gradient from the output layer to the locality coded target layer.
In this chapter, the introduced model jointly learns the auto-encoder and dictionary to benefit from both techniques. To make the model fast, a lite version of the auto-encoder, i.e., marginalized denoising auto-encoder [20] is adapted, which has shown appealing performance and efficiency. Furthermore, there are several benchmarks used to evaluate the proposed algorithm, and the experimental results show its better performance compared to the state-of-the-art methods.
In this section, we first revisit locality-constrained dictionary learning and marginalized denoising auto-encoder. Then we introduce marginalized denoising dictionary learning with a locality constraint along with an efficient solution to optimize the proposed algorithm. The proposed overall framework could be viewed in Fig. 7.1.
In this section, we introduce a feature learning model by unifying the marginalized denoising auto-encoder and locality-constrained dictionary learning (MDDL) together to simultaneously benefit from their merits [1]. Specifically, dictionary learning manages handling corrupted data from the sample space, while marginalized auto-encoder attempts to deal with the noisy data from the feature space. Thus, the algorithm aims to work well with the corrupted data from both the sample and feature spaces by integrating dictionary learning and auto-encoder into a unified framework. The key points of this introduced framework are listed as follows:
Consider a matrix X=[x1,x2,…,xn] having n samples. A low-rank representation tries to segment these samples by minimizing the following objective function:
argminZ‖Z‖⁎s.t. X=DZ,
where ‖⋅‖⁎ denotes the nuclear norm and Z is the coefficient matrix. In the subspace segmentation problem, a low-rank approximation is enforced by minimizing the error between X and its low rank representation (D=X). By applying low-rank regularization in dictionary updating, the DLRD [25] algorithm achieved impressive results, especially when corruption exists. Jiang et al. proposed a sparse and dense hybrid representation based on a supervised low-rank dictionary decomposition to learn a class-specific dictionary and erase non-class-specific information [26]. Furthermore, supervised information has been well utilized to seek a more discriminative dictionary [11,17]. In the introduced model, supervised information is also adopted to learn multiple sub-dictionaries so that samples from the same class are drawn from one low-dimensional subspace.
Above-mentioned sparse representation based methods consider each sample as an independent sparse linear combination, this assumption fails to exploit the spatial consistency between neighboring samples. Recent research efforts have yielded more promising results on the task of classification by using the idea of locality [27]. The method named Local Coordinate Coding (LCC), which specifically encourages the coding to rely on local structure, has been presented as a modification to sparse coding. In [27] the author also theoretically proved that locality is more essential than sparsity under certain assumptions. Inspired by the above learning techniques, Wang et al. proposed a Locality-Constrained Low-Rank Dictionary Learning (LC-LRD) to enhance the identification capability by using the geometric structure information [28].
Given a set of training data X=[X1,X2,…,Xc]∈Rd×n, where d is the feature dimensionality, n is the number of total training samples, c is the number of classes, and Xi∈Rd×ni is the sample from class i which has ni samples. The goal of dictionary learning is to learn an m atoms dictionary D∈Rd×m which yields a sparse representation matrix A∈Rm×n from X for future classification tasks. Then we can write X=DA+E, where E is the sparse noise matrix. Rather than learning the dictionary as a whole from all the training samples, we learn sub-dictionary Di for the ith class separately. Then A and D could be written as A=[A1,A2,…,Ac] and D=[D1,D2,…,Dc], where Ai is the sub-matrix that denotes the coefficients for Xi over D.
In [28], Wang et al. have proposed the following LC-LRD model for each sub-dictionary:
minDi,Ai,EiR(Di,Ai)+α‖Di‖⁎+β‖Ei‖1+λni∑k=1‖li,k⊙ai,k‖2, s.t. Xi=DAi+Ei,
where R(Di,Ai) is the Fisher discriminant regularization on each sub-dictionary, ‖Di‖⁎ is the nuclear norm to enforce low-rank properties, and ‖li,k⊙ai,k‖2 is a locality constraint to replace sparsity on the coding coefficient matrix; ai,k denotes the kth column in Ai, which means the coefficient for the kth sample in class i. This model was broken down into the following modules: discriminative sub-dictionaries, low-rank regularization term, and the locality constraint on the coding coefficients.
Sub-dictionary Di should be endowed with the discrimination power to well represent samples from the ith class. Mathematically, the coding coefficients of Xi over D can be written as Ai=[A1i;A2i;…;Aci], where Aji is the coefficient matrix of Xi over Dj. The discerning power of Di is produced by following two aspects: first, it is expected that Xi should be well represented by Di but not by Dj, j≠i. Therefore, we will have to minimize ‖Xi−DiAii−Ei‖2F, where Ei is the residual. Meanwhile, Di should not be good at representing samples from other classes, that is, each Aij, where j≠i, should have nearly zero coefficients so that ‖DiAij‖2F is as small as possible. Thus we denote the discriminative fidelity term for sub-dictionary Di as follows:
R(Di,Ai)=‖Xi−DiAii−Ei‖2F+c∑j=1,j≠i‖DiAij‖2F.
In the task of image classification, the within-class samples are linearly correlated and lie in a low dimensional manifold. Therefore, we want to find the dictionary with the most concise atoms by minimizing the rank of Di, which suggests replacing it by ‖Di‖⁎ [31], where ‖⋅‖⁎ denotes the nuclear norm of a matrix (i.e., the sum of its singular values).
In addition, a locality constraint is deployed on the coefficient matrix instead of the sparsity constraint. As suggested by LCC [32], locality is more essential than sparsity under certain assumptions, as locality must lead to sparsity but not necessary vice versa. Specifically, the locality constraint uses the following criterion:
minAn∑i=1‖li⊙ai‖2, s.t. 1Tai=1,∀i,
where ⊙ denotes the element-wise multiplication, and li∈Rm is the locality adaptor which gives different freedom for each basis vector proportional to its similarity to the input sample xi. Specifically,
li=exp(dist(xi,D)δ),
where dist(xi,D)=[dist(xi,d1),dist(xi,d2),…,dist(xi,dm)]T, and dist(xi,dj) is the Euclidean distance between sample xi and the jth dictionary atom dj; δ controls the bandwidth of the distribution.
Generally speaking, LC-LRD is based on the following three observations: (i) locality is more essential than sparsity to ensure obtain the similar representations for similar samples; (ii) each sub-dictionary should have discerning ability by introducing the discriminative term; (iii) low-rank is introduced to each sub-dictionary to separate noise from samples and discover the latent structure.
Consider a vector input x∈Rd, with d as the dimensionality of the visual descriptor. There are two important transformations, which can be considered as encoder and decoder processes involved in the auto-encoder, namely “input→hidden units” and “hidden units→output”, which are given by
h=σ(Wx+bh),ˆx=σ(WTh+bo),
where h∈Rz is the hidden representation unit, and ˆx∈Rd is interpreted as a reconstruction of input x. The parameter set includes a weight matrix W∈Rz×d, and two offset vectors bh∈Rz and bo∈Rd for hidden units and output, respectively; σ is a nonlinear mapping such as the sigmoid function of the form σ(x)=(1+e−x)−1. In general, an auto-encoder is a single layer hidden neural network, with identical input and target, meaning the auto-encoder encourages the output to be as similar to the target as possible, namely,
minW,bh,boL(x)=minW,bh,bo12nn∑i=1‖xi−ˆxi‖22,
where n is the number of images, xi is the target, and ˆxi is the reconstructed input. In this way, the neurons in the hidden layer can be seen as a good representation for the input, since they are able to reconstruct the data.
Since an auto-encoder deploys nonlinear functions, it takes more time to train the model, especially when the dimension of the data is very high. Recently, marginalized denoising auto-encoder (mDA) [20] was developed to address the data reconstruction in a linear fashion and achieved comparable performance with the original auto-encoder. The general idea of mDA is to learn a linear transformation matrix W to reconstruct the data with the transformation matrix by minimizing the squared reconstruction loss
12nn∑i=1‖xi−W˜xi‖2,
where ˜xi is a corrupted version of xi. The above objective solution is correlated to the randomly corrupted features of each input. To make the variance lower, a marginal denoising auto-encoder was proposed to minimize the overall squared loss from t different corrupted versions
12tnt∑j=1n∑i=1‖xi−W˜xi,(j)‖2,
where ˜xi,(j) denotes the jth corrupted version of the original input xi. Define X=[x1,…,xn] and its t-times repeated version as ‾X=[X,…,X] with its t-times differently corrupted version ˜X=[˜X(1),…,˜X(t)], where ˜X(i) denotes the ith corrupted version of X. In this way, Eq. (7.12) can be reformulated as
12tn‖‾X−W˜X‖2F,
which has the well-known closed-form solution for ordinary least squares. When t→∞, it can be solved by the expectation, using the weak law of large numbers [20].
Previous discussion on mDA gives a brief idea that, with a linear transformation matrix, mDA can be implemented in several lines of Matlab code and works very efficiently. The learned transformation matrix can well reconstruct the data and extract the noisy data.
Inspired by this, the proposed model aims at jointly learning a dictionary and a marginalized denoising transformation matrix in a unified framework. We formulate the objective function as
minDi,Ai,Ei,WF(Di,Ai,Ei)+‖‾X−W˜X‖2F, s.t.WXi=DAi+Ei
where F(Di,Ai,Ei)=R(Di,Ai)+α‖Di‖⁎+β1‖Ei‖1+λ∑nik=1‖li,k⊙ai,k‖2 is the locality-constrained dictionary learning part in Eq. (7.5) and R(Di,Ai)=‖WXi−DiAii−Ei‖2F+∑cj=1,j≠i‖DiAij‖2F is the discriminative term in Eq. (7.6); α, β1, and λ are trade-off parameters.
Discussion. The proposed marginalized denoising regularized dictionary learning (MDDL) aims to learn a more discriminative dictionary on transformed data. Since the marginalized denoising regularizer could generate a better transformation matrix to address corrupted data, the dictionary could be learned on denoised clean data. The framework unifies the marginal denoising auto-encoder and locality-constrained dictionary learning together. Generally, dictionary learning seeks a well-represented basis in order to achieve more discriminative coefficients for original data. Therefore, dictionary learning can handle noisy data to some extent, while the denoising auto-encoder has demonstrated its power in many applications. To this end, the joint learning scheme can benefit from both the marginal denoising auto-encoder and locality-constrained dictionary learning.
The proposed objective function in Eq. (7.14) could be optimized by dividing into two sub-problems: first, updating one by one each coefficient Ai(i=1,2,…,c) and W, by fixing dictionary D and all other Aj(j≠i), and then putting together to get the coding coefficient matrix A; second, updating Di by fixing other variables. These two steps are iteratively repeated to get the discriminative low-rank sub-dictionaries D, the marginal denoising transformation W, and the locality-constrained coefficients A. One problem arises in the second sub-problem, though. Recall that in Eq. (7.6) the coefficients Aii corresponding to Xi over Di should be updated to minimize the term ‖Xi−DiAii−Ei‖2F. Therefore, when we update Di in the second sub-problem, the related variance Aii is also updated.
Sub-problem I. Assume that the structured dictionary D is given, the coefficient matrices Ai(i=1,2,…,c) are updated one by one, then the original objective function in Eq. (7.14) reduces to the following locality-constrained coding problem for the coefficient of each class and W:
minAi,Ei,Wλni∑k=1‖li,k⊙ai,k‖2+β1‖Ei‖1+‖‾X−W˜X‖2F s.t. WXi=DAi+Ei,
which can be solved by the following Augmented Lagrange Multiplier method [33]. We transform Eq. (7.15) into its Lagrange function as follows:
minAi,Ei,W,T1c∑i=1(λni∑k=1‖li,k⊙ai,k‖2+β1‖Ei‖1+〈T1,WXi−DAi−Ei〉+μ2‖WXi−DAi−Ei‖2F)+‖‾X−W˜X‖2F,
where T1 is the Lagrange multiplier, and μ is a positive penalty parameter. Different from traditional locality-constrained linear coding (LLC) [27], MDDL adds an error term which could handle large noise in samples. In the following, we perform iterative optimization on Ai, Ei, and W.
Updating Ai:
Ai=argminAiμ2‖Zi−DAi‖2F+λni∑k=1‖li,k⊙ai,k‖2,
⇒Ai=LLC(Zi,D,λ,δ),
where Zi=WXi−Ei+T1μ, and li,k=exp(dist(zi,k,D)/δ). Function LLC(⋅) is a locality-constrained linear coding function2 [27].
Updating Ei:
Ei=argminEiβ1μ‖Ei‖1+12‖Ei−(WXi−DAi+T1μ)‖2F,
which can be solved by the shrinkage operator [34].
Updating W:
W=argminWc∑i=1(μ2‖WXi−DAi−Ei+T1μ‖2F)W=+‖‾X−W˜X‖2FW=argminWμ2‖WX−DA‖2F+‖‾X−W˜X‖2F,
where X=[X1,…,Xc] and DA=[DA1+E1−T1,1μ,…,DAc+Ec−T1,cμ]. Equation (7.19) has a well-known closed-form solution
W=(μDAXT+2‾X˜XT)(μXXT+2˜X˜XT)−1
where ‾X is the t-times repeated version of X and ˜X consists of its t-times corrupted version. We define P=μDAXT+2‾X˜XT and Q=μXXT+2˜X˜XT. And we would like the repetition number t to be ∞. Therefore, the denoising transformation W could be effectively learned from infinitely many copies of noisy data. Practically, we cannot generate ˜X with infinitely many corrupted versions; however, matrices P and Q converge to their expectations when t becomes very large. In this way, we can derive the expected values of P and Q, and calculate the corresponding mapping W as:
W=E[P]E[Q]−1W=E[μDAXT+2‾X˜XT]E[μXXT+2˜X˜XT]−1W=(μDAXT+2E[‾X˜XT])(μXXT+2E[˜X˜XT])−1
where DA and μ are treated as constant values when optimizing W. The expectations E[‾X˜XT] and E[˜X˜XT] are easy to be computed through mDA [20].
Sub-problem II. For the procedure of sub-dictionary updating, MDDL uses the same method as in [25]. Considering the second sub-problem, when Ai is fixed, sub-dictionaries Di(i=1,2,...,c) are updated one by one. The objective function in Eq. (7.14) is converted into the following problem:
minDi,Ei,Aiic∑j=1,j≠i‖DiAij‖2F+α‖Di‖⁎+β2‖Ei‖1+λni∑k=1‖lii,k⊙aii,k‖2, s.t. WXi=DiAii+Ei
where aii,k is the kth column in Aii, which means the coefficient for the kth sample in class i over Di. Problem Eq. (7.22) can be solved using the Augmented Lagrange Multiplier method [33] by introducing a relaxing variable J:
minDi,Ei,Aiiλni∑k=1‖lii,k⊙aii,k‖2+c∑j=1,j≠i‖DiAij‖2F+α‖J‖⁎+β2‖Ei‖1+〈T2,WXi−DiAii−Ei〉+〈T3,Di−J〉+μ2(‖WXi−DiAii−Ei‖2F+‖Di−J‖2F),
where T2 and T3 are Lagrange multipliers, and μ is a positive penalty parameter. In the following, we describe the iterative optimization of Di and Aii.
Updating Aii:
Similar as for Eq. (7.17), we have the solution for Aii as follows:
Aii=LLC((WXi−Ei+T2μ),Di,λ,δ),
where function LLC(⋅) is a locality-constrained linear coding function [27].
Updating J and Di:
Here we convert Eq. (7.23) to a problem for J and Di as:
minJ,Di∑cj=1,j≠i‖DiAij‖2F+α‖J‖⁎+〈T2,WXi−DiAii−Ei〉+〈T3,Di−J〉+μ2(‖Di−J‖2F+‖WXi−DiAii−Ei‖2F),
where J=argminJα‖J‖⁎+〈T3,Di−J〉+μ2(‖Di−J‖2F), and the solution for Di is:
Di=(J+WXiAiiT−EiAiiT+(T2AiiT−T3)/μ)(I+AiiAiiT+V)−1,whereV=2λμc∑j=1,j≠iAijAijT;
Updating Ei:
Ei=argminEiβ2‖Ei‖1+〈T2,WXi−DiAii−Ei〉+μ2‖WXi−DiAii−Ei‖2F,
which can be solved by the shrinkage operator [34]. The details of the optimization problem solution for the proposed model can be referred to as Algorithm 7.1.
MDDL uses a linear classifier for classification. After the dictionary is learned, the locality-constrained coefficients A of training data X and Atest of test data Xtest are calculated. The representation ai for test sample i is the ith column vector in Atest. The multivariate ridge regression model [35] was used to obtain a linear classifier ˆP:
ˆP=argminP‖L−PA‖2F+γ‖P‖2F
where L is the class label matrix. This yields ˆP=LAT(AAT+γI)−1. When testing points Atest come in, we first compute ˆPAtest. Then the label for sample i is assigned by the position corresponding to the largest value in the label vector, that is, label=argmaxlabel(ˆPai).
To verify the effectiveness and generality of the introduced MDDL, we show experiments conducted on various visual classification applications in this section. The method is tested on five datasets, including three face datasets: ORL [36], Extend YaleB [37], and CMU PIE [38], one object categorization dataset COIL-100 [39], and digits recognition dataset MNIST [40].
We show the experiments in comparison with LDA [41], linear regression classification (LRC) [42] and several latest dictionary learning based classification methods, i.e., FDDL [11], DLRD [25], D2L2R2 [43], DPL [44], and LC-LRD [28]. Moreover, for verifying the advantage of joint learning, a simple combination framework was proposed as a baseline, named as AE+DL, which first uses a traditional SAE to learn a new representation, then feeds in LC-LRD framework [28].
Parameter selection. The number of atoms in a sub-dictionary, which is denoted as mi, is one of the most important parameters in most of dictionary learning algorithms. We conduct the experiment on Extended YaleB with different number of dictionary atoms mi and analyze its effect on the performance of MDDL model and other competitors. Fig. 7.2 shows that all comparisons obtain better performance with larger dictionary size. In the experiments, the number of the dictionary columns of each class is fixed as the training size for ORL, Extend YaleB, AR and COIL-100 datasets, while it is fixed as 30 for CMU PIE and MNIST datasets. All the dictionaries are initialized with PCA on input data.
There are five parameters in Algorithm 7.1: α, λ, δ along with β1, β2 as two error term parameters, respectively, for updating dictionary and coefficients. These five are associated with the dictionary learning part in MDDL and are chosen by 5-fold cross validation. Experiments show that β1 and β2 play more important roles than the other parameters; therefore, the other parameters are set as α=1, λ=1 and δ=1. For Extended YaleB, β1=15, β2=100; for ORL, β1=5, β2=50; for CMU PIE, β1=5, β2=1.5; for COIL-100, β1=3, β2=150; for MNIST, β1=2.5, β2=2.5.
ORL Face Database. The ORL dataset contains 400 images of 40 individuals, so that there are 10 images for each subject with varying pose and illumination. The subjects of the images are in frontal and upright posture while the background is dark and uniform. The images are resized to 32×32, converted to gray scale, normalized and the pixels are concatenated to form a vector. Each image is manually corrupted by a randomly located and unrelated block image. Fig. 7.3B shows four examples of images with increasing block corruptions. For each subject, 5 samples are selected for training and the rest for testing, and the experiment is repeated on 10 random splits for evaluation. Furthermore, SIFT and Gabor filter features are extracted to evaluate MDDL generality.
We illustrate the recognition rates under different percentages of occlusion in Table 7.1. From the table, we can observe two phenomena: first, locality-constrained dictionary learning based methods achieve the best results for all the settings; second, the MDDL model performs best when the data is clean; however, along with the percentage of occlusion increase, MDDL drops behind with LC-LRD. This makes sense because in this experiment occlusion noise is added on to the images, while the denoising auto-encoder module in MDDL is introduced to tackle Gaussian noise. In conclusion, first, MDDL can achieve top results in a no occlusion situation because of the locality term; second, in a larger occlusion situation, the low-rank term outweighs DAEs.
Table 7.1
Average recognition rate (%) of different algorithms on ORL dataset for various occlusion percentage (%). Red denotes the best results, while blue means the second best results. (For interpretation of the colors in the tables, the reader is referred to the web version of this chapter)
Occlusion | LDA [41] | LRC [42] | FDDL [11] | DLRD [25] | D2L2R2 [43] |
---|---|---|---|---|---|
0 | 92.5±1.8 | 91.8±1.6 | 96.0±1.2 | 93.5±1.5 | 93.9±1.8 |
0 (SIFT) | 95.8±1.3 | 92.9±2.1 | 95.2±1.3 | 93.7±1.3 | 93.9±1.5 |
0 (Gabor) | 89.0±3.3 | 93.4±1.7 | 96.0±1.3 | 96.3±1.3 | 96.6±1.3 |
10 | 71.7±3.2 | 82.2±2.2 | 86.6±1.9 | 91.3±1.9 | 91.0±1.9 |
20 | 54.3±2.0 | 71.3±2.8 | 75.3±3.4 | 82.8±3.0 | 82.8±3.3 |
30 | 40.5±3.7 | 63.7±3.1 | 63.8±2.7 | 78.9±3.1 | 78.8±3.3 |
40 | 25.7±2.5 | 48.0±3.0 | 48.1±2.4 | 67.3±3.2 | 67.4±3.4 |
50 | 20.7±3.0 | 40.9±3.7 | 36.7±1.2 | 58.6±3.1 | |
Occlusion | DPL [44] | LC-LRD [28] | AE+DL [1] | MDDL [1] |
---|---|---|---|---|
0 | 94.1±1.8 | 96.2±1.2 | ||
0 (SIFT) | 95.3±1.4 | 93.5±2.6 | ||
0 (Gabor) | 97.0±1.4 | 94.6±1.8 | ||
10 | 84.5±2.7 | 91.5±1.2 | ||
20 | 71.2±1.7 | 83.6±1.8 | ||
30 | 59.8±3.9 | 78.0±2.4 | ||
40 | 43.0±2.9 | 67.3±2.6 | ||
50 | 32.2±3.5 | 58.7±3.2 | 58.5±3.0 |
The effects from two parameters of the error term, β1 and β2, are demonstrated in Fig. 7.4. From the six sub-figures under increasing percentage of corrupted pixels, parameter β1 in the coefficient updating procedure makes a larger impact. As more occlusion is applied, the best result appears when the parameter β1 is smaller, which means the error term plays a more important role when noise exists.
The results show that MDDL introduces significant improvement on some datasets, and for some other datasets, its significance increases along with the noise level in the input data.
Extended YaleB Dataset. The Extended Yale Face Database B contains 2414 frontal-face images from 38 human subjects captured under various laboratory-controlled illumination conditions. The size of image is cropped to 32×32. Two experiments are deployed on this dataset. First, random subsets with p(=5,10,…,40) images per individual taken with labels are chosen to form the training set, and the rest of the dataset is considered to be the testing set. For each given p, there are 10 random splits. Second, a certain percentage of randomly selected pixels from the images are replaced with pixel value of 255 (shown in Fig. 7.3C). Then we randomly take 30 images as training samples, with the rest reserved for testing, and the experiment is repeated 10 times. The experimental results are given in Tables 7.2 and 7.3, respectively.
Table 7.2
Average recognition rate (%) of different algorithms on Extended YaleB dataset with different number of training samples per class. Red denotes the best results, while blue means the second best results
Training | 5 | 10 | 20 | 30 | 40 |
---|---|---|---|---|---|
LDA [41] | 74.1±1.5 | 86.7±0.9 | 90.6±1.1 | 86.8±0.9 | 95.3±0.8 |
LRC [42] | 60.2±2.0 | 83.00±0.8 | 91.8±1.0 | 94.6±0.6 | 96.1±0.6 |
FDDL [11] | 77.8±1.3 | 91.2±0.9 | 96.2±0.7 | 97.9±0.3 | 98.8±0.5 |
DLRD [25] | 76.2±1.2 | 89.9±0.9 | 96.0±0.8 | 97.9±0.5 | 98.8±0.4 |
D2L2R2[43] | 76.0±1.2 | 89.6±0.9 | 96.0±0.9 | 97.9±0.4 | 98.1±0.4 |
DPL [44] | 75.2±1.9 | 89.3±0.6 | 95.7±0.9 | 97.8±0.4 | 98.7±0.4 |
LC-LRD [28] | 78.6±1.2 | 92.1±0.9 | |||
AE+DL [1] | 96.6±0.9 | 98.6±0.5 | 99.2±0.4 | ||
MDDL [1] |
Table 7.3
Average recognition rate (%) of different algorithms on Extended YaleB dataset with various corruption percentage (%). Red denotes the best results, while blue means the second best results
Corruption | 0 | 5 | 10 | 15 | 20 |
---|---|---|---|---|---|
LDA [41] | 86.8±0.9 | 29.0±0.8 | 18.5±1.2 | 13.6±0.5 | 11.3±0.5 |
LRC [42] | 94.6±0.6 | 80.5±1.1 | 67.6±1.3 | 56.8±1.2 | 47.2±1.6 |
FDDL [11] | 97.9±0.4 | 63.6±0.9 | 44.7±1.2 | 32.7±1.0 | 25.3±0.4 |
DLRD [25] | 97.9±0.5 | 91.8±1.1 | 85.8±1.5 | 80.9±1.4 | 73.6±1.6 |
D2L2R2[43] | 97.8±0.4 | 91.9±1.1 | 85.7±1.5 | 80.5±1.6 | 73.6±1.5 |
DPL [44] | 97.8±0.4 | 78.3±1.2 | 64.6±1.1 | 53.8±0.9 | 44.9±1.4 |
LC-LRD [28] | 87.0±0.9 | 74.1±1.0 | |||
AE+DL [1] | 98.6±0.5 | 93.2±0.7 | 81.6±0.9 | ||
MDDL [1] |
We can observe from Table 7.2 that in different training size settings three locality-constrained based methods (LC-LRD, AE+DL, MDDL) archive the best accuracy, and the proposed MDDL performs the best. MDDL's robustness to noise is demonstrated in Table 7.3. As the percentage of corruption increases, MDDL still produces the best recognition results. The performance of LDA, as well as LRC, FDDL and DPL, drops rapidly with larger corruption, while LC-LRD, MDDL, D2L2R2 and DLRD can still get much better recognition accuracy. This demonstrates the effectiveness of low-rank regularization and the error term when noise exists. LC-LRD and simple AE+DL equally match in different situations, while MDDL constantly performs the best.
CMU PIE Dataset. The CMU PIE dataset contains a total of 41,368 face images from 68 people, each with 13 different poses, 4 different expressions, and 43 different illumination conditions. Two experiments are deployed on two subsets of CMU PIE. First of all, five near frontal poses (C05, C07, C09, C27, C29) are selected as a first subset of PIE, and all the images under different illuminations and expressions (11,554 samples in total) are used. Thus, there are about 170 images for each person, and each image is normalized to have size of 32×32 pixels; 60 images per person are selected for training. Secondly, a relatively large-scale dataset is built by choosing more poses, which contains 24,245 samples in total. Overall, there are around 360 images for each person, and each image is normalized to have size of 32×32 pixels. The training set is constructed by randomly selecting 200 images per person, while the rest is used for evaluation. Table 7.4 shows that MDDL achieves good results and outperforms the compared methods.
Table 7.4
Methods | CMU (near frontal poses) | CMU (all poses) |
---|---|---|
LRC [42] | 4.12 | 9.65 |
FDDL [11] | 3.30 | 11.20 |
DLRD [25] | 3.33 | 10.64 |
D2L2R2[43] | 3.29 | 10.14 |
DPL [44] | 3.47 | 9.30 |
LC-LRD [28] | 3.01 | 8.98 |
MDDL [1] | 2.74 | 7.64 |
COIL-100 Dataset. In this section, MDDL is evaluated on object categorization by using the COIL-100 dataset. The training set is constructed by randomly selecting 10 images per object, and the testing set contains the rest of the images. This random selection is repeated 10 times, and the average results of all the compared methods are reported. To evaluate the scalability of different methods, the experiment separately utilizes images of 20, 40, 60, 80 and 100 objects from the dataset. Table 7.5 shows the average recognition rates with standard deviations of all compared methods. The results show that MDDL could not only work on face recognition but also on object categorization.
Table 7.5
Average recognition rate (%) with standard deviations of different algorithms on COIL-100 dataset with different number of classes. Red denotes the best results, while blue means the second best results
Class No. | 20 | 40 | 60 | 80 | 100 |
---|---|---|---|---|---|
LDA [41] | 81.9±1.2 | 76.7±0.3 | 66.2±1.0 | 59.2±0.7 | 52.5±0.5 |
LRC [42] | 90.7±0.7 | 89.0±0.5 | 86.6±0.4 | 85.1±0.3 | 83.2±0.6 |
FDDL [11] | 85.7±0.8 | 82.1±0.4 | 77.2±0.7 | 74.8±0.6 | 73.6±0.6 |
DLRD [25] | 88.6±1.0 | 86.4±0.5 | 83.5±0.1 | 81.5±0.5 | 79.9±0.6 |
D2L2R2[43] | 91.0±0.4 | 88.3±0.4 | 86.4±0.5 | 84.7±0.5 | 83.1±0.4 |
DPL [44] | 87.6±1.3 | 85.1±0.2 | 81.2±0.2 | 78.8±0.9 | 76.3±0.9 |
LC-LRD [28] | 87.1±0.7 | ||||
AE+DL [1] | 91.3±0.5 | 89.1±0.7 | 85.1±0.5 | 84.1±0.4 | |
MDDL [1] |
MNIST Dataset. This section tests MDDL on the subset of MNIST handwritten digit dataset downloaded from CAD website, which includes first 2000 training images and first 2000 test images, with the size of each digit image being 16×16. This experimental setting follows [43], and the experiments get consistent results. The recognition rates and training/testing time by different algorithms on MNIST dataset are summarized in Table 7.6. MDDL achieves the highest accuracy relative to its competitors. Compared within LC-LRD, MDDL costs only slightly more computational time thanks to the easy updating of marginalized auto-encoder.
Table 7.6
Average recognition rate (%) & running time (s) on MNIST dataset
Methods | Accuracy | Training time | Testing time |
---|---|---|---|
LDA [41] | 77.45 | 0.164 | 0.545 |
LRC [42] | 82.70 | 227.192 | – |
FDDL [11] | 85.35 | 240.116 | 97.841 |
DLRD [25] | 86.05 | 156.575 | 48.373 |
D2L2R2[43] | 84.65 | 203.375 | 48.154 |
DPL [44] | 84.65 | 1.773 | 0.847 |
LC-LRD [28] | 88.25 | 80.581 | 48.970 |
AE+DL [1] | 87.95 | 176.525 | 49.230 |
MDDL [1] | 89.75 | 81.042 | 49.781 |
Another experiment setting is conducted on this dataset to evaluate the effect from denoising auto-encoders. All the training and testing images in MNIST are corrupted with additive white Gaussian noise having signal-to-noise ratio (snr) from 50 to 1 dB (shown in Fig. 7.3A). Fig. 7.5 illustrates the recognition rate curves on 8 noised version of datasets. On the X-axis, we show the noise ratio used in input reconstruction process in DAEs, where close to 1 means more noise added, and 0 means no DAEs evolved. From the figure, we can observe that, with the increasing noise added in the datasets (50 to 1 dB), the highest recognition rate appears when the noise parameter gets larger (from nearly 0.004 for 50 dB to nearly 0.1 for 1 dB). In other words, denoising auto-encoders play a more important role when the dataset contains more noise.
To verify if the improvement from MDDL is statistically significant, a significance test (t-test) is further conducted for the results shown in Fig. 7.6. A significance level of 0.05 was used, that is, when the p-value is less than 0.05, the performance difference of two methods is statistically significant. The p-values of MDDL and other competitors are listed in Fig. 7.6. Since we do −log(p) processing, the comparison shows that MDDL outperforms the others significantly if the values are greater than −log(0.05). The results show that MDDL introduces significant improvement on COIL-100 dataset, and for Extended YaleB dataset, the significance increases along with the noise level in the input data.
In this section, we introduced an efficient marginalized denoising dictionary learning (MDDL) framework with a locality constraint. The proposed algorithm was designed to take advantage of two feature learning schemes, dictionary learning and auto-encoder. Specifically, MDDL adopted a lite version of the auto-encoder to seek a denoising transformation matrix. Then, dictionary learning with a locality constraint was built on the transformed data. These two strategies were iteratively optimized so that a marginalized denoising transformation and a locality-constrained dictionary were jointly learned. Experiments on several image datasets, e.g., face, object, digits, demonstrated the superiority of our proposed algorithm by comparing with other existing dictionary algorithms.
In this chapter, we described MDDL model with a desire to seek a transformation matrix to filter out the noise inside the data with a marginalized denoising auto-encoder. However, one problem arises with the auto-encoder representation architecture, which forces the network to learn an approximation to the identity by encouraging the output to be as similar to the input as possible. This scheme leads to the problem that the majority of the learned high-level features may be blindly used to compresses not only the discriminative information but also lots of redundancies or even noise in data. In the following training procedure, it is unreasonable to endow the discriminablity to this kind of task-irrelevant units.
In the future, we plan to explore the auto-encoder high-level features further to extract the discriminative from the task-irrelevant ones. By compressing more task-relevant information on the hidden units, we hope the dictionary learning module could exploit more discriminative information and learn a more compact and pure dictionary. Furthermore, we plan to explore multi-view data classification where the auto-encoder could serve as a domain adaptation to learn a latent feature space for cross-domain datasets. We believe that low-rank and locality term could also play an important role in cross-domain applications.
We investigate the ℓ∞-constrained representation, which demonstrates robustness to quantization errors, utilizing the tool of deep learning. Based on the Alternating Direction Method of Multipliers (ADMM), we formulate the original convex minimization problem as a feed-forward neural network, named Deep ℓ∞ Encoder, by introducing a novel Bounded Linear Unit (BLU) neuron and modeling the Lagrange multipliers as network biases. Such a structural prior acts as an effective network regularization, and facilitates model initialization. We then investigate the effective use of the proposed model in the application of hashing, by coupling the proposed encoders under a supervised pairwise loss, to develop a Deep Siamese ℓ∞ Network, which can be optimized from end to end. Extensive experiments demonstrate the impressive performance of the proposed model. We also provide an in-depth analysis of its behaviors against the competitors.
While ℓ0 and ℓ1 regularizations have been well-known and successfully applied in sparse signal approximations, utilizing the ℓ∞ norm to regularize signal representations has been less explored. In this section, we are particularly interested in the following ℓ∞-constrained least squares problem:
minx||Dx−y||22s.t.||x||∞≤λ,
where y∈Rn×1 denotes the input signal, D∈Rn×N the (overcomplete) the basis (often called frame or dictionary) with N<n, and x∈RN×1 the learned representation. Further, the maximum absolute magnitude of x is bounded by a positive constant λ, so that each entry of x has the smallest dynamic range [45]. As a result, model (7.29) tends to spread the information of y approximately evenly among the coefficients of x. Thus, x is called “democratic” [46] or “anti-sparse” [47], as all of its entries are of approximately the same importance.
In practice, x usually has most entries reaching the same absolute maximum magnitude [46], therefore resembling an antipodal signal in an N-dimensional Hamming space. Furthermore, the solution x to (7.29) withstands errors in a very powerful way: the representation error gets bounded by the average, rather than the sum, of the errors in the coefficients. These errors may be of arbitrary nature, including distortion (e.g., quantization) and losses (e.g., transmission failure). This property was quantitatively established in Section II.C of [45]:
In the case of N≪n, the above will yield great robustness for the solution to (7.29) with respect to noise, in particular, quantization errors. Also note that its error bound will not grow with the input dimensionality n, a highly desirable stability property for high-dimensional data. Therefore, (7.29) appears to be favorable for the applications such as vector quantization, hashing and approximate nearest neighbor search.
In this section, we investigate (7.29) in the context of deep learning. Based on the Alternating Direction Methods of Multipliers (ADMM) algorithm, we formulate (7.29) as a feed-forward neural network [48], called Deep ℓ∞ Encoder, by introducing a novel Bounded Linear Unit (BLU) neuron and modeling the Lagrange multipliers as network biases. The major technical merit to be presented is how a specific optimization model (7.29) could be translated to designing a task-specific deep model, which displays the desired quantization-robust property. We then study its application in hashing, by developing a Deep Siamese ℓ∞ Network that couples the proposed encoders under a supervised pairwise loss, which could be optimized from end to end. Impressive performances are observed in our experiments.
Similar to the case of ℓ0/ℓ1 sparse approximation problems, solving (7.29) and its variants (e.g., [46]) relies on iterative solutions. The authors of [49] proposed an active set strategy similar to that of [50]. In [51], the authors investigated a primal–dual path-following interior-point method. Albeit effective, the iterative approximation algorithms suffer from their inherently sequential structures, as well as the data-dependent complexity and latency, which often constitute a major bottleneck in the computational efficiency. In addition, the joint optimization of the (unsupervised) feature learning and the supervised steps has to rely on solving complex bi-level optimization problems [52]. Further, to effectively represent datasets of growing sizes, larger dictionaries D are usually needed. Since the inference complexity of those iterative algorithms increases more than linearly with respect to the dictionary size [53], their scalability turns out to be limited. Last but not least, while the hyperparameter λ sometimes has physical interpretations, e.g., for signal quantization and compression, it remains unclear how to set or adjust it for many application cases.
Deep learning has recently attracted great attention [54]. The advantages of deep learning lie in its composition using multiple nonlinear transformations to yield more abstract and descriptive embedding representations. The feed-forward networks could be naturally tuned jointly with task-driven loss functions [55]. With the aid of gradient descent, it also scales linearly in time and space with the number of training samples.
There has been a blooming interest in bridging “shallow” optimization and deep learning models. In [48], a feed-forward neural network, named LISTA, was proposed to efficiently approximate the sparse codes, whose hyperparameters were learned from general regression. In [56], the authors leveraged a similar idea on fast trainable regressors and constructed feed-forward network approximations of the learned sparse models. It was later extended in [57] to develop a principled process of learned deterministic fixed-complexity pursuits, for sparse and low rank models. Lately, [55] proposed Deep ℓ0 Encoders, to model ℓ0 sparse approximation as feed-forward neural networks. The authors extended the strategy to the graph-regularized ℓ1 approximation in [58], and the dual sparsity model in [59]. Despite the above progress, to the best of our knowledge, few efforts have been made beyond sparse approximation (e.g., ℓ0/ℓ1) models.
ADMM has been popular for its remarkable effectiveness in minimizing objectives with linearly separable structures [53]. We first introduce an auxiliary variable z∈RN×1, and rewrite (7.29) as
minx,z12||Dx−y||22s.t.||z||∞≤λ,z−x=0.
The augmented Lagrangian function of (7.30) is
12||Dx−y||22+pT(z−x)+β2||z−x||22+Φλ(z).
Here p∈RN×1 is the Lagrange multiplier attached to the equality constraint, β is a positive constant (with a default value of 0.6), and Φλ(z) is the indicator function which goes to infinity when ||z||∞>λ, and is 0 otherwise. ADMM minimizes (7.31) with respect to x and z in an alternating direction manner, and updates p accordingly. It guarantees global convergence to the optimal solution to (7.29). Starting from any initialization points of x, z, and p, ADMM iteratively solves (t=0,1,2,… denotes the iteration number):
(xupdate)minxt+112||Dx−y||22−pTtx+β2||zt−x||22,
(zupdate)minzt+1β2||z−(xt+1−ptβ)||22+Φλ(z),
(pupdate)pt+1=pt+β(zt+1−xt+1).
Furthermore, both (7.32) and (7.33) enjoy closed-form solutions:
xt+1=(DTD+βI)−1(DTy+βzt+pt),
zt+1=min(max(xt+1−ptβ,−λ),λ).
The above algorithm could be categorized to the primal–dual scheme. However, discussing the ADMM algorithm in more details is beyond the focus of this section. Instead, the purpose of deriving (7.30)–(7.36) is to prepare us for the design of the task-specific deep architecture, as presented below.
We first substitute (7.35) into (7.36), in order to derive an update form explicitly dependent on only z and p:
zt+1=Bλ((DTD+βI)−1(DTy+βzt+pt)−ptβ),
where Bλ is defined as a box-constrained element-wise operator (u denotes a vector and ui is its ith element):
[Bλ(u)]i=min(max(ui,−λ),λ).
Equation (7.37) could be alternatively rewritten as
zt+1=Bλ(Wy+Szt+bt),whereW=(DTD+βI)−1DT,S=β(DTD+βI)−1,bt=[(DTD+βI)−1−1βI]pt,
and expressed as the block diagram in Fig. 7.7, which outlines a recurrent structure when solving (7.29). Note that in (7.39), while W and S are pre-computed hyperparameters shared across iterations, bt remains a variable dependent on pt, and has to be updated throughout iterations, too (bt's update block is omitted in Fig. 7.7).
By time-unfolding and truncating Fig. 7.7 to a fixed number of K iterations (K=2 by default),4 we obtain a feed-forward network structure in Fig. 7.8, named Deep ℓ∞ Encoder. Since the threshold λ is less straightforward to update, we repeat the same trick as in [55] to rewrite (7.38) as [Bλ(u)]i=λiB1(ui/λi). The original operator is thus decomposed into two linear diagonal scaling layers, plus a unit-threshold neuron; the latter is called a Bounded Linear Unit (BLU) by us. All the hyperparameters W, Sk and bk (k=1,2), as well as λ, are all to be learnt from data by backpropagation. Although the equations in (7.39) do not directly apply to solving the deep ℓ∞ encoder, they can serve as high-quality initializations.
It is crucial to notice the modeling of the Lagrange multipliers pt as the biases, and to incorporate their updates into network learning. This provides important clues on how to relate deep networks to a larger class of optimization models, whose solutions rely on dual domain methods.
Comparing BLU with existing neurons. As shown in Fig. 7.9E, BLU tends to suppress large entries while not penalizing small ones, resulting in dense, nearly antipodal representations. A first look at the plot of BLU easily reminds of the tanh neuron (Fig. 7.9A). In fact, with its output range being [−1,1] and a slope of 1 at the origin, tanh could be viewed as a smoothed differentiable approximation of BLU.
We further compare BLU with other popular and recently proposed neurons: Rectifier Linear Unit (ReLU) [54], Soft-tHresholding Linear Unit (SHeLU) [58], and Hard thrEsholding Linear Unit (HELU) [55], as depicted in Fig. 7.9B–D, respectively. Contrary to BLU and tanh, they all introduce sparsity in the outputs, and thus prove successful and outperform tanh in classification and recognition tasks. Interestingly, HELU seems to rival BLU, as it does not penalize large entries but suppresses small ones down to zero.
Rather than solving (7.29) first and then training the encoder as general regression, as [48] did, we instead concatenate encoder(s) with a task-driven loss, and optimize the pipeline from end to end. In this section, we focus on discussing its application in hashing, although the proposed model is not limited to one specific application.
Background. With the ever-growing large-scale image data on the Web, much attention has been devoted to nearest neighbor search via hashing methods [60]. For big data applications, compact bitwise representations improve the efficiency in both storage and search speed. The state-of-the-art approach, learning-based hashing, learns similarity-preserving hash functions to encode input data into binary codes. Furthermore, while earlier methods, such as linear search hashing (LSH) [60], iterative quantization (ITQ) [61] and spectral hashing (SH) [62], do not refer to any supervised information, it has been lately discovered that involving the data similarities/dissimilarities in training benefits the performance [63,64].
Prior Work. Traditional hashing pipelines first represent each input image as a (hand-crafted) visual descriptor, followed by separate projection and quantization steps to encode it into a binary code. The authors of [65] first applied the Siamese network [66] architecture to hashing, which fed two input patterns into two parameter-sharing “encoder” columns and minimized a pairwise-similarity/dissimilarity loss function between their outputs, using pairwise labels. The authors further enforced the sparsity prior on the hash codes in [67], by substituting a pair of LISTA-type encoders [48] for the pair of generic feed-forward encoders in [65] while [68,69] utilized tailored convolution networks with the aid of pairwise labels. The authors of [70] further introduced a triplet loss with a divide-and-encode strategy applied to reduce the hash code redundancy. Note that for the final training step of quantization, [67] relied on an extra hidden layer of tanh neurons to approximate binary codes, while [70] exploited a piecewise linear and discontinuous threshold function.
Our Approach. In view of its robustness to quantization noise, as well as BLU's property as a natural binarization approximation, we construct a Siamese network as in [65], and adopt a pair of parameter-sharing deep ℓ∞ encoders as the two columns. The resulting architecture, named the Deep ℓ∞ Siamese Network, is illustrated in Fig. 7.10. Assume y and y+ make a similar pair while y and y− make a dissimilar pair, and further denote x(y) the output representation by inputting y. The two coupled encoders are then optimized under the following pairwise loss (the constant m represents the margin between dissimilar pairs):
Lp:=12||x(y)−x(y+)||2−12(max(0,m−||x(y)−x(y−)||))2.
The representation is learned to make similar pairs as close as possible and dissimilar pairs to be at least a distance m away. In this section, we follow [65] to use a default m=5 for all experiments.
Once a deep ℓ∞ Siamese network is learned, we apply its encoder part (i.e., a deep ℓ∞ encoder) to a new input. The computation is extremely efficient, involving only a few matrix multiplications and element-wise thresholding operations, with a total complexity of O(nN+2N2). One can obtain an N-bit binary code by simply quantizing the output.
Implementation. The proposed networks are implemented with the CUDA ConvNet package [54]. We use a constant learning rate of 0.01 with no momentum, and a batch size of 128. Different from prior findings such as in [55,58], we discover that untying the values of S1, b1 and S2, b2 boosts the performance more than sharing them. It is not only because that more free parameters enable a larger learning capacity, but also due to the important fact that pt (and thus bk) is in essence not shared across iterations, as in (7.39) and Fig. 7.8.
While many neural networks are trained well with random initializations, it has been discovered that sometimes poor initializations can still hamper the effectiveness of first-order methods [71]. On the other hand, it is much easier to initialize our proposed models in the right regime. We first estimate the dictionary D using the standard K-SVD algorithm [72], and then inexactly solve (7.29) for up to K (K=2) iterations, via the ADMM algorithm in Sect. 7.2.2, with the values of Lagrange multiplier pt recorded for each iteration. Benefiting from the analytical correspondence relationships in (7.39), it is then straightforward to obtain high-quality initializations for W, Sk and bk (k=1,2). As a result, we could achieve a steadily decreasing curve of training errors, without performing common tricks such as annealing the learning rate, which are found to be indispensable if random initialization is applied.
Datasets. The CIFAR10 dataset [73] contains 60K labeled images of 10 different classes. The images are represented using 384-dimensional GIST descriptors [74]. Following the classical setting in [67], we used a training set of 200 images for each class, and a disjoint query set of 100 images per class. The remaining 59K images are treated as database.
NUS-WIDE [75] is a dataset containing 270K annotated images from Flickr. Every image is associated with one or more of the 81 different concepts, and is described using a 500-dimensional bag-of-features. In training and evaluation, we followed the protocol of [76]: two images were considered as neighbors if they share at least one common concept (only 21 most frequent concepts are considered). We use 100K pairs of images for training, and a query set of 100 images per concept in testing.
Comparison Methods. We compare the proposed deep ℓ∞ Siamese network to six state-of-the-art hashing methods:
Comparing the two “deep” competitors to the deep ℓ∞ Siamese network, the only difference among the three is the type of encoder adopted in each's twin columns, as listed in Table 7.7. We reimplement the encoder parts of NNH and SNNH, with three hidden layers (i.e., two unfolded stages for LISTA), so that all three deep hashing models have the same depth.6 Recalling that the input y∈Rn and the hash code x∈RN, we immediately see from (7.39) that W∈Rn×N, Sk∈RN×N, and bk∈RN. We carefully ensure that both NNHash and SparseHash have all their weight layers of the same dimensionality with ours7 for a fair comparison.
Table 7.7
Comparison of NNH, SNNH, and the proposed deep ℓ∞ Siamese network
encoder type | neuron type | structural prior on hashing codes | |
---|---|---|---|
NNH | generic | tanh | / |
SNNH | LISTA | SHeLU | sparse |
Proposed | deep ℓ∞ | BLU | nearly antipodal |
& quantization-robust |
We adopt the following classical criteria for evaluation: (i) precision and recall (PR) for different Hamming radii, and the F1 score as their harmonic average; (ii) mean average precision (MAP) [79]. Besides, for NUS-WIDE, as computing mAP is slow over this large dataset, we follow the convention of [67] to compute the mean precision (MP) of top-5K returned neighbors (MP@5K), as well as report mAP of top-10 results (mAP@10).
We have not compared with convolutional network-based hashing methods [68–70], since it is difficult to ensure their models have the same parameter capacity as our fully-connected model in controlled experiments. We also do not include triplet loss-based methods, e.g., as in [70], into comparison because they will require three parallel encoder columns.
Results and Analysis. The performance of different methods on two datasets are compared in Tables 7.8 and 7.9. Our proposed method ranks top in almost all cases, in terms of mAP/MP and precision. Even under the Hamming radius of 0, our precision result is as high as 33.30% (N=64) for CIFAR10, and 89.49% (N=256) for NUS-WIDE. The proposed method also maintains the second best in most cases, in terms of recall, inferior only to SNNH. In particular, when the hashing code dimensionality is low, e.g., when N=48 for CIFAR10, the proposed method outperforms all other competitors with a significant margin. It demonstrates the competitiveness of the proposed method in generating both compact and accurate hashing codes, achieving more precise retrieval results at lower computation and storage costs.
Table 7.8
Performance (%) of different hashing methods on the CIFAR10 dataset, with different code lengths N
Method | N | mAP | Hamming radius ⩽2 | Hamming radius =0 | ||||
---|---|---|---|---|---|---|---|---|
Prec. | Recall | F1 | Prec. | Recall | F1 | |||
KSH | 48 | 31.10 | 18.22 | 0.86 | 1.64 | 5.39 | 5.6×10−2 | 0.11 |
64 | 32.49 | 10.86 | 0.13 | 0.26 | 2.49 | 9.6×10−3 | 1.9×10−2 | |
AGH1 | 48 | 14.55 | 15.95 | 2.8×10−2 | 5.6×10−2 | 4.88 | 2.2×10−3 | 4.4×10−3 |
64 | 14.22 | 6.50 | 4.1×10−3 | 8.1×10−3 | 3.06 | 1.2×10−3 | 2.4×10−3 | |
AGH2 | 48 | 15.34 | 17.43 | 7.1×10−2 | 3.6×10−2 | 5.44 | 3.5×10−3 | 6.9×10−3 |
64 | 14.99 | 7.63 | 7.2×10−3 | 1.4×10−2 | 3.61 | 1.4×10−3 | 2.7×10−3 | |
PSH | 48 | 15.78 | 9.92 | 6.6×10−3 | 1.3×10−2 | 0.30 | 5.1×10−5 | 1.0×10−4 |
64 | 17.18 | 1.52 | 3.0×10−4 | 6.1×10−4 | 1.0×10−3 | 1.69×10−5 | 3.3×10−5 | |
LH | 48 | 13.13 | 3.0×10−3 | 1.0×10−4 | 5.1×10−5 | 1.0×10−3 | 1.7×10−5 | 3.4×10−5 |
64 | 13.07 | 1.0×10−3 | 1.7×10−5 | 3.3×10−5 | 0.00 | 0.00 | 0.00 | |
NNH | 48 | 31.21 | 34.87 | 1.81 | 3.44 | 10.02 | 9.4×10−2 | 0.19 |
64 | 35.24 | 23.23 | 0.29 | 0.57 | 5.89 | 1.4×10−2 | 2.8×10−2 | |
SNNH | 48 | 26.67 | 32.03 | 12.10 | 17.56 | 19.95 | 0.96 | 1.83 |
64 | 27.25 | 30.01 | 36.68 | 33.01 | 30.25 | 9.8 | 14.90 | |
Proposed | 48 | 31.48 | 36.89 | 12.47 | 18.41 | 24.82 | 0.94 | 1.82 |
64 | 36.76 | 38.67 | 30.28 | 33.96 | 33.30 | 8.9 | 14.05 |
Table 7.9
Performance (%) of different hashing methods on the NUS-WIDE dataset, with different code lengths N
Method | N | mAP@10 | MP@5K | Hamming radius ⩽2 | Hamming radius =0 | ||||
---|---|---|---|---|---|---|---|---|---|
Prec. | Recall | F1 | Prec. | Recall | F1 | ||||
KSH | 64 | 72.85 | 42.74 | 83.80 | 6.1×10−3 | 1.2×10−2 | 84.21 | 1.7×10−3 | 3.3×10−3 |
256 | 73.73 | 45.35 | 84.24 | 1.4×10−3 | 2.9×10−3 | 84.24 | 1.4×10−3 | 2.9×10−3 | |
AGH1 | 64 | 69.48 | 47.28 | 69.43 | 0.11 | 0.22 | 73.35 | 3.9×10−2 | 7.9×10−2 |
256 | 73.86 | 46.68 | 75.90 | 1.5×10−2 | 2.9×10−2 | 81.64 | 3.6×10−3 | 7.1×10−3 | |
AGH2 | 64 | 68.90 | 47.27 | 68.73 | 0.14 | 0.28 | 72.82 | 5.2×10−2 | 0.10 |
256 | 73.00 | 47.65 | 74.90 | 5.3×10−2 | 0.11 | 80.45 | 1.1×10−2 | 2.2×10−2 | |
PSH | 64 | 72.17 | 44.79 | 60.06 | 0.12 | 0.24 | 81.73 | 1.1×10−2 | 2.2×10−2 |
256 | 73.52 | 47.13 | 84.18 | 1.8×10−3 | 3.5×10−3 | 84.24 | 1.5×10−3 | 2.9×10−3 | |
LH | 64 | 71.33 | 41.69 | 84.26 | 1.4×10−3 | 2.9×10−3 | 84.24 | 1.4×10−3 | 2.9×10−3 |
256 | 70.73 | 39.02 | 84.24 | 1.4×10−3 | 2.9×10−3 | 84.24 | 1.4×10−3 | 2.9×10−3 | |
NNH | 64 | 76.39 | 59.76 | 75.51 | 1.59 | 3.11 | 81.24 | 0.10 | 0.20 |
256 | 78.31 | 61.21 | 83.46 | 5.8×10−2 | 0.11 | 83.94 | 4.9×10−3 | 9.8×10−3 | |
SNNH | 64 | 74.87 | 56.82 | 72.32 | 1.99 | 3.87 | 81.98 | 0.37 | 0.73 |
256 | 74.73 | 59.37 | 80.98 | 0.10 | 0.19 | 82.85 | 0.98 | 1.94 | |
Proposed | 64 | 79.89 | 63.04 | 79.95 | 1.72 | 3.38 | 86.23 | 0.30 | 0.60 |
256 | 80.02 | 65.62 | 84.63 | 7.2×10−2 | 0.15 | 89.49 | 0.57 | 1.13 |
The next observation is that, compared to the strongest competitor SNNH, the recall rates of our method seem less compelling. We plot the precision and recall curves of the three best performers (NNH, SNNH, deep l∞), with regard to the bit length of hashing codes N, within the Hamming radius of 2. Fig. 7.12 demonstrates that our method consistently outperforms both SNNH and NNH in precision. On the other hand, SNNH gains advantages in recall over the proposed method, although the margin appears vanishing as N grows.
Although it seems to be a reasonable performance tradeoff, we are curious about the behavior difference between SNNH and the proposed method. We are again reminded that they only differ in the encoder architecture, i.e., one with LISTA while the other using the deep l∞ encoder. We thus plot the learned representations and binary hashing codes of one CIFAR image, using NNH, SNNH, and the proposed method, in Fig. 7.11. By comparing the three pairs, one could see that the quantization from (A) to (B) (also (C) to (D)) suffer visible distortion and information loss. Contrary to them, the output of the deep l∞ encoder has a much smaller quantization error, as it naturally resembles an antipodal signal. Therefore, it suffers minimal information loss during the quantization step.
In view of those, we conclude the following points towards the different behaviors, between SNNH and deep l∞ encoder:
Further, it seems that the sparsity and l∞ structure priors could be complementary. We will explore it as future work.
This section investigates how to import the quantization-robust property of an ℓ∞-constrained minimization model to a specially-designed deep model. It is done by first deriving an ADMM algorithm, which is then reformulated as a feed-forward neural network. We introduce the Siamese architecture concatenated with a pairwise loss, for the application purpose of hashing. We analyze in depth the performance and behaviors of the proposed model against its competitors, and hope it will evoke more interest from the community.