Chapter 6

Kernel Subspace Learning for Pattern Classification

Yinan YuKonstantinos DiamantarasTomas McKelveyS.Y. Kung    Chalmers University of Technology, Electrical Engineering, Gothenburg, Sweden
TEI of Thessaloniki, Thessaloniki, Greece
Princeton University, Princeton, NJ, United States

Abstract

Kernel methods are nonparametric feature extraction techniques that attempt to boost the learning capability of machine learning algorithms using nonlinear transformations. However, one major challenge in its basic form is that the computational complexity and the memory requirement do not scale well with respect to the training size. Kernel approximation is commonly employed to resolve this issue. Essentially, kernel approximation is equivalent to learning an approximated subspace in the high-dimensional feature vector space induced and characterized by the kernel function. With streaming data acquisition, approximated subspaces can be constructed adaptively. Explicit feature vectors are then extracted by a transformation onto the approximated subspace and linear learning techniques can be subsequently applied. From a computational point of view, operations in kernel methods can easily be parallelized and modern infrastructures can be utilized to achieve efficient computing. Moreover, the extracted explicit feature vectors can easily be interfaced with other learning techniques.

Keywords

Kernel approximation; Subspace learning; Classification; Nyström; Spark; GPU; CUDA

Chapter Points

  • •  How to use kernel methods in practice.
  • •  How to control the model complexity.
  • •  Kernel approximation as a subspace learning technique for feature construction.
  • •  Learning criteria for adaptive kernel approximation.
  • •  Infrastructure for kernel methods.

6.1 Introduction

Kernel methods are nonlinear transformation techniques that map a given input set into an implicit high-dimensional feature space by utilizing a positive-definite function called the kernel function. In practice, kernel methods are nonparametric learning techniques where the learning process is restricted to the subspace spanned by the training feature vectors. It implies that the complexity of the model grows with the training size. More specifically, given N training data points, the computational complexity and storage depend on an N×NImage Gram matrix called the “kernel matrix” K. The elements in the matrix are pairwise operations on all training data.

To reduce this complexity, kernel approximation techniques are commonly applied. There are mainly two types of approximation: the random Fourier features [1] and the Nyström methods [24]. Briefly speaking, random Fourier features construct a vector by randomly sampling in the frequency domain. They are computationally efficient, but do not always provide satisfying performance due to the fact that they are not data-dependent in general. Another category is the Nyström methods, where the kernel approximation is posed as a subspace learning problem. That is, one tries to approximate K using a low-rank matrix ˆKImage, with rank(ˆK)NImage. This is usually implemented by sampling a subset of the training data. The actual computational complexity of some operations, such as matrix inversion and multiplications, can be reduced from being a function of N to a function of rank(ˆK)Image. In its basic version [2], Nyström approximates the subspace by random sampling. More sophisticated strategies for sampling are proposed to achieve different objectives, such as control over the model complexity [5], low approximation error [6,4], high class separability [7], efficient memory usage [8]. These algorithms can be naturally computed in an adaptive fashion for streaming data acquisition.

There are typically two ways of representing the approximation outcome: (i) the kernel matrix, which is used in most standard kernel models; or (ii) explicit feature vectors, where one applies a subspace transformation of the feature space induced by the kernel function, which is sometimes more straightforward and explicit compared with the kernel matrix formulation. This is illustrated in Fig. 6.1. Given a training data set and a kernel function, the transformation ϕ:XRmImage can be constructed in an iterative fashion. For a data point xXImage, ϕ(x)Image can be used as the input to the subsequent learning modules, such as regression models, deep neural networks and support vector machines.

Image
Figure 6.1 The transformation from input data x to the feature vector as inputs to the subsequent machine learning module.

This chapter serves as a practical guide to the utilization of subspace learning techniques applied to kernel methods.

6.2 Kernel Methods

Kernel methods [9,10] are popular feature extraction techniques that:

  • •  have the ability to handle nonnumerical/nonvectorial input data;
  • •  are able to deal with high-dimensional data;
  • •  can be formulated as convex optimization problems;
  • •  have well-established theoretical properties.

Essentially, kernel methods map input data onto a high (possibly infinite)-dimensional space for a better data representation. In this section, we introduce kernel methods as a nonparametric learning technique, where the model complexity grows with respect to an increasing training size, which is essentially the dimension of the feature space. We then show that the extracted features can be represented in two different ways. Although being essentially equivalent, one may choose one or the other depending on the application. Issues related to feature dimension and model complexity are then discussed, leading to the necessity of kernel approximation.

6.2.1 Notations

Given a nonempty input set XImage that could be a string of text, a collection of websites, medical records, etc., let HImage be the Reproducing Kernel Hilbert Space (RKHS) [9] associated with a kernel function k:X×XRImage and k(x,z)=x,zHImage for x,zXImage, where HImage denotes the inner-product in HImage. Denote the mapping φ:XHImage and the feature vectors φ:=φ(x)Image, xXImage, where d is the dimension of XImage. In a classification setup, let yiImage denote the label information corresponding to xiImage, where yi1,,CImage with C being the number of classes. Given a training set D={(xi,yi)}Ni=1Image, NNImage and a subset MDImage, the set of indices is denoted IM={i:(xi,yi)M}Image. Moreover, let XM=[xi]iIMImage be a matrix with columns xiImage's. Given a kernel function k, for M,NDImage, denote KMN=k(XM,XN)=[k(xi,xj)]iIM,jINImage and KM=[k(xi,xj)]iIM,jIMImage. In particular, we write K=[k(xi,xj)]iID,jIDImage.

6.2.2 Nonparametric Learning Model

There are many machine learning techniques that are closely related to kernel methods, such as Gaussian Processes (GPs) [11] and Support Vector Machines (SVMs) [12]. The focus of kernel methods is mainly on the feature representation instead of the learning objective itself. However, the feature representation is strongly tied to the learning objective via the Representer Theorem and the empirical risk given below [9].

Empirical Risk  The learning objective used in most kernel techniques is the regularized empirical risk with the following structure:

Re=L(f|D)+λ||f||,

Image (6.1)

where f is a point in the Hilbert space, LImage is any loss function and λ||f||Image is the regularization term.

Representer Theorem  One of the most important results in kernel methods is the Representer Theorem [9], which states that, given an empirical risk (cf. Eq. (6.1)), the optimal solution fImage lies in the subspace spanned by {φ(x1),,φ(xN)}Image, denoted by

FDspan({φ(x1),,φ(xN)}).

Image (6.2)

This indicates the nonparametric nature of kernel techniques. A nonparametric model does not make strong assumptions on the data structure, but the model is instead chosen to fit the training data, where regularizations on the model complexity are usually applied to achieve a reasonable generalization ability on unseen datasets. Such an approach is commonly adopted when there are a lot of training data without sufficient prior knowledge available.

6.2.3 Training With Kernel Methods

6.2.3.1 Feature Representations

The main task of kernel methods is to find a good feature representation that is adapted to the training data, which can be used as an input for the subsequent training model that, for example, tries to find a fImage that minimizes Eq. (6.1).

From Eq. (6.2), we know that the solution fImage can be found by projecting data onto FDImage. In practice, for a given training set {(x1,y1),,(xN,yN)}Image, a kernel function k:X×XRImage, the projection TD:HFDImage and an empirical risk ReImage, there are mainly two ways of representing the features: (i) by the kernel matrix or (ii) by constructing an explicit feature vector.

  1. (i)  Representation by inner-product and the kernel matrix K:
  2. One way to represent data is by computing the kernel matrix projected onto FDImage, which is computed by

KD=[TDφ(xi),TDφ(xj)]1i,jN=[k(x1,x1)k(x1,xN)k(xN,x1)k(xN,xN)]=K.

Image (6.3)

  1. Eq. (6.3) shows that, by projecting feature vectors onto the subspace spanned by the training set, the kernel matrix remains unchanged.
  2. Note that the training data can be entirely represented by K.
  3. (ii)  Representation by an explicit feature vector ϕ(x)Image:
  4. Alternatively, instead of using an N×NImage Gram matrix K, one can construct an explicit feature map ϕ:XRNImage to represent data. More specifically, for any data vector xXImage, the explicit feature vector is computed as

ϕ(x)=AT[k(x1,x)k(xN,x),]

Image (6.4)

  1. where ARN×NImage is a transformation matrix such that ATKA=IN×NImage.

6.2.3.2 Kernelization of Linear Models

Given a linear model, kernelization refers to introducing nonlinearity to the linear model by using the aforementioned feature representations. This transformation can be implemented using Eq. (6.3) or Eq. (6.4). These two alternative representations are essentially equivalent, in the sense that, for any x,zXImage,

ϕ(x)Tϕ(z)=<TDφ(x),TDφ(z)>.

Image (6.5)

It is more common to use Eq. (6.3), where the learning model is formulated using the kernel matrix, such as SVMs, Kernel Ridge Regression (KRR), Kernel Principal Component Analysis (KPCA) and GPs. This alternative requires the follow-up learning algorithm to be formulated using only inner-products to represent input data. However, compared with Eq. (6.3), the explicit feature vector (cf. Eq. (6.4)) provides a more flexible representation. One can simply apply the nonlinear transformation in Eq. (6.4) and use ϕ(x)RNImage as the new feature vector to input the next learning module. The advantages are:

  1. (i)  Instead of requiring a reformulation using inner-products, any linear model can be applied directly in RNImage, and hence all the theoretical properties of well-studied linear models can be adopted.
  2. (ii)  One can explicitly control the complexity of feature space using subspace transformations.

6.2.4 Model Complexity

One of the main challenges of kernel methods, or nonparametric modeling techniques in general, is the control of the model complexity, where a low complexity leads to an oversimplified description of the world, which results in a low variance but a high bias and vice versa. Particularly, for nonparametric learning techniques, where the model complexity is a function of the training size, this might cause issues discussed in this section.

6.2.4.1 Computational Complexity and Memory Usage

One of the major challenges of kernel methods is the computational complexity and memory usage. As shown in Eq. (6.3), to project the features onto the subspace FDImage (cf. Eq. (6.2)), one has to store and invert an N×NImage kernel matrix which has a computational complexity of O(N3)Image and O(N2)Image memory. It is usually not possible to run on a single node machine with limited RAM when N is very large.

6.2.4.2 Robustness and Overfitting

Another issue related to the model complexity is overfitting. Since the feature space is essentially the subspace spanned by {φ(x1),,φ(xN)}Image, the “quality” of the training dataset is crucial to construct a robust feature space. That is, noisy training data might result in a relatively arbitrary feature representation. Besides trying to increase the Signal-to-Noise Ratio (SNR) of the measurement, which has limitations on its own, one commonly adopted solution is to find a more robust and invariant subspace of the feature space.

6.2.5 Kernel Approximation and Related Work

With the rapidly growing data sizes in modern machine learning applications, model reduction techniques are developed to reduce the computational complexity, the storage requirement and the model complexity. To this end, in kernel methods, approximation techniques are introduced. There are mainly two types of kernel approximation techniques: the random Fourier features [1] and the Nyström methods. A comparison between the original versions of these two techniques can be found in [13,14].

Briefly speaking, random Fourier features are explicitly constructed by a random sampling in the Fourier domain. Although computations of such techniques can be quite efficient [15], the main limitation is that the basis functions of the feature construction are in general not data-dependent. Moreover, in its original form, random Fourier features require numerical input sets.

Another popular technique is the Nyström method, where a subspace approximation for the kernel matrix is learned from data. This technique is motivated in Section 6.3.1. Nyström is generally more computationally costly compared with random Fourier features, but due to its data-dependent nature, it typically provides a better performance in data driven problems. Moreover, with Nyström techniques, one can iteratively update the approximated subspace by including/rejecting new data points with respect to some criteria, which induces a nice adaptive framework.

6.2.5.1 Random Fourier Features

Random Fourier feature [1] techniques construct an explicit feature map such that the inner-product of the explicit feature vectors can be used to approximate the kernel function. The kernel functions are required to be shift-invariant, i.e., k(x,z)=k(xz)Image for any x,zXImage. The construction is carried out by drawing independent and identically distributed samples in the frequency domain of the kernel function. The main theorem to ensure the existence of random Fourier features is shown here.

Theorem 1

[16]

A continuous shift-invariant kernel k(x,z)Image on RdImage is positive definite if and only if k(δ)Image is the Fourier transform of a nonnegative measure, i.e.,

k(x,z)=Xp(ω)ejωT(xz)dω=Eω(ζω(x)ζω(z)),

Image (6.6)

where ζω(x)ejωTx=jsin(ωTx)+cos(ωTx)Image and denotes the complex conjugate and j is the imaginary unit.

Theorem 1 shows that ζω(x)ζω(z)Image is an unbiased estimator of k(x,z)Image for shift-invariant kernels. Since k(x,z)Image is a real number, we have ζω(x)=cos(ωTx)Image.

The idea is then to construct an explicit feature map z:XRmImage, such that

zω(x)Tzω(z)=ζω(x)ζω(z)

Image (6.7)

=cos(ωTxωTz)

Image (6.8)

=sin(ωTx)sin(ωTz)+cos(ωTx)cos(ωTz).

Image (6.9)

Empirically, the feature map can then be constructed using

ϕ(x)=1m[cos(ωT1x)cos(ωTmx)sin(ωT1x)sin(ωTmx)]T,

Image

where ω1,,ωmImage are independent and identically distributed samples drawn from p(ω)Image.

The algorithm is summarized in Algorithm 1. Details will be addressed in Chapter 7.

Image
Algorithm 1 Random Fourier Features.

6.2.5.2 Nyström Methods: Subspace Learning

Another important family of kernel approximation techniques are the Nyström methods, where the effective rank of the kernel matrix is exploited in a more direct fashion. That is, one assumes an underlying subspace structure in the kernel matrix, i.e., KˆKImage for some ˆKImage such that rank(ˆK)=m<NImage. The task is then to use the rank-deficient matrix ˆKImage to approximate the kernel matrix.

In its basic form, Nyström randomly selects a subset of the training data GDImage and constructs the approximation using

ˆK=k(XD,XG)RN×mk(XG,XG)1Rm×mk(XG,XD)Rm×N=KDGK1GKGD.

Image

The random subset selection is computationally simple and in many cases sufficient to capture the underlying data structure in many applications. Many more advanced sampling strategies have been proposed for different purposes. Most techniques cast the approximation problem as an estimation of the top m singular values of K. Essentially, they attempt to approximate the subspace as well as possible while trying to reduce the computational complexity and/or storage usage. For example, some of these techniques include subspace approximation using the incomplete Cholesky decomposition [17]; identifying representative points by clustering [18]; enabling parallelization and performance boosting by the ensemble Nyström method. Some other methods explore the intrinsic structure of the kernel matrix to obtain a more compact and efficient memory storage, such as the Memory-Efficient Kernel Approximation (MEKA) [8], Clustered Low-Rank Approximation (CLRA) [19] and the CLAss-Specific Kernel (CLASK) subspace approximation [7]. These techniques more or less share the same base algorithm with different sampling criteria and/or feature representation strategy. In the following sections, we explore the underlying framework of kernel subspace approximation and its adaptive learning algorithms.

6.3 Kernel Subspace Approximation

6.3.1 Motivation

Briefly speaking, kernel subspace approximation applies a subspace transformation to the feature vector space FDImage. The transformed data vectors with a lower dimensionality are then used as the new input features to the subsequent module.

That is,

T:FDF,

Image (6.12)

where FImage has a lower dimension than FDImage. By applying this transformation, the complexity is then reduced to be a function of dim(F)Image. The flowchart of this process can be found in Fig. 6.2.

Image
Figure 6.2 Transformations in kernel methods.

Given a training set DImage and a kernel function k, the idea is to find a subset GDImage with cardinality |G|NImage and a transformation matrix TRNG×mImage with m|G|Image, such that

TTk(XG,XG)T=Im×m,

Image (6.13)

which gives a transformation matrix k(,XG)TImage with orthonormal columns.

6.3.2 Training Complexity With Approximation

In kernel methods, during the training process, to find the exact or approximated optimal solution of unknown parameters in the learning model, the computational complexity is typically dominated by the operation (K+λI)1Image, where I is the N×NImage identity matrix; λR+Image is the ridge parameter to handle the ill-conditioned matrix inversion [20]. In practice, this matrix inversion requires O(N2.8)Image [21] to O(N3)Image operations depending on the algorithm.1 Kernel approximations transform the learning model onto a lower dimensional feature space and subsequently reduce the computational complexity. In this section, we discuss the approximation for the two different feature representations in Eq. (6.3) and Eq. (6.4).

6.3.2.1 Approximated Kernel Matrix

One computational advantage of kernel methods is that all computations are carried out by inner-products. The kernel matrix K (cf. Eq. (6.3)) that holds the pair-wise inner-product of the training data is then a key component of kernel methods. Essentially, the subspace approximation can be interpreted as weighted inner-products, which transforms the kernel matrix as

K˜K=[k(XG,x1)Tk(XG,xN)T]KDGRN×|G|TTT[k(XG,x1)k(XG,xN)]KGDR|G|×NRN×N,

Image (6.14)

where k(XG,x)=[k(xi1,x)k(xi|G|,x)]Image and indices {ij:ijIG,j=1|G|}Image denote the index set of the subset GImage. The weighting matrix TTTImage (cf. Eq. (6.13)) is rank-deficient.

After kernel approximation, the size of the kernel matrix remains the same and the approximated kernel matrix ˜KImage becomes rank-deficient, i.e., rank(˜KImage)=m. However, in combination with the matrix inversion lemma, the kernel matrix inversion can be reduced to

(K+λI)1(˜K+λI)1=1λI1λ2KDGT(I+1λTTKGDKDGT)1Rm×mTTKGD.

Image (6.15)

The computational complexity is listed as follows:

  • •  Matrix multiplication KDGTImage: O(N|G|m)Image.
  • •  Matrix multiplication (KDGT)TKDGTImage: O(Nm2)Image.
  • •  Matrix inversion: O(m3)Image.

Due to the fact that N|G|mImage, we have the overall complexity O(N|G|m)Image.

6.3.2.2 Approximated Sample Covariance Matrix Using Explicit Feature Vectors

Kernel approximation can be readily applied to create an explicit feature vector by a map ˜ϕ:XRmImage. Specifically,

˜ϕ(x)=TT[k(xi1,x)k(xi|G|,x)]Rm.

Image (6.16)

Eq. (6.16) can then be used as the feature vector for the succeeding learning model. When applying Eq. (6.16) to linear learning techniques, the sample covariance matrix typically needs to be inverted. Let us assume that the feature vector ˜ϕ(x)Image is always centered (cf. Appendix 6.A.1), the (scaled) sample covariance matrix is defined as

C=[˜ϕ(x1)˜ϕ(xN)][˜ϕ(x1)T˜ϕ(xN)T]

Image (6.17)

=TT[k(XG,x1)k(XG,xN)][k(XG,x1)Tk(XG,xN)T]TRm×m.

Image (6.18)

The computational complexity of C1Image is composed of:

  • •  Matrix multiplication: O(N|G|m)Image.
  • •  Matrix inversion: O(m3)Image.

Hence, we have the overall computational complexity O(N|G|m)Image.

In the next section, we focus on an adaptive algorithmic framework and its instantiations to construct desired subspaces.

6.4 Adaptive Kernel Subspace Approximation Algorithm

6.4.1 Algorithmic Framework

With an online data acquisition, the subspace approximation can be implemented in an adaptive fashion. One generic framework is summarized in Algorithm 2.

Image
Algorithm 2 Generic Kernel Subspace Approximation Framework.

The matrix inversion lemma [22] is applied to obtain an iterative update of the kernel matrix inversion. The actual sampling strategy is embedded in the “criteria”, which are discussed in the upcoming section.

6.4.2 Algorithm Design

The key design components in Algorithm 2 are (1) the sampling criteria to find GImage and (2) the construction of the transformation matrix T. In this section, we discuss the principles of finding GImage and T and how they may affect the performance of the algorithm.

6.4.2.1 Design Criteria

Generally speaking, we are interested in building a model (i.e., the approximated subspace) with a high generalization ability, which is measured by the performance on unseen datasets. The generalization ability is closely related to the model complexity. That is, a model with a high complexity tends to result in a high performance on the training data, but it might perform poorly on testing datasets and vice versa. It is referred to as the “overfitting” problem.

This can be observed by the bias-variance trade-off, where the performance statistics are measured when the model is trained using different training sets with a fixed training size. More precisely, when the model complexity is high, the search space of the model parameters is large and hence the bias of the model tends to be low. On the other hand, since the model is derived from the training data, the higher complexity the model has, the higher variance as a result we find.

The strategy for achieving a high generalization is twofold: one needs to (1) reduce the model complexity and (2) maintain a low empirical risk on the training data. By combining an intuitive illustration in Fig. 6.3 and the actual model representation in Eq. (6.16), we see that the model complexity (the “size” of the circle) is related to (1) the size of the subset GImage and (2) the number of columns in T; and the empirical risk (the “location” of the circle) depends on (1) the sampling criteria for GImage and (2) the construction strategy for the transformation matrix T. Given the analysis, in this section, we demonstrate some design criteria for Algorithm 2.

Rejecting samples with a low innovation: The “innovation” of a data point x is the projection error from φ(x)Image to the existing subspace. The sample is only accepted if the innovation is large enough, i.e.,

k(x,x)kTGjinvKGkGjη(0,1).

Image (6.23)

When Eq. (6.23) is violated, it means that it is not necessary to include φ(x)Image. Hence, the larger η is, the less data points are included in the subspace construction.

Dimension reduction by choosing principal components: Principal Component Analysis (PCA) [23] belongs to the most commonly used linear techniques for dimension reduction purposes. From Eq. (6.16), we see that, given the subset GImage and a hyperparameter m<|G|Image, the transformation matrix TR|G|×mImage can be constructed by applying PCA on the dataset U={[k(xi1,x)k(xi|G|,x)]T},xD}Image. That is, let ˜TImage be the matrix that contains the sorted principal vectors computed from UImage as its columns. The matrix T is obtained by truncation. We have

T=˜T(:,1:m).

Image (6.24)

Since m<|G|Image, the model complexity is then further reduced.

Class Separability: In classification problems, the reconstruction error is not the only criterion to measure the quality of the subspace. When label information is available, the discriminative ability is preferably taken into consideration. This information can be encoded in various ways. For example, the CLAss-specific Subspace Kernel (CLASK) representation [7] encodes the information into the feature vector by constructing one subspace for each class. The sampling criteria depend on the between-class projection error. Fisher discriminant analysis [24] finds a transformation matrix T that maximizes the Fisher Quotient, which measures the class separability. Other constrained optimizations can be applied to finding T that results in a good class separation, which can be cast as metric learning [2527] problems.

Ensemble techniques: One efficient way of reducing the variance is to learn multiple models from subsets of training data and the final estimate is based on an ensemble decision. One such example would be the ensemble Nyström method [28], where

ˆK=Pr=1μrˆKr

Image (6.25)

and each ˆKrImage is estimated from a subset of DImage and μrImage are the weights for r=1PImage, PN+Image.

Image
Figure 6.3 Illustration of the algorithm design criteria. The size of the circles indicates the model complexity. Circles FImage, F1,,FnImage are subspaces with the same complexity and FImage has a lower complexity compared with the rest. The dot f indicates the optimal solution that minimizes the empirical risk given the subspace FDImage that is spanned by the whole training dataset (cf. Eq. (6.1)). The goal of the algorithm design is to find a subspace with an appropriate size (model complexity) with a reasonable location (which hopefully contains f) before the training process.

6.4.2.2 Empirical Risk

Given a model FImage designed by the aforementioned criteria, empirical risks are measured to ensure that the “location” of the circle in Fig. 6.3 is not way off. Typically, with the same model complexity, the model corresponding to a lower empirical risk is preferred. Typically, there are two empirical risks to be considered.

Reconstruction error: The approximated subspace FImage (cf. Eq. (6.12)) has to be representative and is evaluated by the following reconstruction error on the training data. Let ˆφnImage be the projection of φ(xn)Image onto FImage. The empirical risk of the reconstruction error (on the training set DImage) is defined as

EF=1NNn=1||φn|G|i=1an,iˆφn||22φn2

Image (6.26)

=1NNn=1(1|||G|j=1an,ik(xn,xi)TTTk(xn,xi)||22k(xn,xn)),

Image (6.27)

where an,iImage's are found by

an,i=argmina1,1,,aN,mEF(a1,1,,aN,m).

Image

We see that, by setting the hyperparameters η (cf. Eq. (6.23)) to a small value, the reconstruction error EFImage becomes small. On the other hand, a small number of columns in T results in a large EFImage due to the truncation (cf. Eq. (6.24)).

Classification error: In classification problems, the ultimate risk function is the 0–1 loss, which is defined as

LF=1NNn=11(fF(ˆφn)yn),

Image (6.28)

where ynImage are the labels for data xnImage, 1 is the indicator function and fFImage is found by minimizing LFImage. Note that, given a training set, this empirical risk depends not only on the subspace approximation algorithm, but also on the classification learning model.

6.5 Infrastructures

The computational complexity of kernel methods mainly depends on the dimension d of the input space, the training size N and the approximated rank m. Given a large training size N and a complex intrinsic data structure, to fully explore the capability of the representative power of kernel methods, we need to take advantage of the modern computational infrastructures and frameworks.

6.5.1 Speedup: GPU/CUDA

To speed up the kernel algorithms, one can take advantage of the graphics processing units (GPUs) for general computational purposes [29]. This is readily applicable due to the fact that computations in kernel methods mainly involve a large amount of simple operations that can easily be parallelized.

In recent years graphics cards have evolved into GPUs offering computing resources for general purpose computations. Typically, a GPU card contains hundreds or even thousands of simple processors (cores) allowing the concurrent execution of multiple threads. GPU cores, unlike the cores in a multicore CPU, are not full-blown processors. They do not have their own memory but share memory with other cores and they allow the execution of simple tasks, such as mathematical operations, shared memory access and access to the global GPU memory. The Computer-Unified Device Architecture (CUDA) by NVIDIA is a library enabling the use of the GPU cores for general purpose parallel computing. Threads are organized into blocks that run on groups of cores called streaming multiprocessors. CUDA is compatible with the data-parallel model of execution. All threads execute the same function, called a “kernel”. Each thread, however, operates on different parts of the data depending on its ID. This programming paradigm is also known as the Single Instruction–Multiple Threads (SIMT) model [30]. The basic idea behind CUDA is that the GPU is operating separately and independently from the host. The CPU submits a job to the GPU copying the necessary data onto the global GPU memory, the GPU does the processing and returns the result in the CPU so that the results can be printed, saved on disk, etc. The GPU device has no access to the system I/O and it essentially operates as a slave accelerator. Most popular machine learning packages, such as Tensorflow and Theano, can take advantage of CUDA in order to accelerate computations.

Example: Computing Kernel Functions  We see from Algorithm 2 that the computational complexity of the algorithm is dominated by kernel function computations and vector-matrix multiplication. In this example, we show the capacity of the speedup achieved by using CUDA on a GPU compared with the CPU performance. The setup of the experiment is specified as follows.

Hardware specs: GPU/CPU specifications

  1. -  GPU: GeForce GTX 960M with 640 CUDA cores
  2. -  CPU: Intel i7-6700HQ processor 2.6 GHz 4 cores 8 threads

Data: two synthetic datasets {x1,xN1}Image and {z1,zN2}Image, where each element in each data vector is generated independently from a normal distribution.

Computations involved: allocate memory and compute the entries for a N1×N2Image matrix, where each element is M(i,j)=k(xi,zj)Image for i=1N1Image, j=1N2Image and xi,zjRdImage. To vary the number of operations per thread, we compute and compare the run time for the following kernels:

  1. -  Single kernel: polynomial kernel,

k(xi,zj)=(xTizj+1)2.

Image

  1. -  Multiple-kernel [31],

k(xi,zj)=3l=1kl(xi,zj)=(xTizj+1)2+exp(12xizj2)+exp(12xizj).

Image (6.29)

Variables: to compare the performance and gain a better understanding, we vary the following parameters:

  1. -  Data-related: N1, N2, d
  2. -  Operation related: single kernel computation versus multiple-kernel computation
  3. -  CUDA-related: THREAD_PER_BLOCK
  4. -  CUDA block grid: dim3 blocks(N1, N2THREAD_PER_BLOCKImage,1)

The design of the CUDA algorithm can be further optimized. For simplicity, we choose to compute one element k(xi,zj)Image per thread.

The run time comparison can be found in Fig. 6.4, Fig. 6.5 and Fig. 6.6, where the run time is measured in milliseconds using the C function clock() with an output type clock_t. With the aforementioned setup, we observe a major speedup using GPU under most circumstances with various GPU parameters. However, note that, when the number of operations per thread is too low (d=10Image), CPU can outperform GPU, since the overhead of CUDA, such as data transfer, will dominate the computational complexity.

Image
Figure 6.4 CUDA performance measurement I. Run time comparison for different d. The number of elements is N1 × N2.
Image
Figure 6.5 CUDA performance measurement II(1). THREAD_PER_BLOCK is varied with N1 = 64.
Image
Figure 6.6 CUDA performance measurement II(2). THREAD_PER_BLOCK is varied for N1 = 512.

6.5.2 Scaling: Spark

Apache Spark [3234] is a data-parallel computing framework offering scalability, fault tolerance and increased performance due to the use of in-memory operations. Spark operates on top of various cluster managers and is agnostic of the architecture of the underlying cluster. Compared to Hadoop MapReduce [35], Spark can be 10 to 100 times faster. A central data structure concept in the Spark programming model is the Resilient Distributed Dataset (RDD) [36], which is a fault-tolerant collection of data partitioned across nodes. An RDD can be rebuilt automatically in case of data loss because of a node failure. RDDs can be created by various actions, for instance reading a text file from a distributed file system or by transforming an existing RDD. A typical transformation is the map operation which applies the same function on each element of the RDD via closure. Reductions of RDDs into single objects are also possible. For example, an RDD can be reduced into a number by summing up its elements, by counting its elements, etc.

Example: Ensemble Nyström  The Spark framework can be adopted for ensemble algorithms, such as the ensemble Nyström in Eq. (6.25) [28] (cf. Fig. 6.7). The training set is divided into n disjoint subsets denoted by D=D1D2DnImage, for nN+Image. In Fig. 6.7, each solid black node denotes the kernel subspace approximation in Algorithm 2. The framework is composed of two parts:

  • •  map(): The map function takes a collection of training sets {D1,D2,,Dn}Image and applies Algorithm 2 to each element DiImage, for i=1,,nImage. The output of each operation is the kernel matrix ˆKiImage.
  • •  reduce(): The reduce function takes the collection of outputs {ˆK1,,ˆKn}Image and applies a weighted sum (cf. Eq. (6.25)) to obtain the final result ˆKImage.

The Spark framework scales linearly with respect to the number of nodes. Moreover, one can enable the NVIDIA CUDA GPU on Spark to accelerate the computations.

Image
Figure 6.7 Ensemble Nyström (cf. Eq. (6.25)). A diagram illustrating how the Ensemble Nyström can be implemented using the Spark map() and reduce() functions.

These frameworks and libraries are updated frequently with the development of new technologies around them. However, when the underlying principle is taken into account, this rapid evolution is not as overwhelming as it seems. Nevertheless, learning how to use these tools is an essential part of the algorithm design at hand.

6.6 Conclusion

Kernel techniques are powerful nonlinear feature extraction techniques that handle nonvectorial input sets with well-established theoretical foundations. Given its nonparametric nature, the model complexity of kernel techniques grows with respect to an increasing training size. High model complexity might lead to prohibitive computational complexity and memory usage. Kernel approximation techniques are typically applied to address this issue. In this chapter, we have mainly discussed two types of kernel approximation strategies, the Random Fourier Features and the Nyström subspace approximation. We have focused on the Nyström due to its data-dependent nature and ease of modification to an adaptive learning framework. Designs of the basic Nyström and its extensions have been discussed and analyzed. Moreover, to fully take advantage of the large-scale training size, we have presented possibilities of using modern computing infrastructures and frameworks, e.g. GPU with CUDA library, the map-reduce framework, etc., to scale and speed up kernel methods. Furthermore, the advantage of theoretical properties can be beneficial in combination with other machine learning techniques such as deep learning [3741], where kernel methods can be used as a preprocessing or a classification module, which can be implemented using popular APIs such as Torch [42] and Tensorflow [43]. The interested reader is referred to the example code from this chapter for further investigation.

Appendix 6.A

6.A.1 Centering

One preprocessing technique is to center the feature vectors as follows:

k(x,z)φ(x)E(φ(x)),φ(z)E(φ(z))=φ(x),φ(z)E(φ(x)),φ(z)φ(x),E(φ(z))+E(φ(x)),E(φ(z)),

Image

where E(φ())Image can be estimated from the training data {x1,,xN}Image. We have

k(x,z)k(x,z)1NNi=1k(xi,x)1NNi=1k(xi,z)+1N2Ni=1Nj=1k(xi,xj).

Image (6.30)

After centering the feature vectors, the kernel function only depends on the relative position between φ(x)Image and φ(z)Image.

6.A.2 Normalization

The implementation of normalization on the feature vector φ(x)Image is as follows:

k(x,z)k(x,z)k(x,x)k(z,z).

Image (6.31)

With normalization, the kernel function essentially represents the angle between two feature vectors. The effect of the absolute length of the vectors is dismissed. This leads to a more robust result in many applications.

References

[1] Ali Rahimi, Benjamin Recht, Random features for large-scale kernel machines, Neural Information Processing Systems. 2007:1–10.

[2] Christopher Williams, Matthias Seeger, Using the Nyström method to speed up kernel machines, Advances in Neural Information Processing Systems 13. MIT Press; 2001:682–688.

[3] Sanjiv Kumar, Mehryar Mohri, Ameet Talwalkar, Sampling methods for the Nyström method, Journal of Machine Learning Research 2012;13:981–1006.

[4] Petros Drineas, Michael W. Mahoney, On the Nyström method for approximating a gram matrix for improved kernel-based learning, Journal of Machine Learning Research 2005;6:2153–2175.

[5] Mu Li, James T. Kwok, Bao-Liang Lu, Making large-scale Nyström approximation possible, Johannes Fürnkranz, Thorsten Joachims, eds. ICML. Omnipress; 2010:631–638.

[6] Ming Lin, Fei Wang, Changshui Zhang, Large-scale eigenvector approximation via Hilbert space embedding Nyström, Pattern Recognition 2015;48(5):1904–1912.

[7] Yinan Yu, I. Diamantaras Konstantinos, Tomas McKelvey, S.Y. Kung, Class-specific subspace kernel representations and adaptive margin slack minimization for large scale classification, IEEE Transaction on Neural Network Learn System 2016;29(2):440–456.

[8] Si Si, Cho-Jui Hsieh, Inderjit S. Dhillon, Memory efficient kernel approximation, ICML. JMLR Workshop and Conference Proceedings. 2014;vol. 32:701–709 JMLR.org.

[9] Bernhard Schölkopf, J. Smola Alexander, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. The MIT Press; 2001.

[10] S.Y. Kung, Kernel Methods and Machine Learning. Cambridge University Press; 2014.

[11] Carl Edward Rasmussen, Christopher K.I. Williams, Gaussian Processes for Machine Learning. The MIT Press; 2006.

[12] Corinna Cortes, Vladimir Vapnik, Support vector networks, Machine Learning 1995;20:273–297.

[13] Tianbao Yang, Yu-feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-hua Zhou, San Ramon, East Lansing, Nyström method vs random Fourier features: a theoretical and empirical comparison, Neural Information Processing Systems. 2012:1–9.

[14] Steven C.H. Hoi, Jialei Wang, Peilin Zhao, Jinfeng Zhuang, Zhiyong Liu, Large scale online kernel classification, Francesca Rossi, ed. IJCAI. IJCAI/AAAI; 2013:1750–1756.

[15] Ali Rahimi, Benjamin Recht, Weighted sums of random kitchen sinks: replacing minimization with randomization in learning, Daphne Koller, Dale Schuurmans, Yoshua Bengio, Léon Bottou, eds. NIPS. Curran Associates, Inc.; 2008:1313–1320.

[16] Walter Rudin, Fourier Analysis on Groups. reprint edition New York: Wiley-Interscience; 1994.

[17] Francis R. Bach, Michael I. Jordan, Computer Science Division, Predictive low-rank decomposition for kernel methods, NIPS. 2005.

[18] Kai Zhang, James T. Kwok, Clustered Nyström method for large scale manifold learning and dimension reduction, IEEE Transactions on Neural Networks 2010;21(10):1576–1587.

[19] Berkant Savas, Inderjit S. Dhillon, Clustered low rank approximation of graphs in information science applications, SDM. SIAM/Omnipress; 2011:164–175.

[20] A.E. Hoerl, R.W. Kennard, Ridge regression: biased estimation for nonorthogonal problems, Technometrics 1970;12:55–67.

[21] Steven Skiena, The Algorithm Design Manual. 2nd ed. Springer; 2008.

[22] William W. Hager, Updating the inverse of a matrix, SIAM Review 1989;31(2):221–239.

[23] H. Abdi, L.J. Williams, Principal component analysis, Computational Statistics 2010;2:433–459.

[24] Sebastian Mika, Gunnar Rätsch, Weston Jason, Schölkopf Bernhard, Müller Klaus-Robert, Fisher discriminant analysis with kernels, Neural Networks for Signal Processing IX. IEEE; 1999:3–10.

[25] Eric P. Xing, Andrew Y. Ng, Michael I. Jordan, Stuart J. Russell, Distance metric learning with application to clustering with side-information, NIPS'02. 2002:505–512.

[26] Kilian Q. Weinberger, Metric learning, 2010.

[27] Dor Kedem, Stephen Tyree, Kilian Weinberger, Fei Sha, Gert Lanckriet, Non-linear metric learning, P. Bartlett, F.C.N. Pereira, C.J.C. Burges, L. Bottou, K.Q. Weinberger, eds. Advances in Neural Information Processing Systems 25. 2012:2582–2590.

[28] Sanjiv Kumar, Mehryar Mohri, Ameet Talwalkar, Ensemble Nyström method, Yoshua Bengio, Dale Schuurmans, John D. Lafferty, Christopher K.I. Williams, Aron Culotta, eds. NIPS. Curran Associates, Inc.; 2009:1060–1068.

[29] CUDA: http://www.nvidia.com/object/cuda_home_new.html.

[30] NVIDIA, Whitepaper – NVIDIA's next generation CUDA compute architecture: Fermi, 2009.

[31] Mehmet Gönen, Ethem Alpaydin, Multiple kernel learning algorithms, Journal of Machine Learning Research 2011;12:2211–2268.

[32] SPARK: http://spark.apache.org/.

[33] Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica, Spark: cluster computing with working sets, HotCloud 10. 2010:10.

[34] apache-sparkTM: Lightning-fast cluster computing http://spark.apache.org/.

[35] Hadoop MapReduce: https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html.

[36] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica, Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing, Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association; 2012:2.

[37] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, Eric P. Xing, Deep kernel learning, CoRR arXiv:1511.02222; 2015.

[38] Youngmin Cho, Lawrence K. Saul, Kernel methods for deep learning, Yoshua Bengio, Dale Schuurmans, John D. Lafferty, Christopher K.I. Williams, Aron Culotta, eds. NIPS. Curran Associates, Inc.; 2009:342–350.

[39] Grégoire Montavon, Mikio L. Braun, Klaus-Robert Müller, Kernel analysis of deep networks, Journal of Machine Learning Research 2011;12:2563–2581.

[40] Grégoire Montavon, Mikio L. Braun, Klaus-Robert Müller, Layer-wise analysis of deep networks with Gaussian kernels, John D. Lafferty, Christopher K.I. Williams, John Shawe-Taylor, Richard S. Zemel, Aron Culotta, eds. NIPS. Curran Associates, Inc.; 2010:1678–1686.

[41] Brendan McCane, Lech Szymanski, Deep radial kernel networks: approximating radially symmetric functions with deep networks, CoRR arXiv:1703.03470; 2017.

[42] TORCH: http://torch.ch/.

[43] TENSORFLOW: https://www.tensorflow.org/.


1  “Throughout the paper, we assume that the complexity for matrix inversion is O(N3)Image.”

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset