Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 6

Kernel Subspace Learning for Pattern Classification

Yinan Yu^⁎; Konstantinos Diamantaras^†; Tomas McKelvey^⁎; S.Y. Kung^‡ ^⁎Chalmers University of Technology, Electrical Engineering, Gothenburg, Sweden
^†TEI of Thessaloniki, Thessaloniki, Greece
^‡Princeton University, Princeton, NJ, United States

Abstract

Kernel methods are nonparametric feature extraction techniques that attempt to boost the learning capability of machine learning algorithms using nonlinear transformations. However, one major challenge in its basic form is that the computational complexity and the memory requirement do not scale well with respect to the training size. Kernel approximation is commonly employed to resolve this issue. Essentially, kernel approximation is equivalent to learning an approximated subspace in the high-dimensional feature vector space induced and characterized by the kernel function. With streaming data acquisition, approximated subspaces can be constructed adaptively. Explicit feature vectors are then extracted by a transformation onto the approximated subspace and linear learning techniques can be subsequently applied. From a computational point of view, operations in kernel methods can easily be parallelized and modern infrastructures can be utilized to achieve efficient computing. Moreover, the extracted explicit feature vectors can easily be interfaced with other learning techniques.

Keywords

Kernel approximation; Subspace learning; Classification; Nyström; Spark; GPU; CUDA

Chapter Points

• How to use kernel methods in practice.
• How to control the model complexity.
• Kernel approximation as a subspace learning technique for feature construction.
• Learning criteria for adaptive kernel approximation.
• Infrastructure for kernel methods.

6.1 Introduction

Kernel methods are nonlinear transformation techniques that map a given input set into an implicit high-dimensional feature space by utilizing a positive-definite function called the kernel function. In practice, kernel methods are nonparametric learning techniques where the learning process is restricted to the subspace spanned by the training feature vectors. It implies that the complexity of the model grows with the training size. More specifically, given N training data points, the computational complexity and storage depend on an $N \times N$ Gram matrix called the “kernel matrix” K. The elements in the matrix are pairwise operations on all training data.

To reduce this complexity, kernel approximation techniques are commonly applied. There are mainly two types of approximation: the random Fourier features [1] and the Nyström methods [2–4]. Briefly speaking, random Fourier features construct a vector by randomly sampling in the frequency domain. They are computationally efficient, but do not always provide satisfying performance due to the fact that they are not data-dependent in general. Another category is the Nyström methods, where the kernel approximation is posed as a subspace learning problem. That is, one tries to approximate K using a low-rank matrix $\hat{K}$ , with $r a n k (\hat{K}) ≪ N$ . This is usually implemented by sampling a subset of the training data. The actual computational complexity of some operations, such as matrix inversion and multiplications, can be reduced from being a function of N to a function of $r a n k (\hat{K})$ . In its basic version [2], Nyström approximates the subspace by random sampling. More sophisticated strategies for sampling are proposed to achieve different objectives, such as control over the model complexity [5], low approximation error [6,4], high class separability [7], efficient memory usage [8]. These algorithms can be naturally computed in an adaptive fashion for streaming data acquisition.

There are typically two ways of representing the approximation outcome: (i) the kernel matrix, which is used in most standard kernel models; or (ii) explicit feature vectors, where one applies a subspace transformation of the feature space induced by the kernel function, which is sometimes more straightforward and explicit compared with the kernel matrix formulation. This is illustrated in Fig. 6.1. Given a training data set and a kernel function, the transformation $ϕ : X \to R^{m}$ can be constructed in an iterative fashion. For a data point $x \in X$ , $ϕ (x)$ can be used as the input to the subsequent learning modules, such as regression models, deep neural networks and support vector machines.

Figure 6.1 The transformation from input data x to the feature vector as inputs to the subsequent machine learning module.

This chapter serves as a practical guide to the utilization of subspace learning techniques applied to kernel methods.

6.2 Kernel Methods

Kernel methods [9,10] are popular feature extraction techniques that:

• have the ability to handle nonnumerical/nonvectorial input data;
• are able to deal with high-dimensional data;
• can be formulated as convex optimization problems;
• have well-established theoretical properties.

Essentially, kernel methods map input data onto a high (possibly infinite)-dimensional space for a better data representation. In this section, we introduce kernel methods as a nonparametric learning technique, where the model complexity grows with respect to an increasing training size, which is essentially the dimension of the feature space. We then show that the extracted features can be represented in two different ways. Although being essentially equivalent, one may choose one or the other depending on the application. Issues related to feature dimension and model complexity are then discussed, leading to the necessity of kernel approximation.

6.2.1 Notations

Given a nonempty input set $X$ that could be a string of text, a collection of websites, medical records, etc., let $H$ be the Reproducing Kernel Hilbert Space (RKHS) [9] associated with a kernel function $k : X \times X \to R$ and $k (x, z) = {〈 x, z 〉}_{H}$ for $x, z \in X$ , where ${〈 〉}_{H}$ denotes the inner-product in $H$ . Denote the mapping $φ : X \to H$ and the feature vectors $φ : = φ (x)$ , $\forall x \in X$ , where d is the dimension of $X$ . In a classification setup, let $y_{i}$ denote the label information corresponding to $x_{i}$ , where $y_{i} \in 1, \dots, C$ with C being the number of classes. Given a training set $D = {\begin{matrix} (x_{i}, y_{i}) \end{matrix}}_{i = 1}^{N}$ , $N \in N$ and a subset $M \subseteq D$ , the set of indices is denoted $I_{M} = {\begin{matrix} i : (x_{i}, y_{i}) \in M \end{matrix}}$ . Moreover, let $X_{M} = {[\begin{matrix} x_{i} \end{matrix}]}_{i \in I_{M}}$ be a matrix with columns $x_{i}$ 's. Given a kernel function k, for $M, N \subseteq D$ , denote $K_{M N} = k (X_{M}, X_{N}) = {[\begin{matrix} k (x_{i}, x_{j}) \end{matrix}]}_{i \in I_{M}, j \in I_{N}}$ and $K_{M} = {[\begin{matrix} k (x_{i}, x_{j}) \end{matrix}]}_{i \in I_{M}, j \in I_{M}}$ . In particular, we write $K = {[\begin{matrix} k (x_{i}, x_{j}) \end{matrix}]}_{i \in I_{D}, j \in I_{D}}$ .

6.2.2 Nonparametric Learning Model

There are many machine learning techniques that are closely related to kernel methods, such as Gaussian Processes (GPs) [11] and Support Vector Machines (SVMs) [12]. The focus of kernel methods is mainly on the feature representation instead of the learning objective itself. However, the feature representation is strongly tied to the learning objective via the Representer Theorem and the empirical risk given below [9].

Empirical Risk The learning objective used in most kernel techniques is the regularized empirical risk with the following structure:

$R_{e} = L (f | D) + λ | | f | |,$

(6.1)

where f is a point in the Hilbert space, $L$ is any loss function and $λ | | f | |$ is the regularization term.

Representer Theorem One of the most important results in kernel methods is the Representer Theorem [9], which states that, given an empirical risk (cf. Eq. (6.1)), the optimal solution $f^{⁎}$ lies in the subspace spanned by ${φ (x_{1}), \dots, φ (x_{N})}$ , denoted by

$F_{D} ≜ span ({φ (x_{1}), \dots, φ (x_{N})}) .$

(6.2)

This indicates the nonparametric nature of kernel techniques. A nonparametric model does not make strong assumptions on the data structure, but the model is instead chosen to fit the training data, where regularizations on the model complexity are usually applied to achieve a reasonable generalization ability on unseen datasets. Such an approach is commonly adopted when there are a lot of training data without sufficient prior knowledge available.

6.2.3 Training With Kernel Methods

6.2.3.1 Feature Representations

The main task of kernel methods is to find a good feature representation that is adapted to the training data, which can be used as an input for the subsequent training model that, for example, tries to find a $f^{⁎}$ that minimizes Eq. (6.1).

From Eq. (6.2), we know that the solution $f^{⁎}$ can be found by projecting data onto $F_{D}$ . In practice, for a given training set ${(x_{1}, y_{1}), \dots, (x_{N}, y_{N})}$ , a kernel function $k : X \times X \to R$ , the projection $T_{D} : H \to F_{D}$ and an empirical risk $R_{e}$ , there are mainly two ways of representing the features: (i) by the kernel matrix or (ii) by constructing an explicit feature vector.

(i) Representation by inner-product and the kernel matrix K:
One way to represent data is by computing the kernel matrix projected onto $F_{D}$ , which is computed by

$K_{D} = {[\begin{matrix} 〈 T_{D} φ (x_{i}), T_{D} φ (x_{j}) 〉 \end{matrix}]}_{1 ⩽ i, j ⩽ N} = [\begin{matrix} k (x_{1}, x_{1}) & \dots & k (x_{1}, x_{N}) \\ ⋮ & ⋱ & ⋮ \\ k (x_{N}, x_{1}) & \dots & k (x_{N}, x_{N}) \end{matrix}] = K .$

(6.3)

Eq. (6.3) shows that, by projecting feature vectors onto the subspace spanned by the training set, the kernel matrix remains unchanged.
Note that the training data can be entirely represented by K.
(ii) Representation by an explicit feature vector $ϕ (x)$ :
Alternatively, instead of using an $N \times N$ Gram matrix K, one can construct an explicit feature map $ϕ : X \to R^{N}$ to represent data. More specifically, for any data vector $x \in X$ , the explicit feature vector is computed as

$ϕ (x) = A^{T} [\begin{matrix} k (x_{1}, x) \\ ⋮ \\ k (x_{N}, x), \end{matrix}]$

(6.4)

where $A \in R^{N \times N}$ is a transformation matrix such that $A^{T} K A = I_{N \times N}$ .

6.2.3.2 Kernelization of Linear Models

Given a linear model, kernelization refers to introducing nonlinearity to the linear model by using the aforementioned feature representations. This transformation can be implemented using Eq. (6.3) or Eq. (6.4). These two alternative representations are essentially equivalent, in the sense that, for any $x, z \in X$ ,

$ϕ {(x)}^{T} ϕ (z) = < T_{D} φ (x), T_{D} φ (z) > .$

(6.5)

It is more common to use Eq. (6.3), where the learning model is formulated using the kernel matrix, such as SVMs, Kernel Ridge Regression (KRR), Kernel Principal Component Analysis (KPCA) and GPs. This alternative requires the follow-up learning algorithm to be formulated using only inner-products to represent input data. However, compared with Eq. (6.3), the explicit feature vector (cf. Eq. (6.4)) provides a more flexible representation. One can simply apply the nonlinear transformation in Eq. (6.4) and use $ϕ (x) \in R^{N}$ as the new feature vector to input the next learning module. The advantages are:

(i) Instead of requiring a reformulation using inner-products, any linear model can be applied directly in $R^{N}$ , and hence all the theoretical properties of well-studied linear models can be adopted.
(ii) One can explicitly control the complexity of feature space using subspace transformations.

6.2.4 Model Complexity

One of the main challenges of kernel methods, or nonparametric modeling techniques in general, is the control of the model complexity, where a low complexity leads to an oversimplified description of the world, which results in a low variance but a high bias and vice versa. Particularly, for nonparametric learning techniques, where the model complexity is a function of the training size, this might cause issues discussed in this section.

6.2.4.1 Computational Complexity and Memory Usage

One of the major challenges of kernel methods is the computational complexity and memory usage. As shown in Eq. (6.3), to project the features onto the subspace $F_{D}$ (cf. Eq. (6.2)), one has to store and invert an $N \times N$ kernel matrix which has a computational complexity of $O (N^{3})$ and $O (N^{2})$ memory. It is usually not possible to run on a single node machine with limited RAM when N is very large.

6.2.4.2 Robustness and Overfitting

Another issue related to the model complexity is overfitting. Since the feature space is essentially the subspace spanned by ${φ (x_{1}), \dots, φ (x_{N})}$ , the “quality” of the training dataset is crucial to construct a robust feature space. That is, noisy training data might result in a relatively arbitrary feature representation. Besides trying to increase the Signal-to-Noise Ratio (SNR) of the measurement, which has limitations on its own, one commonly adopted solution is to find a more robust and invariant subspace of the feature space.

6.2.5 Kernel Approximation and Related Work

With the rapidly growing data sizes in modern machine learning applications, model reduction techniques are developed to reduce the computational complexity, the storage requirement and the model complexity. To this end, in kernel methods, approximation techniques are introduced. There are mainly two types of kernel approximation techniques: the random Fourier features [1] and the Nyström methods. A comparison between the original versions of these two techniques can be found in [13,14].

Briefly speaking, random Fourier features are explicitly constructed by a random sampling in the Fourier domain. Although computations of such techniques can be quite efficient [15], the main limitation is that the basis functions of the feature construction are in general not data-dependent. Moreover, in its original form, random Fourier features require numerical input sets.

Another popular technique is the Nyström method, where a subspace approximation for the kernel matrix is learned from data. This technique is motivated in Section 6.3.1. Nyström is generally more computationally costly compared with random Fourier features, but due to its data-dependent nature, it typically provides a better performance in data driven problems. Moreover, with Nyström techniques, one can iteratively update the approximated subspace by including/rejecting new data points with respect to some criteria, which induces a nice adaptive framework.

6.2.5.1 Random Fourier Features

Random Fourier feature [1] techniques construct an explicit feature map such that the inner-product of the explicit feature vectors can be used to approximate the kernel function. The kernel functions are required to be shift-invariant, i.e., $k (x, z) = k (x - z)$ for any $x, z \in X$ . The construction is carried out by drawing independent and identically distributed samples in the frequency domain of the kernel function. The main theorem to ensure the existence of random Fourier features is shown here.

Theorem 1

[16]

A continuous shift-invariant kernel $k (x, z)$ on $R^{d}$ is positive definite if and only if $k (δ)$ is the Fourier transform of a nonnegative measure, i.e.,

$k (x, z) = \int_{X} p (ω) e^{j ω^{T} (x - z)} d ω = E_{ω} (ζ_{ω} (x) ζ_{ω} {(z)}^{⁎}),$

(6.6)

where $ζ_{ω} (x) ≜ e^{j ω^{T} x} = j \sin (ω^{T} x) + \cos (ω^{T} x)$ and ^⁎ denotes the complex conjugate and j is the imaginary unit.

Theorem 1 shows that $ζ_{ω} (x) ζ_{ω} {(z)}^{⁎}$ is an unbiased estimator of $k (x, z)$ for shift-invariant kernels. Since $k (x, z)$ is a real number, we have $ζ_{ω} (x) = \cos (ω^{T} x)$ .

The idea is then to construct an explicit feature map $z : X \to R^{m}$ , such that

$z_{ω} {(x)}^{T} z_{ω} (z) = ζ_{ω} (x) ζ_{ω} (z)$

(6.7)

$= \cos (ω^{T} x - ω^{T} z)$

(6.8)

$= \sin (ω^{T} x) \sin (ω^{T} z) + \cos (ω^{T} x) \cos (ω^{T} z) .$

(6.9)

Empirically, the feature map can then be constructed using

$ϕ (x) = \sqrt{\frac{1}{m}} {[\begin{matrix} \cos (ω_{1}^{T} x) & \dots & \cos (ω_{m}^{T} x) & \sin (ω_{1}^{T} x) & \dots & \sin (ω_{m}^{T} x) \end{matrix}]}^{T},$

where $ω_{1}, \dots, ω_{m}$ are independent and identically distributed samples drawn from $p (ω)$ .

The algorithm is summarized in Algorithm 1. Details will be addressed in Chapter 7.

6.2.5.2 Nyström Methods: Subspace Learning

Another important family of kernel approximation techniques are the Nyström methods, where the effective rank of the kernel matrix is exploited in a more direct fashion. That is, one assumes an underlying subspace structure in the kernel matrix, i.e., $K \approx \hat{K}$ for some $\hat{K}$ such that $r a n k (\hat{K}) = m < N$ . The task is then to use the rank-deficient matrix $\hat{K}$ to approximate the kernel matrix.

In its basic form, Nyström randomly selects a subset of the training data $G \subset D$ and constructs the approximation using

$\hat{K} = \underset{\in R^{N \times m}}{\underset{︸}{k (X_{D}, X_{G})}} \underset{\in R^{m \times m}}{\underset{︸}{k {(X_{G}, X_{G})}^{- 1}}} \underset{\in R^{m \times N}}{\underset{︸}{k (X_{G}, X_{D})}} = K_{D G} K_{G}^{- 1} K_{G D} .$

The random subset selection is computationally simple and in many cases sufficient to capture the underlying data structure in many applications. Many more advanced sampling strategies have been proposed for different purposes. Most techniques cast the approximation problem as an estimation of the top m singular values of K. Essentially, they attempt to approximate the subspace as well as possible while trying to reduce the computational complexity and/or storage usage. For example, some of these techniques include subspace approximation using the incomplete Cholesky decomposition [17]; identifying representative points by clustering [18]; enabling parallelization and performance boosting by the ensemble Nyström method. Some other methods explore the intrinsic structure of the kernel matrix to obtain a more compact and efficient memory storage, such as the Memory-Efficient Kernel Approximation (MEKA) [8], Clustered Low-Rank Approximation (CLRA) [19] and the CLAss-Specific Kernel (CLASK) subspace approximation [7]. These techniques more or less share the same base algorithm with different sampling criteria and/or feature representation strategy. In the following sections, we explore the underlying framework of kernel subspace approximation and its adaptive learning algorithms.

6.3 Kernel Subspace Approximation

6.3.1 Motivation

Briefly speaking, kernel subspace approximation applies a subspace transformation to the feature vector space $F_{D}$ . The transformed data vectors with a lower dimensionality are then used as the new input features to the subsequent module.

That is,

$T : F_{D} \to F,$

(6.12)

where $F$ has a lower dimension than $F_{D}$ . By applying this transformation, the complexity is then reduced to be a function of $d i m (F)$ . The flowchart of this process can be found in Fig. 6.2.

Figure 6.2 Transformations in kernel methods.

Given a training set $D$ and a kernel function k, the idea is to find a subset $G \subseteq D$ with cardinality $| G | ≪ N$ and a transformation matrix $T \in R^{N_{G} \times m}$ with $m ⩽ | G |$ , such that

$T^{T} k (X_{G}, X_{G}) T = I_{m \times m},$

(6.13)

which gives a transformation matrix $k (\cdot, X_{G}) T$ with orthonormal columns.

6.3.2 Training Complexity With Approximation

In kernel methods, during the training process, to find the exact or approximated optimal solution of unknown parameters in the learning model, the computational complexity is typically dominated by the operation ${(K + λ I)}^{- 1}$ , where I is the $N \times N$ identity matrix; $λ \in R^{+}$ is the ridge parameter to handle the ill-conditioned matrix inversion [20]. In practice, this matrix inversion requires $O (N^{2.8})$ [21] to $O (N^{3})$ operations depending on the algorithm.¹ Kernel approximations transform the learning model onto a lower dimensional feature space and subsequently reduce the computational complexity. In this section, we discuss the approximation for the two different feature representations in Eq. (6.3) and Eq. (6.4).

6.3.2.1 Approximated Kernel Matrix

One computational advantage of kernel methods is that all computations are carried out by inner-products. The kernel matrix K (cf. Eq. (6.3)) that holds the pair-wise inner-product of the training data is then a key component of kernel methods. Essentially, the subspace approximation can be interpreted as weighted inner-products, which transforms the kernel matrix as

$K \approx \tilde{K} = \underset{K_{D G} \in R^{N \times | G |}}{\underset{︸}{[\begin{matrix} k {(X_{G}, x_{1})}^{T} \\ ⋮ \\ k {(X_{G}, x_{N})}^{T} \end{matrix}]}} T T^{T} \underset{K_{G D} \in R^{| G | \times N}}{\underset{︸}{[\begin{matrix} k (X_{G}, x_{1}) & \dots & k (X_{G}, x_{N}) \end{matrix}]}} \in R^{N \times N},$

(6.14)

where $k (X_{G}, x) = [\begin{matrix} k (x_{i_{1}}, x) \\ ⋮ \\ k (x_{i_{| G |}}, x) \end{matrix}]$ and indices ${i_{j} : i_{j} \in I_{G}, j = 1 \dots | G |}$ denote the index set of the subset $G$ . The weighting matrix $T T^{T}$ (cf. Eq. (6.13)) is rank-deficient.

After kernel approximation, the size of the kernel matrix remains the same and the approximated kernel matrix $\tilde{K}$ becomes rank-deficient, i.e., rank( $\tilde{K}$ )=m. However, in combination with the matrix inversion lemma, the kernel matrix inversion can be reduced to

${(K + λ I)}^{- 1} \approx {(\tilde{K} + λ I)}^{- 1} = \frac{1}{λ} I - \frac{1}{λ^{2}} K_{D G} T \underset{\in R^{m \times m}}{\underset{︸}{{(I + \frac{1}{λ} T^{T} K_{G D} K_{D G} T)}^{- 1}}} T^{T} K_{G D} .$

(6.15)

The computational complexity is listed as follows:

• Matrix multiplication $K_{D G} T$ : $O (N | G | m)$ .
• Matrix multiplication ${(K_{D G} T)}^{T} K_{D G} T$ : $O (N m^{2})$ .
• Matrix inversion: $O (m^{3})$ .

Due to the fact that $N ≫ | G | ⩾ m$ , we have the overall complexity $O (N | G | m)$ .

6.3.2.2 Approximated Sample Covariance Matrix Using Explicit Feature Vectors

Kernel approximation can be readily applied to create an explicit feature vector by a map $\tilde{ϕ} : X \to R^{m}$ . Specifically,

$\tilde{ϕ} (x) = T^{T} [\begin{matrix} k (x_{i_{1}}, x) \\ ⋮ \\ k (x_{i_{| G |}}, x) \end{matrix}] \in R^{m} .$

(6.16)

Eq. (6.16) can then be used as the feature vector for the succeeding learning model. When applying Eq. (6.16) to linear learning techniques, the sample covariance matrix typically needs to be inverted. Let us assume that the feature vector $\tilde{ϕ} (x)$ is always centered (cf. Appendix 6.A.1), the (scaled) sample covariance matrix is defined as

$C = [\begin{matrix} \tilde{ϕ} (x_{1}) & \dots & \tilde{ϕ} (x_{N}) \end{matrix}] [\begin{matrix} \tilde{ϕ} {(x_{1})}^{T} \\ ⋮ \\ \tilde{ϕ} {(x_{N})}^{T} \end{matrix}]$

(6.17)

$\begin{matrix} = & T^{T} [\begin{matrix} k (X_{G}, x_{1}) & \dots & k (X_{G}, x_{N}) \end{matrix}] [\begin{matrix} k {(X_{G}, x_{1})}^{T} \\ ⋮ \\ k {(X_{G}, x_{N})}^{T} \end{matrix}] T \\ \in & R^{m \times m} . \end{matrix}$

(6.18)

The computational complexity of $C^{- 1}$ is composed of:

• Matrix multiplication: $O (N | G | m)$ .
• Matrix inversion: $O (m^{3})$ .

Hence, we have the overall computational complexity $O (N | G | m)$ .

In the next section, we focus on an adaptive algorithmic framework and its instantiations to construct desired subspaces.

6.4 Adaptive Kernel Subspace Approximation Algorithm

6.4.1 Algorithmic Framework

With an online data acquisition, the subspace approximation can be implemented in an adaptive fashion. One generic framework is summarized in Algorithm 2.

Algorithm 2 Generic Kernel Subspace Approximation Framework.

The matrix inversion lemma [22] is applied to obtain an iterative update of the kernel matrix inversion. The actual sampling strategy is embedded in the “criteria”, which are discussed in the upcoming section.

6.4.2 Algorithm Design

The key design components in Algorithm 2 are (1) the sampling criteria to find $G$ and (2) the construction of the transformation matrix T. In this section, we discuss the principles of finding $G$ and T and how they may affect the performance of the algorithm.

6.4.2.1 Design Criteria

Generally speaking, we are interested in building a model (i.e., the approximated subspace) with a high generalization ability, which is measured by the performance on unseen datasets. The generalization ability is closely related to the model complexity. That is, a model with a high complexity tends to result in a high performance on the training data, but it might perform poorly on testing datasets and vice versa. It is referred to as the “overfitting” problem.

This can be observed by the bias-variance trade-off, where the performance statistics are measured when the model is trained using different training sets with a fixed training size. More precisely, when the model complexity is high, the search space of the model parameters is large and hence the bias of the model tends to be low. On the other hand, since the model is derived from the training data, the higher complexity the model has, the higher variance as a result we find.

The strategy for achieving a high generalization is twofold: one needs to (1) reduce the model complexity and (2) maintain a low empirical risk on the training data. By combining an intuitive illustration in Fig. 6.3 and the actual model representation in Eq. (6.16), we see that the model complexity (the “size” of the circle) is related to (1) the size of the subset $G$ and (2) the number of columns in T; and the empirical risk (the “location” of the circle) depends on (1) the sampling criteria for $G$ and (2) the construction strategy for the transformation matrix T. Given the analysis, in this section, we demonstrate some design criteria for Algorithm 2.

Rejecting samples with a low innovation: The “innovation” of a data point x is the projection error from $φ (x)$ to the existing subspace. The sample is only accepted if the innovation is large enough, i.e.,

$k (x, x) - k_{G j}^{T} {invK}_{G} k_{G j} ⩾ η \in (0, 1) .$

(6.23)

When Eq. (6.23) is violated, it means that it is not necessary to include $φ (x)$ . Hence, the larger η is, the less data points are included in the subspace construction.

Dimension reduction by choosing principal components: Principal Component Analysis (PCA) [23] belongs to the most commonly used linear techniques for dimension reduction purposes. From Eq. (6.16), we see that, given the subset $G$ and a hyperparameter $m < | G |$ , the transformation matrix $T \in R^{| G | \times m}$ can be constructed by applying PCA on the dataset $U = {{[k (x_{i_{1}}, x) \dots k (x_{i_{| G |}}, x)]}^{T}}, \forall x \in D}$ . That is, let $\tilde{T}$ be the matrix that contains the sorted principal vectors computed from $U$ as its columns. The matrix T is obtained by truncation. We have

$T = \tilde{T} (:, 1 : m) .$

(6.24)

Since $m < | G |$ , the model complexity is then further reduced.

Class Separability: In classification problems, the reconstruction error is not the only criterion to measure the quality of the subspace. When label information is available, the discriminative ability is preferably taken into consideration. This information can be encoded in various ways. For example, the CLAss-specific Subspace Kernel (CLASK) representation [7] encodes the information into the feature vector by constructing one subspace for each class. The sampling criteria depend on the between-class projection error. Fisher discriminant analysis [24] finds a transformation matrix T that maximizes the Fisher Quotient, which measures the class separability. Other constrained optimizations can be applied to finding T that results in a good class separation, which can be cast as metric learning [25–27] problems.

Ensemble techniques: One efficient way of reducing the variance is to learn multiple models from subsets of training data and the final estimate is based on an ensemble decision. One such example would be the ensemble Nyström method [28], where

$\hat{K} = \sum_{r = 1}^{P} μ_{r} {\hat{K}}_{r}$

(6.25)

and each ${\hat{K}}_{r}$ is estimated from a subset of $D$ and $μ_{r}$ are the weights for $r = 1 \dots P$ , $P \in N^{+}$ .

Figure 6.3 Illustration of the algorithm design criteria. The size of the circles indicates the model complexity. Circles $F$ , $F_{1}, \dots, F_{n}$ are subspaces with the same complexity and $F^{'}$ has a lower complexity compared with the rest. The dot f^⁎ indicates the optimal solution that minimizes the empirical risk given the subspace $F_{D}$ that is spanned by the whole training dataset (*cf.* Eq. (6.1)). The goal of the algorithm design is to find a subspace with an appropriate size (model complexity) with a reasonable location (which hopefully contains f^⁎) before the training process.

6.4.2.2 Empirical Risk

Given a model $F$ designed by the aforementioned criteria, empirical risks are measured to ensure that the “location” of the circle in Fig. 6.3 is not way off. Typically, with the same model complexity, the model corresponding to a lower empirical risk is preferred. Typically, there are two empirical risks to be considered.

Reconstruction error: The approximated subspace $F$ (cf. Eq. (6.12)) has to be representative and is evaluated by the following reconstruction error on the training data. Let ${\hat{φ}}_{n}$ be the projection of $φ (x_{n})$ onto $F$ . The empirical risk of the reconstruction error (on the training set $D$ ) is defined as

$E_{F} = \frac{1}{N} \sum_{n = 1}^{N} \frac{{| | φ_{n} - \sum_{i = 1}^{| G |} a_{n, i}^{⁎} {\hat{φ}}_{n} | |}_{2}^{2}}{{‖ φ_{n} ‖}^{2}}$

(6.26)

$= \frac{1}{N} \sum_{n = 1}^{N} (1 - \frac{{| | \sum_{j = 1}^{| G |} a_{n, i}^{⁎} k (x_{n}, x_{i}) T T^{T} k (x_{n}, x_{i}) | |}_{2}^{2}}{k (x_{n}, x_{n})}),$

(6.27)

where $a_{n, i}^{⁎}$ 's are found by

$a_{n, i}^{⁎} = \underset{a_{1, 1}, \dots, a_{N, m}}{\arg \min} E_{F} (a_{1, 1}, \dots, a_{N, m}) .$

We see that, by setting the hyperparameters η (cf. Eq. (6.23)) to a small value, the reconstruction error $E_{F}$ becomes small. On the other hand, a small number of columns in T results in a large $E_{F}$ due to the truncation (cf. Eq. (6.24)).

Classification error: In classification problems, the ultimate risk function is the 0–1 loss, which is defined as

$L_{F} = \frac{1}{N} \sum_{n = 1}^{N} 1 (f_{F}^{⁎} ({\hat{φ}}_{n}) \neq y_{n}),$

(6.28)

where $y_{n}$ are the labels for data $x_{n}$ , 1 is the indicator function and $f_{F}^{⁎}$ is found by minimizing $L_{F}$ . Note that, given a training set, this empirical risk depends not only on the subspace approximation algorithm, but also on the classification learning model.

6.5 Infrastructures

The computational complexity of kernel methods mainly depends on the dimension d of the input space, the training size N and the approximated rank m. Given a large training size N and a complex intrinsic data structure, to fully explore the capability of the representative power of kernel methods, we need to take advantage of the modern computational infrastructures and frameworks.

6.5.1 Speedup: GPU/CUDA

To speed up the kernel algorithms, one can take advantage of the graphics processing units (GPUs) for general computational purposes [29]. This is readily applicable due to the fact that computations in kernel methods mainly involve a large amount of simple operations that can easily be parallelized.

In recent years graphics cards have evolved into GPUs offering computing resources for general purpose computations. Typically, a GPU card contains hundreds or even thousands of simple processors (cores) allowing the concurrent execution of multiple threads. GPU cores, unlike the cores in a multicore CPU, are not full-blown processors. They do not have their own memory but share memory with other cores and they allow the execution of simple tasks, such as mathematical operations, shared memory access and access to the global GPU memory. The Computer-Unified Device Architecture (CUDA) by NVIDIA is a library enabling the use of the GPU cores for general purpose parallel computing. Threads are organized into blocks that run on groups of cores called streaming multiprocessors. CUDA is compatible with the data-parallel model of execution. All threads execute the same function, called a “kernel”. Each thread, however, operates on different parts of the data depending on its ID. This programming paradigm is also known as the Single Instruction–Multiple Threads (SIMT) model [30]. The basic idea behind CUDA is that the GPU is operating separately and independently from the host. The CPU submits a job to the GPU copying the necessary data onto the global GPU memory, the GPU does the processing and returns the result in the CPU so that the results can be printed, saved on disk, etc. The GPU device has no access to the system I/O and it essentially operates as a slave accelerator. Most popular machine learning packages, such as Tensorflow and Theano, can take advantage of CUDA in order to accelerate computations.

Example: Computing Kernel Functions We see from Algorithm 2 that the computational complexity of the algorithm is dominated by kernel function computations and vector-matrix multiplication. In this example, we show the capacity of the speedup achieved by using CUDA on a GPU compared with the CPU performance. The setup of the experiment is specified as follows.

Hardware specs: GPU/CPU specifications

- GPU: GeForce GTX 960M with 640 CUDA cores
- CPU: Intel i7-6700HQ processor 2.6 GHz 4 cores 8 threads

Data: two synthetic datasets ${\begin{matrix} x_{1} \dots, x_{N 1} \end{matrix}}$ and ${\begin{matrix} z_{1} \dots, z_{N 2} \end{matrix}}$ , where each element in each data vector is generated independently from a normal distribution.

Computations involved: allocate memory and compute the entries for a $N 1 \times N 2$ matrix, where each element is $M (i, j) = k (x_{i}, z_{j})$ for $i = 1 \dots N 1$ , $j = 1 \dots N 2$ and $x_{i}, z_{j} \in R^{d}$ . To vary the number of operations per thread, we compute and compare the run time for the following kernels:

- Single kernel: polynomial kernel,

$k (x_{i}, z_{j}) = {(x_{i}^{T} z_{j} + 1)}^{2} .$

- Multiple-kernel [31],

$\begin{matrix} k (x_{i}, z_{j}) & = & \sum_{l = 1}^{3} k_{l} (x_{i}, z_{j}) \\ = & {(x_{i}^{T} z_{j} + 1)}^{2} + \exp (- \frac{1}{2} {‖ x_{i} - z_{j} ‖}^{2}) + \exp (- \frac{1}{2} ‖ x_{i} - z_{j} ‖) . \end{matrix}$

(6.29)

Variables: to compare the performance and gain a better understanding, we vary the following parameters:

- Data-related: N1, N2, d
- Operation related: single kernel computation versus multiple-kernel computation
- CUDA-related: THREAD_PER_BLOCK
- CUDA block grid: dim3 blocks(N1, $\frac{N 2}{THREAD_PER_BLOCK}$ ,1)

The design of the CUDA algorithm can be further optimized. For simplicity, we choose to compute one element $k (x_{i}, z_{j})$ per thread.

The run time comparison can be found in Fig. 6.4, Fig. 6.5 and Fig. 6.6, where the run time is measured in milliseconds using the C function clock() with an output type clock_t. With the aforementioned setup, we observe a major speedup using GPU under most circumstances with various GPU parameters. However, note that, when the number of operations per thread is too low ( $d = 10$ ), CPU can outperform GPU, since the overhead of CUDA, such as data transfer, will dominate the computational complexity.

Figure 6.4 CUDA performance measurement I. Run time comparison for different d. The number of elements is N1 × N2.

Figure 6.5 CUDA performance measurement II(1). THREAD_PER_BLOCK is varied with N1 = 64.

Figure 6.6 CUDA performance measurement II(2). THREAD_PER_BLOCK is varied for N1 = 512.

6.5.2 Scaling: Spark

Apache Spark [32–34] is a data-parallel computing framework offering scalability, fault tolerance and increased performance due to the use of in-memory operations. Spark operates on top of various cluster managers and is agnostic of the architecture of the underlying cluster. Compared to Hadoop MapReduce [35], Spark can be 10 to 100 times faster. A central data structure concept in the Spark programming model is the Resilient Distributed Dataset (RDD) [36], which is a fault-tolerant collection of data partitioned across nodes. An RDD can be rebuilt automatically in case of data loss because of a node failure. RDDs can be created by various actions, for instance reading a text file from a distributed file system or by transforming an existing RDD. A typical transformation is the map operation which applies the same function on each element of the RDD via closure. Reductions of RDDs into single objects are also possible. For example, an RDD can be reduced into a number by summing up its elements, by counting its elements, etc.

Example: Ensemble Nyström The Spark framework can be adopted for ensemble algorithms, such as the ensemble Nyström in Eq. (6.25) [28] (cf. Fig. 6.7). The training set is divided into n disjoint subsets denoted by $D = D_{1} \cup D_{2} \cup \dots \cup D_{n}$ , for $n \in N^{+}$ . In Fig. 6.7, each solid black node denotes the kernel subspace approximation in Algorithm 2. The framework is composed of two parts:

• map(): The map function takes a collection of training sets ${\begin{matrix} D_{1}, D_{2}, \dots, D_{n} \end{matrix}}$ and applies Algorithm 2 to each element $D_{i}$ , for $i = 1, \dots, n$ . The output of each operation is the kernel matrix ${\hat{K}}_{i}$ .
• reduce(): The reduce function takes the collection of outputs ${\begin{matrix} {\hat{K}}_{1}, \dots, {\hat{K}}_{n} \end{matrix}}$ and applies a weighted sum (cf. Eq. (6.25)) to obtain the final result $\hat{K}$ .

The Spark framework scales linearly with respect to the number of nodes. Moreover, one can enable the NVIDIA CUDA GPU on Spark to accelerate the computations.

Figure 6.7 Ensemble Nyström (*cf.* Eq. (6.25)). A diagram illustrating how the Ensemble Nyström can be implemented using the Spark map() and reduce() functions.

These frameworks and libraries are updated frequently with the development of new technologies around them. However, when the underlying principle is taken into account, this rapid evolution is not as overwhelming as it seems. Nevertheless, learning how to use these tools is an essential part of the algorithm design at hand.

6.6 Conclusion

Kernel techniques are powerful nonlinear feature extraction techniques that handle nonvectorial input sets with well-established theoretical foundations. Given its nonparametric nature, the model complexity of kernel techniques grows with respect to an increasing training size. High model complexity might lead to prohibitive computational complexity and memory usage. Kernel approximation techniques are typically applied to address this issue. In this chapter, we have mainly discussed two types of kernel approximation strategies, the Random Fourier Features and the Nyström subspace approximation. We have focused on the Nyström due to its data-dependent nature and ease of modification to an adaptive learning framework. Designs of the basic Nyström and its extensions have been discussed and analyzed. Moreover, to fully take advantage of the large-scale training size, we have presented possibilities of using modern computing infrastructures and frameworks, e.g. GPU with CUDA library, the map-reduce framework, etc., to scale and speed up kernel methods. Furthermore, the advantage of theoretical properties can be beneficial in combination with other machine learning techniques such as deep learning [37–41], where kernel methods can be used as a preprocessing or a classification module, which can be implemented using popular APIs such as Torch [42] and Tensorflow [43]. The interested reader is referred to the example code from this chapter for further investigation.

Appendix 6.A

6.A.1 Centering

One preprocessing technique is to center the feature vectors as follows:

$\begin{matrix} k (x, z) & \leftarrow & 〈 φ (x) - E (φ (x)), φ (z) - E (φ (z)) 〉 \\ = & 〈 φ (x), φ (z) 〉 - 〈 E (φ (x)), φ (z) 〉 - 〈 φ (x), E (φ (z)) 〉 + 〈 E (φ (x)), E (φ (z)) 〉, \end{matrix}$

where $E (φ (\cdot))$ can be estimated from the training data ${x_{1}, \dots, x_{N}}$ . We have

$k (x, z) \leftarrow k (x, z) - \frac{1}{N} \sum_{i = 1}^{N} k (x_{i}, x) - \frac{1}{N} \sum_{i = 1}^{N} k (x_{i}, z) + \frac{1}{N^{2}} \sum_{i = 1}^{N} \sum_{j = 1}^{N} k (x_{i}, x_{j}) .$

(6.30)

After centering the feature vectors, the kernel function only depends on the relative position between $φ (x)$ and $φ (z)$ .

6.A.2 Normalization

The implementation of normalization on the feature vector $φ (x)$ is as follows:

$k (x, z) \leftarrow \frac{k (x, z)}{\sqrt{k (x, x)} \sqrt{k (z, z)}} .$

(6.31)

With normalization, the kernel function essentially represents the angle between two feature vectors. The effect of the absolute length of the vectors is dismissed. This leads to a more robust result in many applications.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 6: Kernel Subspace Learning for Pattern Classification

Create new playlist

Sign In

Sign Up