Chapter 8. Kernel Models and Support Vector Machines

This chapter introduces kernel functions, binary support vectors classifiers, one-class support vector machines for anomaly detection, and support vector regression.

In the Binomial classification section in Chapter 6, Regression and Regularization, you learned the concept of hyperplanes to segregate observations from the training set and estimate the linear decision boundary. The logistic regression has at least one limitation: it requires that the datasets be linearly separated using a defined function (sigmoid). This limitation is especially an issue for high-dimension problems (large number of features that are highly nonlinearly dependent). Support vector machines (SVMs) overcome this limitation by estimating the optimal separating hyperplane using kernel functions.

In this chapter, we will cover the following topics:

  • The impact of some of the SVM configuration parameters and the kernel method on the accuracy of the classification
  • How to apply the binary support vector classifier to estimate the risk for a public company to curtail or eliminate its dividend
  • How to detect outliers with a one-class support vector classifier
  • How the support vector regression is compared to the linear regression

Support vector machines are formulated as a convex optimization problem. The mathematical foundation of the related algorithms is described in this chapter, for reference.

Kernel functions

Every machine learning model introduced in this book so far assumes that observations are represented by a feature vector of a fixed size. However, some real-world applications such as text mining or genomics do not lend themselves to this restriction. The critical element of the process of classification is to define a similarity or distance between two observations. Kernel functions allow developers to compute the similarity between observations without the need to encode them in feature vectors [8:1].

An overview

The concept of kernel methods may be a bit odd at first to a novice. Let's consider the example of the classification of proteins. Proteins have different lengths and compositions, but they do not prevent scientists from classifying them [8:2].

Note

Proteins

Proteins are polymers of amino acids joined together by peptide bonds. They are composed of a carbon atom bonded to a hydrogen atom, another amino acid, or a carboxyl group.

A protein is represented using a traditional molecular notation to which biochemists are familiar. Geneticists describe proteins in terms of a sequence of characters known as the protein sequence annotation. The sequence annotation encodes the structure and composition of the protein. The following image illustrates the molecular (left) and encoded (right) representation of a protein:

An overview

The sequence annotation of a protein

The classification and the clustering of a set of proteins require the definition of a similarity factor or distance used to evaluate and compare the proteins. For example, the similarity between three proteins can be defined as a normalized dot product of their sequence annotation:

An overview

The similarity between the sequence annotations of three proteins

You do not have to represent the entire sequence annotation of the proteins as a feature vector in order to establish that they belong to the same class. You only need to compare each element of each sequence, one by one, and compute the similarity. For the same reason, the estimation of the similarity does not require the two proteins to have the same length.

In this example, we do not have to assign a numerical value to each element of the annotation. Let's consider an element of the protein annotation as its character c and position p (for example, K, 4). The dot product of the two protein annotations x and x' of the respective lengths n and n' are defined as the number of identical elements (character and position) between the two annotations divided by the maximum length between the two annotations (M1):

An overview

The computation of the similarity for the three proteins produces the result as sim(x,x')=6/12 = 0.50, sim(x,x'')=3/13 =0.23, and sim(x',x'')= 4/13= 0.31.

Another similar aspect is that the similarity of two identical annotations is 1.0 and the similarity of two completely different annotations is 0.0.

Note

The visualization of similarity

It is usually more convenient to use a radial representation to visualize the similarity between features, as in the example of proteins' annotations. The distance d(x,x') = 1/sim(x,x') is visualized as the angle or cosine between two features. The cosine metric is commonly used in text mining.

In this example, the similarity is known as a kernel function in the space of the sequence annotation of proteins.

Common discriminative kernels

Although the measure of similarity is very useful to understand the concept of a kernel function, kernels have a broader definition. A kernel K(x, x') is a symmetric, nonnegative real function that takes two real arguments (values of two features). There are many different types of kernel functions, among which the most common are as follows:

  • The linear kernel (dot product): This is useful in the case of very high-dimensional data where problems can be expressed as a linear combination of the original features.
  • The polynomial kernel: This extends the linear kernel for a combination of features that are not completely linear.
  • The radial basis function (RBF): This is the most commonly applied kernel. It is used where the labeled or target data is noisy and requires some level of regularization.
  • The sigmoid kernel: This is used in conjunction with neural networks.
  • The Laplacian kernel: This is a variant of RBF with a higher regularization impact on training data.
  • The log kernel: This is used in image processing.

Note

The RBF terminology

In this presentation and the library used in its implementation, the radial basis function is a synonym to the Gaussian kernel function. However, RBF also refers to the family of exponential kernel functions that encompasses Gaussian, Laplacian, and exponential functions.

The simple linear model for regression consists of the dot product of the regression parameters (weights) and the input data (refer to the Ordinary least squares regression section in Chapter 6, Regression and Regularization).

The model is, in fact, the linear combination of weights and linear combination of inputs. The concept can be extended by defining a general regression model as the linear combination of nonlinear functions, known as basis functions (M2):

Common discriminative kernels

The most commonly used basis functions are the power and Gaussian functions. The kernel function is described as the dot product of the two vectors of the basis function φ(x).φ(x') of two features vectors x and x'. A partial list of kernel methods is as follows:

Note

M3: The generic kernel function is defined as:

Common discriminative kernels

M4: The linear kernel is defined as:

Common discriminative kernels

M5: The polynomial kernel with the slope γ, degree n, and constant c is defined as:

Common discriminative kernels

M6: The sigmoid kernel with the slope γ and constant c is defined as:

Common discriminative kernels

M7: The radial basis function kernel with the slope γ is defined as:

Common discriminative kernels

M8: The Laplacian kernel with the slope γ is defined as:

Common discriminative kernels

M9: The log kernel with the degree n is defined as:

Common discriminative kernels

The list of discriminative kernel functions described earlier is just a subset of the kernel methods' universe. The other types of kernels include the following:

  • Probabilistic kernels: These are kernels derived from generative models. Probabilistic models such as Gaussian processes can be used as a kernel function [8:3].
  • Smoothing kernels: This is the nonparametric formulation, averaging density with the nearest neighbor observations [8:4].
  • Reproducible kernel Hilbert spaces: This is the dot product of finite or infinite basis functions [8:5].

The kernel functions play a very important role in support vector machines for nonlinear problems.

Kernel monadic composition

The concept of a kernel function is actually derived from differential geometry and more specifically from manifold, which was introduced in the Non-linear models section under Dimension reduction in Chapter 4, Unsupervised Learning.

A manifold is a low dimension features space embedded in the observation space of higher dimension. The dot (or inner) product of two observations, known as the Riemann metric, is computed on a Euclidean tangent space.

Note

The heat kernel function

The kernel function on a manifold is actually computed by solving the heat equation that uses the Laplace-Beltrami operator. The heat kernel is the solution of the heat differential equation. It associates the dot product with an exponential map.

The kernel function is the composition of the dot product on the tangent space projected on the manifold using an exponential map, as shown in the following diagram:

Kernel monadic composition

The visualization of a manifold, Riemann metric, and projection of an inner product

A kernel function is the composition g o f of two functions:

  • A function h that implements the Riemann metric or similarity between two vectors v and w
  • A function g that implements the projection of the similarity h(v, w) to the manifold (exponential map)

The KF class implements the kernel function as a composition of the functions g and h:

type F1 = Double => Double
type F2 = (Double, Double) => Double

case class KF[G](val g: G, h: F2) {
  def metric(v: DblVector, w: DblVector)
      (implicit gf: G => F1): Double =  //1
    g(v.zip(w).map{ case(_v, _w) => h(_v, _w)}.sum) //2
}

The KF class is parameterized with a G type that can be converted to Function1[Double, Double]. Therefore, the computation of metric (dot product) requires an implicit conversion from G to Function1 (line 1). The metric is computed by zipping the two vectors, mapping the h similarity function, and summing up the resulting vector (line 2).

Let's define the monadic composition for the KF class:

val kfMonad = new _Monad[KF] {
  override def map[G,H](kf: KF[G])(f: G =>H): KF[H] = 
     KF[H](f(kf.g), kf.h) //3
  override def flatMap[G,H](kf: KF[G])(f: G =>KF[H]): KF[H] =
     KF[H](f(kf.g).g, kf.h)
}

The creation of the kfMonad instance overrides the map and flatMap methods defined in the generic _Monad trait, as described in the Monads section in Chapter 1, Getting Started. The implementation of the unit method is not essential to the monadic composition and it is, therefore, omitted.

The function argument of the map and flatMap methods applies only to the exponential map function g (line 3). The composition of two kernel functions kf1 = g1 o h and kf2 = g2 o h produces a kernel function kf3 = g2 o (g1 o h) = (g2 o g1) o h = g3 o h.

Note

Interpretation of kernel functions' monadic composition

The visualization of the monadic composition of kernel functions on the manifold is quite intuitive. The composition of two kernel functions consists of composing their respective projections or exponential map functions g. The function g is directly related to the curvature of the manifold around the data point for which the metric is computed. The monadic composition of the kernel functions attempts to adjust the exponential map to fit the curvature of the manifold.

The next step is to define an implicit class to convert a kernel function of the KF type to its monadic representation so that it can access the map and flatMap methods (line 4):

implicit class kF2Monad[G](kf: KF[G]) {  //4
  def map[H](f: G =>H): KF[H] = kfMonad.map(kf)(f)
  def flatMap[H](f: G =>KF[H]): KF[H] =kfMonad.flatMap(kf)(f)
}

Let's implement the RBF radial basis function and the polynomial kernel function, Polynomial, by defining their respective g and h functions. The parameterized type for the kernel function is simply Function1[Double, Double]:

class RBF(s2: Double) extends KF[F1](
    (x: Double) => Math.exp(-0.5*x*x/s2), 
    (x: Double, y: Double) => x -y)
class Polynomial(d: Int) extends KF[F1](
    (x: Double) => Math.pow(1.0+x, d), 
    (x: Double, y: Double) => x*y)

Here is an example of the composition of two kernel functions: a kf1 kernel RBF with a standard deviation of 0.6 (line 5) and a kf2 polynomial kernel with a degree 3 (line 6):

val v = Vector[Double](0.5, 0.2, 0.3)
val w = Vector[Double](0.1, 0.7, 0.2)
val composed = for {
  kf1 <- new RBF(0.6)  //5
  kf2 <- new Polynomial(3)  //6
} yield kf2
composed.metric(v, w) //7

Finally, the metric is computed on the composed kernel functions (line 7).

Note

Kernel functions in SVM

Our implementation of the support vector machine uses the kernel function included in the LIBSVM library.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset