8Support vector machines algorithms

This chapter covers the following items:

Algorithm for support vectors

Algorithm for classifying linear and nonlinear

Examples and applications

Support vector machines (SVMs) are used as a method for classifying linear and nonlinear data. This chapter discusses on how SVM works. It basically utilizes a nonlinear mapping for converting the original training data into a higher dimension in which it finds the linear optimal separating hyperplane (a “decision boundary” splitting the datasets of a class from another one). If an appropriate nonlinear mapping is used with an adequately high dimension, data from two classes can at all times be split by a hyperplane [1–5]. Through the use of support vectors (“essential” training datasets), SVM finds these hyperplane and margins (these are defined by the support vectors).

Vladimir Vapnik and his colleagues Bernhard Boser and Isabelle Guyon worked on a study regarding SVMs in 1992. This proves to be the first paper in this subject. In fact, SVMs have been existing since 1960s but the groundwork was made by Vapnik and his colleagues. SVMs have become an appealing option recently. The training time pertaining to even the fastest SVMs can be slow to a great extent; this may be counted as a drawback. However, the accuracy is their superior feature. They have high level of accuracy because of their capability in modeling complex nonlinear decision boundaries. In addition, they are less inclined to overfitting compared to other methods. They also present a compact description of the learned model [5]. You may use SVMs both for numeric prediction and for classification. It is possible to apply SVMs to certain fields such as speaker identification, object recognition as well as handwritten digit recognition. They are also used for the benchmark time-series prediction tests.

8.1The case with data being linearly separable

SVMs pose a mystery. We can start solving the mystery with a simple case: this is a problem with two classes and the classes are separable in a linear way [6–10]. Dataset D is given as (X1,y1),(X2,y2),...(X|D|,y|D|):Xi is the set of training datasets with associated class labels, yi. Each yi can take one of the two values: they can take +1 or −1 (i.e., yi ∈ { + 1, − 1}).

Figure 8.1: 2D training data are separable in a linear fashion. There are an infininte number of possible separating hyperplanes (decision boundaries). Some of these hyperplanes are presented in Figure 8.1 with dashed lines. Which one would be the best one? [11].

This matches with the classes patient = no and healthy = yes as shown in Table 2.19 respectively. Two input attributes: A1 and A2, as given in Figure 8.1, the 2D data are separable in a linear fashion. The reason is that you can draw a straight line to separate all the class datasets C1 from all the datasets of class −1.

Drawing the number of separating lines is infinite. The “best” one refers to the one being aspired to have the minimum classification error on earlier unseen datasets. To find the best line, one would like to find the best possible separating plane (for data that are 3D attributes). For n dimensions, researchers would wish to find the best hyperplane. In this book, hyperplane refers to the decision boundary being sought irrespective of the number of input attributes. So, in other words, how can we find the best hyperplane? For the aim of finding the best hyperplane, an approach related to SVM can deal with this concern by looking for the maximum marginal hyperplane (MMH) [11]. Figure 8.2 presents the two possible separating hyperplanes and the associated margins thereof. Before getting into the definition of margins, we can have an intuitive glance at this figure. In this figure, it can be seen that both the hyperplanes can classify all the given datasets in a correct way. It is intituitively anticipated that hyperplane that has the larger margin is more accurate for the classification procedure of the future datasets compared to the hyperplane that has a smaller margin. Because of this difference, SVM performs a search for the hyperplane that has the largest margin or the MMH during the learning or training phase. The largest separation between classes is given by the associated margin.

Figure 8.3 presents the flowchart steps for SVM algorithm.

Acccording to the binary (linear) support vector machine algorithm in Figure 8.4, the following can be stated:

Figure 8.2: In these figures, we can see the two possible separating hyperplanes along with their associated margins. Which one would be a better one? Is it the one with the larger margin(b) that should have a greater generalization accuracy? [11].
Figure 8.3: Linear SVM algorithm flowchart.
Figure 8.4: General binary (linear) support vector machine algorithm.

Step (16) The shortest distance from a hyperplane to one side of the margin of the hyperplane has the same value as the shortest distance from the hyperplane to the other side of its margin, the “sides” of the margin being parallel to the hyperplane. As for MMH, this distance is the shortest distance from the MMH to the closest training dataset of one class or the other one.

The denotation of separation of hyperplane can be done as follows:

W.X+b=0(8.1)

According to this denotation, W is a weight vector, which meansW={w1,w2,...,wn}, n the number of attributes and b is a scalar (frequently called bias) [11–13]. If you consider input attributes A1 and A2, in Figure 8.2(b), training datasets are 2D (e.g., X = (x1, x2)), x1 and x2 are the values of attributes A1 and A2, respectively for X. If b is considered an additional weight, w0, eq. (8.1) can be written as follows:

w0+w1x1+w2x2=0(8.2)

Based on this, any point above the separating hyperplane fulfills the following criteria:

w0+w1x1+w2x2<0(8.3)

Correspondingly, any point below the separating hyperplane fulfills the following criteria:

w0+w1x1+w2x2>0(8.4)

It is possible to attune the weights to make the hyperplanes that define the sides of the margin to be written as follows:

H1=w0+w1x1+w2x21foryi=+1(8.5)

H2=w0+w1x1+w2x21foryi=1(8.6)

Any dataset that lies on or above H1 belongs to class +1, and any dataset that lies on or below H2 belongs to class −1. Two inequalities of eqs. (8.5) and (8.6) are combined and the following is obtained:

yi(w0+w1x1+w2x2)1,i(8.7)

All training datasets on hyperplanes H1 or H2, the “sides” that define the margin fulfill the conditions of eq. (8.2). These are termed as support vectors, and they are close to the separating MMH equally [14]. Figure 8.4 presents the support vectors in a way that is bordered with a bolder (thicker) border. Support vectors are basically regraded as the datasets that prove to be hardest to be classified and yield the most information for the classification process.

It is possible to obtain a formula for the maximal marginal size. The distance from the separating hyperplane to any point on H1 is 1w:w, which is the Euclidean norm of W, and WW.ifW={w1,w2,...,wn}, then W.W=w12+w22+...+wn2. This is equal to the distance from any point on H2 to the separating hyperplane, thereby 2w is the maximal margin (Figure 8.5).

Figure 8.5: Support vectors: the SVM finds out the maximum separating hyperplane, which means the one with the maximum distance between the nearest training dataset. The support vectors are shown with a thicker border [11].

Step (715) Following these details, you may ask how an SVM finds the MMH and the support vectors. For this purpose, eq. (8.7) can be written again using certain tricks in maths. This turns into a concept known as constrained (convex) quadratic optimization problem in the field. Readers who would like to research further can resort to some tricks (which are not the scope of this book) using a Lagrangian formulation as well as solution that employs Karush–Kuhn–Tucker (KKT) conditions [11–14]. If your data are small with less than 2,000 training datasets, you can use any of the optimization software package to be able to find the support vectors and MMH. If you have larger data, then you will have to resort to more efficient and specific algorithms for the training of SVMs. Once you find the support vectors and MMH, you get a trained SVM. The MMH is a linear class boundary. Therefore, the corresponding SVM could be used to categorize data that are separable in a linear fashion. This is trained as linear SVM. The MMH can be written again as the decision boundary in accordance with the Lagrangian formulation:

d(XT)=i=1lyiaiXiXT+b0(8.8)

Here, yi is the class label of support vector Xi; XT is a test dataset; aiXiXT and b0 are numeric parameters automatically determined through SVM algorithm or optimization. Finally, l is the number of support vectors.

For the test dataset, XT, you can insert it into eq. (8.8), and do a checking for seeing the sign of the result, which shows you the side of the hyperplane that the test dataset falls. If you get a positive sign, this is indicative of the fact that XT falls on or above the MMH with the SVM predicting that XT belongs to class +1 (representing patient=no, in our current case). For a negative side, XT falls on or below the MMH with the class prediction being −1 (representing healthy= yes).

We should make note of two significant issues at this point. First of all, the complexity of the learned classifier is specified by the number of support vectors and not by the data dimensionality. It is due to this fact that SVMs are less inclined to overfitting compared to other methods. We see support vectors as the critical training datasets lying at the position closest to the decision boundary (MMH). If you remove all the other training datasets and repeat the training, you could find the same separating hyperplane. You can also use the number of support vectors found so that you can make the computation of an upper bound on the expected error rate of the SVM classifier. This classifier is not dependent of the data dimensionality. An SVM that has a small number of support vectors gives decent generalization despite the high data dimensionality.

8.2The case when the data are linearly inseparable

The previous section has dealt with linear SVMs for the classification of data that are separable linearly. In some instances, we face with data that are not linearly separable (as can be seen from Figure 8.6). When the data cannot be separated in a linear way, no straight line can be found to separate the classes. As a result, linear SVMs would not be able to offer an applicable solution. However, there is a solution for this situation because linear SVMs can be extended to create nonlinear SVMs for the classification of linearly inseparable data, which is also named as nonlinearly separable data, or nonlinear data). These SVMs are thus able to find boundaries of nonlinear decision (nonlinear hypersurfaces) in the input space. How is then the linear approach extended? When the approach for linear SVMs is extended, a nonlinear SVM is to be obtained [13–15]. For this, there are two basic steps. As the initial step, by using a nonlinear mapping, the original input data are transformed into a higher dimensional space. The second step follows after the transformation into new higher space. In this subsequent step, a search for a linear separating hyperplane starts in the new space. As a result, there is again the quadratic optimization problem. The solution of this problem is reached when you use the linear SVM formulation. The maximal marginal hyperplane that is found in the new space matches up with a nonlinear separating hypersurface in the original space.

Figure 8.6: A simple 2D case that shows data that are inseperable linearly. Unlike the linear seperable data as presented in Figure 8.1, in this case, drawing a straight line seperating the classes is not a possible action. On the contrary, the decision boundary is nonlinear [11].

The algorithmic flow in nonlinear SVM algorithm is presented in Figure 8.7.

Figure 8.7: General binary (nonlinear) support vector machine algorithm.

Step (15) As can be seen in Figure 8.7, this is not without problems, though. One problem is related to the fact how the nonlinear mapping to a higher dimensional space is chosen. The second one is related to the fact that the relevant computation involved in the process is a costly one. You may resort to eq. (8.8) for the classification of a test dataset XT. The dot product should be calculated with each of the support vectors. The dot product of two vectors are as follows: XT=(x1T,x2T,...,xnT) and Xi=(xi1,xi2,...,xin)isx1Txi1+x2Txi2+xnTxin. In training, a similar dot product should be calculated a number of times if you are to find the MMH, which would end up being an expensive approach. Hereafter, the dot product computation required is very heavy and costly. In order to deal with this situation, another trick would be of help, and there exists another trick in maths, fortunately!

Step (615) It occurs as in solving the quadratic optimization problem regarding the linear SVM (while looking for a linear SVM in the new higher dimensional space); the training datasets come out only in the dot product forms, ϕ(Xi)ϕ(Xj) at this point ϕ(X)is merely the nonlinear mapping function [16–18]. This function is applied for the transformation of the training datasets. Rather than computing the dot product on the transformed datasets, it happens to be mathematically similar to administering a kernel function, K (Xi, Xj) to the original input data:

K(Xi,Xj)=ϕ(Xi)ϕ(Xj)(8.9)

In each place that ϕ(Xi).ϕ(Xj) is seen in the training algorithm, it can be replaced by K (Xi, Xj). Thus, you will be able to do all of your computations in the original input space that has the potential of having a significantly lower dimensionality. Owing to this, you can avoid the mapping in a safe manner while you do not even have to know what mapping is. You can implement this trick and later move on to finding a maximal separating hyperplane. This procedure resembles the one explained in Section 8.1 but it includes the placing of a user-specified upper bound, C, on the Lagrange multipliers, ai. Such upper bound is in the best way determined in an experimental way. Another relevant question is about some of the kernel functions that can be used. Kernel functions can allow the replacement of the dot product scenario. Three permissible kernel functions are given below [17–19]:

Polynomial kernel of degree h: K (Xi, Xj) = (Xi.Xj + 1)h

Gaussian radial basis function kernel:K(Xi,Xj)=eXiXj2/2σ2

Sigmoid kernel: K (Xi, Xj) = tanh (kXi, Xj − δ)

All of these three kernel functions have results in a different nonlinear classifier in the original input space. Upholders of neural network would think that the resulting decision hyperplanes found for nonlinear SVMs is the same as the ones that have been found by other recognized neural network classifiers. As an example, SVM with a Gaussian radial basis function (RBF) gives the same decision hyperplane as a neural network kind, known as a RBF network. SVM with a sigmoid kernel is comparable to a simple two-layer neural network, also termed as a multilayer perceptron that does not have any hidden layers. There exist no unique rules available to determine which admissible kernel will give rise to the most accurate SVM. During the practice process, the chosen particular kernel does not usually make a substantial difference about the subsequent accuracy [18–19]. It is known that there is always a finding of a global solution by SVM training (as opposed to neural networks like back propagation).

The linear and nonlinear SVMs have been provided for binary (two-class) classification up until this section. It is also possible to combine the SVM classifiers for the multi-class case. About SVMs, an important aim of research is to enhance the speed for the training and testing procedures to render SVMs a more applicable and viable choice for very big sets of data (e.g., millions of support vectors). Some other concerns are the determination of the best kernel for a certain dataset and exploration of more efficient methods for cases that involve multi-classes.

Example 8.1 shows the nonlinear transformation of original data into higher dimensional space. Please have a look at this presented example: A 3D input vector X=(x1,x2,x3) is mapped into 6D space. By using Z space for the mapping, φ1(X)=x1,φ2(X)=x2,φ3(X)=x3,φ4(X)=(x1)2,φ5(X)=x1x2,φ6(X)=x1x3. A decision hyperplane in the new space is d(Z) = WZ + b. In this formulation, W and Z are vectors, which is supposed to be linear. W and b are solved and afterwards substituted back in order to allow the linear decision hyperplane in the new (Z) space to correspond to a nonlinear second-order polynomial in the original 3D input space:

d(Z)=w1x1+w2x2+w3x3+w4(x1)2+w5x1x2+w6x1x3+b=w1z1+w2z2+w3z3+w4z4+w5z5+w6z6+b

8.3SVM algorithm for the analysis of various data

SVM algorithm has been applied on economy (U.N.I.S.) dataset, multiple sclerosis (MS) dataset and WAIS-R dataset in Sections 8.3.1, 8.3.2 and 8.3.3, respectively.

8.3.1SVM algorithm for the analysis of economy (U.N.I.S.) Data

As the second set of data, we will use in the following sections some data that are related to economies for U.N.I.S. countries. The attributes of economies of these countries are data regarding years, unemployment youth male(% of male labor force ages 1524) (national estimate), gross value added at factorcost ($), tax revenue and gross domestic product (GDP) growth (annual %). Economy (U.N.I.S.) Dataset is made up of a total of 18 attributes. Data belonging to economies of USA, New Zealand, Italy and Sweden from 1960 to 2015 are defined on the basis of attributes given in Table 2.8 Economy (U.N.I.S.) Dataset (http://data.worldbank.org) [15], to be used in the following sections for the classification of D matrix through SVM algorithm. The first-step training procedure is to be employed. For the training procedure, 66.66% of the D matrix can be split for the training dataset (151 × 18), and 33.33% as the test dataset (77 × 18).

Following the classification of the training dataset being trained with SVM algorithm, we can classify the test dataset. Although the procedural steps of the algorithm may seem complicated at first glance, the only thing you have to do is to concentrate on the steps and grasp them. For this, let us have a look at the steps provided in Figure 8.8 closely.

Figure 8.8: SVM algorithm for the analysis of the economy (U.N.I.S.) Dataset.

In order to do classification with multi-class SVM algorithm, the test data is randomly chosen as 33.3% from the Economy (U.N.I.S) Dataset.

Economy (U.N.I.S.) Dataset has four classes. Let us analyze the example given in Example 8.2 in order to understand how the classifiers are found for the Economy dataset being trained by multi-class SVM algorithm.

Example 8.2

USA: X1=|11|;X2=|11|

New Zealand: B: X3=|11|;X4=|11|

Since y=|1111|, let us find the classifier of this sample problem.

The aspects regarding the vectors presented in Example 8.2 are defined as in Figure 8.8.

Let us rewrite the n = 4 relation in eq. (8.8):

Step (15)

L(a)=i=1nai12aTHa, here we see Hij=yiyj(K(Xi,Xj). Since the second-degree polynomial function is K (Xi,Xj)=(Xi.Xj+1)2 the matrix can be calculated as follows:

H11=y1y2(1+X1TX2)2=(1)(1)(1+[11][11])2=1H12=y1y2(1+X1TX1)2=(1)(1)(1+[11][11])2=9

If all the Hij values are calculated in this way, the following matrix is obtained:

H=[1911911111191191]

Step (817) In this stage, in order to obtain the α value, it is essential to solve the 1 −Ha = 0 and [1111][1911911111191191]a=0. As a result of the solution, a1 = a2 = a3 = a4 = 0.125 is obtained. This means all the samples are accepted as support vectors. These results obtained fulfill the condition of i=14aiyi=a1+a2a3a4=0 in eq. (8.8).

The ϕ(x) corresponding for the second-degree K(Xi,Xj)=(1+XiTXj) is calculated as Z=[12x12x2x1xx1x] based on eq. (8.8).

In this case, in order to find the w weight vector, the calculation is performed as follows: w=i=14aiyi(ϕ(xi))

For economies of USA and New Zealand, X1=|11|;X2=|11|;

X3=|11|;X4=|11|andy=|1111|, w vector is calculated as follows:

w=(0.125){(1)[122211]+(1)[122211]+(1)[122211]+(1)[122211]}=(0.125)[0004200]=[0002200]

Step (6)

Such being the case, the classifier is obtained as follows:

=[0002200][12x12x22x1x2x12x22]=x1x2

The test accuracy rate has been obtained as 95.2% in the classification procedure of Economy (U.N.I.S.) dataset with four classes through SVM polynomial kernel algorithm.

8.3.2SVM algorithm for the analysis of multiple sclerosis

As presented in Table 2.12, multiple sclerosis (MS) dataset has data from the following groups: 76 samples belonging to RRMS, 76 samples to SPMS, 76 samples to PPMS and 76 samples (see chapter 2, Table 2.11) belonging to healthy subjects of control group. The attributes of the control group are data regarding brain stem (MRI 1), corpus collasum periventricular (MRI 2), upper cervical (MRI 3) lesion diameter size (milimetric [ mm]) in the MR image and EDSS score. Data are made up of a total of 112 attributes. It is known that using these attributes of 304 individuals as the data if they belong to the MS subgroup or healthy group is known. How can we make the classification as to which MS patient belongs to which subgroup of MS, including healthy individuals and those diagnosed with MS (based on the lesion diameters [MRI 1,MRI 2,MRI 3]), the number of lesion size for (MRI 1, MRI 2, MRI 3) as obtained from MRI images and EDSS scores)? D matrix has a dimension of (304 × 112). This means D matrix includes the MS dataset of 304 individuals along with their 112 attributes (see Table 2.12 for the MS dataset). For the classification of D matrix through SVM algorithm, the first-step training procedure is to be employed. For the training procedure, 66.66% of the D matrix can be split for the training dataset (203 × 112) and 33.33% as the test dataset (101 × 112).

Following the classification of the training dataset being trained with SVM algorithm, we can do the classification of the test dataset.

33.3% portion has been selected from the MS dataset randomly in order to do the classification with multi-class SVM algorithm. Test data was randomly selected from the MS dataset as 33.3%.

The MS dataset has four classes. Let us study the example given in Example 8.2 in order to understand how the classifiers are found through the training of the MS dataset with multi-class SVM algorithm.

Example 8.3

SPMS:X1=|11|;X2=|11|RRMS:X3=|11|;X4=|11|

As we see that y=|1111|, let us find the classifier of the sample problem.

The aspects regarding the vectors presented in Example 8.3 can be seen in Figure 8.9.

Figure 8.9: SVM algorithm for the analysis of the MS Dataset.

Let us rewrite n = 4 relation in eq. (8.8):

L(a)=i=1nai12aTHa, here we see Hij = yiyj(K(Xi, Xj). The second-degree polynomial function is K (Xi, Xj) = (Xi.Xj + 1)2, the matrix can be calculated as follows:

Step (17)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset