3

Recognizing Faces with Support Vector Machine

In the previous chapter, we built a movie recommendation system with Naïve Bayes. This chapter continues our journey of supervised learning and classification. Specifically, we will be focusing on multiclass classification and support vector machine (SVM) classifiers. SVM is one of the most popular algorithms when it comes to high-dimensional spaces. The goal of the algorithm is to find a decision boundary in order to separate data from different classes. We will be discussing in detail how that works. Also, we will be implementing the algorithm with scikit-learn, and applying it to solve various real-life problems, including our main project of face recognition, along with fetal state categorization in cardiotocography and breast cancer prediction. A dimensionality reduction technique called principal component analysis, which boosts the performance of the image classifier, will also be covered in the chapter.

This chapter explores the following topics:

  • The mechanics of SVM explained through different scenarios
  • The implementations of SVM with scikit-learn
  • Multiclass classification strategies
  • SVM with kernel methods
  • How to choose between linear and Gaussian kernels
  • Face recognition with SVM
  • Principal component analysis
  • Tuning with grid search and cross-validation
  • Fetal state categorization using SVM with a non-linear kernel

Finding the separating boundary with SVM

Now that you have been introduced to a powerful yet simple classifier, Naïve Bayes, we will continue with another great classifier, SVM, which is effective in cases with high-dimensional spaces or where the number of dimensions is greater than the number of samples.

In machine learning classification, SVM finds an optimal hyperplane that best segregates observations from different classes. A hyperplane is a plane of n - 1 dimensions that separates the n-dimensional feature space of the observations into two spaces. For example, the hyperplane in a two-dimensional feature space is a line, and in a three-dimensional feature space the hyperplane is a surface. The optimal hyperplane is picked so that the distance from its nearest points in each space to itself is maximized. And these nearest points are the so-called support vectors. The following toy example demonstrates what support vectors and a separating hyperplane (along with the distance margin, which I will explain later) look like in a binary classification case:

Figure 3.1: Example of support vectors and a hyperplane in binary classification

The ultimate goal of SVM is to find an optimal hyperplane, but the burning question is "how can we find this optimal hyperplane?" You will get the answer as we explore the following scenarios. It's not as hard as you may think. The first thing we will look at is how to find a hyperplane.

Scenario 1 – identifying a separating hyperplane

First, you need to understand what qualifies as a separating hyperplane. In the following example, hyperplane C is the only correct one, as it successfully segregates observations by their labels, while hyperplanes A and B fail:

Figure 3.2: Example of qualified and unqualified hyperplanes

This is an easy observation. Let's express a separating hyperplane in a formal or mathematical way next.

In a two-dimensional space, a line can be defined by a slope vector w (represented as a two-dimensional vector), and an intercept b. Similarly, in a space of n dimensions, a hyperplane can be defined by an n-dimensional vector w, and an intercept b. Any data point x on the hyperplane satisfies wx + b = 0. A hyperplane is a separating hyperplane if the following conditions are satisfied:

  • For any data point x from one class, it satisfies wx + b > 0
  • For any data point x from another class, it satisfies wx + b < 0

However, there can be countless possible solutions for w and b. You can move or rotate hyperplane to certain extents and it will still remain a separating hyperplane. Next, you will learn how to identify the best hyperplane among various possible separating hyperplanes.

Scenario 2 – determining the optimal hyperplane

Look at the following example: hyperplane C is the preferred one as it enables the maximum sum of the distance between the nearest data point on the positive side and itself and the distance between the nearest data point on the negative side and itself:

Figure 3.3: An example of optimal and suboptimal hyperplanes

The nearest point(s) on the positive side can constitute a hyperplane parallel to the decision hyperplane, which we call a positive hyperplane; on the other hand, the nearest point(s) on the negative side can constitute the negative hyperplane. The perpendicular distance between the positive and negative hyperplanes is called the margin, the value of which equates to the sum of the two aforementioned distances. A decision hyperplane is deemed optimal if the margin is maximized.

The optimal (also called maximum-margin) hyperplane and the distance margins for a trained SVM model are illustrated in the following diagram. Again, samples on the margin (two from one class, and one from another class, as shown) are the so-called support vectors:

Figure 3.4: An example of an optimal hyperplane and distance margins

We can interpret it in a mathematical way by first describing the positive and negative hyperplanes as follows:

Here,  is a data point on the positive hyperplane, and  is a data point on the negative hyperplane.

The distance between a point  and the decision hyperplane can be calculated as follows:

Similarly, the distance between a point  and the decision hyperplane is as follows:

So, the margin becomes . As a result, we need to minimize  in order to maximize the margin. Importantly, to comply with the fact that the support vectors on the positive and negative hyperplanes are the nearest data points to the decision hyperplane, we add a condition that no data point falls between the positive and negative hyperplanes:

Here, is an observation. This can be combined further into the following:

To summarize, w and b, which determine the SVM decision hyperplane, are trained and solved by the following optimization problem:

  • Minimizing 
  • Subject to , for a training set of ,… …, 

To solve this optimization problem, we need to resort to quadratic programming techniques, which are beyond the scope of our learning journey. Therefore, we will not cover the computation methods in detail and instead will implement the classifier using the SVC and LinearSVC modules from scikit-learn, which are realized respectively based on libsvm (https://www.csie.ntu.edu.tw/~cjlin/libsvm/) and liblinear (https://www.csie.ntu.edu.tw/~cjlin/liblinear/), two popular open-source SVM machine learning libraries. But it is always valuable to understand the concepts of computing SVM.

Shai Shalev-Shwartz et al. "Pegasos: Primal estimated sub-gradient solver for SVM" (Mathematical Programming, March 2011, volume 127, issue 1, pp. 3-30), and Cho-Jui Hsieh et al. "A dual coordinate descent method for large-scale linear SVM" (Proceedings of the 25th international conference on machine learning, pp 408-415) are great learning materials. They cover two modern approaches, sub-gradient descent and coordinate descent.

The learned model parameters w and b are then used to classify a new sample x' based on the following conditions:

Moreover,  can be portrayed as the distance from the data point x' to the decision hyperplane, and also interpreted as the confidence of prediction: the higher the value, the further away the data point is from the decision boundary, hence the higher prediction certainty.

Although you might be eager to implement the SVM algorithm, let's take a step back and look at a common scenario where data points are not linearly separable in a strict way. Try to find a separating hyperplane in the following example:

Figure 3.5: An example of data points that are not strictly linearly separable

Scenario 3 – handling outliers

How can we deal with cases where it is impossible to strictly linearly segregate a set of observations containing outliers? We can actually allow the misclassification of such outliers and try to minimize the error introduced. The misclassification error  (also called hinge loss) for a sample can be expressed as follows:

Together with the ultimate term ||w||that we want to reduce, the final objective value we want to minimize becomes the following:

As regards a training set of m samples , ,… …, , where the hyperparameter C controls the trade-off between the two terms:

  • If a large value of C is chosen, the penalty for misclassification becomes relatively high. This means the rule of thumb of data segregation becomes stricter and the model might be prone to overfitting, since few mistakes are allowed during training. An SVM model with a large C has a low bias, but it might suffer from high variance.
  • Conversely, if the value of C is sufficiently small, the influence of misclassification becomes fairly low. This model allows more misclassified data points than a model with a large C. Thus, data separation becomes less strict. Such a model has low variance, but it might be compromised by high bias.

A comparison between a large and small C is shown in the following diagram:

Figure 3.6: How the value of C affects the strictness of segregation and the margin

The parameter C determines the balance between bias and variance. It can be fine-tuned with cross-validation, which we will practice shortly.

Implementing SVM

We have largely covered the fundamentals of the SVM classifier. Now, let's apply it right away to an easy binary classification dataset. We will use the classic breast cancer Wisconsin dataset (https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html) from scikit-learn.

Let's take a look at the following steps:

  1. We first load the dataset and do some basic analysis, as follows:
    >>> from sklearn.datasets import load_breast_cancer
    >>> cancer_data = load_breast_cancer()
    >>> X = cancer_data.data
    >>> Y = cancer_data.target
    >>> print('Input data size :', X.shape)
    Input data size : (569, 30)
    >>> print('Output data size :', Y.shape)
    Output data size : (569,)
    >>> print('Label names:', cancer_data.target_names)
    Label names: ['malignant' 'benign']
    >>> n_pos = (Y == 1).sum()
    >>> n_neg = (Y == 0).sum()
    >>> print(f'{n_pos} positive samples and {n_neg} negative samples.')
    357 positive samples and 212 negative samples.
    

    As you can see, the dataset has 569 samples with 30 features; its label is binary, and 63% of samples are positive (benign). Again, always check whether classes are imbalanced before trying to solve any classification problem. In this case, they are relatively balanced.

  2. Next, we split the data into training and testing sets:
    >>> from sklearn.model_selection import train_test_split
    >>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=42)
    

    For reproducibility, don't forget to specify a random seed.

  3. We can now apply the SVM classifier to the data. We first initialize an SVC model with the kernel parameter set to linear (I will explain what kernel means in the next section) and the penalty hyperparameter C set to the default value, 1.0:
    >>> from sklearn.svm import SVC
    >>> clf = SVC(kernel='linear', C=1.0, random_state=42)
    
  4. We then fit our model on the training set as follows:
    >>> clf.fit(X_train, Y_train)
    SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
        decision_function_shape='ovr', degree=3,
        gamma='auto_deprecated', kernel='linear', max_iter=-1,
        probability=False, random_state=42, shrinking=True,
        tol=0.001, verbose=False)
    
  5. And we predict on the testing set with the trained model and obtain the prediction accuracy directly:
    >>> accuracy = clf.score(X_test, Y_test)
    >>> print(f'The accuracy is: {accuracy*100:.1f}%')
    The accuracy is: 95.8%
    

Our first SVM model works just great, achieving an accuracy of 95.8%. How about dealing with more than two topics? How does SVM handle multiclass classification?

Scenario 4 – dealing with more than two classes

SVM and many other classifiers can be applied to cases with more than two classes. There are two typical approaches we can take, one-vs-rest (also called one-vs-all) and one-vs-one.

In the one-vs-rest setting, for a K-class problem, we construct K different binary SVM classifiers. For the kth classifier, it treats the kth class as the positive case and the remaining K-1 classes as the negative case as a whole; the hyperplane denoted as (wk, bk) is trained to separate these two cases. To predict the class of a new sample, x', it compares the resulting predictions  from K individual classifiers from 1 to k. As we discussed in the previous section, the larger value of  means higher confidence that x' belongs to the positive case. Therefore, it assigns x' to the class i where  has the largest value among all prediction results:

The following diagram shows how the one-vs-rest strategy works in a three-class case:

Figure 3.7: An example of three-class classification using the one-vs-rest strategy

For instance, if we have the following (rb, and g denote the red, blue, and green classes, respectively):

we can say x' belongs to the red class since 0.78 > 0.35 > -0.64. If we have the following:

then we can determine that x' belongs to the blue class regardless of the sign since -0.35 > -0.64 > -0.78.

In the one-vs-one strategy, we conduct a pairwise comparison by building a set of SVM classifiers that can distinguish data points from each pair of classes. This will result in  different classifiers.

For a classifier associated with classes i and j, the hyperplane denoted as  is trained only on the basis of observations from i (can be viewed as a positive case) and j (can be viewed as a negative case); it then assigns the class, either i or j, to a new sample, x', based on the sign of . Finally, the class with the highest number of assignments is considered the predicting result of x'. The winner is the class that gets the most votes.

The following diagram shows how the one-vs-one strategy works in a three-class case:

Figure 3.8: An example of three-class classification using the one-vs-one strategy

In general, an SVM classifier with a one-vs-rest setting and a classifier with a one-vs-one setting perform comparably in terms of accuracy. The choice between these two strategies is largely computational.

Although one-vs-one requires more classifiers, , than one-vs-rest (K), each pairwise classifier only needs to learn on a small subset of data, as opposed to the entire set in the one-vs-rest setting. As a result, training an SVM model in the one-vs-one setting is generally more memory efficient and less computationally expensive, and hence is preferable for practical use, as argued in Chih-Wei Hsu and Chih-Jen Lin's A comparison of methods for multiclass support vector machines (IEEE Transactions on Neural Networks, March 2002, Volume 13, pp. 415-425).

In scikit-learn, classifiers handle multiclass cases internally, and we do not need to explicitly write any additional code to enable this. You can see how simple it is in the wine classification example (https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html#sklearn.datasets.load_wine) with three classes, as follows:

  1. We first load the dataset and do some basic analysis, as follows:
    >>> from sklearn.datasets import load_wine
    >>> wine_data = load_breast_cancer()
    >>> X = wine_data.data
    >>> Y = wine_data.target
    >>> print('Input data size :', X.shape)
    Input data size : (178, 13)
    >>> print('Output data size :', Y.shape)
    Output data size : (178,)
    >>> print('Label names:', wine_data.target_names)
    Label names: ['class_0' 'class_1' 'class_2']
    >>> n_class0 = (Y == 0).sum() 
    >>> n_class1 = (Y == 1).sum() 
    >>> n_class2 = (Y == 2).sum() 
    >>> print(f'{n_class0} class0 samples,
    {n_class1} class1 samples,
    {n_class2} class2 samples.')
    59 class0 samples, 
    71 class1 samples, 
    48 class2 samples.
    

    As you can see, the dataset has 178 samples with 13 features; its label has three possible values taking up 33%, 40%, and 27%, respectively.

  2. Next, we split the data into training and testing sets:
    >>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=42)
    
  3. We can now apply the SVM classifier to the data. We first initialize an SVC model and fit it against the training set:
    >>> clf = SVC(kernel='linear', C=1.0, random_state=42)
    >>> clf.fit(X_train, Y_train)
    SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
        decision_function_shape='ovr', degree=3,
        gamma='auto_deprecated', kernel='linear', max_iter=-1,
        probability=False, random_state=42, shrinking=True,
        tol=0.001, verbose=False)
    

    In an SVC model, multiclass support is implicitly handled according to the one-vs-one scheme.

  4. Next, we predict on the testing set with the trained model and obtain the prediction accuracy directly:
    >>> accuracy = clf.score(X_test, Y_test)
    >>> print(f'The accuracy is: {accuracy*100:.1f}%')
    The accuracy is: 97.8%
    

    Our SVM model also works well in the multiclass case, achieving an accuracy of 97.8%.

  5. We also check how it performs for individual classes:
    >>> from sklearn.metrics import classification_report
    >>> pred = clf.predict(X_test)
    >>> print(classification_report(Y_test, pred))
                  precision    recall  f1-score   support
               0       1.00      1.00      1.00        15
               1       1.00      0.94      0.97        18
               2       0.92      1.00      0.96        12
       micro avg       0.98      0.98      0.98        45
       macro avg       0.97      0.98      0.98        45
    weighted avg       0.98      0.98      0.98        45
    

It looks excellent! Is the example too easy? Maybe. What do we do in tricky cases? Of course, we could tweak the values of the kernel and C hyperparameters. As discussed, the factor C controls the strictness of separation, and it can be tuned to achieve the best trade-off between bias and variance. How about the kernel? What does it mean and what are the alternatives to a linear kernel?

In the next section, we will answer those two questions we just raised. You will see how the kernel trick makes SVM so powerful.

Scenario 5 – solving linearly non-separable problems with kernels

The hyperplanes we have found so far are linear, for instance, a line in a two-dimensional feature space, or a surface in a three-dimensional one. However, in the following example, we are not able to find a linear hyperplane that can separate two classes:

Figure 3.9: The linearly non-separable case

Intuitively, we observe that data points from one class are closer to the origin than those from another class. The distance to the origin provides distinguishable information. So we add a new feature, , and transform the original two-dimensional space into a three-dimensional one. In the new space, as displayed in the following diagram, we can find a surface hyperplane separating the data, or a line in the two-dimension view. With the additional feature, the dataset becomes linearly separable in the higher dimensional space, (x1,x2,z):

Figure 3.10: Making a non-separable case separable

Based upon similar logic, SVMs with kernels were invented to solve non-linear classification problems by converting the original feature, space , to a higher dimensional feature space with a transformation function, Φ, such that the transformed dataset  is linearly separable.

A linear hyperplane  is then learned using observations . For an unknown sample x', it is first transformed into ; the predicted class is determined by .

An SVM with kernels enables non-linear separation, but it does not explicitly map each original data point to the high-dimensional space and then perform expensive computation in the new space. Instead, it approaches this in a tricky way.

During the course of solving the SVM quadratic optimization problems, feature vectors  are involved only in the form of a pairwise dot product , although we will not expand this mathematically in this book. With kernels, the new feature vectors are  and their pairwise dot products can be expressed as . It would be computationally efficient to first implicitly conduct a pairwise operation on two low-dimensional vectors and later map the result to the high-dimensional space. In fact, a function K that satisfies this does exist:

The function K is the so-called kernel function. With this trick, the transformation Φ becomes implicit, and the non-linear decision boundary can be efficiently learned by simply replacing the term  with .

The most popular kernel function is probably the radial basis function (RBF) kernel (also called the Gaussian kernel), which is defined as follows:

Here, . In the Gaussian function, the standard deviation  controls the amount of variation or dispersion allowed: the higher the  (or the lower the ), the larger the width of the bell, and the wider the range is that data points are allowed to spread out over. Therefore,  as the kernel coefficient determines how strictly or generally the kernel function fits the observations. A large  implies a small variance allowed and a relatively exact fit on the training samples, which might lead to overfitting. On the other hand, a small  implies a high variance allowed and a loose fit on the training samples, which might cause underfitting.

To illustrate this trade-off, let's apply the RBF kernel with different values to a toy dataset:

>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> X = np.c_[# negative class
...           (.3, -.8),
...           (-1.5, -1),
...           (-1.3, -.8),
...           (-1.1, -1.3),
...           (-1.2, -.3),
...           (-1.3, -.5),
...           (-.6, 1.1),
...           (-1.4, 2.2),
...           (1, 1),
...           # positive class
...           (1.3, .8),
...           (1.2, .5),
...           (.2, -2),
...           (.5, -2.4),
...           (.2, -2.3),
...           (0, -2.7),
...           (1.3, 2.1)].T
>>> Y = [-1] * 8 + [1] * 8

Eight data points are from one class, and eight are from another. We take three values, 12, and 4, for kernel coefficient options as an example:

>>> gamma_option = [1, 2, 4]

Under each kernel coefficient, we fit an individual SVM classifier and visualize the trained decision boundary:

>>> import matplotlib.pyplot as plt
>>> gamma_option = [1, 2, 4]
>>> for i, gamma in enumerate(gamma_option, 1):
...     svm = SVC(kernel='rbf', gamma=gamma)
...     svm.fit(X, Y)
...     plt.scatter(X[:, 0], X[:, 1], c=['b']*8+['r']*8, 
...                 zorder=10, cmap=plt.cm.Paired)
...     plt.axis('tight')
...     XX, YY = np.mgrid[-3:3:200j, -3:3:200j]
...     Z = svm.decision_function(np.c_[XX.ravel(), YY.ravel()])
...     Z = Z.reshape(XX.shape)
...     plt.pcolormesh(XX, YY, Z > 0, cmap=plt.cm.Paired)
...     plt.contour(XX, YY, Z, colors=['k', 'k', 'k'],
...            linestyles=['--', '-', '--'], levels=[-.5, 0, .5])
...     plt.title('gamma = %d' % gamma)
...     plt.show()

Refer to the following screenshot for the end results:

Figure 3.11: The SVM classification decision boundary under different values of gamma

We can observe that a larger  results in narrow regions, which means a stricter fit on the dataset; a smaller  results in broad regions, which means a loose fit on the dataset. Of course,  can be fine-tuned through cross-validation to obtain the best performance.

Some other common kernel functions include the polynomial kernel

and the sigmoid kernel:

In the absence of prior knowledge of the distribution, the RBF kernel is usually preferable in practical usage, as there is an additional parameter to tweak in the polynomial kernel (polynomial degree d) and the empirical sigmoid kernel can perform approximately on a par with the RBF, but only under certain parameters. Hence, we come to a debate between the linear (also considered no kernel) and the RBF kernel given a dataset.

Choosing between linear and RBF kernels

Of course, linear separability is the rule of thumb when choosing the right kernel to start with. However, most of the time this is very difficult to identify, unless you have sufficient prior knowledge of the dataset, or its features are of low dimensions (1 to 3).

Some general prior knowledge that is commonly known includes that text data is often linearly separable, while data generated from the XOR function (https://en.wikipedia.org/wiki/XOR_gate) is not.

Now, let's look at the following three scenarios where the linear kernel is favored over RBF.

Scenario 1: Both the number of features and the number of instances are large (more than 104 or 105). Since the dimension of the feature space is high enough, additional features as a result of RBF transformation will not provide a performance improvement, but this will increase the computational expense. Some examples from the UCI machine learning repository are of this type:

Scenario 2: The number of features is noticeably large compared to the number of training samples. Apart from the reasons stated in scenario 1, the RBF kernel is significantly more prone to overfitting. Such a scenario occurs, for example, in the following examples:

Scenario 3: The number of instances is significantly large compared to the number of features. For a dataset of low dimension, the RBF kernel will, in general, boost the performance by mapping it to a higher-dimensional space. However, due to the training complexity, it usually becomes inefficient on a training set with more than 106 or 107 samples. Example datasets include the following:

Aside from these three scenarios, RBF is ordinarily the first choice.

The rules for choosing between linear and RBF kernels can be summarized as follows:

Scenario

Linear

RBF

Prior knowledge

If linearly separable

If nonlinearly separable

Visualizable data of 1 to 3 dimension(s)

If linearly separable

If nonlinearly separable

Both the number of features and number of instances are large.

First choice

Features >> Instances

First choice

Instances >> Features

First choice

Others

First choice

Table 3.1: Rules for choosing between linear and RBF kernels

Once again, first choice means we can begin with this option; it does not mean that this is the only option moving forward.

Next, let's take a look at classifying face images.

Classifying face images with SVM

Finally, it is time to build an SVM-based face image classifier using everything you just learned. We will do so in parts, exploring the image dataset.

Exploring the face image dataset

We will use the Labeled Faces in the Wild (LFW) people dataset (https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_lfw_people.html) from scikit-learn. It consists of more than 13,000 curated face images of more than 5,000 famous people. Each class has various numbers of image samples.

First, we load the face image data as follows:

>>> from sklearn.datasets import fetch_lfw_people
Downloading LFW metadata:c https://ndownloader.figshare.com/files/5976012
Downloading LFW metadata: https://ndownloader.figshare.com/files/5976009
Downloading LFW metadata: https://ndownloader.figshare.com/files/5976006
Downloading LFW data (~200MB): https://ndownloader.figshare.com/files/5976015
>>> face_data = fetch_lfw_people(min_faces_per_person=80)

We only load classes with at least 80 samples so that we will have enough training data. Note that if you run into the problem of ImportError: The Python Imaging Library (PIL) is required to load data from jpeg files, please install the package pillow as follows in the terminal:

pip install pillow

Next, we take a look at the data we loaded:

>>> X = face_data.data
>>> Y = face_data.target
>>> print('Input data size :', X.shape)
Input data size : (1140, 2914)
>>> print('Output data size :', Y.shape)
Output data size : (1140,)
>>> print('Label names:', face_data.target_names)
Label names: ['Colin Powell' 'Donald Rumsfeld' 'George W Bush' 'Gerhard Schroeder' 'Tony Blair']

This five-class dataset contains 1,140 samples and a sample is of 2,914 dimensions. As a good practice, we analyze the label distribution as follows:

>>> for i in range(5):
...     print(f'Class {i} has {(Y == i).sum()} samples.')
Class 0 has 236 samples.
Class 1 has 121 samples.
Class 2 has 530 samples.
Class 3 has 109 samples.
Class 4 has 144 samples.

The dataset is rather imbalanced. Let's keep this in mind when we build the model.

Now let's plot a few face images:

>>> import matplotlib.pyplot as plt
>>>
>>> fig, ax = plt.subplots(3, 4)
>>> for i, axi in enumerate(ax.flat):
...     axi.imshow(face_data.images[i], cmap='bone')
...     axi.set(xticks=[], yticks=[],
...             xlabel=face_data.target_names[face_data.target[i]])
...
>>> plt.show()

You will see the following 12 images with their labels:

Figure 3.12: Samples from the LFW people dataset

Now that we have covered exploratory data analysis, we will move on to the model development phase in the next section.

Building an SVM-based image classifier

First, we split the data into the training and testing set:

>>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=42)

In this project, the number of dimensions is greater than the number of samples. This is a classification case that SVM is effective at solving. In our solution, we will tune the hyperparameters, including the penalty C, the kernel (linear or RBF), and (for the RBF kernel) through cross-validation.

We then initialize a common SVM model:

>>> clf = SVC(class_weight='balanced', random_state=42)

The dataset is imbalanced, so we set class_weight='balanced' to emphasize the underrepresented classes.

The way we have conducted cross-validation so far is to explicitly split data into folds and repetitively write a for loop to consecutively examine each hyperparameter. To make this less redundant, we'll introduce a more elegant approach utilizing the GridSearchCV module from scikit-learn. GridSearchCV handles the entire process implicitly, including data splitting, fold generation, cross training and validation, and finally, an exhaustive search over the best set of parameters. What is left for us is just to specify the hyperparameter(s) to tune and the values to explore for each individual hyperparameter:

>>> parameters = {'C': [0.1, 1, 10],
...               'gamma': [1e-07, 1e-08, 1e-06],
...               'kernel' : ['rbf', 'linear'] }
>>> from sklearn.model_selection import GridSearchCV
>>> grid_search = GridSearchCV(clf, parameters, n_jobs=-1, cv=5)

The GridSearchCV model we just initialized will conduct five-fold cross-validation (cv=5) and will run in parallel on all available cores (n_jobs=-1). We then perform hyperparameter tuning by simply applying the fit method:

>>> grid_search.fit(X_train, Y_train)

We obtain the optimal set of hyperparameters using the following code:

>>> print('The best model:
', grid_search.best_params_)
The best model:
 {'C': 10, 'gamma': 1e-07, 'kernel': 'rbf'}

And we obtain the best five-fold averaged performance under the optimal set of parameters by using the following code:

>>> print('The best averaged performance:', grid_search.best_score_)
 The best averaged performance: 0.8526315789473684

We then retrieve the SVM model with the optimal set of hyperparameters and apply it to the testing set:

>>> clf_best = grid_search.best_estimator_
>>> pred = clf_best.predict(X_test)

We then calculate the accuracy and classification report:

>>> print(f'The accuracy is: {clf_best.score(X_test,
...       Y_test)*100:.1f}%')
The accuracy is: 87.7%
>>> print(classification_report(Y_test, pred,
...           target_names=face_data.target_names))
                   precision    recall  f1-score   support
     Colin Powell       0.89      0.88      0.88        64
  Donald Rumsfeld       0.84      0.81      0.83        32
    George W Bush       0.88      0.93      0.90       127
Gerhard Schroeder       0.84      0.72      0.78        29
       Tony Blair       0.91      0.88      0.89        33
        micro avg       0.88      0.88      0.88       285
        macro avg       0.87      0.84      0.86       285
     weighted avg       0.88      0.88      0.88       285

It should be noted that we tune the model based on the original training set, which is divided into folds for cross training and validation internally, and that we apply the optimal model to the original testing set. We examine the classification performance in this manner in order to measure how well generalized the model is in order to make correct predictions on a completely new dataset. An accuracy of 87.7% is achieved with the best SVM model.

There is another SVM classifier, LinearSVC (https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html), from scikit-learn. How is it different from SVCLinearSVC is similar to SVC with linear kernels, but it is implemented based on the liblinear library, which is better optimized than libsvm with the linear kernel, and its penalty function is more flexible.

In general, training with the LinearSVC model is faster than SVC. This is because the liblinear library with high scalability is designed for large datasets, while the libsvm library with more than quadratic computation complexity is not able to scale well with more than 105 training instances. But again, the LinearSVC model is limited to only linear kernels.

Boosting image classification performance with PCA

We can also improve the image classifier by compressing the input features with principal component analysis (PCA) (https://en.wikipedia.org/wiki/Principal_component_analysis). It reduces the dimension of the original feature space and preserves the most important internal relationships among features. In simple terms, PCA projects the original data into a smaller space with the most important directions (coordinates). We hope that in cases where we have more features than training samples, considering fewer features as a result of dimensionality reduction using PCA can prevent overfitting.

We will implement PCA with the PCA module (https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) from scikit-learn. We will first apply PCA to reduce the dimensionality and train the classifier on the resulting data. In machine learning, we usually concatenate multiple consecutive steps and treat them as one "model." We call this process pipelining. We utilize the pipeline API (https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) from scikit-learn to facilitate this.

Now let's initialize a PCA model, an SVC model, and a model pipelining these two:

>>> from sklearn.decomposition import PCA
>>> pca = PCA(n_components=100, whiten=True, random_state=42)
>>> svc = SVC(class_weight='balanced', kernel='rbf',
...           random_state=42)
>>> from sklearn.pipeline import Pipeline
>>> model = Pipeline([('pca', pca),
...                  ('svc', svc)])

The PCA component projects the original data into a 100-dimension space, followed by the SVC classifier with the RBF kernel. We then grid search for the best model from a few options:

>>> parameters_pipeline = {'svc__C': [1, 3, 10],
...                       'svc__gamma': [0.001, 0.005]}
>>> grid_search = GridSearchCV(model, parameters_pipeline)
>>> grid_search.fit(X_train, Y_train)

Finally, we print out the best set of hyperparameters and the classification performance with the best model:

>>> print('The best model:
', grid_search.best_params_)
The best model:
 {'svc__C': 3, 'svc__gamma': 0.005}
>>> print('The best averaged performance:', grid_search.best_score_)
The best averaged performance: 0.8467836257309942
>>> model_best = grid_search.best_estimator_
>>> print(f'The accuracy is: {model_best.score(X_test, Y_test)*100:.1f}%')
The accuracy is: 92.3%
>>> pred = model_best.predict(X_test)
>>> print(classification_report(Y_test, pred, target_names=face_data.target_names))
                   precision    recall  f1-score   support
     Colin Powell       0.97      0.95      0.96        64
  Donald Rumsfeld       0.93      0.84      0.89        32
    George W Bush       0.92      0.98      0.95       127
Gerhard Schroeder       0.88      0.79      0.84        29
       Tony Blair       0.88      0.85      0.86        33
        micro avg       0.92      0.92      0.92       285
        macro avg       0.92      0.88      0.90       285
     weighted avg       0.92      0.92      0.92       285

The model composed of a PCA and an SVM classifier achieves an accuracy of 92.3%. PCA boosts the performance of the SVM-based image classifier. You can read more about PCA at https://www.kaggle.com/nirajvermafcb/principal-component-analysis-explained if you are interested.

Following the successful application of SVM in image classification, we will look at one more example in the next section.

Fetal state classification on cardiotocography

We are going to build a classifier that helps obstetricians categorize cardiotocograms (CTGs) into one of the three fetal states (normal, suspect, and pathologic). The cardiotocography dataset we will use is from https://archive.ics.uci.edu/ml/datasets/Cardiotocography in the UCI Machine Learning Repository, and it can be directly downloaded from https://archive.ics.uci.edu/ml/machine-learning-databases/00193/CTG.xls as an .xls Excel file. The dataset consists of measurements of fetal heart rate and uterine contraction as features, and the fetal state class code (1=normal, 2=suspect, 3=pathologic) as a label. There are in total 2,126 samples with 23 features. Based on the numbers of instances and features (2,126 is not significantly larger than 23), the RBF kernel is the first choice.

We will work with the Excel file using pandas, which is suitable for table data. It might request an additional installation of the xlrd package when you run the following lines of codes, since its Excel module is built based on xlrd. If so, just run pip install xlrd in the terminal to install xlrd.

We first read the data located in the sheet named Raw Data:

>>> import pandas as pd
>>> df = pd.read_excel('CTG.xls', "Raw Data")

Then, we take these 2,126 data samples and assign the feature set (from columns D to AL in the spreadsheet) and label set (column AN):

>>> X = df.iloc[1:2126, 3:-2].values
>>> Y = df.iloc[1:2126, -1].values

Don't forget to check the class proportions:

>>> print(Counter(Y))
Counter({1.0: 1655, 2.0: 295, 3.0: 176})

We set aside 20% of the original data for final testing:

>>> X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
...                                 test_size=0.2, random_state=42)

Now, we tune the RBF-based SVM model in terms of the penalty C, and the kernel coefficient :

>>> svc = SVC(kernel='rbf')
>>> parameters = {'C': (100, 1e3, 1e4, 1e5),
...               'gamma': (1e-08, 1e-7, 1e-6, 1e-5)}
>>> grid_search = GridSearchCV(svc, parameters, n_jobs=-1, cv=5)
>>> grid_search.fit(X_train, Y_train)
>>> print(grid_search.best_params_)
{'C': 100000.0, 'gamma': 1e-07}
>>> print(grid_search.best_score_)
0.9547058823529412

Finally, we apply the optimal model to the testing set:

>>> svc_best = grid_search.best_estimator_
>>> accuracy = svc_best.score(X_test, Y_test)
>>> print(f'The accuracy is: {accuracy*100:.1f}%')
The accuracy is: 96.5%

Also, we have to check the performance for individual classes since the data is quite imbalanced:

>>> prediction = svc_best.predict(X_test)
>>> report = classification_report(Y_test, prediction)
>>> print(report)
           precision recall f1-score  support
       1.0    0.98    0.98     0.98     333
       2.0    0.89    0.91     0.90     64
       3.0    0.96    0.93     0.95     29
micro avg     0.96    0.96     0.96     426
macro avg     0.95    0.94     0.94     426
weighted avg  0.96    0.96     0.96     426

We have now successfully built another SVM-based classifier to solve a real-world problem: fetal state classification in cardiotocography.

Summary

In this chapter, we continued our journey of supervised learning with SVM. You learned about the mechanics of an SVM, kernel techniques and implementations of SVM, and other important concepts of machine learning classification, including multiclass classification strategies and grid search, as well as useful tips for using an SVM (for example, choosing between kernels and tuning parameters). Then, we finally put into practice what you learned in the form of real-world use cases, including face recognition and fetal state classification.

You have learned and adopted two classification algorithms so far, Naïve Bayes and SVM. Naïve Bayes is a simple algorithm (as its name implies). For a dataset with independent, or close to independent, features, Naïve Bayes will usually perform well. SVM is versatile and adaptive to the linear separability of data. In general, high accuracy can be achieved by SVM with the right kernel and parameters. However, this might be at the expense of intense computation and high memory consumption. In practice, we can simply try both and select the better one with optimal parameters.

In the next chapter, we will look at online advertising and predict whether a user will click on an ad. This will be accomplished by means of tree-based algorithms, including decision tree and random forest.

Exercises

  1. Can you implement SVM using the LinearSVC module? What are the hyperparameters that you need to tweak, and what is the best performance of face recognition you are able to achieve?
  2. Can you classify more classes in the image recognition project? As an example, you can set min_faces_per_person=50. What is the best performance you can achieve using grid search and cross-validation?
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset