There are some similarities between support vector classification models and k-nearest neighbors models. They are both intuitive and flexible. However, support vector classification, due to the nature of the algorithm, scales better than k-nearest neighbor. Unlike logistic regression, it can handle nonlinear models rather easily. The strategies and issues with using support vector machines for classification are similar to those we discussed in Chapter 8, Support Vector Regression, when we used support vector machines for regression.
One of the key advantages of support vector classification (SVC) is the ability it gives us to reduce model complexity without increasing our feature space. But it also provides multiple levers we can adjust to limit the possibility of overfitting. We can choose a linear model or select from several nonlinear kernels. We can use a regularization parameter, much as we did for logistic regression. With extensions, we can also use these same techniques to construct multiclass models.
We will explore the following topics in this chapter:
We will stick to the pandas, NumPy, and scikit-learn libraries in this chapter. All code in this chapter was tested with scikit-learn versions 0.24.2 and 1.0.2. The code that displays the decision boundaries needs scikit-learn version 1.1.1 or later.
We can use support vector machines (SVMs) to find a line or curve to separate instances by class. When classes can be discriminated by a line, they are said to be linearly separable.
There may, however, be many possible linear classifiers, as we can see in Figure 13.1. Each line successfully discriminates between the two classes, represented by dots and squares, using the two features x1 and x2. The key difference is in how the lines would classify new instances, represented by the transparent rectangle. Using the line closest to the squares would cause the transparent rectanglez to be classified as a dot. Using either of the other two lines would classify it as a square.
When a linear discriminant is very close to training instances, as is the case with two of the lines in Figure 13.2, there is a greater risk of misclassifying new instances. We want a line that gives us the maximum margin between classes; one that is furthest away from border data points for each class. That is the middle line in Figure 13.1, but it can be seen more clearly in Figure 13.2:
The bold line splits the maximum margin and is referred to as the decision boundary. The border data points for each class are known as the support vectors.
We use SVM to find the linear discriminant with the maximum margin between classes. It does this by finding an equation representing a margin that can be maximized, where the margin is the distance between a data point and the separating hyperplane. With two features, as in Figure 13.2, that hyperplane is just a line. However, this can be generalized to feature spaces with more dimensions.
With data points such as those in Figure 13.2, we can use what is known as hard margin classification without problems; that is, we can be strict about all observations for each class being on the correct side of the decision boundary. But what if our data points look like those in Figure 13.3? Here, there is a square very close to the dots. The hard margin classifier is the left line, giving us quite tiny margins.
If we use soft margin classification instead, we get the line to the right. Soft margin classification relaxes the constraint that all instances have to be correctly separated. As is the case with the data in Figure 13.3, allowing for a small number of misclassifications in the training data can give us a larger margin. We ignore the wayward square and get a decision boundary represented by the soft margin line.
The amount of relaxation of the constraint is determined by the C hyperparameter. The larger the value of C, the greater the penalty for margin violations. Not surprisingly, models with larger C values are more prone to overfitting. Figure 13.4 illustrates how the margin changes with values of C. At C = 1, the penalty for misclassification is low, giving us a much greater margin than when C is 100. Even at a C of 100, however, some margin violation still happens.
As a practical matter, we almost always build our SVC models with soft margins. The default value for C in scikit-learn is 1.
We have not yet fully addressed the issue of linear separability with SVC. For simplicity, it is helpful to return to a classification problem involving two features. Let’s say a plot of two features against a categorical target looks like the illustration in Figure 13.5. The target has two possible values, represented by the dots and squares. x1 and x2 are numeric and have negative values.
What can we do in a case like this to identify a margin between the classes? It is often the case that a margin can be identified at a higher dimension. In this example, we can use a polynomial transformation, as illustrated in Figure 13.6:
There is now a third dimension, which is the sum of the squares of x1 and x2. The dots are all higher than the squares. This is similar to how we used polynomial transformation with linear regression.
One drawback of this approach is that we can quickly end up with too many features for our model to perform well. This is where the kernel trick comes in very handy. SVC can use a kernel function to expand the feature space implicitly without actually creating more features. This is done by creating a vector of values that can be used to fit a nonlinear margin.
While this allows us to fit a polynomial transformation like the hypothetical one illustrated in Figure 13.6, the most frequently used kernel function with SVC is the radial basis function (RBF). RBF is popular because it is faster than other common kernel functions and because it can be used with the gamma hyperparameter for additional flexibility. The equation for the RBF kernel is as follows:
Here, and are data points. Gamma, , determines the amount of influence of each point. With high values of gamma, points have to be very close to each other to be grouped together. At very high values of gamma, we start to see islands of points.
Of course, what is a high value for gamma, or of C, depends partly on our data. A good approach is to create visualizations of decision boundaries at different values for gamma and C before doing much modeling. This will give us a sense of whether or not we are underfitting or overfitting at different hyperparameter values. We will plot decision boundaries at different values of gamma and C in this chapter.
All of our discussion about SVC so far has centered on binary classification. Fortunately, all of the key concepts that apply to SVMs for binary classification also apply to classification when our target has more than two possible values. We transform the multiclass problem into a binary classification problem by modeling it as a one-versus-one, or a one-versus-rest problem.
One-versus-one classification is easy to illustrate in a three-class example, as shown on the left side of Figure 13.7. A decision boundary is estimated between each class and each of the other classes. For example, the dotted line is the decision boundary for the dot class versus the square class. The solid line is the decision boundary between the dots and the ovals.
With one-versus-rest classification, a decision boundary is constructed between each class and those instances that are not of that class. This is illustrated on the right side of Figure 13.7. The solid line is the decision boundary between the dots and the instances that are not dots (those that are squares or ovals). The dotted and double lines are the decision boundaries for the squares versus the rest and the ovals versus the rest of the instances, respectively.
We can construct both linear and nonlinear SVC models using either one-versus-one or one-versus-rest classification. We can also specify values for C to construct soft margins. However, the construction of more decision boundaries with each of these techniques claims greater computational resources than SVC for binary classification. If we have a large number of observations, many features, and more than a couple of parameters to tune, we will likely need very good system resources to get timely results.
The three-class example hides one thing that is different about one-versus-one and one-versus-rest classifiers. With three classes, they use the same number of classifiers (three), but the number of classifiers increases relatively rapidly with one-versus-one. The number of classifiers will always be equal to the number of class values with one-versus-rest, whereas, with one-versus-one, it is equal to the following:
Here, S is the number of classifiers and N is the cardinality (the number of class values) of the target. So, with a cardinality of 4, one-versus-rest needs 4 classifiers, and one-versus-one uses 6.
We explore multiclass SVC models in the last section of this chapter, but let’s start with a relatively straightforward linear model to see SVC in action. There are two things to keep in mind when doing the preprocessing for an SVC model. First, SVC is sensitive to the scale of features, so we will need to address that before fitting our model. Second, if we are using hard margins or high values for C, outliers might have a large effect on our model.
We can often get good results by using a linear SVC model. When we have more than two features, there is no easy way to visualize whether our data is linearly separable or not. We often decide on linear or nonlinear based on hyperparameter tuning. For this section, we will assume we can get good performance with a linear model and soft margins.
We will work with data on National Basketball Association (NBA) games in this section. The dataset has statistics from each NBA game from the 2017/2018 season through the 2020/2021 season. This includes the home team, whether the home team won, the visiting team, shooting percentages for visiting and home teams, turnovers, rebounds, and assists by both teams, and a number of other measures.
Note
NBA game data is available for download for the public at https://www.kaggle.com/datasets/wyattowalsh/basketball. This dataset has game data starting with the 1946/1947 NBA season. It uses nba_api to pull stats from nba.com. That API is available at https://github.com/swar/nba_api.
Let’s build a linear SVC model:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.svm import LinearSVC
from scipy.stats import uniform
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.feature_selection import RFECV
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.model_selection import cross_validate,
RandomizedSearchCV, RepeatedStratifiedKFold
import sklearn.metrics as skmet
import seaborn as sns
import os
import sys
sys.path.append(os.getcwd() + "/helperfunctions")
from preprocfunc import OutlierTrans
There is not much of a problem with a class imbalance here. This will save us some time later:
nbagames = pd.read_csv("data/nbagames2017plus.csv", parse_dates=['GAME_DATE'])
nbagames =
nbagames.loc[nbagames.WL_HOME.isin(['W','L'])]
nbagames.shape
(4568, 149)
nbagames['WL_HOME'] =
np.where(nbagames.WL_HOME=='L',0,1).astype('int')
nbagames.WL_HOME.value_counts(dropna=False)
1 2586
0 1982
Name: WL_HOME, dtype: int64
num_cols = ['FG_PCT_HOME','FTA_HOME','FG3_PCT_HOME',
'FTM_HOME','FT_PCT_HOME','OREB_HOME','DREB_HOME',
'REB_HOME','AST_HOME','STL_HOME','BLK_HOME',
'TOV_HOME','FG_PCT_AWAY','FTA_AWAY','FG3_PCT_AWAY',
'FT_PCT_AWAY','OREB_AWAY','DREB_AWAY','REB_AWAY',
'AST_AWAY','STL_AWAY','BLK_AWAY','TOV_AWAY']
cat_cols = ['SEASON']
nbagames[['WL_HOME'] + num_cols].agg(['count','min','median','max']).T
count min median max
WL_HOME 4,568 0.00 1.00 1.00
FG_PCT_HOME 4,568 0.27 0.47 0.65
FTA_HOME 4,568 1.00 22.00 64.00
FG3_PCT_HOME 4,568 0.06 0.36 0.84
FTM_HOME 4,568 1.00 17.00 44.00
FT_PCT_HOME 4,568 0.14 0.78 1.00
OREB_HOME 4,568 1.00 10.00 25.00
DREB_HOME 4,568 18.00 35.00 55.00
REB_HOME 4,568 22.00 45.00 70.00
AST_HOME 4,568 10.00 24.00 50.00
.........
FT_PCT_AWAY 4,568 0.26 0.78 1.00
OREB_AWAY 4,568 0.00 10.00 26.00
DREB_AWAY 4,568 18.00 34.00 56.00
REB_AWAY 4,568 22.00 44.00 71.00
AST_AWAY 4,568 9.00 24.00 46.00
STL_AWAY 4,568 0.00 8.00 19.00
BLK_AWAY 4,568 0.00 5.00 15.00
TOV_AWAY 4,568 3.00 14.00 30.00
corrmatrix = nbagames[['WL_HOME'] +
num_cols].corr(method="pearson")
sns.heatmap(corrmatrix,
xticklabels=corrmatrix.columns,
yticklabels=corrmatrix.columns, cmap="coolwarm")
plt.title('Heat Map of Correlation Matrix')
plt.tight_layout()
plt.show()
This produces the following plot:
Several features are correlated with the target, including the field goal percentage of the home team (FG_PCT_HOME) and defensive rebounds of the home team (DREB_HOME).
There is also correlation among the features. For example, the field goal percentage of the home team (FG_PCT_HOME) and the 3-point field goal percentage of the home team (FG3_PCT_HOME) are positively correlated, not surprisingly. Also, rebounds of the home team (REB_HOME) and defensive rebounds of the home team (DREB_HOME) are likely too closely correlated for any model to disentangle their impact.
X_train, X_test, y_train, y_test =
train_test_split(nbagames[num_cols + cat_cols],
nbagames[['WL_HOME']], test_size=0.2, random_state=0)
ohe = OneHotEncoder(drop='first', sparse=False)
cattrans = make_pipeline(ohe)
standtrans = make_pipeline(OutlierTrans(2),
SimpleImputer(strategy="median"), StandardScaler())
coltrans = ColumnTransformer(
transformers=[
("cat", cattrans, cat_cols),
("stand", standtrans, num_cols)
]
)
We create a function, dispbound, which will use the DecisionBoundaryDisplay module to show the boundary. This module is available with scikit-learn versions 1.1.1 or later. DecisionBoundaryDisplay needs a model to fit, two features, and target values:
pipe0 = make_pipeline(OutlierTrans(2),
SimpleImputer(strategy="median"), StandardScaler())
X_train_enc = pipe0.
fit_transform(X_train[['FG_PCT_HOME','DREB_HOME']])
def dispbound(model, X, xvarnames, y, title):
dispfit = model.fit(X,y)
disp = DecisionBoundaryDisplay.from_estimator(
dispfit, X, response_method="predict",
xlabel=xvarnames[0], ylabel=xvarnames[1],
alpha=0.5,
)
scatter = disp.ax_.scatter(X[:,0], X[:,1],
c=y, edgecolor="k")
disp.ax_.set_title(title)
legend1 = disp.ax_.legend(*scatter.legend_elements(),
loc="lower left", title="Home Win")
disp.ax_.add_artist(legend1)
dispbound(LinearSVC(max_iter=1000000,loss='hinge'),
X_train_enc, ['FG_PCT_HOME','DREB_HOME'],
y_train.values.ravel(),
'Linear SVC Decision Boundary')
This produces the following plot:
We get a pretty decent linear boundary with just the two features. That is great news, but let’s do a more carefully constructed model.
svc = LinearSVC(max_iter=1000000, loss='hinge',
random_state=0)
rfecv = RFECV(estimator=svc, cv=5)
pipe1 = make_pipeline(coltrans, rfecv, svc)
pipe1.fit(X_train, y_train.values.ravel())
new_cat_cols =
pipe1.named_steps['columntransformer'].
named_transformers_['cat'].
named_steps['onehotencoder'].
get_feature_names(cat_cols)
new_cols = np.concatenate((new_cat_cols, np.array(num_cols)))
sel_cols = new_cols[pipe1['rfecv'].get_support()]
np.set_printoptions(linewidth=55)
sel_cols
array(['SEASON_2018', 'SEASON_2019', 'SEASON_2020',
'FG_PCT_HOME', 'FTA_HOME', 'FG3_PCT_HOME',
'FTM_HOME', 'FT_PCT_HOME', 'OREB_HOME',
'DREB_HOME', 'REB_HOME', 'AST_HOME',
'TOV_HOME', 'FG_PCT_AWAY', 'FTA_AWAY',
'FG3_PCT_AWAY', 'FT_PCT_AWAY', 'OREB_AWAY',
'DREB_AWAY', 'REB_AWAY', 'AST_AWAY',
'BLK_AWAY', 'TOV_AWAY'], dtype=object)
pd.Series(pipe1['linearsvc'].
coef_[0], index=sel_cols).
sort_values(ascending=False)
FG_PCT_HOME 2.21
TOV_AWAY 1.20
REB_HOME 1.19
FTM_HOME 0.95
FG3_PCT_HOME 0.94
FT_PCT_HOME 0.31
AST_HOME 0.25
OREB_HOME 0.18
DREB_AWAY 0.11
SEASON_2018 0.10
FTA_HOME -0.05
BLK_AWAY -0.07
SEASON_2019 -0.11
SEASON_2020 -0.19
AST_AWAY -0.44
OREB_AWAY -0.47
DREB_HOME -0.49
FT_PCT_AWAY -0.53
REB_AWAY -0.63
FG3_PCT_AWAY -0.80
FTA_AWAY -0.81
TOV_HOME -1.19
FG_PCT_AWAY -1.91
dtype: float64
pred = pipe1.predict(X_test)
print("accuracy: %.2f, sensitivity: %.2f, specificity: %.2f, precision: %.2f" %
(skmet.accuracy_score(y_test.values.ravel(), pred),
skmet.recall_score(y_test.values.ravel(), pred),
skmet.recall_score(y_test.values.ravel(), pred, pos_label=0),
skmet.precision_score(y_test.values.ravel(), pred)))
accuracy: 0.93, sensitivity: 0.95, specificity: 0.92, precision: 0.93
kf = RepeatedStratifiedKFold(n_splits=7,n_repeats=10,
random_state=0)
scores = cross_validate(pipe1, X_train,
y_train.values.ravel(),
scoring=['accuracy','precision','recall','f1'],
cv=kf, n_jobs=-1)
print("accuracy: %.2f, precision: %.2f, sensitivity: %.2f, f1: %.2f" %
(np.mean(scores['test_accuracy']),
np.mean(scores['test_precision']),
np.mean(scores['test_recall']),
np.mean(scores['test_f1'])))
accuracy: 0.93, precision: 0.93, sensitivity: 0.95, f1: 0.94
svc_params = {
'linearsvc__C': uniform(loc=0, scale=100)
}
rs = RandomizedSearchCV(pipe1, svc_params, cv=10,
scoring='accuracy', n_iter=20, random_state=0)
rs.fit(X_train, y_train.values.ravel())
rs.best_params_
{'linearsvc__C': 54.88135039273247}
rs.best_score_
0.9315809566584325
The best C value is 2.02 and the best accuracy score is 0.9316.
results =
pd.DataFrame(rs.cv_results_['mean_test_score'],
columns=['meanscore']).
join(pd.DataFrame(rs.cv_results_['params'])).
sort_values(['meanscore'], ascending=False)
results
meanscore linearsvc__C
0 0.93 54.88
8 0.93 96.37
18 0.93 77.82
17 0.93 83.26
13 0.93 92.56
12 0.93 56.80
11 0.93 52.89
1 0.93 71.52
10 0.93 79.17
7 0.93 89.18
6 0.93 43.76
5 0.93 64.59
3 0.93 54.49
2 0.93 60.28
19 0.93 87.00
9 0.93 38.34
4 0.93 42.37
14 0.93 7.10
15 0.93 8.71
16 0.93 2.02
pred = rs.predict(X_test)
print("accuracy: %.2f, sensitivity: %.2f, specificity: %.2f, precision: %.2f" %
(skmet.accuracy_score(y_test.values.ravel(), pred),
skmet.recall_score(y_test.values.ravel(), pred),
skmet.recall_score(y_test.values.ravel(), pred, pos_label=0),
skmet.precision_score(y_test.values.ravel(), pred)))
accuracy: 0.93, sensitivity: 0.95, specificity: 0.92, precision: 0.93
cm = skmet.confusion_matrix(y_test, pred)
cmplot =
skmet.ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Loss', 'Won'])
cmplot.plot()
cmplot.ax_.set(title='Home Team Win Confusion Matrix',
xlabel='Predicted Value', ylabel='Actual Value')
This produces the following plot:
Our model largely predicts home team wins and losses correctly. Tuning the value of C did not make much of a difference, as we get pretty much the same accuracy regardless of the C value.
Note
You may have noticed that we are using the accuracy metric more often with the NBA games data than with the heart disease and machine failure data that we have worked with in previous chapters. We focused more on sensitivity with that data. There are two reasons for that. First, accuracy is a more compelling measure when classes are closer to being balanced for reasons we discussed in detail in Chapter 6, Preparing for Model Evaluation. Second, in predicting heart disease and machine power failure, we are biased towards sensitivity, as the cost of a false negative is higher than that of a false positive in those domains. For predicting NBA games, there is no such bias.
One advantage of linear SVC models is how easy they are to interpret. We are able to look at coefficients, which helps us make sense of the model and communicate the basis of our predictions to others. It nonetheless can be helpful to confirm that we do not get better results with a nonlinear model. We will do that in the next section.
Although nonlinear SVC is more complicated conceptually than linear SVC, as we saw in the first section of this chapter, running a nonlinear model with scikit-learn is relatively straightforward. The main difference from a linear model is that we need to do a fair bit more hyperparameter tuning. We have to specify values for C, for gamma, and for the kernel we want to use.
While there are theoretical reasons for hypothesizing that some hyperparameter values might work better than others for a given modeling challenge, we usually resolve those values empirically, that is, with hyperparameter tuning. We try that in this section with the same NBA games data that we used in the previous section:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from scipy.stats import uniform
from sklearn.feature_selection import RFECV
from sklearn.impute import SimpleImputer
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV
import sklearn.metrics as skmet
import os
import sys
sys.path.append(os.getcwd() + "/helperfunctions")
from preprocfunc import OutlierTrans
We also import the dispbound function we used in the previous section to display decision boundaries. We copied that code into a file called displayfunc.py in the helperfunctions subfolder of the current directory:
import nbagames as ng
from displayfunc import dispbound
X_train = ng.X_train
X_test = ng.X_test
y_train = ng.y_train
y_test = ng.y_test
pipe0 = make_pipeline(OutlierTrans(2),
SimpleImputer(strategy="median"),
StandardScaler())
X_train_enc =
pipe0.fit_transform(X_train[['FG_PCT_HOME',
'DREB_HOME']])
dispbound(SVC(kernel='rbf', gamma=30, C=1),
X_train_enc,['FG_PCT_HOME','DREB_HOME'],
y_train.values.ravel(),
"SVC with rbf kernel-gamma=30, C=1")
Running this a few different ways produces the following plots:
At values for gamma and C near the default, we see some bending of the decision boundary to accommodate a few wayward points in the loss class. These are instances where the home team lost despite having very high defensive rebound totals. With the rbf kernel, two of these instances are now correctly classified. There are also a couple of instances with a high home team field goal percentage but low home team defensive rebounds, which are now correctly classified. However, there is not much change overall in our predictions compared with the linear model from the previous section.
But this changes significantly if we increase values for C or gamma. Recall that higher values of C increase the penalty for misclassification. This leads to boundaries that wind around instances more.
Increasing gamma to 30 causes substantial overfitting. High values of gamma mean that data points have to be very close to each other to be grouped together. This results in decision boundaries closely tied to small numbers of instances, sometimes just one instance.
dispbound(SVC(kernel='poly', degree=7),
X_train_enc, ['FG_PCT_HOME','DREB_HOME'],
y_train.values.ravel(),
"SVC with polynomial kernel - degree=7")
Running this a couple of different ways produces the following plots:
We can see some bending of the decision boundary at higher degree levels to handle a couple of unusual instances. There is not much overfitting here, but not really much improvement in our predictions either.
This at least hints at what to expect when we construct the model. We should try some nonlinear models but there is a good chance that they will not lead to much improvement over the linear model we used in the previous section.
rfecv = RFECV(estimator=LogisticRegression())
svc = SVC()
pipe1 = make_pipeline(ng.coltrans, rfecv, svc)
svc_params = [
{
'svc__kernel': ['rbf'],
'svc__C': uniform(loc=0, scale=20),
'svc__gamma': uniform(loc=0, scale=100)
},
{
'svc__kernel': ['poly'],
'svc__degree': randint(1, 5),
'svc__C': uniform(loc=0, scale=20),
'svc__gamma': uniform(loc=0, scale=100)
},
{
'svc__kernel': ['linear','sigmoid'],
'svc__C': uniform(loc=0, scale=20)
}
]
Note
You may have noticed that one of the kernels we will be using is linear, and wonder how this is different from the linear SVC module we used in the previous section. LinearSVC will often converge faster, particularly with large datasets. It does not use the kernel trick. We will also likely get different results as the optimization is different in several ways.
rs = RandomizedSearchCV(pipe1, svc_params, cv=5,
scoring='accuracy', n_iter=10, n_jobs=-1,
verbose=5, random_state=0)
rs.fit(X_train, y_train.values.ravel())
rs.best_params_
{'svc__C': 1.1342595463488636, 'svc__kernel': 'linear'}
rs.best_score_
0.9299405955437289
We sort by accuracy score in descending order. Linear kernels outperform polynomial and rbf kernels, though not substantially better than polynomial at 3, 4, and 5 degrees. rbf kernels perform particularly poorly:
results =
pd.DataFrame(rs.cv_results_['mean_test_score'],
columns=['meanscore']).
join(pd.json_normalize(rs.cv_results_['params'])).
sort_values(['meanscore'], ascending=False)
results
C gamma kernel degree
meanscore
0.93 1.13 NaN linear NaN
0.89 1.42 64.82 poly 3.00
0.89 9.55 NaN sigmoid NaN
0.89 11.36 NaN sigmoid NaN
0.89 2.87 75.86 poly 5.00
0.64 12.47 43.76 poly 4.00
0.64 15.61 72.06 poly 4.00
0.57 11.86 84.43 rbf NaN
0.57 16.65 77.82 rbf NaN
0.57 19.57 79.92 rbf NaN
Note
We use the pandas json_nomalize method to handle the somewhat messy hyperparameter combinations we pull from the params list. It is messy because different hyperparameters are available depending on the kernel used. This means that the 20 dictionaries in the params list will have different keys. For example, the polynomial kernels will have values for degrees. The linear and rbf kernels will not.
rs.best_estimator_['svc'].
support_vectors_.shape
(625, 18)
pred = rs.predict(X_test)
print("accuracy: %.2f, sensitivity: %.2f, specificity: %.2f, precision: %.2f" %
(skmet.accuracy_score(y_test.values.ravel(), pred),
skmet.recall_score(y_test.values.ravel(), pred),
skmet.recall_score(y_test.values.ravel(), pred,
pos_label=0),
skmet.precision_score(y_test.values.ravel(), pred)))
accuracy: 0.93, sensitivity: 0.94, specificity: 0.91, precision: 0.93
Although we have not improved upon our model from the previous section, it was still a worthwhile exercise to experiment with some nonlinear models. Indeed, this is often how we discover whether we have data that can be successfully separated linearly. This is typically difficult to visualize and so we rely on hyperparameter tuning to tell us which kernel classifies our data best.
This section and the previous one demonstrate the key techniques for using SVMs for binary classification. Much of what we have done so far applies to multiclass classification as well. We will take a look at SVC modeling strategies when our target has more than two values in the next section.
All of the same concerns that we had when we used SVC for binary classification apply when we are doing multiclass classification. We need to determine whether the classes are linearly separable, and if not, which kernel will yield the best results. As discussed in the first section of this chapter, we also need to decide whether that classification is best modeled as one-versus-one or one-versus-rest. One-versus-one finds decision boundaries that separate each class from each of the other classes. One-versus-rest finds decision boundaries that distinguish each class from all other instances. We try both approaches in this section.
We will work with the machine failure data that we worked with in previous chapters.
Note
This dataset on machine failure is available for public use at https://www.kaggle.com/datasets/shivamb/machine-predictive-maintenance-classification. There are 10,000 observations, 12 features, and two possible targets. One is binary: the machine failed or did not. The other has types of failure. The instances in this dataset are synthetic, generated by a process designed to mimic machine failure rates and causes.
Let’s build a multiclass SVC model:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC
from scipy.stats import uniform
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import RandomizedSearchCV
import sklearn.metrics as skmet
import os
import sys
sys.path.append(os.getcwd() + "/helperfunctions")
from preprocfunc import OutlierTrans
machinefailuretype = pd.read_csv("data/machinefailuretype.csv")
machinefailuretype.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 udi 10000 non-null int64
1 product 10000 non-null object
2 machinetype 10000 non-null object
3 airtemp 10000 non-null float64
4 processtemperature 10000 non-null float64
5 rotationalspeed 10000 non-null int64
6 torque 10000 non-null float64
7 toolwear 10000 non-null int64
8 fail 10000 non-null int64
9 failtype 10000 non-null object
dtypes: float64(3), int64(4), object(3)
memory usage: 781.4+ KB
machinefailuretype.head()
udi product machinetype airtemp processtemperature
0 1 M14860 M 298 309
1 2 L47181 L 298 309
2 3 L47182 L 298 308
3 4 L47183 L 298 309
4 5 L47184 L 298 309
rotationalspeed torque toolwear fail failtype
0 1551 43 0 0 No Failure
1 1408 46 3 0 No Failure
2 1498 49 5 0 No Failure
3 1433 40 7 0 No Failure
4 1408 40 9 0 No Failure
machinefailuretype.failtype.
value_counts(dropna=False).sort_index()
Heat Dissipation Failure 112
No Failure 9652
Overstrain Failure 78
Power Failure 95
Random Failures 18
Tool Wear Failure 45
Name: failtype, dtype: int64
def setcode(typetext):
if (typetext=="No Failure"):
typecode = 1
elif (typetext=="Heat Dissipation Failure"):
typecode = 2
elif (typetext=="Power Failure"):
typecode = 3
elif (typetext=="Overstrain Failure"):
typecode = 4
else:
typecode = 5
return typecode
machinefailuretype["failtypecode"] =
machinefailuretype.apply(lambda x: setcode(x.failtype), axis=1)
num_cols = ['airtemp','processtemperature',
'rotationalspeed','torque','toolwear']
cat_cols = ['machinetype']
machinefailuretype[num_cols].agg(['min','median','max']).T
min median max
airtemp 295.30 300.10 304.50
processtemperature 305.70 310.10 313.80
rotationalspeed 1,168.00 1,503.00 2,886.00
torque 3.80 40.10 76.60
toolwear 0.00 108.00 253.00
X_train, X_test, y_train, y_test =
train_test_split(machinefailuretype[num_cols + cat_cols],
machinefailuretype[['failtypecode']],
stratify=machinefailuretype[['failtypecode']],
test_size=0.2, random_state=0)
ohe = OneHotEncoder(drop='first', sparse=False)
cattrans = make_pipeline(ohe)
standtrans = make_pipeline(OutlierTrans(3),
SimpleImputer(strategy="median"),
MinMaxScaler())
coltrans = ColumnTransformer(
transformers=[
("cat", cattrans, cat_cols),
("stand", standtrans, num_cols),
]
)
svc = SVC(class_weight='balanced', probability=True)
pipe1 = make_pipeline(coltrans, svc)
We only have a handful of features in this case so we will not worry about feature selection. (We might still be concerned about features that are highly correlated, but that is not an issue with this dataset.)
svc_params = [
{
'svc__kernel': ['rbf'],
'svc__C': uniform(loc=0, scale=20),
'svc__gamma': uniform(loc=0, scale=100),
'svc__decision_function_shape': ['ovr','ovo']
},
{
'svc__kernel': ['poly'],
'svc__degree': np.arange(0,6),
'svc__C': uniform(loc=0, scale=20),
'svc__gamma': uniform(loc=0, scale=100),
'svc__decision_function_shape': ['ovr','ovo']
},
{
'svc__kernel': ['linear','sigmoid'],
'svc__C': uniform(loc=0, scale=20),
'svc__decision_function_shape': ['ovr','ovo']
}
]
rs = RandomizedSearchCV(pipe1, svc_params, cv=7, scoring="roc_auc_ovr", n_iter=10)
rs.fit(X_train, y_train.values.ravel())
rs.best_params_
{'svc__C': 5.609789456747942,
'svc__decision_function_shape': 'ovo',
'svc__gamma': 27.73459801111866,
'svc__kernel': 'rbf'}
rs.best_score_
0.9187636814475847
results =
pd.DataFrame(rs.cv_results_['mean_test_score'],
columns=['meanscore']).
join(pd.json_normalize(rs.cv_results_['params'])).
sort_values(['meanscore'], ascending=False)
results
meanscore svc__C svc__decision_function_shape svc__gamma svc__kernel
7 0.92 5.61 ovo 27.73 rbf
5 0.91 9.43 ovr NaN linear
3 0.91 5.40 ovr NaN linear
0 0.90 19.84 ovr 28.70 rbf
8 0.87 5.34 ovo 93.87 rbf
6 0.86 8.05 ovr 80.57 rbf
9 0.86 4.41 ovo 66.66 rbf
1 0.86 3.21 ovr 85.35 rbf
4 0.85 0.01 ovo 38.24 rbf
2 0.66 7.61 ovr NaN sigmoid
pred = rs.predict(X_test)
cm = skmet.confusion_matrix(y_test, pred)
cmplot = skmet.ConfusionMatrixDisplay(confusion_matrix=cm,
display_labels=['None', 'Heat','Power','Overstrain','Other'])
cmplot.plot()
cmplot.ax_.set(title='Machine Failure Type Confusion Matrix',
xlabel='Predicted Value', ylabel='Actual Value')
This produces the following plot:
print(skmet.classification_report(y_test, pred,
target_names=['None', 'Heat','Power', 'Overstrain', 'Other']))
precision recall f1-score support
None 0.99 0.97 0.98 1930
Heat 0.50 0.91 0.65 22
Power 0.60 0.47 0.53 19
Overstrain 0.65 0.81 0.72 16
Other 0.06 0.15 0.09 13
accuracy 0.96 2000
macro avg 0.56 0.66 0.59 2000
weighted avg 0.97 0.96 0.96 2000
When modeling targets such as machine failure types that have a high class imbalance, we are often more concerned with metrics other than accuracy. This is partly determined by our domain knowledge. Avoiding false negatives may be more important than avoiding false positives. Doing a thorough check on a machine too early is definitely preferable to doing it too late.
The 96 to 97 percent weighted precision, recall (sensitivity), and f1 scores do not provide a good sense of the performance of our model. They mainly reflect the large class imbalance and the fact that it is very easy to predict no machine failure. The much lower macro averages (which are just simple averages across classes) indicate that our model struggles to predict some types of machine failure.
This example illustrates that it is relatively easy to extend SVC to models that have targets with more than two values. We can specify whether we want to use one-versus-one or one-versus-rest classification. The one-versus-rest approach can be faster when the number of classes is above three since there will be fewer classifiers trained.
In this chapter, we explored the different strategies for implementing SVC. We used linear SVC (which does not use kernels), which can perform very well when our classes are linearly separable. We then examined how to use the kernel trick to extend SVC to cases where the classes are not linearly separable. Finally, we used one-versus-one and one-versus-rest classification to handle targets with more than two values.
SVC is an exceptionally useful technique for binary and multiclass classification. It can handle both straightforward and complicated relationships between features and the target. There are few supervised learning problems for which SVMs should not at least be considered. However, it is not very efficient with very large datasets.
In the next chapter, we will explore another popular and flexible classification algorithm, Naive Bayes.