Business case

In the upcoming case study, we will apply KNN and SVM to the same dataset. This will allow us to compare the R code and learning methods on the same problem, starting with KNN. We will also spend some time drilling down into the confusion matrix, comparing a number of statistics to evaluate model accuracy.

Business understanding

The data that we will examine was originally collected by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). It consists of 532 observations and eight input features along with a binary outcome (Yes/No). The patients in this study were of Pima Indian descent from South Central Arizona. The NIDDK data shows that since the past 30 years, research has helped scientists to prove that obesity is a major risk factor in the development of diabetes. The Pima Indians were selected for the study as one-half of the adult Pima Indians have diabetes and 95 percent of those with diabetes are overweight. The analysis will focus on adult women only. Diabetes was diagnosed according to the WHO criteria and was of the type of diabetes that is known as type 2. In this type of diabetes, the pancreas is still able to function and produce insulin and it used to be referred to as non-insulin-dependent diabetes.

Our task is to examine and predict those individuals that have diabetes or the risk factors that could lead to diabetes in this population. Diabetes has become an epidemic in the USA, given the relatively sedentary lifestyle and high-caloric diet. According to the American Diabetes Association (ADA), the disease was the seventh leading cause of death in the USA in 2010, despite being underdiagnosed. Diabetes is also associated with a dramatic increase in comorbidities, such as hypertension, dyslipidemia, stroke, eye disease, and kidney disease. The costs of diabetes and its complications are enormous. The ADA estimates that the total cost of the disease in 2012 was approximately $490 billion. For further background information on the problem, refer to ADA's website at http://www.diabetes.org/diabetes-basics/statistics/.

Data understanding and preparation

The dataset for the 532 women is in two separate data frames. The variables of interest are as follows:

  • npreg: This is the number of pregnancies
  • glu: This is the plasma glucose concentration in an oral glucose tolerance test
  • bp: This is the diastolic blood pressure (mm Hg)
  • skin: This is triceps skin-fold thickness measured in mm
  • bmi: This is the body mass index
  • ped: This is the diabetes pedigree function
  • age: This is the age in years
  • type: This is diabetic, Yes or No

The datasets are contained in the R package, MASS. One data frame is named Pima.tr and the other is named Pima.te. Instead of using these as separate train and test sets, we will combine them and create our own in order to discover how to do such a task in R.

To begin, let's load the following packages that we will need for the exercise:

> library(class) #k-nearest neighbors
> library(kknn) #weighted k-nearest neighbors
> library(e1071) #SVM
> library(caret) #select tuning parameters
> library(MASS) # contains the data
> library(reshape2) #assist in creating boxplots
> library(ggplot2) #create boxplots
> library(kernlab) #assist with SVM feature selection
> library(pROC)

We will now load the datasets and check their structure, ensuring that they are the same, starting with Pima.tr, as follows:

> data(Pima.tr)
> str(Pima.tr)
'data.frame':200 obs. of  8 variables:
 $ npreg: int  5 7 5 0 0 5 3 1 3 2 ...
 $ glu  : int  86 195 77 165 107 97 83 193 142 128 ...
 $ bp   : int  68 70 82 76 60 76 58 50 80 78 ...
 $ skin : int  28 33 41 43 25 27 31 16 15 37 ...
 $ bmi  : num  30.2 25.1 35.8 47.9 26.4 35.6 34.3 25.9 32.4 43.3 ...
 $ ped  : num  0.364 0.163 0.156 0.259 0.133 ...
 $ age  : int  24 55 35 26 23 52 25 24 63 31 ...
 $ type : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 1 1 2 ...
> data(Pima.te)

> str(Pima.te)
'data.frame':332 obs. of  8 variables:
 $ npreg: int  6 1 1 3 2 5 0 1 3 9 ...
 $ glu  : int  148 85 89 78 197 166 118 103 126 119 ...
 $ bp   : int  72 66 66 50 70 72 84 30 88 80 ...
 $ skin : int  35 29 23 32 45 19 47 38 41 35 ...
 $ bmi  : num  33.6 26.6 28.1 31 30.5 25.8 45.8 43.3 39.3 29 ...
 $ ped  : num  0.627 0.351 0.167 0.248 0.158 0.587 0.551 0.183 0.704 0.263 ...
 $ age  : int  50 31 21 26 53 51 31 33 27 29 ...
 $ type : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 2 2 1 1 2 ...

Looking at the structures, we can be confident that we can combine the data frames to one. This is very easy to do using the rbind() function, which stands for row binding and appends the data. If you had the same observations in each frame and wanted to append the features, you would bind them by columns using the cbind() function. You will simply name your new data frame and use this syntax: new data = rbind(data frame1, data frame2). Our code thus becomes as follows:

> pima = rbind(Pima.tr, Pima.te)

As always, double-check the structure. We can see that there are no issues, as follows:

> str(pima)
'data.frame':532 obs. of  8 variables:
 $ npreg: int  5 7 5 0 0 5 3 1 3 2 ...
 $ glu  : int  86 195 77 165 107 97 83 193 142 128 ...
 $ bp   : int  68 70 82 76 60 76 58 50 80 78 ...
 $ skin : int  28 33 41 43 25 27 31 16 15 37 ...
 $ bmi  : num  30.2 25.1 35.8 47.9 26.4 35.6 34.3 25.9 32.4 43.3 ...
 $ ped  : num  0.364 0.163 0.156 0.259 0.133 ...
 $ age  : int  24 55 35 26 23 52 25 24 63 31 ...
 $ type : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 1 1 1 2 ...

Let's do some exploratory analysis by putting this in boxplots. For this, we want to use the outcome variable, "type", as our ID variable. As we did with logistic regression, the melt() function will do this and prepare a data frame that we can use for the boxplots. We will call the new data frame pima.melt, as follows:

> pima.melt = melt(pima, id.var="type")

The boxplot layout using the ggplot2 package is quite effective, so we will use it. In the ggplot() function, we will specify the data to use, the x and y variables, and what type of plot and create a series of plots with two columns. In the following code, we will put the response variable as x and its value as y in aes(). Then, geom_boxplot() creates the boxplots. Finally, we will build the boxplots in two columns with facet_wrap():

> ggplot(data=pima.melt, aes(x=type, y=value)) + geom_boxplot() + facet_wrap(~variable, ncol=2)

The following is the output of the preceding command:

Data understanding and preparation

This is an interesting plot because it is difficult to discern any dramatic differences in the plots, probably with the exception of glucose (glu). As you may have suspected, the fasting glucose appears to be significantly higher in the patients currently diagnosed with diabetes. The main problem here is that the plots are all on the same y axis scale. We can fix this and produce a more meaningful plot by standardizing the values and then re-plotting. R has a built-in function, scale(), which will convert the values to a mean of zero and a standard deviation of one. Let's put this in a new data frame called pima.scale, converting all of the features and leaving out the type response. Additionally, while doing KNN, it is important to have the features on the same scale with a mean of zero and a standard deviation of one. If not, then the distance calculations in the nearest neighbor calculation are flawed. If something is measured on a scale of 1 to 100, it will have a larger effect versus another feature that is measured on a scale of 1 to 10. Note that when you scale a data frame, it automatically becomes a matrix. Using the as.data.frame() function, convert it back to a data frame, as follows:

> pima.scale = as.data.frame(scale(pima[,-8]))
> str(pima.scale)
'data.frame':532 obs. of  7 variables:
 $ npreg: num  0.448 1.052 0.448 -1.062 -1.062 ...
 $ glu  : num  -1.13 2.386 -1.42 1.418 -0.453 ...
 $ bp   : num  -0.285 -0.122 0.852 0.365 -0.935 ...
 $ skin : num  -0.112 0.363 1.123 1.313 -0.397 ...
 $ bmi  : num  -0.391 -1.132 0.423 2.181 -0.943 ...
 $ ped  : num  -0.403 -0.987 -1.007 -0.708 -1.074 ...
 $ age  : num  -0.708 2.173 0.315 -0.522 -0.801 ...

Now, we will need to include the response in the data frame, as follows:

> pima.scale$type = pima$type

Let's just repeat the boxplotting process again with melt() and ggplot():

> pima.scale.melt = melt(pima.scale, id.var="type")

> ggplot(data=pima.scale.melt, aes(x=type, y=value)) +geom_boxplot()+facet_wrap(~variable, ncol=2)

The following is the output of the preceding command:

Data understanding and preparation

With the features scaled, the plot is easier to read. In addition to glucose, it appears that the other features may differ by type, in particular, age.

Before splitting this into train and test sets, let's have a look at the correlation with the R function, cor(). This will produce a matrix instead of a plot of the Pearson correlations:

> cor(pima.scale[-8])
            npreg       glu          bp       skin
npreg 1.000000000 0.1253296 0.204663421 0.09508511
glu   0.125329647 1.0000000 0.219177950 0.22659042
bp    0.204663421 0.2191779 1.000000000 0.22607244
skin  0.095085114 0.2265904 0.226072440 1.00000000
bmi   0.008576282 0.2470793 0.307356904 0.64742239
ped   0.007435104 0.1658174 0.008047249 0.11863557
age   0.640746866 0.2789071 0.346938723 0.16133614

              bmi         ped        age
npreg 0.008576282 0.007435104 0.64074687
glu   0.247079294 0.165817411 0.27890711
bp    0.307356904 0.008047249 0.34693872
skin  0.647422386 0.118635569 0.16133614
bmi   1.000000000 0.151107136 0.07343826
ped   0.151107136 1.000000000 0.07165413
age   0.073438257 0.071654133 1.00000000

There are a couple of correlations to point out, npreg/age and skin/bmi. Multi-collinearity is generally not a problem with these methods, assuming that they are properly trained and the hyperparameters are tuned.

I think we are now ready to create the train and test sets, but before we do so, I recommend that you always check the ratio of Yes and No in our response. It is important to make sure that you will have a balanced split in the data, which may be a problem if one of the outcomes is sparse. This can cause a bias in a classifier between the majority and minority classes. There are no hard and fast rules on what is an improper balance. A good rule of thumb is that you strive for—at least—a 2:1 ratio in the possible outcomes (He and Wa, 2013).

> table(pima.scale$type)

 No Yes
355 177

The ratio is 2:1 so we can create the train and test sets with our usual syntax using a 70/30 split in the following way:

> set.seed(502)

> ind = sample(2, nrow(pima.scale), replace=TRUE, prob=c(0.7,0.3))

> train = pima.scale[ind==1,]

> test = pima.scale[ind==2,]

> str(train)
'data.frame':385 obs. of  8 variables:
 $ npreg: num  0.448 0.448 -0.156 -0.76 -0.156 ...
 $ glu  : num  -1.42 -0.775 -1.227 2.322 0.676 ...
 $ bp   : num  0.852 0.365 -1.097 -1.747 0.69 ...
 $ skin : num  1.123 -0.207 0.173 -1.253 -1.348 ...
 $ bmi  : num  0.4229 0.3938 0.2049 -1.0159 -0.0712 ...
 $ ped  : num  -1.007 -0.363 -0.485 0.441 -0.879 ...
 $ age  : num  0.315 1.894 -0.615 -0.708 2.916 ...
 $ type : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 2 2 1 1 1 …

> str(test)
'data.frame':147 obs. of  8 variables:
 $ npreg: num  0.448 1.052 -1.062 -1.062 -0.458 ...
 $ glu  : num  -1.13 2.386 1.418 -0.453 0.225 ...
 $ bp   : num  -0.285 -0.122 0.365 -0.935 0.528 ...
 $ skin : num  -0.112 0.363 1.313 -0.397 0.743 ...
 $ bmi  : num  -0.391 -1.132 2.181 -0.943 1.513 ...
 $ ped  : num  -0.403 -0.987 -0.708 -1.074 2.093 ...
 $ age  : num  -0.7076 2.173 -0.5217 -0.8005 -0.0571 ...
 $ type : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 2 1 1 1 ...

All seems to be in order, so we can move on to the building of our predictive models and evaluate them, starting with KNN.

Modeling and evaluation

Now, we will see discuss various aspects pertaining to modeling and evaluation.

KNN modeling

As previously mentioned, it is critical to select the most appropriate parameter (k or K) when using this technique. Let's put the caret package to good use again in order to identify k. We will create a grid of inputs for the experiment, with k ranging from 2 to 20 by an increment of 1. This is easily done with the expand.grid() and seq() functions. The caret package parameter that works with the KNN function is simply .k:

> grid1 = expand.grid(.k=seq(2,20, by=1))

We will also incorporate cross-validation in the selection of the parameter, creating an object called control and utilizing the trainControl() function from the caret package, as follows:

> control = trainControl(method="cv")

Now, we can create the object that will show us how to compute the optimal k value with the train() function, which is also part of the caret package. Remember that while conducting any sort of random sampling, you will need to set the seed value as follows:

> set.seed(502)

The object created by the train() function requires the model formula, train data name, and an appropriate method. The model formula is the same as we've used before—y~x. The method designation is simply knn. With this in mind, this code will create the object that will show us the optimal k value, as follows:

> knn.train = train(type~., data=train, method="knn", trControl=control, tuneGrid=grid1)

Calling the object provides us with the k parameter that we are seeking, which is k=17:

> knn.train
k-Nearest Neighbors

385 samples
  7 predictor
  2 classes: 'No', 'Yes'

No pre-processing
Resampling: Cross-Validated (10 fold)

Summary of sample sizes: 347, 347, 345, 347, 347, 346, ...

Resampling results across tuning parameters:

  k   Accuracy  Kappa  Accuracy SD  Kappa SD
   2  0.736     0.359  0.0506       0.1273  
   3  0.762     0.416  0.0526       0.1313  
   4  0.761     0.418  0.0521       0.1276  
   5  0.759     0.411  0.0566       0.1295  
   6  0.772     0.442  0.0559       0.1474  
   7  0.767     0.417  0.0455       0.1227  
   8  0.767     0.425  0.0436       0.1122  
   9  0.772     0.435  0.0496       0.1316  
  10  0.780     0.458  0.0485       0.1170  
  11  0.777     0.446  0.0437       0.1120  
  12  0.775     0.440  0.0547       0.1443  
  13  0.782     0.456  0.0397       0.1084  
  14  0.780     0.449  0.0557       0.1349  
  15  0.772     0.427  0.0449       0.1061  
  16  0.782     0.453  0.0403       0.0954  
  17  0.795     0.485  0.0382       0.0978  
  18  0.782     0.451  0.0461       0.1205  
  19  0.785     0.455  0.0452       0.1197  
  20  0.782     0.446  0.0451       0.1124  

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 17.  

In addition to the results that yield k=17, we get the information in the form of a table on the Accuracy and Kappa statistics and their standard deviations from the cross-validation. Accuracy tells us the percentage of observations that the model classified correctly. Kappa refers to what is known as Cohen's Kappa statistic. The Kappa statistic is commonly used to provide a measure of how well can two evaluators classify an observation correctly. It provides an insight into this problem by adjusting the accuracy scores, which is done by accounting for the evaluators being totally correct by mere chance. The formula for the statistic is Kappa = (Percent of agreement – Percent of chance agreement) / (1 – Percent of chance agreement).

The Percent of agreement is the rate that the evaluators agreed on the class (accuracy) and Percent of chance agreement is the rate that the evaluators randomly agreed on. The higher the statistic, the better they performed with the maximum agreement being one. We will work through an example when we will apply our model on the test data.

To do this, we will utilize the knn() function from the class package. With this function, we will need to specify at least four items. These would be the train inputs, the test inputs, correct labels from the train set, and k. We will do this by creating the knn.test object and see how it performs:

> knn.test = knn(train[,-8], test[,-8], train[,8], k=17)

With the object created, let's examine the confusion matrix and calculate the accuracy and kappa:

> table(knn.test, test$type)
        
knn.test No Yes
     No  77  26
     Yes 16  28

The accuracy is done by simply dividing the correctly classified observations by the total observations:

> (77+28)/147
[1] 0.7142857

This is slightly less than our accuracy of 71 percent that we achieved on the train data alone of almost eight percent. We can now discuss the code of finding the kappa statistic. We have our accuracy and the chance calculation is simply the first row counts divided by the total rows multiplied by the first column counts divided by the total rows, as follows:

> #calculate Kappa

> prob.agree = (77+28)/147 #accuracy

> prob.chance = ((77+26)/147) * ((77+16)/147)

> prob.chance
[1] 0.4432875

> kappa = (prob.agree - prob.chance) / (1 - prob.chance)

> kappa
[1] 0.486783

The kappa statistic at 0.49 is what we achieved with the train set. Altman(1991) provides a heuristic to assist us in the interpretation of the statistic, which is shown in the following table:

Value of K

Strength of Agreement

<0.20

Poor

0.21-0.40

Fair

0.41-0.60

Moderate

0.61-0.80

Good

0.81-1.00

Very good

With our kappa only moderate and with an accuracy just over 70 percent on the test set, we should see if we can perform better by utilizing weighted neighbors. A weighting schema increases the influence of neighbors that are closest to an observation versus those that are further away. The further away the observation is from a point in space, the more its influence is penalized. For this technique, we will use the kknn package and its train.kknn() function to select the optimal weighting scheme.

The train.kknn() function uses LOOCV that we examined in the prior chapters in order to select the best parameters for the optimal k neighbors, one of the two distance measures, and a kernel function.

The unweighted k neighbors algorithm that we created uses the Euclidian distance as we discussed previously. With the kknn package, there are options available to compare the sum of the absolute differences versus the Euclidian distance. The package refers to the distance calculation used as the Minkowski parameter.

As for the weighting of the distances, many different methods are available. For our purpose, the package that we will use has ten different weighting schemas, which includes the unweighted ones. They are rectangular (unweighted), triangular, epanechnikov, biweight, triweight, cosine, inversion, gaussian, rank, and optimal. A full discussion of these weighting techniques is available in Hechenbichler K. and Schliep K.P. (2004).

For simplicity, let's focus on just two: triangular and epanechnikov. Prior to having the weights assigned, the algorithm standardizes all the distances so that they are between zero and one. The triangular weighting method multiplies the observation distance by one minus the distance. With epanechnikov, the distance is multiplied by ¾ times (one minus the distance two). For our problem, we will incorporate these weighting methods along with the standard unweighted version for comparison purposes.

After specifying a random seed, we will create the train set object with kknn(). This function asks for the maximum number of k values (kmax), distance (one is equal to Euclidian and two is equal to absolute), and kernel. For this model, kmax will be set to 25 and distance will be 2:

> set.seed(123)

> kknn.train = train.kknn(type~., data=train, kmax=25, distance=2, kernel=c("rectangular", "triangular", "epanechnikov"))

A nice feature of the package is the ability to plot and compare the results, as follows:

> plot(kknn.train)

The following is the output of the preceding command:

KNN modeling

This plot shows k on the x axis and the percentage of misclassified observations by kernel. To my surprise, the unweighted (rectangular) version at k: 19 performs the best. You can also call the object to see what is the classification error and the best parameter in the following way:

> kknn.train

Call:
train.kknn(formula = type ~ ., data = train, kmax = 25, distance = 2,     kernel = c("rectangular", "triangular", "epanechnikov", "gaussian"))

Type of response variable: nominal
Minimal misclassification: 0.212987
Best kernel: rectangular
Best k: 19

So, with this data, weighting the distance does not improve the model accuracy. There are other weights that we could try, but as I tried these other weights, the results that I achieved were not more accurate than these. We don't need to pursue KNN any further. I would encourage you to experiment with various parameters on your own to see how they perform.

SVM modeling

We will use the e1071 package to build our SVM models. We will start with a linear support vector classifier and then move on to the nonlinear versions. The e1071 package has a nice function for SVM called tune.svm(), which assists in the selection of the tuning parameters/kernel functions. The tune.svm() function from the package uses cross-validation to optimize the tuning parameters. Let's create an object called linear.tune and call it using the summary() function, as follows:

> linear.tune = tune.svm(type~., data=train, kernel="linear", cost=c(0.001, 0.01, 0.1, 1,5,10))

> summary(linear.tune)

Parameter tuning of 'svm':

- sampling method: 10-fold cross validation

- best parameters:
 cost
    1

- best performance: 0.2051957

- Detailed performance results:
   cost     error dispersion
1 1e-03 0.3197031 0.06367203
2 1e-02 0.2080297 0.07964313
3 1e-01 0.2077598 0.07084088
4 1e+00 0.2051957 0.06933229
5 5e+00 0.2078273 0.07221619
6 1e+01 0.2078273 0.07221619

The optimal cost function is one for this data and leads to a misclassification error of roughly 21 percent. We can make predictions on the test data and examine that as well using the predict() function and applying newdata=test:

> best.linear = linear.tune$best.model

> tune.test = predict(best.linear, newdata=test)

> table(tune.test, test$type)
         
tune.test No Yes
      No  80  22
      Yes 13  32

> (80+32)/147
[1] 0.7619048

The linear support vector classifier has slightly outperformed KNN on both the train and test sets. The e1071 package has a nice function for SVM called tune.svm() that assists in the selection of the tuning parameters/kernel functions. We will now see if non-linear methods will improve our performance and also use cross-validation to select tuning parameters.

The first kernel function that we will try is polynomial, and we will be tuning two parameters: a degree of polynomial (degree) and kernel coefficient (coef0). The polynomial order will be 3, 4, and 5 and the coefficient will be in increments from 0.1 to 4, as follows:

> set.seed(123)

> poly.tune = tune.svm(type~., data=train, kernel="polynomial", degree=c(3,4,5), coef0=c(0.1,0.5,1,2,3,4))

> summary(poly.tune)

Parameter tuning of 'svm':

- sampling method: 10-fold cross validation

- best parameters:
 degree coef0
      3   0.1

- best performance: 0.2310391

The model has selected degree of 3 for the polynomial and coefficient of 0.1. Just as the linear SVM, we can create predictions on the test set with these parameters, as follows:

> best.poly = poly.tune$best.model

> poly.test = predict(best.poly, newdata=test)

> table(poly.test, test$type)
         
poly.test No Yes
      No  81  28
      Yes 12  26

> (81+26)/147
[1] 0.7278912

This did not perform quite as well as the linear model. We will now run the radial basis function. In this instance, the one parameter that we will solve for is gamma, which we will examine in increments of 0.1 to 4. If gamma is too small, the model will not capture the complexity of the decision boundary; if it is too large, the model will severely overfit:

> set.seed(123)

> rbf.tune = tune.svm(type~., data=train, kernel="radial", gamma=c(0.1,0.5,1,2,3,4))

> summary(rbf.tune)

Parameter tuning of 'svm':

- sampling method: 10-fold cross validation

- best parameters:
 gamma
   0.5

- best performance: 0.2284076

The best gamma value is 0.5 and the performance at this setting does not seem to improve much over the other SVM models. We will check for the test set as well in the following way:

> best.rbf = rbf.tune$best.model

> rbf.test = predict(best.rbf, newdata=test)

> table(rbf.test, test$type)
        
rbf.test No Yes
     No  73  33
     Yes 20  21

> (73+21)/147
[1] 0.6394558

The performance is downright abysmal. One last shot to improve here would be with kernel="sigmoid". We will be solving for two parameters that are gamma and the kernel coefficient (coef0):

> set.seed(123)

> sigmoid.tune = tune.svm(type~., data=train, kernel="sigmoid", gamma=c(0.1,0.5,1,2,3,4), coef0=c(0.1,0.5,1,2,3,4))

> summary(sigmoid.tune)

Parameter tuning of 'svm':

- sampling method: 10-fold cross validation

- best parameters:
 gamma coef0
   0.1     2

- best performance: 0.2080972

This error rate is in line with the linear model. It is now just a matter of whether it performs better on the test set or not:

> best.sigmoid = sigmoid.tune$best.model

> sigmoid.test = predict(best.sigmoid, newdata=test)

> table(sigmoid.test, test$type)
            
sigmoid.test No Yes
         No  82  19
         Yes 11  35

> (82+35)/147
[1] 0.7959184

Lo and behold! We finally have a test performance that is in line with the performance on the train data. It appears that we can choose the sigmoid kernel as the best predictor.

So far we played around with different models. Now, let's evaluate their performance along with the linear model using metrics other than just the accuracy.

Model selection

We've looked at two different types of modeling techniques here, and for all intents and purposes, KNN has fallen short. The best accuracy on the test set for KNN was only around 71 percent. Conversely, with SVM, we could obtain an accuracy close to 80 percent. Before just simply selecting the most accurate model—in this case, the SVM with the sigmoid kernel—let's look at how we can compare them with a deep examination of the confusion matrices.

For this exercise, we can turn to our old friend, the caret package, and utilize the confusionMatrix() function. This will produce all of the statistics that we need in order to evaluate and select the best model. Let's start with the last model that we built first, using the same syntax that we used in the base table() function with the exception of specifying the positive class, as follows:

> confusionMatrix(sigmoid.test, test$type, positive="Yes")

Confusion Matrix and Statistics

          Reference
Prediction No Yes
       No  82  19
       Yes 11  35
                                          
                Accuracy : 0.7959          
                 95% CI : (0.7217, 0.8579)
    No Information Rate : 0.6327          
    P-Value [Acc > NIR] : 1.393e-05       
                                          
                  Kappa : 0.5469          
 Mcnemar's Test P-Value : 0.2012          
                                          
            Sensitivity : 0.6481          
            Specificity : 0.8817          
         Pos Pred Value : 0.7609          
         Neg Pred Value : 0.8119          
             Prevalence : 0.3673          
         Detection Rate : 0.2381          
   Detection Prevalence : 0.3129          
      Balanced Accuracy : 0.7649          
                                          
       'Positive' Class : Yes    

The function produces some items that we already covered such as Accuracy and Kappa. Here are the other stats that it produces:

  • No Information Rate is the proportion of the largest class—63 percent did not have diabetes.
  • P-Value is used to test the hypothesis that the accuracy is actually better than No Information Rate.
  • We will not concern ourselves with Mcnemar's Test, which is used for the analysis of the matched pairs, primarily in epidemiology studies
  • Sensitivity is the true positive rate; in this case, the rate of those not having diabetes has been correctly identified as such.
  • Specificity is the true negative rate or, for our purposes, the rate of a diabetic that has been correctly identified.
  • The positive predictive value (Pos Pred Value) is the probability of someone in the population classified as being diabetic and truly has the disease. The following formula is used:
    Model selection
  • The negative predictive value (Neg Pred Value) is the probability of someone in the population classified as not being diabetic and truly does not have the disease. The formula for this is as follows:
    Model selection
  • Prevalence is the estimated population prevalence of the disease, calculated here as the total of the second column (the Yes column) divided by the total observations.
  • Detection Rate is the rate of the true positives that have been identified—in our case, 35—divided by the total observations.
  • Detection Prevalence is the predicted prevalence rate, or in our case, the bottom row divided by the total observations.
  • Balanced Accuracy is the average accuracy obtained from either class. This measure accounts for a potential bias in the classifier algorithm, thus potentially overpredicting the most frequent class. This is simply Sensitivity + Specificity divided by 2.

The sensitivity of our model is not as powerful as we would like and tells us that we are missing some features from our dataset that would improve the rate of finding the true diabetic patients. We will now compare these results with the linear SVM, as follows:

> confusionMatrix(tune.test, test$type, positive="Yes")

         Reference
Prediction No Yes
       No  82  24
       Yes 11  30
                                          
               Accuracy : 0.7619          
                 95% CI : (0.6847, 0.8282)
    No Information Rate : 0.6327          
    P-Value [Acc > NIR] : 0.0005615       
                                          
                  Kappa : 0.4605          
 Mcnemar's Test P-Value : 0.0425225       
                                          
            Sensitivity : 0.5556          
            Specificity : 0.8817          
         Pos Pred Value : 0.7317          
         Neg Pred Value : 0.7736          
             Prevalence : 0.3673          
         Detection Rate : 0.2041          
   Detection Prevalence : 0.2789          
      Balanced Accuracy : 0.7186                                           
       'Positive' Class : Yes             

As we can see by comparing the two models, the linear SVM is inferior across the board. Our clear winner is the sigmoid kernel SVM. However, there is one thing that we are missing here and that is any sort of feature selection. What we have done is just thrown all the variables together as the feature input space and let the blackbox SVM calculations give us a predicted classification. One of the issues with SVMs is that the findings are very difficult to interpret. There are a number of ways to go about this process that I feel are beyond the scope of this chapter and this is something that you should begin to explore and learn on your own as you become comfortable with the basics that have been outlined previously.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset