Chapter 10. Exploring advanced methods

This chapter covers

  • Decision tree–based models
  • Generalized additive models
  • Support vector machines

In chapter 7, you learned about linear methods for fitting predictive models. These models are the bread-and-butter methods of machine learning; they are easy to fit; they are small, portable, and efficient; they sometimes provide useful advice; and they can work well in a wide variety of situations. However, they also make strong assumptions about the world: namely, that the outcome is linearly related to all the inputs, and all the inputs contribute additively to the outcome. In this chapter, you will learn about methods that relax these assumptions.

Figure 10.1 represents our mental model for what we'll do in this chapter: use R to master the science of building supervised machine learning models.

Figure 10.1. Mental model

Example

Suppose you want to study the relationship between mortality rates and measures of a person’s health or fitness, including BMI (body mass index).

Figure 10.2 shows the relationship between BMI and mortality hazard ratio for a population of older Thais over a four-year period.[1] It shows that both high and low BMI are associated with higher mortality rates: the relationship between BMI and mortality is not linear. So a straightforward linear model to predict mortality rates based (partly) on BMI may not perform well.

1

Data from Vapattanawong, et.al. “Obesity and mortality among older Thais: a four year follow up study,” BMC Public Health, 2010. https://doi.org/10.1186/1471-2458-10-604.

Figure 10.2. Mortality rates of men and women as a function of body mass index

In addition, there may be interactions between BMI and other factors, like how active a person is. For example, for people who are highly active, BMI may affect mortality rates much less than for people who are sedentary. Some interactions, such as “if-then” relationships among variables, or multiplicative effects between variables, may not always be expressible in linear models.[2]

2

One can model interactions in linear models, but it must be done explicitly by the data scientist. Instead, we’ll focus on machine learning techniques, such as tree-based methods, that can learn at least certain types of interactions directly.

The machine learning techniques in this chapter use a variety of methods to address non-linearity, interactions, and other issues in modeling.

10.1. Tree-based methods

You saw an example of a basic decision tree model in chapter 1 (reproduced in figure 10.3). Decision trees are useful for both classification and regression, and they are an attractive method for a number of reasons:

  • They take any type of datums, numerical or categorical, without any distributional assumptions and without preprocessing.
  • Most implementations (in particular, R) handle missing data; the method is also robust to redundant and non-linear data.
  • The algorithm is easy to use, and the output (the tree) is relatively easy to understand.
  • They naturally express certain kinds of interactions among the input variables: those of the form “IF x is true AND y is true, THEN....”
  • Once the model is fit, scoring is fast.
Figure 10.3. Example decision tree (from chapter 1)

On the other hand, decision trees do have some drawbacks:

  • They have a tendency to overfit, especially without pruning.
  • They have high training variance: samples drawn from the same population can produce trees with different structures and different prediction accuracy.
  • Simple decision trees are not as reliable as the other tree-based ensemble methods we'll discuss in this chapter.[3]

    3

    See Lim, Loh, and Shih, “A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirtythree Old and New Classification Algorithms,” Machine Learning, 2000. 40, 203–229; online at http://mng.bz/qX06.

For these reasons, we don’t emphasize the use of basic decision trees in this book. However, there are a number of techniques to fix these weaknesses that lead to state-of-the-art, useful, and performant modeling algorithms. We’ll discuss some of these techniques in this section.

10.1.1. A basic decision tree

To motivate the discussion of tree-based methods, we’ll return to an example that we used in chapter 6 and build a basic decision tree.

Example

Suppose you want to classify email into spam (email you do not want) and non-spam (email you want).

For this example, you’ll again use the Spambase dataset. The dataset consists of about 4,600 documents and 57 features that describe the frequency of certain keywords and characters. Here's the process:

  • First, you’ll train a decision tree to estimate the probability that a given document is spam.
  • Next, you’ll evaluate the tree’s performance according to several performance measures, including accuracy, F1, and deviance (all discussed in chapter 7).

Recall from discussions in chapters 6 and 7 that we want accuracy and F1 to be high, and deviance (which is similar to variance) to be low.

First, let’s load the data. As you did in section 6.2, download a copy of spamD.tsv from https://github.com/WinVector/PDSwR2/raw/master/Spambase/spamD.tsv. Then, write a few convenience functions and train a decision tree, as in the following listing.

Listing 10.1. Preparing Spambase data and evaluating a decision tree model
spamD <- read.table('spamD.tsv', header = TRUE, sep = '	')                1
spamD$isSpam <- spamD$spam == 'spam'
spamTrain <- subset(spamD, spamD$rgroup >= 10)
spamTest <- subset(spamD, spamD$rgroup < 10)

spamVars <- setdiff(colnames(spamD), list('rgroup', 'spam', 'isSpam'))
library(wrapr)
spamFormula <- mk_formula("isSpam", spamVars)                              2

loglikelihood <- function(y, py) {                                         3
   pysmooth <- ifelse(py == 0, 1e-12,
                  ifelse(py == 1, 1 - 1e-12, py))

  sum(y * log(pysmooth) + (1 - y) * log(1 - pysmooth))
}

accuracyMeasures <- function(pred, truth, name = "model") {                4
   dev.norm <- -2 * loglikelihood(as.numeric(truth), pred) / length(pred)  5
   ctable <- table(truth = truth,
                 pred = (pred > 0.5))                                      6
   accuracy <- sum(diag(ctable)) / sum(ctable)
  precision <- ctable[2, 2] / sum(ctable[, 2])
  recall <- ctable[2, 2] / sum(ctable[2, ])
  f1 <- 2 * precision * recall / (precision + recall)
  data.frame(model = name, accuracy = accuracy, f1 = f1, dev.norm)
}

library(rpart)                                                             7
treemodel <- rpart(spamFormula, spamTrain, method = "class")

library(rpart.plot)                                                        8
rpart.plot(treemodel, type = 5, extra = 6)

predTrain <- predict(treemodel, newdata = spamTrain)[, 2]                  9

trainperf_tree <- accuracyMeasures(predTrain,                              10)

                 spamTrain$spam == "spam",
                 name = "tree, training")

predTest <- predict(treemodel, newdata = spamTest)[, 2]
testperf_tree <- accuracyMeasures(predTest,
                 spamTest$spam == "spam",

  • 1 Loads the data and splits into training (90% of data) and test (10% of data) sets
  • 2 Uses all the features and does binary classification, where TRUE corresponds to spam documents
  • 3 A function to calculate log likelihood (for calculating deviance)
  • 4 A function to calculate and return various measures on the model: normalized deviance, prediction accuracy, and f1
  • 5 Normalizes the deviance by the number of data points so we can compare the deviance across training and test sets
  • 6 Converts the class probability estimator into a classifier by labeling documents that score greater than 0.5 as spam
  • 7 Loads the rpart library and fits a decision tree model
  • 8 For plotting the tree
  • 9 probabilities of the class “spam”
  • 10 Evaluates the decision tree model against the training and test sets

The resulting decision tree model is shown in figure 10.4. The output of the two calls to accuracyMeasures() looks like the following:

Figure 10.4. Decision tree model for spam filtering

library(pander)                              1

panderOptions("plain.ascii", TRUE)           2
panderOptions("keep.trailing.zeros", TRUE)
panderOptions("table.style", "simple")
perf_justify <- "lrrr"

perftable <- rbind(trainperf_tree, testperf_tree)
pandoc.table(perftable, justify = perf_justify)

##
##
## model              accuracy       f1   dev.norm
## ---------------- ---------- -------- ----------
## tree, training       0.8996   0.8691     0.6304
## tree, test           0.8712   0.8280     0.7531

  • 1 A package to make nicely formatted ASCII tables
  • 2 Sets some options globally so we don't have to keep setting them in every call

As expected, the accuracy and F1 scores both degrade on the test set, and the deviance increases.

10.1.2. Using bagging to improve prediction

One way to mitigate the shortcomings of decision tree models is by bootstrap aggregation, or bagging. In bagging, you draw bootstrap samples (random samples with replacement) from your data. From each sample, you build a decision tree model. The final model is the average of all the individual decision trees. This is shown in figure 10.5.[4]

4

Bagging, random forests, and gradient-boosted trees are variations of a general technique called ensemble learning. An ensemble model is composed of the combination of several smaller simple models (often small decision trees). Giovanni Seni and John Elder’s Ensemble Methods in Data Mining (Morgan & Claypool, 2010) is an excellent introduction to the general theory of ensemble learning.

Figure 10.5. Bagging decision trees

To make this concrete, suppose that x is an input datum, y_i(x) is the output of the ith tree, c(y_1(x), y_2(x), ... y_n(x)) is the vector of individual outputs, and y is the output of the final model:

  • For regression, or for estimating class probabilities, y(x) is the average of the scores returned by the individual trees: y(x) = mean(c(y_1(x), ... y_n(x))).
  • For classification, the final model assigns the class that got the most votes from the individual trees.

Bagging decision trees stabilizes the final model by lowering the variance; this improves the accuracy. A bagged ensemble of trees is also less likely to overfit the data.

Try bagging some tree models for the spam example.

Listing 10.2. Bagging decision trees
ntrain <- dim(spamTrain)[1]
n <- ntrain                                          1

ntree <- 100

samples <- sapply(1:ntree,                           2

                 FUN = function(iter)
                   { sample(1:ntrain, size = n, replace = TRUE) })

treelist <-lapply(1:ntree,                           3

                  FUN = function(iter) {
                    samp <- samples[, iter];
                    rpart(spamFormula, spamTrain[samp, ], method = "class") }
      )

predict.bag <- function(treelist, newdata) {         4

  preds <- sapply(1:length(treelist),
                 FUN = function(iter) {
                   predict(treelist[[iter]], newdata = newdata)[, 2] })
  predsums <- rowSums(preds)
  predsums / length(treelist)
}

pred <- predict.bag(treelist, newdata = spamTrain)
trainperf_bag <- accuracyMeasures(pred,              5

                 spamTrain$spam == "spam",
                 name = "bagging, training")

pred <- predict.bag(treelist, newdata = spamTest)
testperf_bag <- accuracyMeasures(pred,
                 spamTest$spam == "spam",
                 name = "bagging, test")

perftable <- rbind(trainperf_bag, testperf_bag)
pandoc.table(perftable, justify = perf_justify)
##
##
## model                 accuracy       f1   dev.norm
## ------------------- ---------- -------- ----------
## bagging, training       0.9167   0.8917     0.5080
## bagging, test           0.9127   0.8824     0.5793

  • 1 Uses bootstrap samples the same size as the training set, with 100 trees
  • 2 Builds the bootstrap samples by sampling the row indices of spamTrain with replacement. Each column of the matrix samples represents the row indices into spamTrain that comprise the bootstrap sample.
  • 3 Trains the individual decision trees and returns them in a list. Note: This step can take a few minutes.
  • 4 predict.bag assumes the underlying classifier returns decision probabilities, not decisions. predict.bag takes the mean of the predictions of all the individual trees
  • 5 Evaluates the bagged decision trees against the training and test sets

As you see, bagging improves accuracy and F1, and reduces deviance over both the training and test sets when compared to the single decision tree (you’ll see a direct comparison of the scores a little later on). There is also less degradation in the bagged model’s performance going from training to test than there is with the decision tree.

You can further improve prediction performance by going from bagging to random forests.

Bagging classifiers

The proofs that bagging reduces variance are only valid for regression and for estimating class probabilities, not for classifiers (a model that only returns class membership, not class probabilities). Bagging a bad classifier can make it worse. So you definitely want to work over estimated class probabilities, if they’re at all available. But it can be shown that for CART trees (which is the decision tree implementation in R) under mild assumptions, bagging tends to increase classifier accuracy. See Clifton D. Sutton, “Classification and Regression Trees, Bagging, and Boosting,” Handbook of Statistics, Vol. 24 (Elsevier, 2005) for more details.

10.1.3. Using random forests to further improve prediction

In bagging, the trees are built using randomized datasets, but each tree is built by considering the exact same set of features. This means that all the individual trees are likely to use very similar sets of features (perhaps in a different order or with different split values). Hence, the individual trees will tend to be overly correlated with each other. If there are regions in feature space where one tree tends to make mistakes, then all the trees are likely to make mistakes there, too, diminishing our opportunity for correction. The random forest approach tries to decorrelate the trees by randomizing the set of variables that each tree is allowed to use.

The process is shown in figure 10.6. For each individual tree in the ensemble, the random forest method does the following:

  1. Draws a bootstrapped sample from the training data
  2. For each sample, grows a decision tree, and at each node of the tree

    1. Randomly draws a subset of mtry variables from the p total features that are available
    2. Picks the best variable and the best split from that set of mtry variables
    3. Continues until the tree is fully grown
Figure 10.6. Growing a random forest

The final ensemble of trees is then bagged to make the random forest predictions. This is quite involved, but fortunately all done by a single-line random forest call.

By default, the randomForest() function in R draws mtry = p/3 variables at each node for regression trees, and m = sqrt(p) variables for classification trees. In theory, random forests aren’t terribly sensitive to the value of mtry. Smaller values will grow the trees faster; but if you have a very large number of variables to choose from, of which only a small fraction are actually useful, then using a larger mtry is better, since with a larger mtry you’re more likely to draw some useful variables at every step of the tree-growing procedure.

Continuing from the data in section 10.1, try building a spam model using random forests.

Listing 10.3. Using random forests
library(randomForest)                                        1
set.seed(5123512)                                            2
fmodel <- randomForest(x = spamTrain[, spamVars],            3
         y = spamTrain$spam,
         ntree = 100,                                        4
         nodesize = 7,                                       5
         importance = TRUE)                                  6

pred <- predict(fmodel,
                spamTrain[, spamVars],
                type = 'prob')[, 'spam']

trainperf_rf <-  accuracyMeasures(predict(fmodel,            7
   newdata = spamTrain[, spamVars], type = 'prob')[, 'spam'],
   spamTrain$spam == "spam", name = "random forest, train")

testperf_rf <-  accuracyMeasures(predict(fmodel,
   newdata = spamTest[, spamVars], type = 'prob')[, 'spam'],
   spamTest$spam == "spam", name = "random forest, test")

perftable <- rbind(trainperf_rf, testperf_rf)
pandoc.table(perftable, justify = perf_justify)

##
##
## model                    accuracy       f1   dev.norm
## ---------------------- ---------- -------- ----------
## random forest, train       0.9884   0.9852     0.1440
## random forest, test        0.9498   0.9341     0.3011

  • 1 Loads the randomForest package
  • 2 Sets the pseudo-random seed to a known value to try to make the random forest run repeatable
  • 3 Calls the randomForest() function to build the model with explanatory variables as x and the category to be predicted as y
  • 4 Uses 100 trees to be compatible with our bagging example. The default is 500 trees.
  • 5 Specifies that each node of a tree must have a minimum of 7 elements to be compatible with the default minimum node size that rpart() uses on this training set
  • 6 Tells the algorithm to save information to be used for calculating variable importance (we’ll see this later)
  • 7 Reports the model quality

You can summarize the results for all three of the models you’ve looked at. First, on training:

trainf <- rbind(trainperf_tree, trainperf_bag, trainperf_rf)
pandoc.table(trainf, justify = perf_justify)
##
##
## model                    accuracy       f1   dev.norm
## ---------------------- ---------- -------- ----------
## tree, training             0.8996   0.8691     0.6304
## bagging, training          0.9160   0.8906     0.5106
## random forest, train       0.9884   0.9852     0.1440

Then, on test:

testf <- rbind(testperf_tree, testperf_bag, testperf_rf)
pandoc.table(testf, justify = perf_justify)
##
##
## model                   accuracy       f1   dev.norm
## --------------------- ---------- -------- ----------
## tree, test                0.8712   0.8280     0.7531
## bagging, test             0.9105   0.8791     0.5834
## random forest, test       0.9498   0.9341     0.3011

The random forest model performed dramatically better than the other two models on both training and test.

You can also look at performance change: the decrease in accuracy and F1 when going from training to test, and the corresponding increase in deviance.

difff <- data.frame(model = c("tree", "bagging", "random forest"),
                  accuracy = trainf$accuracy - testf$accuracy,
                  f1 = trainf$f1 - testf$f1,
                  dev.norm = trainf$dev.norm - testf$dev.norm)

pandoc.table(difff, justify=perf_justify)

##
##
## model             accuracy        f1   dev.norm
## --------------- ---------- --------- ----------
## tree              0.028411   0.04111   -0.12275
## bagging           0.005523   0.01158   -0.07284
## random forest     0.038633   0.05110   -0.15711

The random forest’s model degraded about as much as a single decision tree when going from training to test data, and much more than the bagged model did. This is one of the drawbacks of random forest models: the tendency to overfit the training data. However, in this case, the random forest model was still the best performing.

Random forests can overfit!

It’s lore among random forest proponents that “random forests don’t overfit.” In fact, they can. Hastie et al. back up this observation in their chapter on random forests in The Elements of Statistical Learning (Springer, 2011). Seeing virtually perfect prediction on training data and less-than-perfect performance on holdout data is characteristic of random forest models. So when using random forest, it’s extremely important to validate model performance on holdout data.

Examining variable importance

A useful feature of the randomForest() function is its variable importance calculation. Since the algorithm uses a large number of bootstrap samples, each data point x has a corresponding set of out-of-bag samples: those samples that don’t contain the point x. This is shown in figure 10.7 for the data point x1. The out-of-bag samples can be used in a way similar to N-fold cross-validation, to estimate the accuracy of each tree in the ensemble.

Figure 10.7. Out-of-bag samples for datum x1

To estimate the “importance” of a variable v1, the variable’s values are randomly permuted. Each tree is then evaluated against its out-of-bag samples, and the corresponding decrease in each tree’s accuracy is estimated. This is shown in figure 10.8.

Figure 10.8. Calculating variable importance of variable v1

If the average decrease over all the trees is large, then the variable is considered important—its value makes a big difference in predicting the outcome. If the average decrease is small, then the variable doesn’t make much difference to the outcome. The algorithm also measures the decrease in node purity that occurs from splitting on a permuted variable (how this variable affects the quality of the tree).

You can calculate the variable importance by setting importance = TRUE in the randomForest() call (as you did in listing 10.3), and then calling the functions importance() and varImpPlot().

Listing 10.4. randomForest variable importances
varImp <- importance(fmodel)                                       1

varImp[1:10, ]                                                     2
##                     non-spam      spam MeanDecreaseAccuracy
## word.freq.make      1.656795  3.432962             3.067899
## word.freq.address   2.631231  3.800668             3.632077
## word.freq.all       3.279517  6.235651             6.137927
## word.freq.3d        3.900232  1.286917             3.753238
## word.freq.our       9.966034 10.160010            12.039651
## word.freq.over      4.657285  4.183888             4.894526
## word.freq.remove   19.172764 14.020182            20.229958
## word.freq.internet  7.595305  5.246213             8.036892
## word.freq.order     3.167008  2.505777             3.065529
## word.freq.mail      3.820764  2.786041             4.869502

varImpPlot(fmodel, type = 1)                                       3

  • 1 Calls importance() on the spam model
  • 2 The importance() function returns a matrix of importance measures (larger values = more important).
  • 3 Plots the variable importance as measured by accuracy change

The result of the varImpPlot() call is shown in figure 10.9. According to the plot, the most important variable for determining if an email is spam is char.freq.bang, or the number of times an exclamation point appears in an email, which makes some intuitive sense. The next most important variable is word.freq.remove, or the number of times the word “remove” appears in the email.

Figure 10.9. Plot of the most important variables in the spam model, as measured by accuracy

Knowing which variables are most important (or at least, which variables contribute the most to the structure of the underlying decision trees) can help you with variable reduction. This is useful not only for building smaller, faster trees, but also for choosing variables to be used by another modeling algorithm, if that’s desired. We can reduce the number of variables in this spam example from 57 to 30 without affecting the quality of the final model.

Variable screening as an initial screening

Data scientist Jeremy Howard (of Kaggle and fast.ai fame) is a big proponent of using an initial variable importance screen early in a data science project to eliminate variables that are not of interest and identify variables to discuss with business partners.

Listing 10.5. Fitting with fewer variables
sorted <- sort(varImp[, "MeanDecreaseAccuracy"],       1
                decreasing = TRUE)

selVars <- names(sorted)[1:30]
fsel <- randomForest(x = spamTrain[, selVars],         2
                         y = spamTrain$spam,
                        ntree = 100,
                        nodesize = 7,
                        importance = TRUE)

trainperf_rf2 <- accuracyMeasures(predict(fsel,
   newdata = spamTrain[, selVars], type = 'prob')[, 'spam'],
   spamTrain$spam == "spam", name = "RF small, train")

testperf_rf2 <- accuracyMeasures(predict(fsel,
   newdata=spamTest[, selVars], type = 'prob')[, 'spam'],
   spamTest$spam == "spam", name = "RF small, test")

perftable <- rbind(testperf_rf, testperf_rf2)          3
pandoc.table(perftable, justify = perf_justify)
##
##
## model                   accuracy       f1   dev.norm
## --------------------- ---------- -------- ----------
## random forest, test       0.9498   0.9341     0.3011
## RF small, test            0.9520   0.9368     0.4000

  • 1 Sorts the variables by their importance, as measured by accuracy change
  • 2 Builds a random forest model using only the 30 most important variables
  • 3 Compares the two random forest models on the test set

The smaller model performs just as well as the random forest model built using all 57 variables.

Random forest variable importance versus LIME

Random forest variable importance measures how important individual variables are to the model’s overall prediction performance. They tell you which variables generally affect the model’s predictions the most, or which variables the model depends on the most.

LIME variable importances (discussed in section 6.3) measure how much different variables affect the model’s prediction on a specific example. LIME explanations can help you determine if the model is using its variables appropriately, by explaining specific decisions.

10.1.4. Gradient-boosted trees

Gradient boosting is another ensemble method that improves the performance of decision trees. Rather than averaging the predictions of several trees together, as bagging and random forests do, gradient boosting tries to improve prediction performance by incrementally adding trees to an existing ensemble. The steps are as follows:

  1. Use the current ensemble TE to make predictions on the training data.
  2. Measure the residuals between the true outcomes and the predictions on the training data.
  3. Fit a new tree T_i to the residuals. Add T_i to the ensemble TE.
  4. Continue until the residuals have vanished, or another stopping criterion is achieved.

The procedure is sketched out in figure 10.10.

Figure 10.10. Building up a gradient-boosted tree model

Gradient-boosted trees can also overfit, because at some point the residuals are just random noise. To mitigate overfitting, most implementations of gradient boosting provide cross-validation methods to help determine when to stop adding trees to the ensemble.

You saw examples of gradient boosting when we discussed LIME in section 6.3, where you fit the gradient-boosted tree models using the xgboost package. In this section, we’ll go over the modeling code that you used in section 6.3 in more detail.

The iris example

Let’s start with a small example.

Example

Suppose you have a dataset of petal and sepal measurements for three varieties of iris. The object is to predict whether a given iris is a setosa based on its petal and sepal dimensions.

Listing 10.6. Loading the iris data
iris <- iris
iris$class <- as.numeric(iris$Species == "setosa")       1

set.seed(2345)
intrain <- runif(nrow(iris)) < 0.75                      2
train <- iris[intrain, ]
test <- iris[!intrain, ]
head(train)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species class
## 1          5.1         3.5          1.4         0.2  setosa     1
## 2          4.9         3.0          1.4         0.2  setosa     1
## 3          4.7         3.2          1.3         0.2  setosa     1
## 4          4.6         3.1          1.5         0.2  setosa     1
## 5          5.0         3.6          1.4         0.2  setosa     1
## 6          5.4         3.9          1.7         0.4  setosa     1

input <- as.matrix(train[, 1:4])                         3

  • 1 setosa is the positive class.
  • 2 Splits the data into training and test (75%/25%)
  • 3 Creates the input matrix

Note that xgboost requires its input to be a numeric (no categorical variables) matrix, so in listing 10.6, you take the input data from the training data frame and create an input matrix.

In section 6.3, you fit the iris model by using the preprovided convenience function fit_iris_example(); here we’ll explain the code from that function in detail. The first step is to run the cross-validation function xgb.cv() to determine the appropriate number of trees to use.

Listing 10.7. Cross-validating to determine model size
library(xgboost)

cv <- xgb.cv(input,                                     1

            label = train$class,                        2

              params = list(
                objective = "binary:logistic"           3
               ),
              nfold = 5,                                4
              nrounds = 100,                            5
              print_every_n = 10,                       6

              metrics = "logloss")                      7

evalframe <- as.data.frame(cv$evaluation_log)           8

head(evalframe)                                         9

##   iter train_logloss_mean train_logloss_std test_logloss_mean
## 1    1          0.4547800      7.758350e-05         0.4550578
## 2    2          0.3175798      9.268527e-05         0.3179284
## 3    3          0.2294212      9.542411e-05         0.2297848
## 4    4          0.1696242      9.452492e-05         0.1699816
## 5    5          0.1277388      9.207258e-05         0.1280816
## 6    6          0.0977648      8.913899e-05         0.0980894
##   test_logloss_std
## 1      0.001638487
## 2      0.002056267
## 3      0.002142687
## 4      0.002107535
## 5      0.002020668
## 6      0.001911152

(NROUNDS <- which.min(evalframe$test_logloss_mean))    10)
## [1] 18

library(ggplot2)
ggplot(evalframe, aes(x = iter, y = test_logloss_mean)) +
  geom_line() +
  geom_vline(xintercept = NROUNDS, color = "darkred", linetype = 2) +
  ggtitle("Cross-validated log loss as a function of ensemble size")

  • 1 The input matrix
  • 2 The class labels, which must also be numeric (1 for setosa, 0 for not setosa)
  • 3 Uses the objective “binary:logistic” for binary classification, “reg:linear” for regression
  • 4 Uses 5-fold cross-validation
  • 5 Builds an ensemble of 100 trees
  • 6 Prints a message every 10th iteration (use verbose = FALSE for no messages)
  • 7 Uses minimum cross-validated logloss (related to deviance) to pick the optimum number of trees. For regression, uses metrics = “rmse”.
  • 8 Gets the performance log
  • 9 evalframe records the training and cross-validated logloss as a function of the number of trees.
  • 10 Finds the number of trees that gave the minimum cross-validated logloss

Figure 10.11 shows the cross-validated log loss as a function of the number of trees. In this case, xgb.cv() estimated that 18 trees gave the best model. Once you know the number of trees to use, you can call xgboost() to fit the appropriate model.

Figure 10.11. Cross-validated log loss as a function of ensemble size

Listing 10.8. Fitting an xgboost model
model <- xgboost(data = input,
                 label = train$class,
                 params = list(
                    objective = "binary:logistic"
                  ),
                 nrounds = NROUNDS,
                 verbose = FALSE)

test_input <- as.matrix(test[, 1:4])      1
pred <- predict(model,  test_input)       2

accuracyMeasures(pred, test$class)

##   model accuracy f1   dev.norm
## 1 model        1  1 0.03458392

  • 1 Creates the input matrix for the test data
  • 2 Makes predictions

The model predicts perfectly on the holdout data, because this is an easy problem. Now that you are familiar with the steps, you can try xgboost on a harder problem: the movie review classification problem from section 6.3.3.

Gradient boosting for text classification
Example

For this example, you will classify movie reviews from the Internet Movie Database (IMDB). The task is to identify positive reviews.

As you did in section 6.3.3, you’ll use the training and test data, IMDBtrain.RDS and IMDBtest.RDS, found at https://github.com/WinVector/PDSwR2/tree/master/IMDB. Each RDS object is a list with two elements: a character vector representing 25,000 reviews, and a vector of numeric labels where 1 means a positive review and 0 a negative review.

First, load the training data:

library(zeallot)
c(texts, labels) %<-% readRDS("IMDBtrain.RDS")

You have to convert the textual input data to a numeric representation. As in section 6.3.3, you’ll convert the training data into a document-term matrix, implemented as a sparse matrix of class dgCMatrix. The convenience functions to do this conversion are in https://github.com/WinVector/PDSwR2/tree/master/IMDB/lime_imdb_example.R. Next, you’ll create the vocabulary of terms in the corpus, and then make the document-term matrix for the training data:

source("lime_imdb_example.R")
vocab <- create_pruned_vocabulary(texts)
dtm_train <- make_matrix(texts, vocab)

The first step to fit the model is to determine the number of trees to use. This may take a while.

cv <- xgb.cv(dtm_train,
             label = labels,
             params = list(
               objective = "binary:logistic"
               ),
             nfold = 5,
             nrounds = 500,
             early_stopping_rounds = 20,     1
             print_every_n = 10,
             metrics = "logloss")

evalframe <- as.data.frame(cv$evaluation_log)
(NROUNDS <- which.min(evalframe$test_logloss_mean))
## [1] 319

  • 1 Stop early if performance doesn’t improve for 20 rounds.

Then fit the model and evaluate it:

model <- xgboost(data = dtm_train, label = labels,
                  params = list(
                    objective = "binary:logistic"
                  ),
                  nrounds = NROUNDS,
                  verbose = FALSE)

pred = predict(model, dtm_train)
trainperf_xgb =  accuracyMeasures(pred, labels, "training")

c(test_texts, test_labels) %<-% readRDS("IMDBtest.RDS")      1
dtm_test = make_matrix(test_texts, vocab)

pred = predict(model, dtm_test)
testperf_xgb = accuracyMeasures(pred, test_labels, "test")

perftable <- rbind(trainperf_xgb, testperf_xgb)
pandoc.table(perftable, justify = perf_justify)
##
##
## model        accuracy       f1   dev.norm
## ---------- ---------- -------- ----------
## training       0.9891   0.9891     0.1723
## test           0.8725   0.8735     0.5955

  • 1 Loads the test data and converts it to a document-term matrix

As with random forests, this gradient-boosted model gives near-perfect performance on training data, and less-than-perfect, but still decent performance on holdout data. Even though the cross-validation step suggested 319 trees, you may want to examine evalframe (as you did in the iris example), and experiment with different numbers of trees, to see if that reduces the overfitting.

Gradient boosting models vs. random forests

In our own work, we’ve found that gradient boosting models tend to outperform random forests on most problems where we’ve tried both. However, there are occasionally situations where gradient boosting models perform poorly, while random forest models give acceptable performance. Your experiences may be different. In any case, it’s a good idea to keep both methods in your arsenal.

Using xgboost with categorical variables

In the iris example, all the input variables were numeric; in the movie review example, you converted the unstructured text input into a structured, numeric matrix representation. In many situations, you will have structured input data with categorical levels, as in the following example.

Example

Suppose you want to predict a newborn’s birth weight as a function of several variables, both numeric and categorical, using xgboost.

The data for this example is from the 2010 CDC natality dataset; it is similar to the data that you used in chapter 7 for predicting at-risk births.[5]

5

Listing 10.9. Loading the natality data
load("NatalBirthData.rData")
train <- sdata[sdata$ORIGRANDGROUP <= 5, ]                            1

test <- sdata[sdata$ORIGRANDGROUP >5 , ]

input_vars <- setdiff(colnames(train), c("DBWT", "ORIGRANDGROUP"))    2

str(train[, input_vars])

## 'data.frame':    14386 obs. of  11 variables:
##  $ PWGT      : int  155 140 151 160 135 180 200 135 112 98 ...
##  $ WTGAIN    : int  42 40 1 47 25 20 24 51 36 22 ...
##  $ MAGER     : int  30 32 34 32 24 25 26 26 20 22 ...
##  $ UPREVIS   : int  14 13 15 1 4 10 14 15 14 10 ...
##  $ CIG_REC   : logi  FALSE FALSE FALSE TRUE FALSE FALSE ...
##  $ GESTREC3  : Factor w/ 2 levels ">= 37 weeks",..: 1 1 1 2 1 1 1 1 1 1 ...
##  $ DPLURAL   : Factor w/ 3 levels "single","triplet or higher",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ URF_DIAB  : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ URF_CHYPER: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ URF_PHYPER: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ URF_ECLAM : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

  • 1 Splits the data into training and test sets
  • 2 Uses all the variables in the model. DBWT (baby's birth weight) is the value to be predicted, and ORIGRANDGROUP is the grouping variable.

As you can see, the input data has numerical variables, logical variables, and categorical (factor) variables. If you want to use xgboost() to fit a gradient-boosted model for a baby’s birth weight using all of these variables, you must convert the input to all-numerical data. There are several ways to do this, including the base R model .matrix() function. We recommend using vtreat, as you did in chapter 8.

For this scenario, there are three ways you can use vtreat:

  • Split the data into three sets: calibration/train/test. Use the calibration set with designTreatmentsN() to create the treatment plan; prepare() the training set to fit the xgboost model; and then prepare() the test set to validate the model. This is a good option when you have a large training set with either complex variables (categorical variables that take on a large number of possible levels), or a large number of categorical variables. It is also a good option if you want to prune some of the variables before fitting the model (using significance pruning—see section 8.4.2).
  • Split the data into train/test (as we did here). Use mkCrossFrameNExperiment() to create the treatment plan and a cross-frame for training the xgboost model; prepare() the test set to validate the model. This is a good option when you don’t have enough training data to split into three groups, but you have complex variables or a large number of categorical variables, and/or you want to prune some of the variables before fitting the model.
  • Split the data into train/test. Use designTreatmentsZ() to create a treatment plan that manages missing values and converts categorical variables to indicator variables. prepare() both the training and test sets to create purely numeric input. This solution is quite similar to calling model.matrix(), with the added advantage that it manages missing values, and also gracefully handles situations where some categorical levels show up in either training or test, but not both. It’s a good solution when you only have a few categorical variables, and none of the variables are too complex.

Since in this scenario there are only two categorical variables, and none of them are too complex (GESTREC3 takes on two values, and DPLURAL takes on three), you can use the third option.

Listing 10.10. Using vtreat to prepare data for xgboost
library(vtreat)

treatplan <- designTreatmentsZ(train,                                        1
                               input_vars,
                               codeRestriction = c("clean", "isBAD", "lev" ),2
                               verbose = FALSE)

train_treated <- prepare(treatplan, train)                                   3
str(train_treated)

## 'data.frame':    14386 obs. of  14 variables:
##  $ PWGT                           : num  155 140 151 160 135 180 200 135 112 98 ...
##  $ WTGAIN                         : num  42 40 1 47 25 20 24 51 36 22 ...
##  $ MAGER                          : num  30 32 34 32 24 25 26 26 20 22 ...
##  $ UPREVIS                        : num  14 13 15 1 4 10 14 15 14 10 ...
##  $ CIG_REC                        : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ URF_DIAB                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ URF_CHYPER                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ URF_PHYPER                     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ URF_ECLAM                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ GESTREC3_lev_x_37_weeks        : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ GESTREC3_lev_x_37_weeks_1      : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ DPLURAL_lev_x_single           : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ DPLURAL_lev_x_triplet_or_higher: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ DPLURAL_lev_x_twin             : num  0 0 0 0 0 0 0 0 0 0 ...

  • 1 Creates the treatment plan
  • 2 Creates clean numeric variables (“clean”), missingness indicators (“isBad”), indicator variables (“lev”), but not catP (prevalence) variables
  • 3 Prepares the training data

Note that train_treated is purely numerical, with no missing values, and it doesn’t contain the outcome column, so it’s safe to use with xgboost (though you must convert it to a matrix first). To demonstrate this, the following listing directly fits a gradient-boosted model with 50 trees to the prepared training data (no cross-validation to pick the best size), and then applies the model to the prepared test data. This is just for demonstration purposes; normally you would want to call xgb.cv() to pick an appropriate number of trees first.

Listing 10.11. Fitting and applying an xgboost model for birth weight
birthwt_model <- xgboost(as.matrix(train_treated),
                         train$DBWT,
                         params = list(
                           objective = "reg:linear",
                           base_score = mean(train$DBWT)
                         ),
                         nrounds = 50,
                         verbose = FALSE)
test_treated <- prepare(treatplan, test)
pred <- predict(birthwt_model, as.matrix(test_treated))
Exercise: Try to use xgboost to solve the birth weight problem.

Try xgboost to predict DBWT, that is, set up the data and run the preceding code.

Bagging, random forests, and gradient boosting are after-the-fact improvements you can try in order to improve decision tree models. In the next section, you’ll work with generalized additive models, which use a different method to represent non-linear relationships between inputs and outputs.

10.1.5. Tree-based model takeaways

Here's what you should remember about tree-based models:

  • Trees are useful for modeling data with non-linear relationships between the input and the output, and potential interactions among variables.
  • Tree-based ensembles generally have better performance than basic decision tree models.
  • Bagging stabilizes decision trees and improves accuracy by reducing variance.
  • Both random forests and gradient-boosted trees may have a tendency to overfit on training data. Be sure to evaluate the models on holdout data to get a better estimate of model performance.

10.2. Using generalized additive models (GAMs) to learn non-monotone relationships

In chapter 7, you used linear regression to model and predict quantitative output, and logistic regression to predict class probabilities. Linear and logistic regression models are powerful tools, especially when you want to understand the relationship between the input variables and the output. They’re robust to correlated variables (when regularized), and logistic regression preserves the marginal probabilities of the data. The primary shortcoming of both these models is that they assume that the relationship between the inputs and the output is monotone. That is, if more is good, than much more is always better.

But what if the actual relationship is non-monotone? Consider the BMI example that you saw at the beginning of the chapter. For underweight adults, increasing BMI can lower mortality. But there’s a limit: at some point a higher BMI is bad, and mortality will increase as BMI increases. Linear and logistic regression miss this distinction. For the data that we are working with, as figure 10.12 shows, a linear model would predict that mortality always decreases as BMI increases.

Figure 10.12. The effect of BMI on mortality: linear model vs. GAM

Generalized additive models (GAMs) are a way to model non-monotone responses within the framework of a linear or logistic model (or any other generalized linear model). In the mortality example, GAM would try to find a good “u-shaped” function of BMI, s(BMI), that describes the relationship between BMI and mortality, as shown in figure 10.12. GAM would then fit a function to predict mortality in terms of s(BMI).

10.2.1. Understanding GAMs

Recall that if y[i] is the numeric quantity you want to predict, and x[i, ] is a row of inputs that corresponds to output y[i], then linear regression finds a function f(x) such that

f(x[i, ]) = b0 + b[1] * x[i, 1] + b[2] * x[i, 2] + ... b[n] * x[i, n]

And f(x[i, ]) is as close to y[i] as possible.

In its simplest form, a GAM model relaxes the linearity constraint and finds a set of functions s_i() (and a constant term a0) such that

f(x[i,]) = a0 + s_1(x[i, 1]) + s_2(x[i, 2]) + ... s_n(x[i, n])

We also want f(x[i, ]) to be as close to y[i] as possible. The functions s_i() are smooth curve fits that are built up from polynomials. The curves are called splines and are designed to pass as closely as possible through the data without being too “wiggly” (without overfitting). An example of a spline fit is shown in figure 10.13.

Figure 10.13. A spline that has been fit through a series of points

Let’s work on a concrete example.

10.2.2. A one-dimensional regression example

First, consider this toy example.

Example

Suppose you want to fit a model to data where the response y is a noisy non-linear function of the input variable x (in fact, it’s the function shown in figure 10.13).

As usual, we’ll split the data into training and test sets.

Listing 10.12. Preparing an artificial problem
set.seed(602957)

x <- rnorm(1000)
noise <- rnorm(1000, sd = 1.5)

y <- 3 * sin(2 * x) + cos(0.75 * x) - 1.5 * (x^2) + noise

select <- runif(1000)
frame <- data.frame(y = y, x = x)

train <- frame[select > 0.1, ]
test <-frame[select <= 0.1, ]

Given that the data is from the non-linear functions sin() and cos(), there shouldn’t be a good linear fit from x to y. We’ll start by building a (poor) linear regression.

Listing 10.13. Applying linear regression to the artificial example
lin_model <- lm(y ~ x, data = train)
summary(lin_model)

##
## Call:
## lm(formula = y ~ x, data = train)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -17.698  -1.774   0.193   2.499   7.529
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -0.8330     0.1161  -7.175 1.51e-12 ***
## x             0.7395     0.1197   6.180 9.74e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.485 on 899 degrees of freedom
## Multiple R-squared:  0.04075,    Adjusted R-squared:  0.03968
## F-statistic: 38.19 on 1 and 899 DF,  p-value: 9.737e-10

rmse <- function(residuals) {                      1
   sqrt(mean(residuals^2))
}

train$pred_lin <- predict(lin_model, train)        2
resid_lin <- with(train, y - pred_lin)
rmse(resid_lin)
## [1] 3.481091

library(ggplot2)                                   3

ggplot(train, aes(x = pred_lin, y = y)) +
  geom_point(alpha = 0.3) +
  geom_abline()

  • 1 A convenience function for calculating root mean squared error (RMSE ) from a vector of residuals
  • 2 Calculates the RMSE of this model on the training data
  • 3 Plots y versus prediction

The resulting model’s predictions are plotted versus true response in figure 10.14. As expected, it’s a very poor fit, with an R-squared of about 0.04. In particular, the errors are not homoscedastic: there are regions where the model systematically underpredicts and regions where it systematically overpredicts. If the relationship between x and y were truly linear (with independent noise), then the errors would be homoscedastic: the errors would be evenly distributed (mean 0) around the predicted value everywhere.

Figure 10.14. Linear model’s predictions vs. actual response. The solid line is the line of perfect prediction (prediction == actual).

Now try finding a non-linear model that maps x to y. We’ll use the function gam() in the package mgcv.[6] When using gam(), you can model variables as either linear or non-linear. You model a variable x as non-linear by wrapping it in the s() notation. In this example, rather than using the formula y ~ x to describe the model, you’d use the formula y ~ s(x). Then gam() will search for the spline s() that best describes the relationship between x and y, as shown in listing 10.14. Only terms surrounded by s() get the GAM/spline treatment.

6

There’s an older package called gam, written by Hastie and Tibshirani, the inventors of GAMs. The gam package works fine. But it’s incompatible with the mgcv package, which ggplot already loads. Since we’re using ggplot for plotting, we’ll use mgcv for our examples.

Listing 10.14. Applying GAM to the artificial example
library(mgcv)                                                       1
gam_model <- gam(y ~ s(x), data = train)                            2
gam_model$converged                                                 3
## [1] TRUE

summary(gam_model)

## Family: gaussian                                                 4
## Link function: identity
##
## Formula:
## y ~ s(x)
##
## Parametric coefficients:                                         5
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.83467    0.04852   -17.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:                        6
##        edf Ref.df     F p-value
## s(x) 8.685  8.972 497.8  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) =  0.832   Deviance explained = 83.4%                 7
## GCV score =  2.144  Scale est. = 2.121     n = 901

train$pred <- predict(gam_model, train)                             8
resid_gam <- with(train, y - pred)
rmse(resid_gam)

## [1] 1.448514

ggplot(train, aes(x = pred, y = y)) +                               9
  geom_point(alpha = 0.3) +
  geom_abline()

  • 1 Loads the mgcv package
  • 2 Builds the model, specifying that x should be treated as a non-linear variable
  • 3 The converged parameter tells you if the algorithm converged. You should only trust the output if this is TRUE.
  • 4 Setting family = gaussian and link = identity tells you that the model was treated with the same distribution assumptions as a standard linear regression.
  • 5 The parametric coefficients are the linear terms (in this example, only the constant term). This section of the summary tells you which linear terms were significantly different from 0.
  • 6 The smooth terms are the non-linear terms. This section of the summary tells you which non-linear terms were significantly different from 0. It also tells you the effective degrees of freedom (edf) used to build each smooth term. An edf near 1 indicates that the variable has an approximately linear relationship to the output.
  • 7 R-sq.(adj) is the adjusted R-squared. “Deviance explained” is the raw R-squared (0.834).
  • 8 Calculates the RMSE of this model on the training data
  • 9 Plots y versus prediction

The resulting model’s predictions are plotted versus true response in figure 10.15. This fit is much better: the model explains over 80% of the variance (R-squared of 0.83), and the root mean squared error (RMSE) over the training data is less than half the RMSE of the linear model. Note that the points in figure 10.15 are distributed more or less evenly around the line of perfect prediction. The GAM has been fit to be homoscedastic, and any given prediction is as likely to be an overprediction as an underprediction.

Figure 10.15. GAM’s predictions vs. actual response. The solid line is the theoretical line of perfect prediction (prediction == actual).

Modeling linear relationships using gam()

By default, gam() will perform standard linear regression. If you were to call gam() with the formula y ~ x, you’d get the same model that you got using lm(). More generally, the call gam(y ~ x1 + s(x2), data=...) would model the variable x1 as having a linear relationship with y, and try to fit the best possible smooth curve to model the relationship between x2 and y. Of course, the best smooth curve could be a straight line, so if you’re not sure whether the relationship between x and y is linear, you can use s(x). If you see that the coefficient has an edf (effective degrees of freedom—see the model summary in listing 10.14) of about 1, then you can try refitting the variable as a linear term.

The use of splines gives GAMs a richer model space to choose from; this increased flexibility brings a higher risk of overfitting. You should also check the models’ performances on the test data.

Listing 10.15. Comparing linear regression and GAM performance
test <- transform(test,                                  1
                  pred_lin = predict(lin_model, test),
                  pred_gam = predict(gam_model, test) )

test <- transform(test,                                  2
                  resid_lin = y - pred_lin,
                  resid_gam = y - pred_gam)

rmse(test$resid_lin)                                     3
## [1] 2.792653

rmse(test$resid_gam)
## [1] 1.401399

library(sigr)                                            4
 wrapFTest(test, "pred_lin", "y")$R2
## [1] 0.115395
wrapFTest(test, "pred_gam", "y")$R2
## [1] 0.777239

  • 1 Gets predictions from both models on the test data. The function transform() is a base R version of dplyr::mutate().
  • 2 Calculates the residuals
  • 3 Compares the RMSE of both models on the test data
  • 4 Compares the R-squared of both models on the test data, using the sigr package

The GAM performed similarly on both training and test sets: RMSE of 1.40 on test versus 1.45 on training; R-squared of 0.78 on test versus 0.83 on training. So there’s likely no overfit.

10.2.3. Extracting the non-linear relationships

Once you fit a GAM, you’ll probably be interested in what the s() functions look like. Calling plot() on a GAM will give you a plot for each s() curve, so you can visualize non-linearities. In our example, plot(gam_model) produces the top curve in figure 10.16.

Figure 10.16. Top: The non-linear function s(PWGT) discovered by gam(), as output by plot(gam_model). Bottom: The same spline superimposed over the training data.

The shape of the curve is quite similar to the scatter plot we saw in figure 10.13 (which is reproduced as the lower half of figure 10.16). In fact, the spline that’s superimposed on the scatter plot in figure 10.13 is the same curve.

You can extract the data points that were used to make this graph by using the predict() function with the argument type = "terms". This produces a matrix where the ith column represents s(x[,i]). The following listing demonstrates how to reproduce the lower plot in figure 10.16.

Listing 10.16. Extracting a learned spline from a GAM
sx <- predict(gam_model, type = "terms")
summary(sx)
##       s(x)
##  Min.   :-17.527035
##  1st Qu.: -2.378636
##  Median :  0.009427
##  Mean   :  0.000000
##  3rd Qu.:  2.869166
##  Max.   :  4.084999

xframe <- cbind(train, sx = sx[,1])

ggplot(xframe, aes(x = x)) +
     geom_point(aes(y = y), alpha = 0.4) +
     geom_line(aes(y = sx))

Now that you’ve worked through a simple example, you are ready to try a more realistic example with more variables.

10.2.4. Using GAM on actual data

Example

Suppose you want to predict a newborn baby’s weight (DBWT) from a number of variables:

  • Mother’s weight (PWGT)
  • Mother’s pregnancy weight gain (WTGAIN)
  • Mother’s age (MAGER)
  • The number of prenatal medical visits (UPREVIS)

For this example, you’ll use data from the 2010 CDC natality dataset that you used in section 7.2 (though this is not the risk data used in that chapter).[7] Note that we’ve chosen this example to highlight the mechanisms of gam(), not to find the best model for birth weight. Adding other variables beyond the four we’ve chosen will improve the fit, but obscure the exposition.

7

The dataset can be found at https://github.com/WinVector/PDSwR2/blob/master/CDC/NatalBirthData.rData. A script for preparing the dataset from the original CDC extract can be found at https://github.com/WinVector/PDSwR2/blob/master/CDC/prepBirthWeightData.R.

In the next listing, you’ll fit a linear model and a GAM, and compare.

Listing 10.17. Applying linear regression (with and without GAM) to health data
library(mgcv)
library(ggplot2)
load("NatalBirthData.rData")
train <- sdata[sdata$ORIGRANDGROUP <= 5, ]
test <- sdata[sdata$ORIGRANDGROUP > 5, ]

form_lin <- as.formula("DBWT ~ PWGT + WTGAIN + MAGER + UPREVIS")
linmodel <- lm(form_lin, data = train)                             1

summary(linmodel)

## Call:
## lm(formula = form_lin, data = train)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -3155.43  -272.09    45.04   349.81  2870.55
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2419.7090    31.9291  75.784  < 2e-16 ***
## PWGT           2.1713     0.1241  17.494  < 2e-16 ***
## WTGAIN         7.5773     0.3178  23.840  < 2e-16 ***
## MAGER          5.3213     0.7787   6.834  8.6e-12 ***
## UPREVIS       12.8753     1.1786  10.924  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 562.7 on 14381 degrees of freedom
## Multiple R-squared:  0.06596,    Adjusted R-squared:  0.0657    2
## F-statistic: 253.9 on 4 and 14381 DF,  p-value: < 2.2e-16
form_gam <- as.formula("DBWT ~ s(PWGT) + s(WTGAIN) +
                        s(MAGER) + s(UPREVIS)")
gammodel <- gam(form_gam, data = train)                            3
gammodel$converged                                                 4
## [1] TRUE

summary(gammodel)

##
## Family: gaussian
## Link function: identity
##
## Formula:
## DBWT ~ s(PWGT) + s(WTGAIN) + s(MAGER) + s(UPREVIS)
##
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3276.948      4.623   708.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
##              edf Ref.df       F  p-value
## s(PWGT)    5.374  6.443  69.010  < 2e-16 ***
## s(WTGAIN)  4.719  5.743 102.313  < 2e-16 ***
## s(MAGER)   7.742  8.428   7.145 1.37e-09 ***
## s(UPREVIS) 5.491  6.425  48.423  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) =  0.0927   Deviance explained = 9.42%               5
## GCV = 3.0804e+05  Scale est. = 3.0752e+05  n = 14386

  • 1 Builds a linear model with four variables
  • 2 The model explains about 6.6% of the variance; all coefficients are significantly different from 0.
  • 3 Builds a GAM with the same variables
  • 4 Verifies that the model has converged
  • 5 The model explains a little over 9% of the variance; all variables have a non-linear effect significantly different from 0.

The GAM has improved the fit, and all four variables seem to have a non-linear relationship with birth weight, as evidenced by edfs all greater than 1. You could use plot(gammodel) to examine the shape of the s() functions; instead, let's compare them with a direct smoothing curve of each variable against mother’s weight.

Listing 10.18. Plotting GAM results
terms <- predict(gammodel, type = "terms")               1
terms <- cbind(DBWT = train$DBWT, terms)                 2

tframe <- as.data.frame(scale(terms, scale = FALSE))     3
colnames(tframe) <- gsub('[()]', '', colnames(tframe))   4

vars = c("PWGT", "WTGAIN", "MAGER", "UPREVIS")
pframe <- cbind(tframe, train[, vars])                   5

ggplot(pframe, aes(PWGT)) +                              6
  geom_point(aes(y = sPWGT)) +
  geom_smooth(aes(y = DBWT), se = FALSE)

# [...]                                                  7

  • 1 Gets the matrix of s() functions
  • 2 Binds in the birth weight (DBWT)
  • 3 Shifts all the columns to be zero mean (to make comparisons easy); converts to a data frame
  • 4 Makes the column names reference-friendly (s(PWGT) is converted to sPWGT, etc.)
  • 5 Binds in the input variables
  • 6 Compares the spline s(PWGT) to the smoothed curve of DBWT (baby’s weight) as a function of mother’s weight (PWGT)
  • 7 Repeats for the remaining variables (omitted for brevity)

Figure 10.17 shows the s() splines learned by gam() as the dotted curves. These splines are gam()’s estimate of the (joint) relationship between each variable and the outcome, DBWT. The sum of the splines (plus an offset) is the model’s best estimate of DBWT as a function of the input variables.

Figure 10.17. Smoothing curves of each of the four input variables plotted against birth weight, compared with the splines discovered by gam(). All curves have been shifted to be zero mean for comparison of shape.

The figure also shows the smoothing curves that directly relate each variable to DBWT. The smooth curves in each case are similar to the corresponding s() in shape, and non-linear for all the variables. The differences in shape are because the splines are fit jointly (which is more useful for modeling), and the smoothing curves are merely calculated one at a time.

As usual, you should check for overfit with holdout data.

Listing 10.19. Checking GAM model performance on holdout data
test <- transform(test,                                 1
                  pred_lin = predict(linmodel, test),
                  pred_gam = predict(gammodel, test) )

test <- transform(test,                                 2
                  resid_lin = DBWT - pred_lin,
                  resid_gam = DBWT - pred_gam)

rmse(test$resid_lin)                                    3
## [1] 566.4719

rmse(test$resid_gam)
## [1] 558.2978

wrapFTest(test, "pred_lin", "DBWT")$R2                  4
## [1] 0.06143168

wrapFTest(test, "pred_gam", "DBWT")$R2
## [1] 0.08832297

  • 1 Gets predictions from both models on test data
  • 2 Gets the residuals
  • 3 Compares the RMSE of both models on the test data
  • 4 Compares the R-squared of both models on the test data, using sigr

The performance of the linear model and the GAM were similar on the test set, as they were on the training set, so in this example, there’s no substantial overfit.

10.2.5. Using GAM for logistic regression

The gam() function can be used for logistic regression as well.

Example

Suppose you want to predict when a baby will be born underweight (defined as DBWT < 2000), using the same input variables as the previous scenario.

The logistic regression call to do this is shown in the following listing.

Listing 10.20. GLM logistic regression
form <- as.formula("DBWT < 2000 ~ PWGT + WTGAIN + MAGER + UPREVIS")
logmod <- glm(form, data = train, family = binomial(link = "logit"))

The corresponding call to gam() also specifies the binomial family with the logit link.

Listing 10.21. GAM logistic regression
form2 <- as.formula("DBWT < 2000 ~ s(PWGT) + s(WTGAIN) +
                                              s(MAGER) + s(UPREVIS)")
glogmod <- gam(form2, data = train, family = binomial(link = "logit"))
glogmod$converged
## [1] TRUE

summary(glogmod)
## Family: binomial
## Link function: logit
##
## Formula:
## DBWT < 2000 ~ s(PWGT) + s(WTGAIN) + s(MAGER) + s(UPREVIS)
##
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.94085    0.06794     -58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
##              edf Ref.df  Chi.sq  p-value
## s(PWGT)    1.905  2.420   2.463  0.36412                 1
## s(WTGAIN)  3.674  4.543  64.426 1.72e-12 ***
## s(MAGER)   1.003  1.005   8.335  0.00394 **
## s(UPREVIS) 6.802  7.216 217.631  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) =  0.0331   Deviance explained = 9.14%        2
## UBRE score = -0.76987  Scale est. = 1         n = 14386

  • 1 Note the large p-value associated with mother’s weight (PGWT). That means that there’s no statistical proof that the mother’s weight (PWGT) has a significant effect on the outcome.
  • 2 “Deviance explained” is the pseudo R-squared: 1 - (deviance/null.deviance).

As with the standard logistic regression call, we recover the class probabilities with the call predict(glogmodel, newdata = train, type = "response"). Again, these models are coming out with low quality, and in practice we would look for more explanatory variables to build better screening models.

10.2.6. GAM takeaways

Here’s what you should remember about GAMs:

  • GAMs let you represent non-linear and non-monotonic relationships between variables and outcome in a linear or logistic regression framework.
  • In the mgcv package, you can extract the discovered relationship from the GAM model using the predict() function with the type = "terms" parameter.
  • You can evaluate the GAM with the same measures you’d use for standard linear or logistic regression: residuals, deviance, R-squared, and pseudo R-squared. The gam() summary also gives you an indication of which variables have a significant effect on the model.
  • Because GAMs have increased complexity compared to standard linear or logistic regression models, there’s more risk of overfit.

GAMs extend linear methods (and generalized linear methods) by allowing variables to have non-linear (or even non-monotone) effects on outcome. Another approach is to form new variables from non-linear combinations of existing variables. The data scientist can do this by hand, by adding interactions or new synthetic variables, or it can be done mechanically, by support vector machines (SVMs), as shown in the next section. The hope is that with access to enough of these new variables, your modeling problem becomes easier.

In the next section, we’ll work with two of the most popular ways to add and manage new variables: kernel methods and support vector machines.

10.3. Solving “inseparable” problems using support vector machines

Some classification problems are called inseparable: instances of one class, A, are inside regions bounded by another class, B, so that class A can’t be separated from class B by a flat boundary. For example, in figure 10.18, we see a number of o’s inside a triangle defined by x’s (and we also see the data converted to a nice separable arrangement by a so-called kernel function phi()). The original arrangement on the left side is linearly inseparable: there is no hyperplane that separates the x’s from the o’s. Hence, it would be impossible for linear methods to completely separate the two classes. We could use the tree-based methods demonstrated in section 10.1 to fit a classifier, or we could use a technique called kernel methods. In this section, we will use SVMs and kernel methods to build good classifiers on linearly inseparable data.

Figure 10.18. Notional illustration of a kernel transform (based on Cristianini and Shawe-Taylor, 2000)

Having more than one way to do things

At this point, we have seen a number of advanced methods that give us more than one way to handle complex problems. For example: random forests, boosting, and SVMs can all introduce variable interactions to solve problems. It would be nice if there were always one obvious best method. However, each of these methodologies can dominate for different problems. So there is no one best method.

Our advice is try simple methods such as linear and logistic regression first. Then bring in and try advanced methods such as GAMs (which can handle single-variable reshaping), tree-based methods (which can handle many-variable interactions), and SVMs (which can handle many-variable reshapings) to address modeling issues.

10.3.1. Using an SVM to solve a problem

Let’s start with an example adapted from R’s kernlab library documentation. Learning to separate two spirals is a famous “impossible” problem that cannot be solved by linear methods (though it is solvable by spectral clustering, kernel methods, SVMs, and deep learning or deep neural nets).

Example

Figure 10.19 shows two spirals, one within the other. Your task is to build a decision procedure that cuts up the plane such that the 1-labeled examples are in one region and the 2-labeled examples are in the complimentary region.[8]

8

See K. J. Lang and M. J. Witbrock, “Learning to tell two spirals apart” in Proceedings of the 1988 Connectionist Models Summer School, D. Touretzky, G. Hinton, and T. Sejnowski (eds), Morgan Kaufmann, 1988 (pp. 52–59).

Support vector machines excel at learning concepts of the form “examples that are near each other should be given the same classification.” To use the SVM technique, the user must choose a kernel (to control what is considered "near" or "far"), and pick a value for a hyperparameter called C or nu (to try to control model complexity).

Spiral example

Listing 10.22 shows the recovery and labeling of the two spirals shown in figure 10.19. You will use the labeled data for the example task: given the labeled data, recover the 1 versus 2 regions by supervised machine learning.

Figure 10.19. The spiral counterexample

Listing 10.22. Setting up the spirals data as a classification problem
library('kernlab')
data(spirals)                                               1
 sc <- specc(spirals, centers = 2)                          2
 s <- data.frame(x = spirals[, 1], y = spirals[, 2],        3
    class = as.factor(sc))

library('ggplot2')
ggplot(data = s) +                                          4
                geom_text(aes(x = x, y = y,
                label = class, color = class)) +
  scale_color_manual(values = c("#d95f02", "#1b9e77")) +
  coord_fixed() +
  theme_bw() +
  theme(legend.position  = 'none') +
  ggtitle("example task: separate the 1s from the 2s")

  • 1 Loads the kernlab kernel and SVM package and then asks that the included example spirals be made available
  • 2 Uses kernlab’s spectral clustering routine to identify the two different spirals in the example dataset
  • 3 Combines the spiral coordinates and the spiral label into a data frame
  • 4 Plots the spirals with class labels

Figure 10.19 shows the labeled spiral dataset. Two classes (represented by digits) of data are arranged in two interwoven spirals. This dataset is difficult for methods that don’t have a rich enough concept space (perceptrons, shallow neural nets) and easy for more-sophisticated learners that can introduce the right new features. Support vector machines, with the right kernel, are a way to introduce new composite features in order to solve the problem.

Support vector machines with an oversimple kernel

Support vector machines are powerful, but without the correct kernel, they have difficulty with some concepts (such as the spiral example). Listing 10.23 shows a failed attempt to learn the spiral concept with an SVM using the identity or dot-product (linear) kernel. The linear kernel does no transformations on the data; it can work for some applications, but in this case it does not give us the data-separating properties we want.

Listing 10.23. SVM with a poor choice of kernel
set.seed(2335246L)
s$group <- sample.int(100, size = dim(s)[[1]], replace = TRUE)
sTrain <- subset(s, group > 10)
sTest <- subset(s,group <= 10)                                             1

library('e1071')
mSVMV <- svm(class ~ x + y, data = sTrain, kernel = 'linear', type =
'nu-classification')                                                  2
 sTest$predSVMV <- predict(mSVMV, newdata = sTest, type = 'response')      3

shading <- expand.grid(                                                    4
  x = seq(-1.5, 1.5, by = 0.01),
  y = seq(-1.5, 1.5, by = 0.01))
shading$predSVMV <- predict(mSVMV, newdata = shading, type = 'response')

ggplot(mapping = aes(x = x, y = y)) +                                      5
  geom_tile(data = shading, aes(fill = predSVMV),
            show.legend = FALSE, alpha = 0.5) +
  scale_color_manual(values = c("#d95f02", "#1b9e77")) +
  scale_fill_manual(values = c("white", "#1b9e77")) +
  geom_text(data = sTest, aes(label = predSVMV),
            size = 12) +
  geom_text(data = s, aes(label = class, color = class),
            alpha = 0.7) +
  coord_fixed() +
  theme_bw() +
  theme(legend.position = 'none') +
  ggtitle("linear kernel")

  • 1 Prepares to try to learn spiral class label from coordinates using an SVM
  • 2 Builds the support vector model using a vanilladot kernel (not a very good kernel)
  • 3 Uses the model to predict class on held-out data
  • 4 Calls the model on a grid of points to generate background shading indicating the learned concept
  • 5 Plots the predictions on top of a grey copy of all the data so we can see if predictions agree with the original markings

This attempt results in figure 10.20. The figure shows the total dataset in a small font and the SVM classifications of the test dataset in large text. It also indicates the learned concept by shading. The SVM didn’t produce a good model with the identity kernel, as it was forced to pick a linear separator. In the next section, you’ll repeat the process with the Gaussian radial kernel and get a much better result.

Figure 10.20. Identity kernel failing to learn the spiral concept

Support vector machines with a good kernel

In listing 10.24, you’ll repeat the SVM fitting process, but this time specifying the Gaussian or radial kernel. Figure 10.21 again plots the SVM test classifications in black (with the entire dataset in a smaller font). Note that this time the algorithm correctly learned the actual spiral concept, as indicated by the shading.

Figure 10.21. Radial kernel successfully learning the spiral concept

Listing 10.24. SVM with a good choice of kernel
mSVMG <- svm(class ~ x + y, data = sTrain, kernel = 'radial', type =
'nu-classification')                                               1
sTest$predSVMG <- predict(mSVMG, newdata = sTest, type = 'response')

shading <- expand.grid(
  x = seq(-1.5, 1.5, by = 0.01),
  y = seq(-1.5, 1.5, by = 0.01))
shading$predSVMG <- predict(mSVMG, newdata = shading, type = 'response')

ggplot(mapping = aes(x = x, y = y)) +
  geom_tile(data = shading, aes(fill = predSVMG),
            show.legend = FALSE, alpha = 0.5) +
  scale_color_manual(values = c("#d95f02", "#1b9e77")) +
  scale_fill_manual(values = c("white", "#1b9e77")) +
  geom_text(data = sTest, aes(label = predSVMG),
            size = 12) +
  geom_text(data = s,aes(label = class, color = class),
            alpha = 0.7) +
  coord_fixed() +
  theme_bw() +
  theme(legend.position = 'none') +
  ggtitle("radial/Gaussian kernel")

  • 1 This time uses the “radial” or Gaussian kernel, which is a nice geometric distance measure
Exercise: Try to use xgboost to solve the spirals problem.

As we stated, some methods work better on some problems than others. Try to use the xgboost package to solve the spirals problem. Do you find the xgboost results to be better or worse than the SVM results? (A worked version of this example can be found here: https://github.com/WinVector/PDSwR2/tree/master/Spirals.)

10.3.2. Understanding support vector machines

An SVM is often portrayed as a magic machine that makes classification easier.[9] To dispel the awe and be able to use support vector methods with confidence, we need to take some time to learn their principles and how they work. The intuition is this: SVMs with the radial kernel are very good nearest-neighbor-style classifiers.

9

Support vector machines can also be used for regression, but we will not cover that here.

In figure 10.22, in the “real space” (on the left), the data is separated by a non-linear boundary. When the data is lifted into the higher-dimensional kernel space (on the right), the lifted points are separated by a hyperplane. Let’s call the normal to that hyperplane w and the offset from the origin b (not shown).

Figure 10.22. Notional illustration of SVM

An SVM finds a linear decision function (determined by parameters w and b), where for a given example x the machine decides x is in the class if

w %*% phi(x) + b >= 0

for some w and b, and not in the class otherwise. The model is completely determined by the function phi(), the vector w, and the scalar offset b. The idea is that phi() lifts or reshapes the data into a nicer space (where things are linearly separable), and then the SVM finds a linear boundary separating the two data classes in this new space (represented by w and b). This linear boundary in the lifted space can be pulled back as a general curved boundary in the original space. The principle is sketched out in figure 10.22.

The support vector training operation finds w and b. There are variations on the SVM that make decisions between more than two classes, perform scoring/regression, and detect novelty. But we’ll discuss only the SVMs for simple classification.

As a user of SVMs, you don’t immediately need to know how the training procedure works; that’s what the software does for you. But you do need to have some notion of what it’s trying to do. The model w,b is ideally picked so that

w %*% phi(x) + b >= u

 

for all training xs that were in the class, and

w %*% phi(x) + b <= v

for all training examples not in the class.

The data is called separable if u > v. The size of the separation is (u - v) / sqrt(w %*% w), and is called the margin. The goal of the SVM optimizer is to maximize the margin. A large margin can actually ensure good behavior on future data (good generalization performance). In practice, real data isn’t always separable even in the presence of a kernel. To work around this, most SVM implementations implement the so-called soft margin optimization goal.

A soft margin optimizer adds additional error terms that are used to allow a limited fraction of the training examples to be on the wrong side of the decision surface.[10] The model doesn’t actually perform well on the altered training examples, but trades the error on these examples against increased margin on the remaining training examples. For most implementations, the model hyperparameter C or nu determines the trade-off between margin width for the remaining data and how much data is pushed around to achieve the margin. We will use the nu hyperparameter. nu takes settings between zero and one; lower values allow fewer training misclassifications, favoring more-complex models (more support vectors).[11] For our example, we will just use the function default value: 0.5.

10

A common type of dataset that is inseparable under any kernel is a dataset where there are at least two examples belonging to different outcome classes with the exact same values for all input or x variables. The original “hard margin” SVM couldn’t deal with this sort of data and was for that reason not considered to be practical.

11

For more details on SVMs, we recommend Cristianini and Shawe-Taylor’s An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, 2000.

10.3.3. Understanding kernel functions

The SVM picks which data is unimportant (left out) and which is very important (used as support vectors). But the reshaping of the problem to make the data separable is actually performed by what are called kernel methods or kernel functions.

Figure 10.22 illustrates[12] what we hope for from a good kernel: our data being pushed around so it’s easier to sort or classify. By using a kernel transformation, we move to a situation where the distinction we’re trying to learn is representable by a linear separator of our transformed data.

12

Cristianini and Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods.

To begin to understand SVMs, we need to take a quick look at the common math and terminology that a user of SVMs and kernel methods should be conversant with. First is the notion of a kernel function, which is used to implement the phi() we saw reshaping space.

Formal definition of a kernel function

In our application, a kernel is a function with a very specific definition. Let u and v be any pair of variables. u and v are typically vectors of input or independent variables (possibly taken from two rows of a dataset). A function k(,) that maps pairs (u,v) to numbers is called a kernel function if and only if there is some function phi() mapping (u,v)s to a vector space such that k(u,v) = phi(u) %*% phi(v) for all u,v.[13] We’ll informally call the expression k(u,v) = phi(u) %*% phi(v) the Mercer expansion of the kernel (in reference to Mercer’s theorem; see http://mng.bz/xFD2) and consider phi() the certificate that tells us k(,) is a good kernel. This is much easier to understand from a concrete example. In the following listing, we show an equivalent phi() / k(,) pair.

13

%*% is R’s notation for dot product or inner product; see help('%*%') for details. Note that phi() is allowed to map to very large (and even infinite) vector spaces.

Listing 10.25. An artificial kernel example
u <- c(1, 2)
v <- c(3, 4)
k <- function(u, v) {                   1
      u[1] * v[1] +
        u[2] * v[2] +
        u[1] * u[1] * v[1] * v[1] +
        u[2] * u[2] * v[2] * v[2] +
        u[1] * u[2] * v[1] * v[2]
  }
phi <- function(x) {                    2
      x <- as.numeric(x)
     c(x, x*x, combn(x, 2, FUN = prod))
  }
print(k(u, v))                          3
 ## [1] 108
print(phi(u))
## [1] 1 2 1 4 2
print(phi(v))
## [1]  3  4  9 16 12
print(as.numeric(phi(u) %*% phi(v)))    4
 ## [1] 108

  • 1 Defines a function of two vector variables (both two dimensional) as the sum of various products of terms
  • 2 Defines a function of a single vector variable that returns a vector containing the original entries plus all products of entries
  • 3 Example evaluation of k (,)
  • 4 Confirms phi() agrees with k(,). phi() is the certificate that shows k(,) is in fact a kernel.

Most kernel methods use the function k(,) directly and only use properties of k(,) guaranteed by the matching phi() to ensure method correctness. The k(,) function is usually quicker to compute than the notional function phi(). A simple example of this is what we’ll call the dot-product similarity of documents. The dot-product document similarity is defined as the dot product of two vectors where each vector is derived from a document by building a huge vector of indicators, one for each possible feature. For instance, if the features you’re considering are word pairs, then for every pair of words in a given dictionary, the document gets a feature of 1 if the pair occurs as a consecutive utterance in the document and 0 if not. This method is the phi(), but in practice we never use the phi() procedure. Instead, when comparing two documents, each consecutive pair of words in one document is generated and a bit of score is added if this pair is both in the dictionary and found consecutively in the other document. For moderate-sized documents and large dictionaries, this direct k(,) implementation is vastly more efficient than the phi() implementation.

The support vectors

The support vector machine gets its name from how the vector w is usually represented: as a linear combination of training examples—the support vectors. Recall we said in section 10.3.3 that the function phi() is allowed, in principle, to map into a very large or even infinite vector space. This means it may not be possible to directly write down w.

Support vector machines work around the “can’t write down w” issue by restricting to ws that are in principle a sum of phi() terms as shown here:

w = sum(a1 * phi(s1), ... , am * phi(sm))

The vectors s1, ..., sm are actually m training examples and are called the support vectors. The preceding formulation helps because such sums are (with some math) equivalent to sums of k( ,x) kernel terms of the form we show next:

w %*% phi(x) + b = sum(a1 * k(s1, x),... , am * k(sm, x)) + b

The right side is a quantity we can compute.

The work of the support vector training algorithm is to pick the vectors s1, ..., sm, the scalars a1, ..., am, and the offset b. All of this is called “the kernel trick.”

What to remember about a support vector model

A support vector model consists of these things:

  • A kernel phi() that reshapes space (chosen by the user)
  • A subset of training data examples, called the support vectors (chosen by the SVM algorithm)
  • A set of scalars a1, ..., am that specify what linear combination of the support vectors define the separating surface (chosen by the SVM algorithm)
  • A scalar threshold b we compare to (chosen by the SVM algorithm)

The reason why the data scientist must be aware of the support vectors is that they’re stored in the support vector model. For example, with too complex a model, there can be a very large number of support vectors, causing the model to be large and expensive to evaluate. In the worst case, the number of support vectors in the model can be almost as large as the number of training examples, making support vector model evaluation potentially as expensive as nearest-neighbor evaluation, and increasing the risk of overfit. The user picks a good number of support vectors by picking a good value of C or nu through cross-validation.

Exercise: Try different values of nu on the spirals problem.

nu is the important hyperparameter for SVMs. Ideally, we should cross-validate for a good value of nu. Instead of full cross-validation, just try a few values of nu to get the landscape. (We have a worked solution here: https://github.com/WinVector/PDSwR2/tree/master/Spirals.)

10.3.4. Support vector machine and kernel methods takeaways

Here’s what you should remember from this section:

  • Support vector machines are a kernel-based classification approach where a complex separating surface is parameterized in terms of a (possibly very large) subset of the training examples (called the support vectors).
  • The goal of “the kernel trick” is to lift the data into a space where the data is separable, or where linear methods can be used directly. Support vector machines and kernel methods work best when the problem has a moderate number of variables and the data scientist suspects that the relation to be modeled is a non-linear combination of variable effects.

Summary

In this chapter, we demonstrated some advanced methods to fix specific issues with basic modeling approaches: modeling variance, modeling bias, issues with non-linearity, and issues with variable interactions. An important additional family of methods we wish we had time to touch on is deep learning, the improved modern treatment of neural nets. Fortunately there is already a good book we can recommend on this topic: Deep Learning with R, by François Chollet with J. J. Allaire, Manning, 2018.

You should understand that you bring in advanced methods and techniques to fix specific modeling problems, not because they have exotic names or exciting histories. We also feel you should at least try to find an existing technique to fix a problem you suspect is hiding in your data before building your own custom technique; often the existing technique already incorporates a lot of tuning and wisdom. Which method is best depends on the data, and there are many advanced methods to try. Advanced methods can help fix overfit, variable interactions, non-additive relations, and unbalanced distributions, but not lack of features or data.

Finally, the goal of learning the theory of advanced techniques is not to be able to recite the steps of the common implementations, but to know when the techniques apply and what trade-offs they represent. The data scientist needs to supply thought and judgment and realize that the platform can supply implementations.

In this chapter you have learned

  • How to bag decision trees to stabilize their models and improve prediction performance
  • How to further improve decision-tree-based models by using random forests or gradient boosting
  • How to use random forest variable importances to help with variable selection
  • How to use generalized additive models to better model non-linear relationships between inputs and outputs in the context of linear and logistic regression
  • How to use support vector machines with the Gaussian kernel to model classification tasks with complex decision surfaces, especially nearest-neighbor-style tasks.

The actual point of a modeling project is to deliver results for production deployment and to present useful documentation and evaluations to your partners. The next part of this book will address best practices for delivering your results.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset