This chapter covers
In chapter 7, you learned about linear methods for fitting predictive models. These models are the bread-and-butter methods of machine learning; they are easy to fit; they are small, portable, and efficient; they sometimes provide useful advice; and they can work well in a wide variety of situations. However, they also make strong assumptions about the world: namely, that the outcome is linearly related to all the inputs, and all the inputs contribute additively to the outcome. In this chapter, you will learn about methods that relax these assumptions.
Figure 10.1 represents our mental model for what we'll do in this chapter: use R to master the science of building supervised machine learning models.
Suppose you want to study the relationship between mortality rates and measures of a person’s health or fitness, including BMI (body mass index).
Figure 10.2 shows the relationship between BMI and mortality hazard ratio for a population of older Thais over a four-year period.[1] It shows that both high and low BMI are associated with higher mortality rates: the relationship between BMI and mortality is not linear. So a straightforward linear model to predict mortality rates based (partly) on BMI may not perform well.
Data from Vapattanawong, et.al. “Obesity and mortality among older Thais: a four year follow up study,” BMC Public Health, 2010. https://doi.org/10.1186/1471-2458-10-604.
In addition, there may be interactions between BMI and other factors, like how active a person is. For example, for people who are highly active, BMI may affect mortality rates much less than for people who are sedentary. Some interactions, such as “if-then” relationships among variables, or multiplicative effects between variables, may not always be expressible in linear models.[2]
One can model interactions in linear models, but it must be done explicitly by the data scientist. Instead, we’ll focus on machine learning techniques, such as tree-based methods, that can learn at least certain types of interactions directly.
The machine learning techniques in this chapter use a variety of methods to address non-linearity, interactions, and other issues in modeling.
You saw an example of a basic decision tree model in chapter 1 (reproduced in figure 10.3). Decision trees are useful for both classification and regression, and they are an attractive method for a number of reasons:
On the other hand, decision trees do have some drawbacks:
See Lim, Loh, and Shih, “A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirtythree Old and New Classification Algorithms,” Machine Learning, 2000. 40, 203–229; online at http://mng.bz/qX06.
For these reasons, we don’t emphasize the use of basic decision trees in this book. However, there are a number of techniques to fix these weaknesses that lead to state-of-the-art, useful, and performant modeling algorithms. We’ll discuss some of these techniques in this section.
To motivate the discussion of tree-based methods, we’ll return to an example that we used in chapter 6 and build a basic decision tree.
Suppose you want to classify email into spam (email you do not want) and non-spam (email you want).
For this example, you’ll again use the Spambase dataset. The dataset consists of about 4,600 documents and 57 features that describe the frequency of certain keywords and characters. Here's the process:
Recall from discussions in chapters 6 and 7 that we want accuracy and F1 to be high, and deviance (which is similar to variance) to be low.
First, let’s load the data. As you did in section 6.2, download a copy of spamD.tsv from https://github.com/WinVector/PDSwR2/raw/master/Spambase/spamD.tsv. Then, write a few convenience functions and train a decision tree, as in the following listing.
spamD <- read.table('spamD.tsv', header = TRUE, sep = ' ') 1 spamD$isSpam <- spamD$spam == 'spam' spamTrain <- subset(spamD, spamD$rgroup >= 10) spamTest <- subset(spamD, spamD$rgroup < 10) spamVars <- setdiff(colnames(spamD), list('rgroup', 'spam', 'isSpam')) library(wrapr) spamFormula <- mk_formula("isSpam", spamVars) 2 loglikelihood <- function(y, py) { 3 pysmooth <- ifelse(py == 0, 1e-12, ifelse(py == 1, 1 - 1e-12, py)) sum(y * log(pysmooth) + (1 - y) * log(1 - pysmooth)) } accuracyMeasures <- function(pred, truth, name = "model") { 4 dev.norm <- -2 * loglikelihood(as.numeric(truth), pred) / length(pred) 5 ctable <- table(truth = truth, pred = (pred > 0.5)) 6 accuracy <- sum(diag(ctable)) / sum(ctable) precision <- ctable[2, 2] / sum(ctable[, 2]) recall <- ctable[2, 2] / sum(ctable[2, ]) f1 <- 2 * precision * recall / (precision + recall) data.frame(model = name, accuracy = accuracy, f1 = f1, dev.norm) } library(rpart) 7 treemodel <- rpart(spamFormula, spamTrain, method = "class") library(rpart.plot) 8 rpart.plot(treemodel, type = 5, extra = 6) predTrain <- predict(treemodel, newdata = spamTrain)[, 2] 9 trainperf_tree <- accuracyMeasures(predTrain, 10) spamTrain$spam == "spam", name = "tree, training") predTest <- predict(treemodel, newdata = spamTest)[, 2] testperf_tree <- accuracyMeasures(predTest, spamTest$spam == "spam",
The resulting decision tree model is shown in figure 10.4. The output of the two calls to accuracyMeasures() looks like the following:
library(pander) 1 panderOptions("plain.ascii", TRUE) 2 panderOptions("keep.trailing.zeros", TRUE) panderOptions("table.style", "simple") perf_justify <- "lrrr" perftable <- rbind(trainperf_tree, testperf_tree) pandoc.table(perftable, justify = perf_justify) ## ## ## model accuracy f1 dev.norm ## ---------------- ---------- -------- ---------- ## tree, training 0.8996 0.8691 0.6304 ## tree, test 0.8712 0.8280 0.7531
As expected, the accuracy and F1 scores both degrade on the test set, and the deviance increases.
One way to mitigate the shortcomings of decision tree models is by bootstrap aggregation, or bagging. In bagging, you draw bootstrap samples (random samples with replacement) from your data. From each sample, you build a decision tree model. The final model is the average of all the individual decision trees. This is shown in figure 10.5.[4]
Bagging, random forests, and gradient-boosted trees are variations of a general technique called ensemble learning. An ensemble model is composed of the combination of several smaller simple models (often small decision trees). Giovanni Seni and John Elder’s Ensemble Methods in Data Mining (Morgan & Claypool, 2010) is an excellent introduction to the general theory of ensemble learning.
To make this concrete, suppose that x is an input datum, y_i(x) is the output of the ith tree, c(y_1(x), y_2(x), ... y_n(x)) is the vector of individual outputs, and y is the output of the final model:
Bagging decision trees stabilizes the final model by lowering the variance; this improves the accuracy. A bagged ensemble of trees is also less likely to overfit the data.
Try bagging some tree models for the spam example.
ntrain <- dim(spamTrain)[1] n <- ntrain 1 ntree <- 100 samples <- sapply(1:ntree, 2 FUN = function(iter) { sample(1:ntrain, size = n, replace = TRUE) }) treelist <-lapply(1:ntree, 3 FUN = function(iter) { samp <- samples[, iter]; rpart(spamFormula, spamTrain[samp, ], method = "class") } ) predict.bag <- function(treelist, newdata) { 4 preds <- sapply(1:length(treelist), FUN = function(iter) { predict(treelist[[iter]], newdata = newdata)[, 2] }) predsums <- rowSums(preds) predsums / length(treelist) } pred <- predict.bag(treelist, newdata = spamTrain) trainperf_bag <- accuracyMeasures(pred, 5 spamTrain$spam == "spam", name = "bagging, training") pred <- predict.bag(treelist, newdata = spamTest) testperf_bag <- accuracyMeasures(pred, spamTest$spam == "spam", name = "bagging, test") perftable <- rbind(trainperf_bag, testperf_bag) pandoc.table(perftable, justify = perf_justify) ## ## ## model accuracy f1 dev.norm ## ------------------- ---------- -------- ---------- ## bagging, training 0.9167 0.8917 0.5080 ## bagging, test 0.9127 0.8824 0.5793
As you see, bagging improves accuracy and F1, and reduces deviance over both the training and test sets when compared to the single decision tree (you’ll see a direct comparison of the scores a little later on). There is also less degradation in the bagged model’s performance going from training to test than there is with the decision tree.
You can further improve prediction performance by going from bagging to random forests.
The proofs that bagging reduces variance are only valid for regression and for estimating class probabilities, not for classifiers (a model that only returns class membership, not class probabilities). Bagging a bad classifier can make it worse. So you definitely want to work over estimated class probabilities, if they’re at all available. But it can be shown that for CART trees (which is the decision tree implementation in R) under mild assumptions, bagging tends to increase classifier accuracy. See Clifton D. Sutton, “Classification and Regression Trees, Bagging, and Boosting,” Handbook of Statistics, Vol. 24 (Elsevier, 2005) for more details.
In bagging, the trees are built using randomized datasets, but each tree is built by considering the exact same set of features. This means that all the individual trees are likely to use very similar sets of features (perhaps in a different order or with different split values). Hence, the individual trees will tend to be overly correlated with each other. If there are regions in feature space where one tree tends to make mistakes, then all the trees are likely to make mistakes there, too, diminishing our opportunity for correction. The random forest approach tries to decorrelate the trees by randomizing the set of variables that each tree is allowed to use.
The process is shown in figure 10.6. For each individual tree in the ensemble, the random forest method does the following:
The final ensemble of trees is then bagged to make the random forest predictions. This is quite involved, but fortunately all done by a single-line random forest call.
By default, the randomForest() function in R draws mtry = p/3 variables at each node for regression trees, and m = sqrt(p) variables for classification trees. In theory, random forests aren’t terribly sensitive to the value of mtry. Smaller values will grow the trees faster; but if you have a very large number of variables to choose from, of which only a small fraction are actually useful, then using a larger mtry is better, since with a larger mtry you’re more likely to draw some useful variables at every step of the tree-growing procedure.
Continuing from the data in section 10.1, try building a spam model using random forests.
library(randomForest) 1 set.seed(5123512) 2 fmodel <- randomForest(x = spamTrain[, spamVars], 3 y = spamTrain$spam, ntree = 100, 4 nodesize = 7, 5 importance = TRUE) 6 pred <- predict(fmodel, spamTrain[, spamVars], type = 'prob')[, 'spam'] trainperf_rf <- accuracyMeasures(predict(fmodel, 7 newdata = spamTrain[, spamVars], type = 'prob')[, 'spam'], spamTrain$spam == "spam", name = "random forest, train") testperf_rf <- accuracyMeasures(predict(fmodel, newdata = spamTest[, spamVars], type = 'prob')[, 'spam'], spamTest$spam == "spam", name = "random forest, test") perftable <- rbind(trainperf_rf, testperf_rf) pandoc.table(perftable, justify = perf_justify) ## ## ## model accuracy f1 dev.norm ## ---------------------- ---------- -------- ---------- ## random forest, train 0.9884 0.9852 0.1440 ## random forest, test 0.9498 0.9341 0.3011
You can summarize the results for all three of the models you’ve looked at. First, on training:
trainf <- rbind(trainperf_tree, trainperf_bag, trainperf_rf) pandoc.table(trainf, justify = perf_justify) ## ## ## model accuracy f1 dev.norm ## ---------------------- ---------- -------- ---------- ## tree, training 0.8996 0.8691 0.6304 ## bagging, training 0.9160 0.8906 0.5106 ## random forest, train 0.9884 0.9852 0.1440
testf <- rbind(testperf_tree, testperf_bag, testperf_rf) pandoc.table(testf, justify = perf_justify) ## ## ## model accuracy f1 dev.norm ## --------------------- ---------- -------- ---------- ## tree, test 0.8712 0.8280 0.7531 ## bagging, test 0.9105 0.8791 0.5834 ## random forest, test 0.9498 0.9341 0.3011
The random forest model performed dramatically better than the other two models on both training and test.
You can also look at performance change: the decrease in accuracy and F1 when going from training to test, and the corresponding increase in deviance.
difff <- data.frame(model = c("tree", "bagging", "random forest"), accuracy = trainf$accuracy - testf$accuracy, f1 = trainf$f1 - testf$f1, dev.norm = trainf$dev.norm - testf$dev.norm) pandoc.table(difff, justify=perf_justify) ## ## ## model accuracy f1 dev.norm ## --------------- ---------- --------- ---------- ## tree 0.028411 0.04111 -0.12275 ## bagging 0.005523 0.01158 -0.07284 ## random forest 0.038633 0.05110 -0.15711
The random forest’s model degraded about as much as a single decision tree when going from training to test data, and much more than the bagged model did. This is one of the drawbacks of random forest models: the tendency to overfit the training data. However, in this case, the random forest model was still the best performing.
It’s lore among random forest proponents that “random forests don’t overfit.” In fact, they can. Hastie et al. back up this observation in their chapter on random forests in The Elements of Statistical Learning (Springer, 2011). Seeing virtually perfect prediction on training data and less-than-perfect performance on holdout data is characteristic of random forest models. So when using random forest, it’s extremely important to validate model performance on holdout data.
A useful feature of the randomForest() function is its variable importance calculation. Since the algorithm uses a large number of bootstrap samples, each data point x has a corresponding set of out-of-bag samples: those samples that don’t contain the point x. This is shown in figure 10.7 for the data point x1. The out-of-bag samples can be used in a way similar to N-fold cross-validation, to estimate the accuracy of each tree in the ensemble.
To estimate the “importance” of a variable v1, the variable’s values are randomly permuted. Each tree is then evaluated against its out-of-bag samples, and the corresponding decrease in each tree’s accuracy is estimated. This is shown in figure 10.8.
If the average decrease over all the trees is large, then the variable is considered important—its value makes a big difference in predicting the outcome. If the average decrease is small, then the variable doesn’t make much difference to the outcome. The algorithm also measures the decrease in node purity that occurs from splitting on a permuted variable (how this variable affects the quality of the tree).
You can calculate the variable importance by setting importance = TRUE in the randomForest() call (as you did in listing 10.3), and then calling the functions importance() and varImpPlot().
varImp <- importance(fmodel) 1 varImp[1:10, ] 2 ## non-spam spam MeanDecreaseAccuracy ## word.freq.make 1.656795 3.432962 3.067899 ## word.freq.address 2.631231 3.800668 3.632077 ## word.freq.all 3.279517 6.235651 6.137927 ## word.freq.3d 3.900232 1.286917 3.753238 ## word.freq.our 9.966034 10.160010 12.039651 ## word.freq.over 4.657285 4.183888 4.894526 ## word.freq.remove 19.172764 14.020182 20.229958 ## word.freq.internet 7.595305 5.246213 8.036892 ## word.freq.order 3.167008 2.505777 3.065529 ## word.freq.mail 3.820764 2.786041 4.869502 varImpPlot(fmodel, type = 1) 3
The result of the varImpPlot() call is shown in figure 10.9. According to the plot, the most important variable for determining if an email is spam is char.freq.bang, or the number of times an exclamation point appears in an email, which makes some intuitive sense. The next most important variable is word.freq.remove, or the number of times the word “remove” appears in the email.
Knowing which variables are most important (or at least, which variables contribute the most to the structure of the underlying decision trees) can help you with variable reduction. This is useful not only for building smaller, faster trees, but also for choosing variables to be used by another modeling algorithm, if that’s desired. We can reduce the number of variables in this spam example from 57 to 30 without affecting the quality of the final model.
Data scientist Jeremy Howard (of Kaggle and fast.ai fame) is a big proponent of using an initial variable importance screen early in a data science project to eliminate variables that are not of interest and identify variables to discuss with business partners.
sorted <- sort(varImp[, "MeanDecreaseAccuracy"], 1 decreasing = TRUE) selVars <- names(sorted)[1:30] fsel <- randomForest(x = spamTrain[, selVars], 2 y = spamTrain$spam, ntree = 100, nodesize = 7, importance = TRUE) trainperf_rf2 <- accuracyMeasures(predict(fsel, newdata = spamTrain[, selVars], type = 'prob')[, 'spam'], spamTrain$spam == "spam", name = "RF small, train") testperf_rf2 <- accuracyMeasures(predict(fsel, newdata=spamTest[, selVars], type = 'prob')[, 'spam'], spamTest$spam == "spam", name = "RF small, test") perftable <- rbind(testperf_rf, testperf_rf2) 3 pandoc.table(perftable, justify = perf_justify) ## ## ## model accuracy f1 dev.norm ## --------------------- ---------- -------- ---------- ## random forest, test 0.9498 0.9341 0.3011 ## RF small, test 0.9520 0.9368 0.4000
The smaller model performs just as well as the random forest model built using all 57 variables.
Random forest variable importance measures how important individual variables are to the model’s overall prediction performance. They tell you which variables generally affect the model’s predictions the most, or which variables the model depends on the most.
LIME variable importances (discussed in section 6.3) measure how much different variables affect the model’s prediction on a specific example. LIME explanations can help you determine if the model is using its variables appropriately, by explaining specific decisions.
Gradient boosting is another ensemble method that improves the performance of decision trees. Rather than averaging the predictions of several trees together, as bagging and random forests do, gradient boosting tries to improve prediction performance by incrementally adding trees to an existing ensemble. The steps are as follows:
The procedure is sketched out in figure 10.10.
Gradient-boosted trees can also overfit, because at some point the residuals are just random noise. To mitigate overfitting, most implementations of gradient boosting provide cross-validation methods to help determine when to stop adding trees to the ensemble.
You saw examples of gradient boosting when we discussed LIME in section 6.3, where you fit the gradient-boosted tree models using the xgboost package. In this section, we’ll go over the modeling code that you used in section 6.3 in more detail.
Let’s start with a small example.
Suppose you have a dataset of petal and sepal measurements for three varieties of iris. The object is to predict whether a given iris is a setosa based on its petal and sepal dimensions.
iris <- iris iris$class <- as.numeric(iris$Species == "setosa") 1 set.seed(2345) intrain <- runif(nrow(iris)) < 0.75 2 train <- iris[intrain, ] test <- iris[!intrain, ] head(train) ## Sepal.Length Sepal.Width Petal.Length Petal.Width Species class ## 1 5.1 3.5 1.4 0.2 setosa 1 ## 2 4.9 3.0 1.4 0.2 setosa 1 ## 3 4.7 3.2 1.3 0.2 setosa 1 ## 4 4.6 3.1 1.5 0.2 setosa 1 ## 5 5.0 3.6 1.4 0.2 setosa 1 ## 6 5.4 3.9 1.7 0.4 setosa 1 input <- as.matrix(train[, 1:4]) 3
Note that xgboost requires its input to be a numeric (no categorical variables) matrix, so in listing 10.6, you take the input data from the training data frame and create an input matrix.
In section 6.3, you fit the iris model by using the preprovided convenience function fit_iris_example(); here we’ll explain the code from that function in detail. The first step is to run the cross-validation function xgb.cv() to determine the appropriate number of trees to use.
library(xgboost) cv <- xgb.cv(input, 1 label = train$class, 2 params = list( objective = "binary:logistic" 3 ), nfold = 5, 4 nrounds = 100, 5 print_every_n = 10, 6 metrics = "logloss") 7 evalframe <- as.data.frame(cv$evaluation_log) 8 head(evalframe) 9 ## iter train_logloss_mean train_logloss_std test_logloss_mean ## 1 1 0.4547800 7.758350e-05 0.4550578 ## 2 2 0.3175798 9.268527e-05 0.3179284 ## 3 3 0.2294212 9.542411e-05 0.2297848 ## 4 4 0.1696242 9.452492e-05 0.1699816 ## 5 5 0.1277388 9.207258e-05 0.1280816 ## 6 6 0.0977648 8.913899e-05 0.0980894 ## test_logloss_std ## 1 0.001638487 ## 2 0.002056267 ## 3 0.002142687 ## 4 0.002107535 ## 5 0.002020668 ## 6 0.001911152 (NROUNDS <- which.min(evalframe$test_logloss_mean)) 10) ## [1] 18 library(ggplot2) ggplot(evalframe, aes(x = iter, y = test_logloss_mean)) + geom_line() + geom_vline(xintercept = NROUNDS, color = "darkred", linetype = 2) + ggtitle("Cross-validated log loss as a function of ensemble size")
Figure 10.11 shows the cross-validated log loss as a function of the number of trees. In this case, xgb.cv() estimated that 18 trees gave the best model. Once you know the number of trees to use, you can call xgboost() to fit the appropriate model.
model <- xgboost(data = input, label = train$class, params = list( objective = "binary:logistic" ), nrounds = NROUNDS, verbose = FALSE) test_input <- as.matrix(test[, 1:4]) 1 pred <- predict(model, test_input) 2 accuracyMeasures(pred, test$class) ## model accuracy f1 dev.norm ## 1 model 1 1 0.03458392
The model predicts perfectly on the holdout data, because this is an easy problem. Now that you are familiar with the steps, you can try xgboost on a harder problem: the movie review classification problem from section 6.3.3.
For this example, you will classify movie reviews from the Internet Movie Database (IMDB). The task is to identify positive reviews.
As you did in section 6.3.3, you’ll use the training and test data, IMDBtrain.RDS and IMDBtest.RDS, found at https://github.com/WinVector/PDSwR2/tree/master/IMDB. Each RDS object is a list with two elements: a character vector representing 25,000 reviews, and a vector of numeric labels where 1 means a positive review and 0 a negative review.
First, load the training data:
library(zeallot) c(texts, labels) %<-% readRDS("IMDBtrain.RDS")
You have to convert the textual input data to a numeric representation. As in section 6.3.3, you’ll convert the training data into a document-term matrix, implemented as a sparse matrix of class dgCMatrix. The convenience functions to do this conversion are in https://github.com/WinVector/PDSwR2/tree/master/IMDB/lime_imdb_example.R. Next, you’ll create the vocabulary of terms in the corpus, and then make the document-term matrix for the training data:
source("lime_imdb_example.R") vocab <- create_pruned_vocabulary(texts) dtm_train <- make_matrix(texts, vocab)
The first step to fit the model is to determine the number of trees to use. This may take a while.
cv <- xgb.cv(dtm_train, label = labels, params = list( objective = "binary:logistic" ), nfold = 5, nrounds = 500, early_stopping_rounds = 20, 1 print_every_n = 10, metrics = "logloss") evalframe <- as.data.frame(cv$evaluation_log) (NROUNDS <- which.min(evalframe$test_logloss_mean)) ## [1] 319
Then fit the model and evaluate it:
model <- xgboost(data = dtm_train, label = labels, params = list( objective = "binary:logistic" ), nrounds = NROUNDS, verbose = FALSE) pred = predict(model, dtm_train) trainperf_xgb = accuracyMeasures(pred, labels, "training") c(test_texts, test_labels) %<-% readRDS("IMDBtest.RDS") 1 dtm_test = make_matrix(test_texts, vocab) pred = predict(model, dtm_test) testperf_xgb = accuracyMeasures(pred, test_labels, "test") perftable <- rbind(trainperf_xgb, testperf_xgb) pandoc.table(perftable, justify = perf_justify) ## ## ## model accuracy f1 dev.norm ## ---------- ---------- -------- ---------- ## training 0.9891 0.9891 0.1723 ## test 0.8725 0.8735 0.5955
As with random forests, this gradient-boosted model gives near-perfect performance on training data, and less-than-perfect, but still decent performance on holdout data. Even though the cross-validation step suggested 319 trees, you may want to examine evalframe (as you did in the iris example), and experiment with different numbers of trees, to see if that reduces the overfitting.
In our own work, we’ve found that gradient boosting models tend to outperform random forests on most problems where we’ve tried both. However, there are occasionally situations where gradient boosting models perform poorly, while random forest models give acceptable performance. Your experiences may be different. In any case, it’s a good idea to keep both methods in your arsenal.
In the iris example, all the input variables were numeric; in the movie review example, you converted the unstructured text input into a structured, numeric matrix representation. In many situations, you will have structured input data with categorical levels, as in the following example.
Suppose you want to predict a newborn’s birth weight as a function of several variables, both numeric and categorical, using xgboost.
The data for this example is from the 2010 CDC natality dataset; it is similar to the data that you used in chapter 7 for predicting at-risk births.[5]
The dataset can be found at https://github.com/WinVector/PDSwR2/blob/master/CDC/NatalBirthData.rData.
load("NatalBirthData.rData") train <- sdata[sdata$ORIGRANDGROUP <= 5, ] 1 test <- sdata[sdata$ORIGRANDGROUP >5 , ] input_vars <- setdiff(colnames(train), c("DBWT", "ORIGRANDGROUP")) 2 str(train[, input_vars]) ## 'data.frame': 14386 obs. of 11 variables: ## $ PWGT : int 155 140 151 160 135 180 200 135 112 98 ... ## $ WTGAIN : int 42 40 1 47 25 20 24 51 36 22 ... ## $ MAGER : int 30 32 34 32 24 25 26 26 20 22 ... ## $ UPREVIS : int 14 13 15 1 4 10 14 15 14 10 ... ## $ CIG_REC : logi FALSE FALSE FALSE TRUE FALSE FALSE ... ## $ GESTREC3 : Factor w/ 2 levels ">= 37 weeks",..: 1 1 1 2 1 1 1 1 1 1 ... ## $ DPLURAL : Factor w/ 3 levels "single","triplet or higher",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ URF_DIAB : logi FALSE FALSE FALSE FALSE FALSE FALSE ... ## $ URF_CHYPER: logi FALSE FALSE FALSE FALSE FALSE FALSE ... ## $ URF_PHYPER: logi FALSE FALSE FALSE FALSE FALSE FALSE ... ## $ URF_ECLAM : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
As you can see, the input data has numerical variables, logical variables, and categorical (factor) variables. If you want to use xgboost() to fit a gradient-boosted model for a baby’s birth weight using all of these variables, you must convert the input to all-numerical data. There are several ways to do this, including the base R model .matrix() function. We recommend using vtreat, as you did in chapter 8.
For this scenario, there are three ways you can use vtreat:
Since in this scenario there are only two categorical variables, and none of them are too complex (GESTREC3 takes on two values, and DPLURAL takes on three), you can use the third option.
library(vtreat) treatplan <- designTreatmentsZ(train, 1 input_vars, codeRestriction = c("clean", "isBAD", "lev" ),2 verbose = FALSE) train_treated <- prepare(treatplan, train) 3 str(train_treated) ## 'data.frame': 14386 obs. of 14 variables: ## $ PWGT : num 155 140 151 160 135 180 200 135 112 98 ... ## $ WTGAIN : num 42 40 1 47 25 20 24 51 36 22 ... ## $ MAGER : num 30 32 34 32 24 25 26 26 20 22 ... ## $ UPREVIS : num 14 13 15 1 4 10 14 15 14 10 ... ## $ CIG_REC : num 0 0 0 1 0 0 0 0 0 0 ... ## $ URF_DIAB : num 0 0 0 0 0 0 0 0 0 0 ... ## $ URF_CHYPER : num 0 0 0 0 0 0 0 0 0 0 ... ## $ URF_PHYPER : num 0 0 0 0 0 0 0 0 0 0 ... ## $ URF_ECLAM : num 0 0 0 0 0 0 0 0 0 0 ... ## $ GESTREC3_lev_x_37_weeks : num 0 0 0 1 0 0 0 0 0 0 ... ## $ GESTREC3_lev_x_37_weeks_1 : num 1 1 1 0 1 1 1 1 1 1 ... ## $ DPLURAL_lev_x_single : num 1 1 1 1 1 1 1 1 1 1 ... ## $ DPLURAL_lev_x_triplet_or_higher: num 0 0 0 0 0 0 0 0 0 0 ... ## $ DPLURAL_lev_x_twin : num 0 0 0 0 0 0 0 0 0 0 ...
Note that train_treated is purely numerical, with no missing values, and it doesn’t contain the outcome column, so it’s safe to use with xgboost (though you must convert it to a matrix first). To demonstrate this, the following listing directly fits a gradient-boosted model with 50 trees to the prepared training data (no cross-validation to pick the best size), and then applies the model to the prepared test data. This is just for demonstration purposes; normally you would want to call xgb.cv() to pick an appropriate number of trees first.
birthwt_model <- xgboost(as.matrix(train_treated), train$DBWT, params = list( objective = "reg:linear", base_score = mean(train$DBWT) ), nrounds = 50, verbose = FALSE) test_treated <- prepare(treatplan, test) pred <- predict(birthwt_model, as.matrix(test_treated))
Try xgboost to predict DBWT, that is, set up the data and run the preceding code.
Bagging, random forests, and gradient boosting are after-the-fact improvements you can try in order to improve decision tree models. In the next section, you’ll work with generalized additive models, which use a different method to represent non-linear relationships between inputs and outputs.
Here's what you should remember about tree-based models:
In chapter 7, you used linear regression to model and predict quantitative output, and logistic regression to predict class probabilities. Linear and logistic regression models are powerful tools, especially when you want to understand the relationship between the input variables and the output. They’re robust to correlated variables (when regularized), and logistic regression preserves the marginal probabilities of the data. The primary shortcoming of both these models is that they assume that the relationship between the inputs and the output is monotone. That is, if more is good, than much more is always better.
But what if the actual relationship is non-monotone? Consider the BMI example that you saw at the beginning of the chapter. For underweight adults, increasing BMI can lower mortality. But there’s a limit: at some point a higher BMI is bad, and mortality will increase as BMI increases. Linear and logistic regression miss this distinction. For the data that we are working with, as figure 10.12 shows, a linear model would predict that mortality always decreases as BMI increases.
Generalized additive models (GAMs) are a way to model non-monotone responses within the framework of a linear or logistic model (or any other generalized linear model). In the mortality example, GAM would try to find a good “u-shaped” function of BMI, s(BMI), that describes the relationship between BMI and mortality, as shown in figure 10.12. GAM would then fit a function to predict mortality in terms of s(BMI).
Recall that if y[i] is the numeric quantity you want to predict, and x[i, ] is a row of inputs that corresponds to output y[i], then linear regression finds a function f(x) such that
f(x[i, ]) = b0 + b[1] * x[i, 1] + b[2] * x[i, 2] + ... b[n] * x[i, n]
And f(x[i, ]) is as close to y[i] as possible.
In its simplest form, a GAM model relaxes the linearity constraint and finds a set of functions s_i() (and a constant term a0) such that
f(x[i,]) = a0 + s_1(x[i, 1]) + s_2(x[i, 2]) + ... s_n(x[i, n])
We also want f(x[i, ]) to be as close to y[i] as possible. The functions s_i() are smooth curve fits that are built up from polynomials. The curves are called splines and are designed to pass as closely as possible through the data without being too “wiggly” (without overfitting). An example of a spline fit is shown in figure 10.13.
Let’s work on a concrete example.
First, consider this toy example.
Suppose you want to fit a model to data where the response y is a noisy non-linear function of the input variable x (in fact, it’s the function shown in figure 10.13).
As usual, we’ll split the data into training and test sets.
set.seed(602957) x <- rnorm(1000) noise <- rnorm(1000, sd = 1.5) y <- 3 * sin(2 * x) + cos(0.75 * x) - 1.5 * (x^2) + noise select <- runif(1000) frame <- data.frame(y = y, x = x) train <- frame[select > 0.1, ] test <-frame[select <= 0.1, ]
Given that the data is from the non-linear functions sin() and cos(), there shouldn’t be a good linear fit from x to y. We’ll start by building a (poor) linear regression.
lin_model <- lm(y ~ x, data = train) summary(lin_model) ## ## Call: ## lm(formula = y ~ x, data = train) ## ## Residuals: ## Min 1Q Median 3Q Max ## -17.698 -1.774 0.193 2.499 7.529 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.8330 0.1161 -7.175 1.51e-12 *** ## x 0.7395 0.1197 6.180 9.74e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.485 on 899 degrees of freedom ## Multiple R-squared: 0.04075, Adjusted R-squared: 0.03968 ## F-statistic: 38.19 on 1 and 899 DF, p-value: 9.737e-10 rmse <- function(residuals) { 1 sqrt(mean(residuals^2)) } train$pred_lin <- predict(lin_model, train) 2 resid_lin <- with(train, y - pred_lin) rmse(resid_lin) ## [1] 3.481091 library(ggplot2) 3 ggplot(train, aes(x = pred_lin, y = y)) + geom_point(alpha = 0.3) + geom_abline()
The resulting model’s predictions are plotted versus true response in figure 10.14. As expected, it’s a very poor fit, with an R-squared of about 0.04. In particular, the errors are not homoscedastic: there are regions where the model systematically underpredicts and regions where it systematically overpredicts. If the relationship between x and y were truly linear (with independent noise), then the errors would be homoscedastic: the errors would be evenly distributed (mean 0) around the predicted value everywhere.
Now try finding a non-linear model that maps x to y. We’ll use the function gam() in the package mgcv.[6] When using gam(), you can model variables as either linear or non-linear. You model a variable x as non-linear by wrapping it in the s() notation. In this example, rather than using the formula y ~ x to describe the model, you’d use the formula y ~ s(x). Then gam() will search for the spline s() that best describes the relationship between x and y, as shown in listing 10.14. Only terms surrounded by s() get the GAM/spline treatment.
There’s an older package called gam, written by Hastie and Tibshirani, the inventors of GAMs. The gam package works fine. But it’s incompatible with the mgcv package, which ggplot already loads. Since we’re using ggplot for plotting, we’ll use mgcv for our examples.
library(mgcv) 1 gam_model <- gam(y ~ s(x), data = train) 2 gam_model$converged 3 ## [1] TRUE summary(gam_model) ## Family: gaussian 4 ## Link function: identity ## ## Formula: ## y ~ s(x) ## ## Parametric coefficients: 5 ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.83467 0.04852 -17.2 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Approximate significance of smooth terms: 6 ## edf Ref.df F p-value ## s(x) 8.685 8.972 497.8 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## R-sq.(adj) = 0.832 Deviance explained = 83.4% 7 ## GCV score = 2.144 Scale est. = 2.121 n = 901 train$pred <- predict(gam_model, train) 8 resid_gam <- with(train, y - pred) rmse(resid_gam) ## [1] 1.448514 ggplot(train, aes(x = pred, y = y)) + 9 geom_point(alpha = 0.3) + geom_abline()
The resulting model’s predictions are plotted versus true response in figure 10.15. This fit is much better: the model explains over 80% of the variance (R-squared of 0.83), and the root mean squared error (RMSE) over the training data is less than half the RMSE of the linear model. Note that the points in figure 10.15 are distributed more or less evenly around the line of perfect prediction. The GAM has been fit to be homoscedastic, and any given prediction is as likely to be an overprediction as an underprediction.
By default, gam() will perform standard linear regression. If you were to call gam() with the formula y ~ x, you’d get the same model that you got using lm(). More generally, the call gam(y ~ x1 + s(x2), data=...) would model the variable x1 as having a linear relationship with y, and try to fit the best possible smooth curve to model the relationship between x2 and y. Of course, the best smooth curve could be a straight line, so if you’re not sure whether the relationship between x and y is linear, you can use s(x). If you see that the coefficient has an edf (effective degrees of freedom—see the model summary in listing 10.14) of about 1, then you can try refitting the variable as a linear term.
The use of splines gives GAMs a richer model space to choose from; this increased flexibility brings a higher risk of overfitting. You should also check the models’ performances on the test data.
test <- transform(test, 1 pred_lin = predict(lin_model, test), pred_gam = predict(gam_model, test) ) test <- transform(test, 2 resid_lin = y - pred_lin, resid_gam = y - pred_gam) rmse(test$resid_lin) 3 ## [1] 2.792653 rmse(test$resid_gam) ## [1] 1.401399 library(sigr) 4 wrapFTest(test, "pred_lin", "y")$R2 ## [1] 0.115395 wrapFTest(test, "pred_gam", "y")$R2 ## [1] 0.777239
The GAM performed similarly on both training and test sets: RMSE of 1.40 on test versus 1.45 on training; R-squared of 0.78 on test versus 0.83 on training. So there’s likely no overfit.
Once you fit a GAM, you’ll probably be interested in what the s() functions look like. Calling plot() on a GAM will give you a plot for each s() curve, so you can visualize non-linearities. In our example, plot(gam_model) produces the top curve in figure 10.16.
The shape of the curve is quite similar to the scatter plot we saw in figure 10.13 (which is reproduced as the lower half of figure 10.16). In fact, the spline that’s superimposed on the scatter plot in figure 10.13 is the same curve.
You can extract the data points that were used to make this graph by using the predict() function with the argument type = "terms". This produces a matrix where the ith column represents s(x[,i]). The following listing demonstrates how to reproduce the lower plot in figure 10.16.
sx <- predict(gam_model, type = "terms") summary(sx) ## s(x) ## Min. :-17.527035 ## 1st Qu.: -2.378636 ## Median : 0.009427 ## Mean : 0.000000 ## 3rd Qu.: 2.869166 ## Max. : 4.084999 xframe <- cbind(train, sx = sx[,1]) ggplot(xframe, aes(x = x)) + geom_point(aes(y = y), alpha = 0.4) + geom_line(aes(y = sx))
Now that you’ve worked through a simple example, you are ready to try a more realistic example with more variables.
Suppose you want to predict a newborn baby’s weight (DBWT) from a number of variables:
For this example, you’ll use data from the 2010 CDC natality dataset that you used in section 7.2 (though this is not the risk data used in that chapter).[7] Note that we’ve chosen this example to highlight the mechanisms of gam(), not to find the best model for birth weight. Adding other variables beyond the four we’ve chosen will improve the fit, but obscure the exposition.
The dataset can be found at https://github.com/WinVector/PDSwR2/blob/master/CDC/NatalBirthData.rData. A script for preparing the dataset from the original CDC extract can be found at https://github.com/WinVector/PDSwR2/blob/master/CDC/prepBirthWeightData.R.
In the next listing, you’ll fit a linear model and a GAM, and compare.
library(mgcv) library(ggplot2) load("NatalBirthData.rData") train <- sdata[sdata$ORIGRANDGROUP <= 5, ] test <- sdata[sdata$ORIGRANDGROUP > 5, ] form_lin <- as.formula("DBWT ~ PWGT + WTGAIN + MAGER + UPREVIS") linmodel <- lm(form_lin, data = train) 1 summary(linmodel) ## Call: ## lm(formula = form_lin, data = train) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3155.43 -272.09 45.04 349.81 2870.55 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2419.7090 31.9291 75.784 < 2e-16 *** ## PWGT 2.1713 0.1241 17.494 < 2e-16 *** ## WTGAIN 7.5773 0.3178 23.840 < 2e-16 *** ## MAGER 5.3213 0.7787 6.834 8.6e-12 *** ## UPREVIS 12.8753 1.1786 10.924 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 562.7 on 14381 degrees of freedom ## Multiple R-squared: 0.06596, Adjusted R-squared: 0.0657 2 ## F-statistic: 253.9 on 4 and 14381 DF, p-value: < 2.2e-16 form_gam <- as.formula("DBWT ~ s(PWGT) + s(WTGAIN) + s(MAGER) + s(UPREVIS)") gammodel <- gam(form_gam, data = train) 3 gammodel$converged 4 ## [1] TRUE summary(gammodel) ## ## Family: gaussian ## Link function: identity ## ## Formula: ## DBWT ~ s(PWGT) + s(WTGAIN) + s(MAGER) + s(UPREVIS) ## ## Parametric coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3276.948 4.623 708.8 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Approximate significance of smooth terms: ## edf Ref.df F p-value ## s(PWGT) 5.374 6.443 69.010 < 2e-16 *** ## s(WTGAIN) 4.719 5.743 102.313 < 2e-16 *** ## s(MAGER) 7.742 8.428 7.145 1.37e-09 *** ## s(UPREVIS) 5.491 6.425 48.423 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## R-sq.(adj) = 0.0927 Deviance explained = 9.42% 5 ## GCV = 3.0804e+05 Scale est. = 3.0752e+05 n = 14386
The GAM has improved the fit, and all four variables seem to have a non-linear relationship with birth weight, as evidenced by edfs all greater than 1. You could use plot(gammodel) to examine the shape of the s() functions; instead, let's compare them with a direct smoothing curve of each variable against mother’s weight.
terms <- predict(gammodel, type = "terms") 1 terms <- cbind(DBWT = train$DBWT, terms) 2 tframe <- as.data.frame(scale(terms, scale = FALSE)) 3 colnames(tframe) <- gsub('[()]', '', colnames(tframe)) 4 vars = c("PWGT", "WTGAIN", "MAGER", "UPREVIS") pframe <- cbind(tframe, train[, vars]) 5 ggplot(pframe, aes(PWGT)) + 6 geom_point(aes(y = sPWGT)) + geom_smooth(aes(y = DBWT), se = FALSE) # [...] 7
Figure 10.17 shows the s() splines learned by gam() as the dotted curves. These splines are gam()’s estimate of the (joint) relationship between each variable and the outcome, DBWT. The sum of the splines (plus an offset) is the model’s best estimate of DBWT as a function of the input variables.
The figure also shows the smoothing curves that directly relate each variable to DBWT. The smooth curves in each case are similar to the corresponding s() in shape, and non-linear for all the variables. The differences in shape are because the splines are fit jointly (which is more useful for modeling), and the smoothing curves are merely calculated one at a time.
As usual, you should check for overfit with holdout data.
test <- transform(test, 1 pred_lin = predict(linmodel, test), pred_gam = predict(gammodel, test) ) test <- transform(test, 2 resid_lin = DBWT - pred_lin, resid_gam = DBWT - pred_gam) rmse(test$resid_lin) 3 ## [1] 566.4719 rmse(test$resid_gam) ## [1] 558.2978 wrapFTest(test, "pred_lin", "DBWT")$R2 4 ## [1] 0.06143168 wrapFTest(test, "pred_gam", "DBWT")$R2 ## [1] 0.08832297
The performance of the linear model and the GAM were similar on the test set, as they were on the training set, so in this example, there’s no substantial overfit.
The gam() function can be used for logistic regression as well.
Suppose you want to predict when a baby will be born underweight (defined as DBWT < 2000), using the same input variables as the previous scenario.
The logistic regression call to do this is shown in the following listing.
form <- as.formula("DBWT < 2000 ~ PWGT + WTGAIN + MAGER + UPREVIS") logmod <- glm(form, data = train, family = binomial(link = "logit"))
The corresponding call to gam() also specifies the binomial family with the logit link.
form2 <- as.formula("DBWT < 2000 ~ s(PWGT) + s(WTGAIN) + s(MAGER) + s(UPREVIS)") glogmod <- gam(form2, data = train, family = binomial(link = "logit")) glogmod$converged ## [1] TRUE summary(glogmod) ## Family: binomial ## Link function: logit ## ## Formula: ## DBWT < 2000 ~ s(PWGT) + s(WTGAIN) + s(MAGER) + s(UPREVIS) ## ## Parametric coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.94085 0.06794 -58 <2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Approximate significance of smooth terms: ## edf Ref.df Chi.sq p-value ## s(PWGT) 1.905 2.420 2.463 0.36412 1 ## s(WTGAIN) 3.674 4.543 64.426 1.72e-12 *** ## s(MAGER) 1.003 1.005 8.335 0.00394 ** ## s(UPREVIS) 6.802 7.216 217.631 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## R-sq.(adj) = 0.0331 Deviance explained = 9.14% 2 ## UBRE score = -0.76987 Scale est. = 1 n = 14386
As with the standard logistic regression call, we recover the class probabilities with the call predict(glogmodel, newdata = train, type = "response"). Again, these models are coming out with low quality, and in practice we would look for more explanatory variables to build better screening models.
Here’s what you should remember about GAMs:
GAMs extend linear methods (and generalized linear methods) by allowing variables to have non-linear (or even non-monotone) effects on outcome. Another approach is to form new variables from non-linear combinations of existing variables. The data scientist can do this by hand, by adding interactions or new synthetic variables, or it can be done mechanically, by support vector machines (SVMs), as shown in the next section. The hope is that with access to enough of these new variables, your modeling problem becomes easier.
In the next section, we’ll work with two of the most popular ways to add and manage new variables: kernel methods and support vector machines.
Some classification problems are called inseparable: instances of one class, A, are inside regions bounded by another class, B, so that class A can’t be separated from class B by a flat boundary. For example, in figure 10.18, we see a number of o’s inside a triangle defined by x’s (and we also see the data converted to a nice separable arrangement by a so-called kernel function phi()). The original arrangement on the left side is linearly inseparable: there is no hyperplane that separates the x’s from the o’s. Hence, it would be impossible for linear methods to completely separate the two classes. We could use the tree-based methods demonstrated in section 10.1 to fit a classifier, or we could use a technique called kernel methods. In this section, we will use SVMs and kernel methods to build good classifiers on linearly inseparable data.
At this point, we have seen a number of advanced methods that give us more than one way to handle complex problems. For example: random forests, boosting, and SVMs can all introduce variable interactions to solve problems. It would be nice if there were always one obvious best method. However, each of these methodologies can dominate for different problems. So there is no one best method.
Our advice is try simple methods such as linear and logistic regression first. Then bring in and try advanced methods such as GAMs (which can handle single-variable reshaping), tree-based methods (which can handle many-variable interactions), and SVMs (which can handle many-variable reshapings) to address modeling issues.
Let’s start with an example adapted from R’s kernlab library documentation. Learning to separate two spirals is a famous “impossible” problem that cannot be solved by linear methods (though it is solvable by spectral clustering, kernel methods, SVMs, and deep learning or deep neural nets).
Figure 10.19 shows two spirals, one within the other. Your task is to build a decision procedure that cuts up the plane such that the 1-labeled examples are in one region and the 2-labeled examples are in the complimentary region.[8]
See K. J. Lang and M. J. Witbrock, “Learning to tell two spirals apart” in Proceedings of the 1988 Connectionist Models Summer School, D. Touretzky, G. Hinton, and T. Sejnowski (eds), Morgan Kaufmann, 1988 (pp. 52–59).
Support vector machines excel at learning concepts of the form “examples that are near each other should be given the same classification.” To use the SVM technique, the user must choose a kernel (to control what is considered "near" or "far"), and pick a value for a hyperparameter called C or nu (to try to control model complexity).
Listing 10.22 shows the recovery and labeling of the two spirals shown in figure 10.19. You will use the labeled data for the example task: given the labeled data, recover the 1 versus 2 regions by supervised machine learning.
library('kernlab') data(spirals) 1 sc <- specc(spirals, centers = 2) 2 s <- data.frame(x = spirals[, 1], y = spirals[, 2], 3 class = as.factor(sc)) library('ggplot2') ggplot(data = s) + 4 geom_text(aes(x = x, y = y, label = class, color = class)) + scale_color_manual(values = c("#d95f02", "#1b9e77")) + coord_fixed() + theme_bw() + theme(legend.position = 'none') + ggtitle("example task: separate the 1s from the 2s")
Figure 10.19 shows the labeled spiral dataset. Two classes (represented by digits) of data are arranged in two interwoven spirals. This dataset is difficult for methods that don’t have a rich enough concept space (perceptrons, shallow neural nets) and easy for more-sophisticated learners that can introduce the right new features. Support vector machines, with the right kernel, are a way to introduce new composite features in order to solve the problem.
Support vector machines are powerful, but without the correct kernel, they have difficulty with some concepts (such as the spiral example). Listing 10.23 shows a failed attempt to learn the spiral concept with an SVM using the identity or dot-product (linear) kernel. The linear kernel does no transformations on the data; it can work for some applications, but in this case it does not give us the data-separating properties we want.
set.seed(2335246L) s$group <- sample.int(100, size = dim(s)[[1]], replace = TRUE) sTrain <- subset(s, group > 10) sTest <- subset(s,group <= 10) 1 library('e1071') mSVMV <- svm(class ~ x + y, data = sTrain, kernel = 'linear', type = 'nu-classification') 2 sTest$predSVMV <- predict(mSVMV, newdata = sTest, type = 'response') 3 shading <- expand.grid( 4 x = seq(-1.5, 1.5, by = 0.01), y = seq(-1.5, 1.5, by = 0.01)) shading$predSVMV <- predict(mSVMV, newdata = shading, type = 'response') ggplot(mapping = aes(x = x, y = y)) + 5 geom_tile(data = shading, aes(fill = predSVMV), show.legend = FALSE, alpha = 0.5) + scale_color_manual(values = c("#d95f02", "#1b9e77")) + scale_fill_manual(values = c("white", "#1b9e77")) + geom_text(data = sTest, aes(label = predSVMV), size = 12) + geom_text(data = s, aes(label = class, color = class), alpha = 0.7) + coord_fixed() + theme_bw() + theme(legend.position = 'none') + ggtitle("linear kernel")
This attempt results in figure 10.20. The figure shows the total dataset in a small font and the SVM classifications of the test dataset in large text. It also indicates the learned concept by shading. The SVM didn’t produce a good model with the identity kernel, as it was forced to pick a linear separator. In the next section, you’ll repeat the process with the Gaussian radial kernel and get a much better result.
In listing 10.24, you’ll repeat the SVM fitting process, but this time specifying the Gaussian or radial kernel. Figure 10.21 again plots the SVM test classifications in black (with the entire dataset in a smaller font). Note that this time the algorithm correctly learned the actual spiral concept, as indicated by the shading.
mSVMG <- svm(class ~ x + y, data = sTrain, kernel = 'radial', type = 'nu-classification') 1 sTest$predSVMG <- predict(mSVMG, newdata = sTest, type = 'response') shading <- expand.grid( x = seq(-1.5, 1.5, by = 0.01), y = seq(-1.5, 1.5, by = 0.01)) shading$predSVMG <- predict(mSVMG, newdata = shading, type = 'response') ggplot(mapping = aes(x = x, y = y)) + geom_tile(data = shading, aes(fill = predSVMG), show.legend = FALSE, alpha = 0.5) + scale_color_manual(values = c("#d95f02", "#1b9e77")) + scale_fill_manual(values = c("white", "#1b9e77")) + geom_text(data = sTest, aes(label = predSVMG), size = 12) + geom_text(data = s,aes(label = class, color = class), alpha = 0.7) + coord_fixed() + theme_bw() + theme(legend.position = 'none') + ggtitle("radial/Gaussian kernel")
As we stated, some methods work better on some problems than others. Try to use the xgboost package to solve the spirals problem. Do you find the xgboost results to be better or worse than the SVM results? (A worked version of this example can be found here: https://github.com/WinVector/PDSwR2/tree/master/Spirals.)
An SVM is often portrayed as a magic machine that makes classification easier.[9] To dispel the awe and be able to use support vector methods with confidence, we need to take some time to learn their principles and how they work. The intuition is this: SVMs with the radial kernel are very good nearest-neighbor-style classifiers.
Support vector machines can also be used for regression, but we will not cover that here.
In figure 10.22, in the “real space” (on the left), the data is separated by a non-linear boundary. When the data is lifted into the higher-dimensional kernel space (on the right), the lifted points are separated by a hyperplane. Let’s call the normal to that hyperplane w and the offset from the origin b (not shown).
An SVM finds a linear decision function (determined by parameters w and b), where for a given example x the machine decides x is in the class if
w %*% phi(x) + b >= 0
for some w and b, and not in the class otherwise. The model is completely determined by the function phi(), the vector w, and the scalar offset b. The idea is that phi() lifts or reshapes the data into a nicer space (where things are linearly separable), and then the SVM finds a linear boundary separating the two data classes in this new space (represented by w and b). This linear boundary in the lifted space can be pulled back as a general curved boundary in the original space. The principle is sketched out in figure 10.22.
The support vector training operation finds w and b. There are variations on the SVM that make decisions between more than two classes, perform scoring/regression, and detect novelty. But we’ll discuss only the SVMs for simple classification.
As a user of SVMs, you don’t immediately need to know how the training procedure works; that’s what the software does for you. But you do need to have some notion of what it’s trying to do. The model w,b is ideally picked so that
w %*% phi(x) + b >= u
for all training xs that were in the class, and
w %*% phi(x) + b <= v
for all training examples not in the class.
The data is called separable if u > v. The size of the separation is (u - v) / sqrt(w %*% w), and is called the margin. The goal of the SVM optimizer is to maximize the margin. A large margin can actually ensure good behavior on future data (good generalization performance). In practice, real data isn’t always separable even in the presence of a kernel. To work around this, most SVM implementations implement the so-called soft margin optimization goal.
A soft margin optimizer adds additional error terms that are used to allow a limited fraction of the training examples to be on the wrong side of the decision surface.[10] The model doesn’t actually perform well on the altered training examples, but trades the error on these examples against increased margin on the remaining training examples. For most implementations, the model hyperparameter C or nu determines the trade-off between margin width for the remaining data and how much data is pushed around to achieve the margin. We will use the nu hyperparameter. nu takes settings between zero and one; lower values allow fewer training misclassifications, favoring more-complex models (more support vectors).[11] For our example, we will just use the function default value: 0.5.
A common type of dataset that is inseparable under any kernel is a dataset where there are at least two examples belonging to different outcome classes with the exact same values for all input or x variables. The original “hard margin” SVM couldn’t deal with this sort of data and was for that reason not considered to be practical.
For more details on SVMs, we recommend Cristianini and Shawe-Taylor’s An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, 2000.
The SVM picks which data is unimportant (left out) and which is very important (used as support vectors). But the reshaping of the problem to make the data separable is actually performed by what are called kernel methods or kernel functions.
Figure 10.22 illustrates[12] what we hope for from a good kernel: our data being pushed around so it’s easier to sort or classify. By using a kernel transformation, we move to a situation where the distinction we’re trying to learn is representable by a linear separator of our transformed data.
Cristianini and Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods.
To begin to understand SVMs, we need to take a quick look at the common math and terminology that a user of SVMs and kernel methods should be conversant with. First is the notion of a kernel function, which is used to implement the phi() we saw reshaping space.
In our application, a kernel is a function with a very specific definition. Let u and v be any pair of variables. u and v are typically vectors of input or independent variables (possibly taken from two rows of a dataset). A function k(,) that maps pairs (u,v) to numbers is called a kernel function if and only if there is some function phi() mapping (u,v)s to a vector space such that k(u,v) = phi(u) %*% phi(v) for all u,v.[13] We’ll informally call the expression k(u,v) = phi(u) %*% phi(v) the Mercer expansion of the kernel (in reference to Mercer’s theorem; see http://mng.bz/xFD2) and consider phi() the certificate that tells us k(,) is a good kernel. This is much easier to understand from a concrete example. In the following listing, we show an equivalent phi() / k(,) pair.
%*% is R’s notation for dot product or inner product; see help('%*%') for details. Note that phi() is allowed to map to very large (and even infinite) vector spaces.
u <- c(1, 2) v <- c(3, 4) k <- function(u, v) { 1 u[1] * v[1] + u[2] * v[2] + u[1] * u[1] * v[1] * v[1] + u[2] * u[2] * v[2] * v[2] + u[1] * u[2] * v[1] * v[2] } phi <- function(x) { 2 x <- as.numeric(x) c(x, x*x, combn(x, 2, FUN = prod)) } print(k(u, v)) 3 ## [1] 108 print(phi(u)) ## [1] 1 2 1 4 2 print(phi(v)) ## [1] 3 4 9 16 12 print(as.numeric(phi(u) %*% phi(v))) 4 ## [1] 108
Most kernel methods use the function k(,) directly and only use properties of k(,) guaranteed by the matching phi() to ensure method correctness. The k(,) function is usually quicker to compute than the notional function phi(). A simple example of this is what we’ll call the dot-product similarity of documents. The dot-product document similarity is defined as the dot product of two vectors where each vector is derived from a document by building a huge vector of indicators, one for each possible feature. For instance, if the features you’re considering are word pairs, then for every pair of words in a given dictionary, the document gets a feature of 1 if the pair occurs as a consecutive utterance in the document and 0 if not. This method is the phi(), but in practice we never use the phi() procedure. Instead, when comparing two documents, each consecutive pair of words in one document is generated and a bit of score is added if this pair is both in the dictionary and found consecutively in the other document. For moderate-sized documents and large dictionaries, this direct k(,) implementation is vastly more efficient than the phi() implementation.
The support vector machine gets its name from how the vector w is usually represented: as a linear combination of training examples—the support vectors. Recall we said in section 10.3.3 that the function phi() is allowed, in principle, to map into a very large or even infinite vector space. This means it may not be possible to directly write down w.
Support vector machines work around the “can’t write down w” issue by restricting to ws that are in principle a sum of phi() terms as shown here:
w = sum(a1 * phi(s1), ... , am * phi(sm))
The vectors s1, ..., sm are actually m training examples and are called the support vectors. The preceding formulation helps because such sums are (with some math) equivalent to sums of k( ,x) kernel terms of the form we show next:
w %*% phi(x) + b = sum(a1 * k(s1, x),... , am * k(sm, x)) + b
The right side is a quantity we can compute.
The work of the support vector training algorithm is to pick the vectors s1, ..., sm, the scalars a1, ..., am, and the offset b. All of this is called “the kernel trick.”
A support vector model consists of these things:
The reason why the data scientist must be aware of the support vectors is that they’re stored in the support vector model. For example, with too complex a model, there can be a very large number of support vectors, causing the model to be large and expensive to evaluate. In the worst case, the number of support vectors in the model can be almost as large as the number of training examples, making support vector model evaluation potentially as expensive as nearest-neighbor evaluation, and increasing the risk of overfit. The user picks a good number of support vectors by picking a good value of C or nu through cross-validation.
nu is the important hyperparameter for SVMs. Ideally, we should cross-validate for a good value of nu. Instead of full cross-validation, just try a few values of nu to get the landscape. (We have a worked solution here: https://github.com/WinVector/PDSwR2/tree/master/Spirals.)
Here’s what you should remember from this section:
In this chapter, we demonstrated some advanced methods to fix specific issues with basic modeling approaches: modeling variance, modeling bias, issues with non-linearity, and issues with variable interactions. An important additional family of methods we wish we had time to touch on is deep learning, the improved modern treatment of neural nets. Fortunately there is already a good book we can recommend on this topic: Deep Learning with R, by François Chollet with J. J. Allaire, Manning, 2018.
You should understand that you bring in advanced methods and techniques to fix specific modeling problems, not because they have exotic names or exciting histories. We also feel you should at least try to find an existing technique to fix a problem you suspect is hiding in your data before building your own custom technique; often the existing technique already incorporates a lot of tuning and wisdom. Which method is best depends on the data, and there are many advanced methods to try. Advanced methods can help fix overfit, variable interactions, non-additive relations, and unbalanced distributions, but not lack of features or data.
Finally, the goal of learning the theory of advanced techniques is not to be able to recite the steps of the common implementations, but to know when the techniques apply and what trade-offs they represent. The data scientist needs to supply thought and judgment and realize that the platform can supply implementations.
In this chapter you have learned
The actual point of a modeling project is to deliver results for production deployment and to present useful documentation and evaluations to your partners. The next part of this book will address best practices for delivering your results.