Chapter 7. Linear and logistic regression

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7. Linear and logistic regression

This chapter covers

Using linear regression to predict quantities
Using logistic regression to predict probabilities or categories
Extracting relations and advice from linear models
Interpreting the diagnostics from R’s lm() call
Interpreting the diagnostics from R’s glm() call
Using regularization via the glmnet package to address issues that can arise with linear models.

In the previous chapter, you learned how to evaluate models. Now that we have the ability to discuss if a model is good or bad, we’ll move on to the modeling step, as shown in the mental model (figure 7.1). In this chapter, we’ll cover fitting and interpreting linear models in R.

Figure 7.1. Mental model

Linear models are especially useful when you don’t want only to predict an outcome, but also to know the relationship between the input variables and the outcome. This knowledge can prove useful because this relationship can often be used as advice on how to get the outcome that you want.

We’ll first define linear regression and then use it to predict customer income. Later, we will use logistic regression to predict the probability that a newborn baby will need extra medical attention. We’ll also walk through the diagnostics that R produces when you fit a linear or logistic model.

Linear methods can work well in a surprisingly wide range of situations. However, there can be issues when the inputs to the model are correlated or collinear. In the case of logistic regression, there can also be issues (ironically) when a subset of the variables predicts a classification output perfectly in a subset of the training data. The last section of the chapter will show how to address these issues by a technique called regularization.

7.1. Using linear regression

Linear regression is the bread and butter prediction method for statisticians and data scientists. If you’re trying to predict a numerical quantity like profit, cost, or sales volume, you should always try linear regression first. If it works well, you’re done; if it fails, the detailed diagnostics produced can give you a good clue as to what methods you should try next.

7.1.1. Understanding linear regression

Example

Suppose you want to predict how many pounds a person on a diet and exercise plan will lose in a month. You will base that prediction on other facts about that person, like how much they reduce their average daily caloric intake over that month and how many hours a day they exercised. In other words, for every person i, you want to predict pounds_lost[i] based on daily_cals_down[i] and daily_exercise[i].

Linear regression assumes that the outcome pounds_lost is linearly related to each of the inputs daily_cals_down[i] and daily_exercise[i]. This means that the relationship between (for instance) daily_cals_down[i] and pounds_lost looks like a (noisy) straight line, as shown in figure 7.2.^[1]

¹
It is tempting to hope that b0J = bC0 + be0 or that b.calsJ = b.cals; however, a joint regression does not ensure this.

Figure 7.2. The linear relationship between `daily_cals_down` and `pounds_lost`

The relationship between daily_exercise and pounds_lost would similarly be a straight line. Suppose that the equation of the line shown in figure 7.2 is

pounds_lost = bc0 + b.cals * daily_cals_down

This means that for every unit change in daily_cals_down (every calorie reduced), the value of pounds_lost changes by b.cals, no matter what the starting value of daily_cals_down was. To make it concrete, suppose pounds_lost = 3 + 2 * daily_ cals_down. Then increasing daily_cals_down by one increases pounds_lost by 2, no matter what value of daily_cals_down you start with. This would not be true for, say, pounds_lost = 3 + 2 * (daily_cals_down^2).

Linear regression further assumes that the total pounds lost is a linear combination of our variables daily_cals_down[i] and daily_exercise[i], or the sum of the pounds lost due to reduced caloric intake, and the pounds lost due to exercise. This gives us the following form for the linear regression model of pounds_lost:

pounds_lost[i] = b0 + b.cals * daily_cals_down[i] +
     b.exercise * daily_exercise[i]

The goal of linear regression is to find the values of b0, b.cals, and b.exercise so that the linear combination of daily_cals_lost[i] and daily_exercise[i] (plus some offset b0) comes very close to pounds_lost[i] for all persons i in the training data.

Let’s put this in more general terms. Suppose that y[i] is the numeric quantity you want to predict (called the dependent or response variable), and x[i,] is a row of inputs that corresponds to output y[i] (the x[i,] are the independent or explanatory variables). Linear regression attempts to find a function f(x) such that

equation 7.1. The expression for a linear regression model

You want numbers b[0],...,b[n] (called the coefficients or betas) such that f(x[i,]) is as near as possible to y[i] for all (x[i,],y[i]) pairs in the training data. R supplies a one-line command to find these coefficients: lm().

The last term in equation 7.1, e[i], represents what are called unsystematic errors, or noise. Unsystematic errors are defined to all have a mean value of 0 (so they don’t represent a net upward or net downward bias) and are defined as uncorrelated with x[i,]. In other words, x[i,] should not encode information about e[i] (or vice versa).

By assuming that the noise is unsystematic, linear regression tries to fit what is called an “unbiased” predictor. This is another way of saying that the predictor gets the right answer “on average” over the entire training set, or that it underpredicts about as much as it overpredicts. In particular, unbiased estimates tend to get totals correct.

Example

Suppose you have fit a linear regression model to predict weight loss based on reduction of caloric intake and exercise. Now consider the set of subjects in the training data, LowExercise, who exercised between zero and one hour a day. Together, these subjects lost a total of 150 pounds over the course of the study. How much did the model predict they would lose?

With a linear regression model, if you take the predicted weight loss for all the subjects in LowExercise and sum them up, that total will sum to 150 pounds, which means that the model predicts the average weight loss of a person in the LowExercise group correctly, even though some of the individuals will have lost more than the model predicted, and some of them will have lost less. In a business setting, getting sums like this correct is critical, particularly when summing up monetary amounts.

Under these assumptions (linear relationships and unsystematic noise), linear regression is absolutely relentless in finding the best coefficients b[i]. If there’s some advantageous combination or cancellation of features, it’ll find it. One thing that linear regression doesn’t do is reshape variables to be linear. Oddly enough, linear regression often does an excellent job, even when the actual relation is not in fact linear.

Thinking about linear regression

When working with linear regression, you’ll go back and forth between “Adding is too simple to work,” and “How is it even possible to estimate the coefficients?” This is natural and comes from the fact that the method is both simple and powerful. Our friend Philip Apps sums it up: “You have to get up pretty early in the morning to beat linear regression.”

When the assumptions of linear regression are violated

As a toy example, consider trying to fit the squares of the integers 1–10 using only a linear function plus a constant. We’re asking for coefficients b[0] and b[1] such that

x[i]^2 nearly equals b[0] + b[1] * x[i]

This is clearly not a fair thing to ask, since we know that what we are trying to predict is not linear. In this case, however, linear regression still does a pretty good job. It picks the following fit:

x[i]^2 nearly equals -22 + 11 * x[i]

As figure 7.3 shows, this is a good fit in the region of values we trained on.

Figure 7.3. Fit versus actuals for y=x²

The example in figure 7.3 is typical of how linear regression is “used in the field”—we’re using a linear model to predict something that is itself not linear. Be aware that this is a minor sin. In particular, note that the errors between the model’s predictions and the true y are not random, but systematic: the model underpredicts for specific ranges of x and overpredicts for others. This isn’t ideal, but often the best we can do. Note also that, in this example, the predictions veer further away from the true outcome near the endpoints of the fit, which indicates that this model is probably not safe to use outside the range of x that the model observed in the training data.

Extrapolation is not as safe as interpolation

In general, you should try to use a model only for interpolation: predicting for new data that falls inside the range of your training data. Extrapolation (predicting for new data outside the range observed during training) is riskier for any model. It’s especially risky for linear models, unless you know that the system that you are modeling is truly linear.

Next we’ll work through an example of how to apply linear regression on more-interesting real data.

Introducing the PUMS dataset

Example

Suppose you want to predict personal income of any individual in the general public, within some relative percent, given their age, education, and other demographic variables. In addition to predicting income, you also have a secondary goal: to determine the effect of a bachelor’s degree on income, relative to having no degree at all.

For this task, you will use the 2016 US Census PUMS dataset. For simplicity, we have prepared a small sample of PUMS data to use for this example. The data preparation steps include these:

Restricting the data to full-time employees between 20 and 50 years of age, with an income between $1,000 and $250,000.
Dividing the data into a training set, dtrain, and a test set, dtest.

We can continue the example by loading psub.RDS (which you can download from https://github.com/WinVector/PDSwR2/raw/master/PUMS/psub.RDS) into your working directory, and performing the steps in the following listing.^[1]

¹
The script for preparing the data sample can be found at https://github.com/WinVector/PDSwR2/blob/master/PUMS/makeSubSample.Rmd.

Listing 7.1. Loading the PUMS data and fitting a model

psub <- readRDS("psub.RDS")

set.seed(3454351)
gp <- runif(nrow(psub))                                               1

dtrain <- subset(psub, gp >= 0.5)                                     2
 dtest <- subset(psub, gp < 0.5)

model <- lm(log10(PINCP) ~ AGEP + SEX + COW + SCHL, data = dtrain)    3
 dtest$predLogPINCP <- predict(model, newdata = dtest)                4
 dtrain$predLogPINCP <- predict(model, newdata = dtrain)

1 Makes a random variable to group and partition the data
2 Splits 50–50 into training and test sets
3 Fits a linear model to log(income)
4 Gets the predicted log(income) on the test and training sets

Each row of PUMS data represents a single anonymized person or household. Personal data recorded includes occupation, level of education, personal income, and many other demographic variables.

For this example we have decided to predict log10(PINCP), or the logarithm of income. Fitting logarithm-transformed data typically gives results with smaller relative error, emphasizing smaller errors on smaller incomes. But this improved relative error comes at a cost of introducing a bias: on average, predicted incomes are going to be below actual training incomes. An unbiased alternative to predicting log(income) would be to use a type of generalized linear model called Poisson regression. We will discuss generalized linear models (specifically, logistic regression) in section 7.2. The Poisson regression is unbiased, but typically at the cost of larger relative errors.^[1]

¹
For a series of articles discussing these issues, please see http://www.win-vector.com/blog/2019/07/link-functions-versus-data-transforms/.

For the analysis in this section, we’ll consider the input variables age (AGEP), sex (SEX), class of worker (COW), and level of education (SCHL). The output variable is personal income (PINCP). We’ll also set the reference level, or “default” sex to M (male); the reference level of class of worker to Employee of a private for-profit; and the reference level of education level to no high school diploma. We’ll discuss reference levels later in this chapter.

Reference levels are baselines, not value judgments

When we say that the default sex is male and the default educational level is no high school diploma, we are not implying that you should expect that a typical worker is male, or that a typical worker has no high school diploma. The reference level of a variable is the baseline that other values of the variable are compared to. So we are saying that at some point in this analysis, we may want to compare the income of female workers to that of male workers with equivalent characteristics, or that we may want to compare the income of workers with a high school degree or a bachelor’s degree to that of a worker with no high school diploma (but otherwise equivalent characteristics).

By default, R selects the alphabetically first value of a categorical variable as the reference level.

Now on to the model building.

7.1.2. Building a linear regression model

The first step in either prediction or finding relations (advice) is to build the linear regression model. The function to build the linear regression model in R is lm(), supplied by the stats package. The most important argument to lm() is a formula with ~ used in place of an equals sign. The formula specifies what column of the data frame is the quantity to be predicted, and what columns are to be used to make the predictions.

Statisticians call the quantity to be predicted the dependent variable and the variables/ columns used to make the prediction the independent variables. We find it is easier to call the quantity to be predicted the y and the variables used to make the predictions the xs. Our formula is this: log10(PINCP) ~ AGEP + SEX + COW + SCHL, which is read “Predict the log base 10 of income as a function of age, sex, employment class, and education.”^[1] The overall method is demonstrated in figure 7.4.

¹
Recall from the discussion of the lognormal distribution in section 4.2 that it’s often useful to log transform monetary quantities. The log transform is also compatible with our original task of predicting incomes with a relative error (meaning large errors count more against small incomes). The glm() methods of section 7.2 can be used to avoid the log transform and predict in such a way as to minimize square errors (so being off by $50,000 would be considered the same error for both large and small incomes).

Figure 7.4. Building a linear model using `lm()`

The statement in figure 7.4 builds the linear regression model and stores the results in the new object called model. This model is able to make predictions, and to extract important advice from the data.

R stores training data in the model

R holds a copy of the training data in its model to supply the residual information seen in summary(model). Holding a copy of the data this way is not strictly necessary, and can needlessly run you out of memory. If you’re running low on memory (or swapping), you can dispose of R objects like model using the rm() command. In this case, you’d dispose of the model by running rm("model").

7.1.3. Making predictions

Once you’ve called lm() to build the model, your first goal is to predict income. This is easy to do in R. To predict, you pass data into the predict() method. Figure 7.5 demonstrates this using both the test and training data frames dtest and dtrain.

Figure 7.5. Making predictions with a linear regression model

The data frame columns dtest$predLogPINCP and dtrain$predLogPINCP now store the predictions for the test and training sets, respectively. We have now both produced and applied a linear regression model.

Characterizing prediction quality

Before publicly sharing predictions, you want to inspect both the predictions and model for quality. We recommend plotting the actual y (in this case, predicted income) that you’re trying to predict as if it were a function of your prediction. In this case, plot log10(PINCP) as if it were a function of predLogPINCP. If the predictions are very good, then the plot will be dots arranged near the line y=x, which we call the line of perfect prediction (the phrase is not standard terminology; we use it to make talking about the graph easier). The steps to produce this, illustrated in figure 7.6, are shown in the next listing.

Figure 7.6. Plot of actual log income as a function of predicted log income

Listing 7.2. Plotting log income as a function of predicted log income

library('ggplot2')
ggplot(data = dtest, aes(x = predLogPINCP, y = log10(PINCP))) +
   geom_point(alpha = 0.2, color = "darkgray") +
   geom_smooth(color = "darkblue") +
   geom_line(aes(x = log10(PINCP),                  1
                  y = log10(PINCP)),
             color = "blue", linetype = 2) +
   coord_cartesian(xlim = c(4, 5.25),               2
                    ylim = c(3.5, 5.5))

1 Plots the line x=y
2 Limits the range of the graph for legibility

Statisticians prefer the residual plot shown in figure 7.7, where the residual errors (in this case, predLogPINCP - log10(PINCP)) are plotted as a function of predLogPINCP. In this case, the line of perfect prediction is the line y=0. Notice that the points are scattered widely from this line (a possible sign of low-quality fit). The residual plot in figure 7.7 is prepared with the R steps in the next listing.

Figure 7.7. Plot of residual error as a function of prediction

Listing 7.3. Plotting residuals income as a function of predicted log income

ggplot(data = dtest, aes(x = predLogPINCP,
                     y = predLogPINCP - log10(PINCP))) +
  geom_point(alpha = 0.2, color = "darkgray") +
  geom_smooth(color = "darkblue") +
  ylab("residual error (prediction - actual)")

Why are the predictions, not the true values, on the x-axis?

A graph that plots predictions on the x-axis and either true values (as in figure 7.6) or residuals (as in figure 7.7) on the y-axis answers different questions than a graph that puts true values on the x-axis and predictions (or residuals) on the y-axis. Statisticians tend to prefer the graph as shown in figure 7.7. A residual graph with predictions on the x-axis gives you a sense of when the model may be under- or overpredicting, based on the model’s output.

A residual graph with the true outcome on the x-axis and residuals on the y-axis would almost always appear to have undesirable residual structure, even when there is no modeling problem. This illusion is due to an effect called regression to the mean or reversion to mediocrity.

When you look at the true-versus-fitted or residual graphs, you’re looking for some specific things that we’ll discuss next.

On average, are the predictions correct?

Does the smoothing curve lie more or less along the line of perfect prediction? Ideally, the points will all lie very close to that line, but you may instead get a wider cloud of points (as we do in figures 7.6 and 7.7) if your input variables don’t explain the output too closely. But if the smoothing curve lies along the line of perfect prediction and “down the middle” of the cloud of points, then the model predicts correctly on average: it underpredicts about as much as it overpredicts.

Are there systematic errors?

If the smoothing curve veers off the line of perfect prediction too much, as in figure 7.8, this is a sign of systematic under- or overprediction in certain ranges: the error is correlated with the prediction. Systematic errors indicate that the system is not “linear enough” for a linear model to be a good fit, so you should try one of the different modeling approaches that we will discuss later in this book.

Figure 7.8. An example of systematic errors in model predictions

R-squared and RMSE

In addition to inspecting graphs, you should produce quantitative summaries of the quality of the predictions and the residuals. One standard measure of quality of a prediction is called R-squared, which we covered in section 6.2.4. R-squared is a measure of how well the model “fits” the data, or its “goodness of fit.” You can compute the R-squared between the prediction and the actual y with the R steps in the following listing.

Listing 7.4. Computing R-squared

rsq <- function(y, f) { 1 - sum((y - f)^2)/sum((y - mean(y))^2) }

rsq(log10(dtrain$PINCP), dtrain$predLogPINCP)     1
 ## [1] 0.2976165

rsq(log10(dtest$PINCP), dtest$predLogPINCP)       2
 ## [1] 0.2911965

1 R-squared of the model on the training data
2 R-squared of the model on the test data

R-squared can be thought of as what fraction of the y variation is explained by the model. You want R-squared to be fairly large (1.0 is the largest you can achieve) and R-squareds that are similar on test and training. A significantly lower R-squared on test data is a symptom of an overfit model that looks good in training and won’t work in production. In this case, the R-squareds were about 0.3 for both the training and test data. We’d like to see R-squareds higher than this (say, 0.7–1.0). So the model is of low quality, but not overfit.

For well-fit models, R-squared is also equal to the square of the correlation between the predicted values and actual training values.^[1]

¹
See http://www.win-vector.com/blog/2011/11/correlation-and-r-squared/.

R-squared can be overoptimistic

In general, R-squared on training data will be higher for models with more input parameters, independent of whether the additional variables actually improve the model or not. That’s why many people prefer the adjusted R-squared (which we’ll discuss later in this chapter).

Also, R-squared is related to correlation, and the correlation can be artificially inflated if the model correctly predicts a few outliers. This is because the increased data range makes the overall data cloud appear “tighter” against the line of perfect prediction. Here’s a toy example. Let y <- c(1,2,3,4,5,9,10) and pred <- c(0.5,0.5,0.5, 0.5,0.5,9,10). This corresponds to a model that’s completely uncorrelated to the true outcome for the first five points, and perfectly predicts the last two points, which are somewhat far away from the first five. You can check for yourself that this obviously poor model has a correlation cor(y, pred) of about 0.926, with a corresponding R-squared of 0.858. So it’s an excellent idea to look at the true-versus-fitted graph on test data, in addition to checking R-squared.

Another good measure to consider is root mean square error (RMSE ).

Listing 7.5. Calculating root mean square error

rmse <- function(y, f) { sqrt(mean( (y-f)^2 )) }

rmse(log10(dtrain$PINCP), dtrain$predLogPINCP)      1
 ## [1] 0.2685855

rmse(log10(dtest$PINCP), dtest$predLogPINCP)        2
 ## [1] 0.2675129

1 RMSE of the model on the training data
2 RMSE of the model on the test data

You can think of the RMSE as a measure of the width of the data cloud around the line of perfect prediction. We’d like RMSE to be small, and one way to achieve this is to introduce more useful, explanatory variables.

7.1.4. Finding relations and extracting advice

Recall that your other goal, beyond predicting income, is to find the value of having a bachelor’s degree. We’ll show how this value, and other relations in the data, can be read directly off a linear regression model.

All the information in a linear regression model is stored in a block of numbers called the coefficients. The coefficients are available through the coefficients(model) function. The coefficients of our income model are shown in figure 7.9.

Figure 7.9. The model coefficients

Reported coefficients

Our original modeling variables were only AGEP, SEX, COW (class of work), and SCHL (schooling/education); yet the model reports many more coefficients than these four. We’ll explain what all the reported coefficients are.

In figure 7.9, there are eight coefficients that start with SCHL. The original variable SCHL took on these eight string values plus one more not shown: no high school diploma. Each of these possible strings is called a level, and SCHL itself is called a categorical or factor variable. The level that isn’t shown is called the reference level; the coefficients of the other levels are measured with respect to the reference level.

For example, in SCHLBachelor's degree we find the coefficient 0.36, which is read as “The model gives a 0.36 bonus to log base 10 income for having a bachelor’s degree, relative to not having a high school degree.” You can solve for the income ratio between someone with a bachelor’s degree and the equivalent person (same sex, age, and class of work) without a high school degree as follows:

log10(income_bachelors) = log10(income_no_hs_degree) + 0.36
log10(income_bachelors) - log10(income_no_hs_degree) = 0.36
         (income_bachelors) / (income_no_hs_degree)  = 10^(0.36)

This means that someone with a bachelor’s degree will tend to have an income about 10^0.36, or 2.29 times higher than the equivalent person without a high school degree.

And under SCHLRegular high school diploma, we find the coefficient 0.11. This is read as “The model believes that having a bachelor’s degree tends to add 0.36–0.11 units to the predicted log income, relative to having a high school degree.”

log10(income_bachelors) - log10(income_no_hs_degree) = 0.36
       log10(income_hs) - log10(income_no_hs_degree) = 0.11

log10(income_bachelors) - log10(income_hs) = 0.36 - 0.11         1
          (income_bachelors) / (income_hs)  = 10^(0.36 - 0.11)

1 Subtracts the second equation from the first

The modeled relation between the bachelor’s degree holder’s expected income and the high school graduate’s (all other variables being equal) is 10^(0.36 - 0.11), or about 1.8 times greater. The advice: college is worth it if you can find a job (remember that we limited the analysis to the fully employed, so this is assuming you can find a job).

SEX and COW are also discrete variables, with reference levels Male and Employee of a private for profit [company], respectively. The coefficients that correspond to the different levels of SEX and COW can be interpreted in a manner similar to the education level. AGEP is a continuous variable with coefficient 0.0116. You can interpret this as saying that a one-year increase in age adds a 0.0116 bonus to log income; in other words, an increase in age of one year corresponds to an increase of income of 10^0.0116, or a factor of 1.027—about a 2.7% increase in income (all other variables being equal).

The coefficient (Intercept) corresponds to a variable that always has a value of 1, which is implicitly added to linear regression models unless you use the special 0+ notation in the formula during the call to lm(). One way to interpret the intercept is to think of it as “the prediction for the reference subject”—that is, the subject who takes on the values of all the reference levels for the categorical inputs, and zero for the continuous variables. Note that this may not be a physically plausible subject.

In our example, the reference subject would be a male employee of a private for-profit company, with no high school degree, who is zero years old. If such a person could exist, the model would predict their log base 10 income to be about 4.0, which corresponds to an income of $10,000.

Indicator variables

Most modeling methods handle a string-valued (categorical) variable with n possible levels by converting it to n (or n-1) binary variables, or indicator variables. R has commands to explicitly control the conversion of string-valued variables into well-behaved indicators: as.factor() creates categorical variables from string variables; relevel() allows the user to specify the reference level.

But beware of variables with a very large number of levels, like ZIP codes. The runtime of linear (and logistic) regression increases as roughly the cube of the number of coefficients. Too many levels (or too many variables in general) will bog the algorithm down and require much more data for reliable inference. In chapter 8, we will discuss methods for dealing with such high cardinality variables, such as effects coding or impact coding.

The preceding interpretations of the coefficients assume that the model has provided good estimates of the coefficients. We’ll see how to check that in the next section.

7.1.5. Reading the model summary and characterizing coefficient quality

In section 7.1.3, we checked whether our income predictions were to be trusted. We’ll now show how to check whether model coefficients are reliable. This is especially important, as we’ve been discussing showing coefficients’ relations to others as advice.

Most of what we need to know is already in the model summary, which is produced using the summary() command: summary(model). This produces the output shown in figure 7.10.

Figure 7.10. Model summary

This figure looks intimidating, but it contains a lot of useful information and diagnostics. You’re likely to be asked about elements of figure 7.10 when presenting results, so we’ll demonstrate how all of these fields are derived and what the fields mean.

We’ll first break down the summary() into pieces.

The original model call

The first part of the summary() is how the lm() model was constructed:

Call:
lm(formula = log10(PINCP) ~ AGEP + SEX + COW + SCHL,
    data = dtrain)

This is a good place to double-check whether you used the correct data frame, performed your intended transformations, and used the right variables. For example, you can double-check whether you used the data frame dtrain and not the data frame dtest.

The residuals summary

The next part of the summary() is the residuals summary:

Residuals:
    Min      1Q  Median      3Q     Max
-1.5038 -0.1354  0.0187  0.1710  0.9741

Recall that the residuals are the errors in prediction: log10(dtrain$PINCP) - predict (model,newdata=dtrain). In linear regression, the residuals are everything. Most of what you want to know about the quality of your model fit is in the residuals. You can calculate useful summaries of the residuals for both the training and test sets, as shown in the following listing.

Listing 7.6. Summarizing residuals

( resids_train <- summary(log10(dtrain$PINCP) -
      predict(model, newdata = dtrain)) )
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
## -1.5038 -0.1354  0.0187  0.0000  0.1710  0.9741

( resids_test <- summary(log10(dtest$PINCP) -
      predict(model, newdata = dtest)) )
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## -1.789150 -0.130733  0.027413  0.006359  0.175847  0.912646

In linear regression, the coefficients are chosen to minimize the sum of squares of the residuals. This is why the method is also often called the least squares method. So for good models, you expect the residuals to be small.

In the residual summary, you’re given the Min. and Max., which are the smallest and largest residuals seen. You’re also given the quartiles of the residuals: 1st. Qu., or the value that upper bounds the first 25% of the data; the Median, or the value that upper bounds the first 50% of the data; and 3rd Qu., or the value that upper bounds the first 75% of the data (the Max is the 4th quartile: the value that upper bounds 100% of the data). The quartiles give you a rough idea of the data’s distribution.

What you hope to see in the residual summary is that the median is near 0 (as it is in our example), and that the 1st. Qu. and the 3rd Qu. are roughly equidistant from the median (with neither too large). In our example, the 1st. Qu. and 3rd Qu. of the training residuals (resids_train) are both about 0.15 from the median. They are slightly less symmetric for the test residuals (0.16 and 0.15 from the median), but still within bounds.

The 1st. Qu. and 3rd Qu. quantiles are interesting because exactly half of the training data has a residual in this range. In our example, if you drew a random training datum, its residual would be in the range –0.1354 to 0.1710 exactly half the time. So you really expect to commonly see prediction errors of these magnitudes. If these errors are too big for your application, you don’t have a usable model.

The coefficients table

The next part of the summary(model) is the coefficients table, as shown in figure 7.11. A matrix form of this table can be retrieved as summary(model)$coefficients.

Figure 7.11. Model summary coefficient columns

Each model coefficient forms a row of the summary coefficients table. The columns report the estimated coefficient, the uncertainty of the estimate, how large the coefficient is relative to the uncertainty, and how likely such a ratio would be due to mere chance. Figure 7.11 gives the names and interpretations of the columns.

You set out to study income and the impact that getting a bachelor’s degree has on income. But you must look at all the coefficients to check for interfering effects.

For example, the coefficient of –0.108 for SEXF means that your model learned a penalty of –0.108 to log10(PINCP) for being female. The ratio of female income to male income is modeled to be 10^(-0.108): women earn 78% of what men earn, all other model parameters being equal. Note we said “all other model parameters being equal” not “all other things being equal.” That’s because we’re not modeling the number of years in the workforce (which age may not be a reliable proxy for) or occupation/industry type (which has a big impact on income). This model is not, with the features it was given, capable of testing if, on average, a female in the same job with the same number of years of experience is paid less.

Insignificant coefficients

Notice in figure 7.11 the coefficient COWSelf employed incorporated is “not significant.” This means there is not enough evidence with respect to this model design to determine if the coefficient is non-zero.

Some recommend stepwise regression to remove such variables, or add a useful inductive bias of the form, “If we can’t tell it is non-zero, force it to zero.” In this case, this wouldn’t be convenient as the variable is just a level of a categorical variable (so it’s a bit harder to treat independently). We do not recommend stepwise regression, as stepwise regression introduces multiple comparison problems that bias the estimates of the remaining coefficients.^[a] We recommend either living with the non-significant estimates (as even replacing them with zero is still trading one uncertain estimate for another), or prefiltering the variables for utility, or regularized methods (such as glmnet/lasso). All of these ideas are covered throughout this book.

^a
See Robert Tibshirani, “Regression shrinkage and selection via the lasso.” Journal of the Royal Statistical Society, Series B 58: 267–288, 1996.

A point to remember: in terms of prediction (our primary goal), it’s not a problem to have a small number of insignificant coefficients with small effects sizes. Problems arise when we have insignificant coefficients with large coefficients/effects or a great number of insignificant coefficients.

Statistics as an attempt to correct bad experimental design

The absolute best experiment to test if there’s a sex-driven difference in income distribution would be to compare incomes of individuals who were identical in all possible variables (age, education, years in industry, performance reviews, race, region, and so on) but differ only in sex. We’re unlikely to have access to such data, so we’d settle for a good experimental design: a population where there’s no correlation between any other feature and sex. Random selection can help in experimental design, but it’s not a complete panacea. Barring a good experimental design, the usual pragmatic strategy is this: introduce extra variables to represent effects that may have been interfering with the effect we were trying to study. Thus a study of the effect of sex on income may include other variables like education and age to try to disentangle the competing effects.

The p-value and significance

The p-value (also called the significance) is one of the most important diagnostic columns in the coefficient summary. The p-value estimates the probability of seeing a coefficient with a magnitude as large as you observed if the true coefficient is really zero (if the variable has no effect on the outcome). So don’t trust the estimate of any coefficient with a large p-value. Generally, people pick a threshold, and call all the coefficients with a p-value below that threshold statistically significant, meaning that those coefficients are likely not zero. A common threshold is p < 0.05; however, this is an arbitrary level.

Note that lower p-values aren’t always “better” once they’re good enough. There’s no reason to prefer a coefficient with a p-value of 1e-23 to one with a p-value of 1e-08 as long as both p-values are below your chosen threshold; at this point, you know both coefficients are likely good estimates and you should prefer the ones that explain the most variance. Also note that high p-values don’t always tell you which of the coefficients are bad, as we discuss in the sidebar.

Collinearity also lowers significance

Sometimes, a predictive variable won’t appear significant because it’s collinear (or correlated) with another predictive variable. For example, if you did try to use both age and number of years in the workforce to predict income, neither variable may appear significant. This is because age tends to be correlated with number of years in the workforce. If you remove one of the variables and the other one gains significance, this is a good indicator of correlation.

If you see coefficients that seem unreasonably large (often of opposite signs), or unusually large standard errors on the coefficients, that may indicate collinear variables.

Another possible indication of collinearity in the inputs is seeing coefficients with an unexpected sign: for example, seeing that income is negatively correlated with years in the workforce.

The overall model can still predict income quite well, even when the inputs are correlated; it just can’t determine which variable deserves the credit for the prediction.

Using regularization can be helpful in collinear situations, as we will discuss in section 7.3. Regularization prefers small coefficients, which can be less hazardous when used on new data.

If you want to use the coefficient values as advice as well as to make good predictions, try to avoid collinearity in the inputs as much as possible.

Overall model quality summaries

The last part of the summary(model) report is the overall model quality statistics. It’s a good idea to check the overall model quality before sharing any predictions or coefficients. The summaries are as follows:

Residual standard error: 0.2688 on 11186 degrees of freedom
Multiple R-squared:  0.2976,    Adjusted R-squared:  0.2966
F-statistic: 296.2 on 16 and 11186 DF,  p-value: < 2.2e-16

Let’s explain each of the summaries in a little more detail.

Degrees of freedom

The degrees of freedom is the number of data rows minus the number of coefficients fit, in our case, this:

(df <-  nrow(dtrain) - nrow(summary(model)$coefficients))
## [1] 11186

The degrees of freedom is the number of training data rows you have after correcting for the number of coefficients you tried to solve for. You want the number of datums in the training set to be large compared to the number of coefficients you are solving for; in other words, you want the degrees of freedom to be high. A low degree of freedom indicates that the model you are trying to fit is too complex for the amount of data that you have, and your model is likely to be overfit. Overfitting is when you find chance relations in your training data that aren’t present in the general population. Overfitting is bad: you think you have a good model when you don’t.

Residual standard error

The residual standard error is the sum of the square of the residuals (or the sum of squared error) divided by the degrees of freedom. So it’s similar to the RMSE (root mean squared error) that we discussed earlier, except with the number of data rows adjusted to be the degrees of freedom; in R, this is calculated as follows:

(modelResidualError <- sqrt(sum(residuals(model)^2) / df))
## [1] 0.2687895

The residual standard error is a more conservative estimate of model performance than the RMSE, because it’s adjusted for the complexity of the model (the degrees of freedom is less than the number of rows of training data, so the residual standard error is larger than the RMSE). Again, this tries to compensate for the fact that more-complex models have a higher tendency to overfit the data.

Degrees of freedom on test data

On test data (data not used during training), the degrees of freedom equal the number of rows of data. This differs from the case of training data, where, as we have said, the degrees of freedom equal the number of rows of data minus the number of parameters of the model.

The difference arises from the fact that model training “peeks at” the training data, but not the test data.

Multiple and adjusted R-squared

Multiple R-squared is just the R-squared of the model on the training data (discussed in section 7.1.3).

The adjusted R-squared is the multiple R-squared penalized for the number of input variables. The reason for this penalty is that, in general, increasing the number of input variables will improve the R-squared on the training data, even if the added variables aren’t actually informative. This is another way of saying that more-complex models tend to look better on training data due to overfitting, so the adjusted R-squared is a more conservative estimate of the model’s goodness of fit.

If you do not have test data, it’s a good idea to rely on the adjusted R-squared when evaluating your model. But it’s even better to compute the R-squared between predictions and actuals on holdout test data. In section 7.1.3, we showed the R-squared on test data was 0.29, which in this case is about the same as the reported adjusted R-squared of 0.3. However, we still advise preparing both training and test datasets; the test dataset estimates can be more representative of production model performance than statistical formulas.

The F-statistic and its p-value

The F-statistic is similar to the t-values for coefficients that you saw earlier in figure 7.11. Just as the t-values are used to calculate p-values on the coefficients, the F-statistic is used to calculate a p-value on the model fit. It gets its name from the F-test, which is the technique used to check if two variances—in this case, the variance of the residuals from the constant model and the variance of the residuals from the linear model—are significantly different. The corresponding p-value is the estimate of the probability that we would’ve observed an F-statistic this large or larger if the two variances in question were in reality the same. So you want the p-value to be small (a common threshold: less that 0.05).

In our example, the F-statistic p-value is quite small (< 2.2e-16): the model explains more variance than the constant model does, and the improvement is incredibly unlikely to have arisen only from sampling error.

Interpreting model significances

Most of the tests of linear regression, including the tests for coefficient and model significance, are based on the assumption that the error terms or residuals are normally distributed. It’s important to examine graphically or use quantile analysis to determine if the regression model is appropriate.

7.1.6. Linear regression takeaways

Linear regression is the go-to statistical modeling method for predicting quantities. It is simple and has the advantage that the coefficients of the model can often function as advice. Here are a few points you should remember about linear regression:

Linear regression assumes that the outcome is a linear combination of the input variables. Naturally, it works best when that assumption is nearly true, but it can predict surprisingly well even when it isn’t.
If you want to use the coefficients of your model for advice, you should only trust the coefficients that appear statistically significant.
Overly large coefficient magnitudes, overly large standard errors on the coefficient estimates, and the wrong sign on a coefficient could be indications of correlated inputs.
Linear regression can predict well even in the presence of correlated variables, but correlated variables lower the quality of the advice.
Linear regression will have trouble with problems that have a very large number of variables, or categorical variables with a very large number of levels.
Linear regression packages have some of the best built-in diagnostics available, but rechecking your model on test data is still your most effective safety check.

7.2. Using logistic regression

Logistic regression is the most important (and probably most used) member of a class of models called generalized linear models. Unlike linear regression, logistic regression can directly predict values that are restricted to the (0, 1) interval, such as probabilities. It’s the go-to method for predicting probabilities or rates, and like linear regression, the coefficients of a logistic regression model can be treated as advice. It’s also a good first choice for binary classification problems.

In this section, we’ll use a medical classification example (predicting whether a newborn will need extra medical attention) to work through all the steps of producing and using a logistic regression model.^[1]

¹
Logistic regression is usually used to perform classification, but logistic regression and its close cousin beta regression are also useful in estimating rates. In fact, R’s standard glm() call will work with predicting numeric values between 0 and 1 in addition to predicting classifications.

As we did with linear regression, we’ll take a quick overview of logistic regression before tackling the main example.

7.2.1. Understanding logistic regression

Example

Suppose you want to predict whether or not a flight will be delayed, based on facts like the flight’s origin and destination, weather, and air carrier. For every flight i, you want to predict flight_delayed[i] based on origin[i], destination[i], weather[i], and air_carrier[i].

We’d like to use linear regression to predict the probability that a flight i will be delayed, but probabilities are strictly in the range 0:1, and linear regression doesn’t restrict its prediction to that range.

One idea is to find a function of probability that is in the range -Infinity:Infinity, fit a linear model to predict that quantity, and then solve for the appropriate probabilities from the model predictions. So let’s look at a slightly different problem: instead of predicting the probability that a flight is delayed, consider the odds that the flight is delayed, or the ratio of the probability that the flight is delayed over the probability that it is not.

odds[flight_delayed] = P[flight_delayed == TRUE] / P[flight_delayed == FALSE]

The range of the odds function isn’t -Infinity:Infinity; it’s restricted to be a non-negative number. But we can take the log of the odds—the log-odds—to get a function of the probabilities that is in the range -Infinity:Infinity.

log_odds[flight_delayed] = log(P[flight_delayed == TRUE] / P[flight_delayed == FALSE])

Let: p = P[flight_delayed == TRUE]; then
log_odds[flight_delayed] = log(p / (1 - p))

Note that if it’s more likely that a flight will be delayed than on time, the odds ratio will be greater than one; if it’s less likely that a flight will be delayed than on time, the odds ratio will be less than one. So the log-odds is positive if it’s more likely that the flight will be delayed, negative if it’s more likely that the flight will be on time, and zero if the chances of delay are 50-50. This is shown in figure 7.12.

Figure 7.12. Mapping the odds of a flight delay to log-odds

The log-odds of a probability p is also known as logit(p). The inverse of logit(p) is the sigmoid function, shown in figure 7.13. The sigmoid function maps values in the range from -Infinity:Infinity to the range 0:1—in this case, the sigmoid maps unbounded log-odds ratios to a probability value that is between 0 and 1.

logit <- function(p) { log(p/(1-p)) }
s <- function(x) { 1/(1 + exp(-x))}

s(logit(0.7))
# [1] 0.7

logit(s(-2))
# -2

Figure 7.13. Mapping log-odds to the probability of a flight delay via the sigmoid function

Now we can try to fit a linear model to the log-odds of a flight being delayed:

logit(P[flight_delayed[i] == TRUE]) = b0 + b_origin * origin[i] + ...

But what we are really interested in is the probability that a flight is delayed. To get that, take the sigmoid s() of both sides:

P[flight_delayed[i] == TRUE] =  s(b0 + b_origin * origin[i] + ...)

This is the logistic regression model for the probability that a flight will be delayed. The preceding derivation may seem ad hoc, but using the logit function to transform the probabilities is known to have a number of favorable properties. For instance, like linear regression, it gets totals right (as we will see in section 7.2.3).

More generally, suppose y[i] is the class of object i: TRUE or FALSE; delayed or on_time. Also, suppose that x[i,] is a row of inputs, and call one of the classes the “class of interest” or target class —that is, the class you are trying to predict (you want to predict whether something is TRUE or whether the flight is in the class delayed). Then logistic regression attempts to a fit function f(x) such that

equation 7.2. The expression for a logistic regression model

If the y[i] are the probabilities that the x[i,] belong to the class of interest, then the task of fitting is to find the a, b[1], ..., b[n] such that f(x[i,]) is the best possible estimate of y[i]. R supplies a one-line statement to find these coefficients: glm().^[1] Note that you don’t need to supply y[i] that are probability estimates to run glm(); the training method only requires y[i] that say which class a given training example belongs to.

¹
Logistic regression can be used for classifying into any number of categories (as long as the categories are disjoint and cover all possibilities: every x has to belong to one of the given categories). But glm() only handles the two-category case, so our discussion will focus on this case.

As we’ve shown, you can think of logistic regression as a linear regression that finds the log-odds of the probability that you’re interested in. In particular, logistic regression assumes that logit(y) is linear in the values of x. Like linear regression, logistic regression will find the best coefficients to predict y, including finding advantageous combinations and cancellations when the inputs are correlated.

Now to the main example.

Example

Imagine that you’re working at a hospital. The overall goal is to design a plan that provisions neonatal emergency equipment to delivery rooms. Newborn babies are assessed at one and five minutes after birth using what’s called the Apgar test, which is designed to determine if a baby needs immediate emergency care or extra medical attention. A baby who scores below 7 (on a scale from 0 to 10) on the Apgar scale needs extra attention.

Such at-risk babies are rare, so the hospital doesn’t want to provision extra emergency equipment for every delivery. On the other hand, at-risk babies may need attention quickly, so provisioning resources proactively to appropriate deliveries can save lives. Your task is to build a model to identify ahead of time situations with a higher probability of risk, so that resources can be allocated appropriately.

We’ll use a sample dataset from the 2010 CDC natality public-use data file (http://mng.bz/pnGy). This dataset records statistics for all US births registered in the 50 states and the District of Columbia, including facts about the mother and father, and about the delivery. The sample has just over 26,000 births in a data frame called sdata.^[2] The data is split into training and test sets, using a random grouping column that we added, which allows for repeatable experiments with the split ratio.

²
Our pre-prepared file is at https://github.com/WinVector/PDSwR2/tree/master/CDC/NatalRiskData.rData; we also provide a script file (https://github.com/WinVector/PDSwR2/blob/master/CDC/PrepNatalRiskData.R), which prepares the data frame from an extract of the full natality dataset. Details found at https://github.com/WinVector/PDSwR2/blob/master/CDC/README.md.

Listing 7.7. Loading the CDC data

load("NatalRiskData.rData")
train <- sdata[sdata$ORIGRANDGROUP <= 5 , ]
test <- sdata[sdata$ORIGRANDGROUP > 5, ]

Table 7.1 lists the columns of the dataset that you will use. Because the goal is to anticipate at-risk infants ahead of time, we’ll restrict variables to those whose values are known before delivery or can be determined during labor. For example, facts about the mother’s weight and health history are valid inputs, but post-birth facts like infant birth weight are not. We can include in-labor complications like breech birth by reasoning that the model can be updated in the delivery room (via a protocol or checklist) in time for emergency resources to be allocated before delivery.

Table 7.1. Some variables in the natality dataset

Variable	Type	Description
atRisk	Logical	TRUE if 5-minute Apgar score < 7; FALSE otherwise
PWGT	Numeric	Mother’s prepregnancy weight
UPREVIS	Numeric (integer)	Number of prenatal medical visits
CIG_REC	Logical	TRUE if smoker; FALSE otherwise
GESTREC3	Categorical	Two categories: <37 weeks (premature) and >=37 weeks
DPLURAL	Categorical	Birth plurality, three categories: single/twin/triplet+
ULD_MECO	Logical	TRUE if moderate/heavy fecal staining of amniotic fluid
ULD_PRECIP	Logical	TRUE for unusually short labor (< three hours)
ULD_BREECH	Logical	TRUE for breech (pelvis first) birth position
URF_DIAB	Logical	TRUE if mother is diabetic
URF_CHYPER	Logical	TRUE if mother has chronic hypertension
URF_PHYPER	Logical	TRUE if mother has pregnancy-related hypertension
URF_ECLAM	Logical	TRUE if mother experienced eclampsia: pregnancy-related seizures

Now we’re ready to build the model.

7.2.2. Building a logistic regression model

The function to build a logistic regression model in R is glm(), supplied by the stats package. In our case, the dependent variable y is the logical (or Boolean) atRisk; all the other variables in table 7.1 are the independent variables x. The formula for building a model to predict atRisk using these variables is rather long to type in by hand; you can generate the formula using the mk_formula() function from the wrapr package, as shown next.

Listing 7.8. Building the model formula

complications <- c("ULD_MECO","ULD_PRECIP","ULD_BREECH")
riskfactors <- c("URF_DIAB", "URF_CHYPER", "URF_PHYPER",
                  "URF_ECLAM")

y <- "atRisk"
x <- c("PWGT",
      "UPREVIS",
      "CIG_REC",
      "GESTREC3",
      "DPLURAL",
      complications,
      riskfactors)
library(wrapr)
fmla <- mk_formula(y, x)

Now we’ll build the logistic regression model, using the training dataset.

Listing 7.9. Fitting the logistic regression model

print(fmla)

## atRisk ~ PWGT + UPREVIS + CIG_REC + GESTREC3 + DPLURAL + ULD_MECO +
##     ULD_PRECIP + ULD_BREECH + URF_DIAB + URF_CHYPER + URF_PHYPER +
##     URF_ECLAM
## <environment: base>

model <- glm(fmla, data = train, family = binomial(link = "logit"))

This is similar to the linear regression call to lm(), with one additional argument: family = binomial(link = "logit"). The family function specifies the assumed distribution of the dependent variable y. In our case, we’re modeling y as a binomial distribution, or as a coin whose probability of heads depends on x. The link function “links” the output to a linear model—it’s as if you pass y through the link function, and then model the resulting value as a linear function of the x values. Different combinations of family functions and link functions lead to different kinds of generalized linear models (for example, Poisson, or probit). In this book, we’ll only discuss logistic models, so we’ll only need to use the binomial family with the logit link.^[1]

¹
The logit link is the default link for the binomial family, so the call glm(fmla, data = train, family = binomial) works just fine. We explicitly specified the link in our example for the sake of discussion.

Don’t forget the family argument!

Without an explicit family argument, glm() defaults to standard linear regression (like lm).

The family argument can be used to select many different behaviors of the glm() function. For example, choosing family = quasipoisson chooses a “log” link, which models the logarithm of the prediction as linear in the inputs.

This would be another approach to try for the income prediction problem of section 7.1. However, it is a subtle point to determine whether a log transformation and linear model or a log-link and a generalized linear model is a better choice for a given problem. The log-link will be better at predicting total incomes (scoring an error of $50,000 for small and large incomes alike). The log-transform method will be better at predicting relative incomes (a scoring error of $50,000 being less dire for large incomes than for small incomes).

As before, we’ve stored the results in the object model.

7.2.3. Making predictions

Making predictions with a logistic model is similar to making predictions with a linear model—use the predict() function. The following code stores the predictions for the training and test sets as the column pred in the respective data frames.

Listing 7.10. Applying the logistic regression model

train$pred <- predict(model, newdata=train, type = "response")
test$pred <- predict(model, newdata=test, type="response")

Note the additional parameter type = "response". This tells the predict() function to return the predicted probabilities y. If you don’t specify type = "response", then by default predict() will return the output of the link function, logit(y).

One strength of logistic regression is that it preserves the marginal probabilities of the training data. That means that if you sum the predicted probability scores for the entire training set, that quantity will be equal to the number of positive outcomes (atRisk == TRUE) in the training set. This is also true for subsets of the data determined by variables included in the model. For example, in the subset of the training data that has train$GESTREC == "<37 weeks" (the baby was premature), the sum of the predicted probabilities equals the number of positive training examples (see, for example http://mng.bz/j338).

Listing 7.11. Preserving marginal probabilities with logistic regression

sum(train$atRisk == TRUE)                               1
 ## [1] 273

sum(train$pred)                                         2
 ## [1] 273

premature <- subset(train, GESTREC3 == "< 37 weeks")    3
sum(premature$atRisk == TRUE)
## [1] 112

sum(premature$pred)                                     4
 ## [1] 112

1 Counts the number of at-risk infants in the training set.
2 Sums all the predicted probabilities over the training set. Notice that it adds to the number of at-risk infants.
3 Counts the number of at-risk premature infants in the training set
4 Sums all the predicted probabilities for premature infants in the training set. Note that it adds to the number of at-risk premature infants.

Because logistic regression preserves marginal probabilities, you know that the model is in some sense consistent with the training data. When the model is applied to future data with distributions similar to the training data, it should then return results consistent with that data: about the correct probability mass of expected at-risk infants, distributed correctly with respect to the infants’ characteristics. However, if the model is applied to future data with very different distributions (for example, a much higher rate of at-risk infants), the model may not predict as well.

Characterizing prediction quality

If your goal is to use the model to classify new instances into one of two categories (in this case, at-risk or not-at-risk), then you want the model to give high scores to positive instances and low scores otherwise. As we discussed in section 6.2.5, you can check if this is so by plotting the distribution of scores for both the positive and negative instances. Let’s do this on the training set (you should also plot the test set, to make sure the performance is of similar quality).

Listing 7.12. Plotting distribution of prediction score grouped by known outcome

library(WVPlots)
DoubleDensityPlot(train, "pred", "atRisk",
                  title = "Distribution of natality risk scores")

The result is shown in figure 7.14. Ideally, we’d like the distribution of scores to be separated, with the scores of the negative instances (FALSE) to be concentrated on the left, and the distribution for the positive instances to be concentrated on the right. Earlier in figure 6.15 (reproduced here as figure 7.15), we showed an example of a classifier (the spam filter) that separates the positives and the negatives quite well. With the natality risk model, both distributions are concentrated on the left, meaning that both positive and negative instances score low. This isn’t surprising, since the positive instances (the ones with the baby at risk) are rare (about 1.8% of all births in the dataset). The distribution of scores for the negative instances dies off sooner than the distribution for positive instances. This means that the model did identify subpopulations in the data where the rate of at-risk newborns is higher than the average, as is pointed out in figure 7.14.

Figure 7.14. Distribution of score broken up by positive examples (TRUE) and negative examples (FALSE)

Figure 7.15. Reproduction of the spam filter score distributions from chapter 6

In order to use the model as a classifier, you must pick a threshold; scores above the threshold will be classified as positive, those below as negative. When you pick a threshold, you’re trying to balance the precision of the classifier (what fraction of the predicted positives are true positives) and its recall (how many of the true positives the classifier finds).

If the score distributions of the positive and negative instances are well separated, as in figure 7.15, you can pick an appropriate threshold in the “valley” between the two peaks. In the current case, the two distributions aren’t well separated, which indicates that the model can’t build a classifier that simultaneously achieves good recall and good precision.

However, you might be able to build a classifier that identifies a subset of situations with a higher-than-average rate of at-risk births: for example, you may be able to find a threshold that produces a classifier with a precision of 3.6%. Even though this precision is low, it represents a subset of the data that has twice the risk as the overall population (3.6% versus 1.8%), so preprovisioning resources to those situations may be advised. We’ll call the ratio of the classifier precision to the average rate of positives the enrichment rate.

The higher you set the threshold, the more precise the classifier will be (you’ll identify a set of situations with a much higher-than-average rate of at-risk births); but you’ll also miss a higher percentage of at-risk situations, as well. When picking the threshold, you should use the training set, since picking the threshold is part of classifier-building. You can then use the test set to evaluate classifier performance.

To help pick the threshold, you can use a plot like figure 7.16, which shows both enrichment and recall as functions of the threshold.

Figure 7.16. Enrichment (top) and recall (bottom) plotted as functions of threshold for the training set

Looking at figure 7.16, you see that higher thresholds result in more-precise classifications (precision is proportional to enrichment), at the cost of missing more cases; a lower threshold will identify more cases, at the cost of many more false positives (lower precision). The best trade-off between precision/enrichment and recall is a function of how many resources the hospital has available to allocate, and how many they can keep in reserve (or redeploy) for situations that the classifier missed. A threshold of 0.02 (marked in figure 7.16 by the dashed line) might be a good trade-off. The resulting classifier will identify a subset of the population where the rate of risky births is 2.5 times higher than in the overall population, and which contains about half of all the true at-risk situations.

You can produce figure 7.16 using the PRTPlot() function in WVPlots.

Listing 7.13. Exploring modeling trade-offs

library("WVPlots")
library("ggplot2")
plt <- PRTPlot(train, "pred", "atRisk", TRUE,                            1
         plotvars = c("enrichment", "recall"),
        thresholdrange = c(0,0.05),
        title = "Enrichment/recall vs. threshold for natality model")
plt + geom_vline(xintercept = 0.02, color="red", linetype = 2)           2

1 Calls PRTPlot() where pred is the column of predictions, atRisk is the true outcome column, and TRUE is the class of interest
2 Adds a line to mark threshold = 0.02.

Once you’ve picked an appropriate threshold, you can evaluate the resulting classifier by looking at the confusion matrix, as we discussed in section 6.2.3. Let’s use the test set to evaluate the classifier with a threshold of 0.02.

Listing 7.14. Evaluating the chosen model

( ctab.test <- table(pred = test$pred > 0.02, atRisk = test$atRisk)  )   1

##        atRisk
## pred    FALSE TRUE
##   FALSE  9487   93
##   TRUE   2405  116

( precision <- ctab.test[2,2] / sum(ctab.test[2,]) )
## [1] 0.04601349

( recall <- ctab.test[2,2] / sum(ctab.test[,2]) )
## [1] 0.5550239

( enrichment <- precision / mean(as.numeric(test$atRisk))  )
## [1] 2.664159

1 Builds the confusion matrix. The rows contain predicted negatives and positives; columns contain actual negatives and positives.

The resulting classifier is low-precision, but identifies a set of potential at-risk cases that contains 55.5% of the true positive cases in the test set, at a rate 2.66 times higher than the overall average. This is consistent with the results on the training set.

In addition to making predictions, a logistic regression model also helps you extract useful information and advice. We’ll show this in the next section.

7.2.4. Finding relations and extracting advice from logistic models

The coefficients of a logistic regression model encode the relationships between the input variables and the output in a way similar to how the coefficients of a linear regression model do. You can get the model’s coefficients with the call coefficients (model).

Listing 7.15. The model coefficients

coefficients(model)
##              (Intercept)                     PWGT
##              -4.41218940               0.00376166
##                  UPREVIS              CIG_RECTRUE
##              -0.06328943               0.31316930
##       GESTREC3< 37 weeks DPLURALtriplet or higher
##               1.54518311               1.39419294
##              DPLURALtwin             ULD_MECOTRUE
##               0.31231871               0.81842627
##           ULD_PRECIPTRUE           ULD_BREECHTRUE
##               0.19172008               0.74923672
##             URF_DIABTRUE           URF_CHYPERTRUE
##              -0.34646672               0.56002503
##           URF_PHYPERTRUE            URF_ECLAMTRUE
##               0.16159872               0.49806435

Negative coefficients that are statistically significant^[1] correspond to variables that are negatively correlated to the odds (and hence to the probability) of a positive outcome (the baby being at risk). Positive coefficients that are statistically significant are positively correlated to the odds of the baby being at risk.

¹
We’ll show how to check for statistical significance in the next section.

As with linear regression, every categorical variable is expanded to a set of indicator variables. If the original variable has n levels, there will be n-1 indicator variables; the remaining level is the reference level.

For example, the variable DPLURAL has three levels corresponding to single births, twins, and triplets or higher. The logistic regression model has two corresponding coefficients: DPLURALtwin and DPLURALtriplet or higher. The reference level is single births. Both of the DPLURAL coefficients are positive, indicating that multiple births have higher odds of being at risk than single births do, all other variables being equal.

Logistic regression also dislikes a very large variable count

And as with linear regression, you should avoid categorical variables with too many levels.

Interpreting the coefficients

Interpreting coefficient values is a little more complicated with logistic than with linear regression. If the coefficient for the variable x[,k] is b[k], then the odds of a positive outcome are multiplied by a factor of exp(b[k]) for every unit change in x[,k].

Example

Suppose a full-term baby with certain characteristics has a 1% probability of being at risk. Then the risk odds for that baby are p/(1-p), or 0.01/0.99 = 0.0101. What are the risk odds (and the risk probability) for a baby with the same characteristics, but born prematurely?

The coefficient for GESTREC3< 37 weeks (for a premature baby) is 1.545183. So for a premature baby, the odds of being at risk are exp(1.545183)= 4.68883 times higher compared to a baby that’s born full-term, with all other input variables unchanged. The risk odds for a premature baby with the same characteristics as our hypothetical full-term baby are 0.0101 * 4.68883 = 0.047.

You can invert the formula odds = p / (1 - p) to solve for p as a function of odds:

p = odds * (1 - p) = odds - p * odds
p * (1 + odds) = odds
p = odds/(1 + odds)

The probability of this premature baby being at risk is 0.047/1.047, or about 4.5%—quite a bit higher than the equivalent full-term baby

Similarly, the coefficient for UPREVIS (number of prenatal medical visits) is about –0.06. This means every prenatal visit lowers the odds of an at-risk baby by a factor of exp(-0.06), or about 0.94. Suppose the mother of a premature baby had made no prenatal visits; a baby in the same situation whose mother had made three prenatal visits would have odds of being at risk of about 0.047 * 0.94 * 0.94 * 0.94 = 0.039. This corresponds to a probability of being at risk of 3.75%.

The general advice in this case might be to keep a special eye on premature births (and multiple births), and encourage expectant mothers to make regular prenatal visits.

7.2.5. Reading the model summary and characterizing coefficients

As we mentioned earlier, conclusions about the coefficient values are only to be trusted if the coefficient values are statistically significant. We also want to make sure that the model is actually explaining something. The diagnostics in the model summary will help us determine some facts about model quality. The call, as before, is summary(model).

Listing 7.16. The model summary

summary(model)

## Call:
## glm(formula = fmla, family = binomial(link = "logit"), data = train)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -0.9732  -0.1818  -0.1511  -0.1358   3.2641
##
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)
## (Intercept)              -4.412189   0.289352 -15.249  < 2e-16 ***
## PWGT                      0.003762   0.001487   2.530 0.011417 *
## UPREVIS                  -0.063289   0.015252  -4.150 3.33e-05 ***
## CIG_RECTRUE               0.313169   0.187230   1.673 0.094398 .
## GESTREC3< 37 weeks        1.545183   0.140795  10.975  < 2e-16 ***
## DPLURALtriplet or higher  1.394193   0.498866   2.795 0.005194 **
## DPLURALtwin               0.312319   0.241088   1.295 0.195163
## ULD_MECOTRUE              0.818426   0.235798   3.471 0.000519 ***
## ULD_PRECIPTRUE            0.191720   0.357680   0.536 0.591951
## ULD_BREECHTRUE            0.749237   0.178129   4.206 2.60e-05 ***
## URF_DIABTRUE             -0.346467   0.287514  -1.205 0.228187
## URF_CHYPERTRUE            0.560025   0.389678   1.437 0.150676
## URF_PHYPERTRUE            0.161599   0.250003   0.646 0.518029
## URF_ECLAMTRUE             0.498064   0.776948   0.641 0.521489
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##    Null deviance: 2698.7  on 14211  degrees of freedom
## Residual deviance: 2463.0  on 14198  degrees of freedom
## AIC: 2491
##
## Number of Fisher Scoring iterations: 7

Again, you’re likely to be asked about elements of the model summary when presenting results, so we’ll discuss what the fields mean, and how to use them to interpret your model.

The original model call

The first line of the summary is the call to glm():

Call:
glm(formula = fmla, family = binomial(link = "logit"), data = train)

Here is where we check that we’ve used the correct training set and the correct formula (although in our case, the formula itself is in another variable). We can also verify that we used the correct family and link function to produce a logistic model.

The deviance residuals summary

The deviance residuals are the analog to the residuals of a linear regression model:

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-0.9732  -0.1818  -0.1511  -0.1358   3.2641

Linear regression models are found by minimizing the sum of the squared residuals; logistic regression models are found by minimizing the sum of the residual deviances, which is equivalent to maximizing the log likelihood of the data, given the model (we’ll talk about log likelihood later in this chapter).

Logistic models can also be used to explicitly compute rates: given several groups of identical data points (identical except the outcome), predict the rate of positive outcomes in each group. This kind of data is called grouped data. In the case of grouped data, the deviance residuals can be used as a diagnostic for model fit. This is why the deviance residuals are included in the summary. We’re using ungrouped data—every data point in the training set is potentially unique. In the case of ungrouped data, the model fit diagnostics that use the deviance residuals are no longer valid, so we won’t discuss them here.^[1]

¹
See Daniel Powers and Yu Xie, Statistical Methods for Categorical Data Analysis, 2nd ed, Emerald Group Publishing Ltd., 2008.

The summary coefficients table

The summary coefficients table for logistic regression has the same format as the coefficients table for linear regression:

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)
(Intercept)              -4.412189   0.289352 -15.249  < 2e-16 ***
PWGT                      0.003762   0.001487   2.530 0.011417 *
UPREVIS                  -0.063289   0.015252  -4.150 3.33e-05 ***
CIG_RECTRUE               0.313169   0.187230   1.673 0.094398 .
GESTREC3< 37 weeks        1.545183   0.140795  10.975  < 2e-16 ***
DPLURALtriplet or higher  1.394193   0.498866   2.795 0.005194 **
DPLURALtwin               0.312319   0.241088   1.295 0.195163
ULD_MECOTRUE              0.818426   0.235798   3.471 0.000519 ***
ULD_PRECIPTRUE            0.191720   0.357680   0.536 0.591951
ULD_BREECHTRUE            0.749237   0.178129   4.206 2.60e-05 ***
URF_DIABTRUE             -0.346467   0.287514  -1.205 0.228187
URF_CHYPERTRUE            0.560025   0.389678   1.437 0.150676
URF_PHYPERTRUE            0.161599   0.250003   0.646 0.518029
URF_ECLAMTRUE             0.498064   0.776948   0.641 0.521489
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The columns of the table represent

A coefficient
Its estimated value
The error around that estimate
The signed distance of the estimated coefficient value from 0 (using the standard error as the unit of distance)
The probability of seeing a coefficient value at least as large as we observed, under the null hypothesis that the coefficient value is really zero

This last value, called the p-value or significance, tells us whether we should trust the estimated coefficient value. The common practice is to assume that coefficients with p-values less than 0.05 are reliable, although some researchers prefer stricter thresholds.

For the birth data, we can see from the coefficient summary that premature birth and triplet birth are strong predictors of the newborn needing extra medical attention: the coefficient magnitudes are non-negligible and the p-values indicate significance. Other variables that affect the outcome are

PWGT—The mother’s prepregnancy weight (heavier mothers indicate higher risk—slightly surprising)
UPREVIS—The number of prenatal medical visits (the more visits, the lower the risk)
ULD_MECOTRUE—Meconium staining in the amniotic fluid
ULD_BREECHTRUE—Breech position at birth

There might be a positive correlation between a mother’s smoking and an at-risk birth, but the data doesn’t indicate it definitively. None of the other variables show a strong relationship to an at-risk birth.

Lack of significance could mean collinear inputs

As with linear regression, logistic regression can predict well with collinear (or correlated) inputs, but the correlations can mask good advice.

To see this for yourself, we left data about the babies’ birth weight in grams in the dataset sdata. It’s present in both the test and training data as the column DBWT. Try adding DBWT to the logistic regression model in addition to all the other variables; you’ll see that the coefficient for baby’s birth weight will be significant, non-negligible (has a substantial impact on prediction), and negatively correlated with risk. The coefficient for DPLURALtriplet or higher will appear insignificant, and the coefficient for GESTREC3< 37 weeks has a much smaller magnitude. This is because low birth weight is correlated to both prematurity and multiple birth. Of the three related variables, birth weight is the best single predictor of the outcome: knowing that the baby is a triplet adds no additional useful information, and knowing the baby is premature adds only a little information.

In the context of the modeling goal—to proactively allocate emergency resources where they’re more likely to be needed—birth weight isn’t very useful a variable, because we don’t know the baby’s weight until it’s born. We do know ahead of time if it’s being born prematurely, or if it’s one of multiple babies. So it’s better to use GESTREC3 and DPLURAL as input variables, instead of DBWT.

Other signs of possibly collinear inputs are coefficients with the wrong sign and unusually large coefficient magnitudes with giant standard errors.

Overall model quality summaries

The next section of the summary contains the model quality statistics:

Null deviance: 2698.7  on 14211  degrees of freedom
Residual deviance: 2463.0  on 14198  degrees of freedom
AIC: 2491

Null and residual deviances

Deviance is a measure of how well the model fits the data. It is two times the negative log likelihood of the dataset, given the model. As we discussed previously in section 6.2.5, the idea behind log likelihood is that positive instances y should have high probability py of occurring under the model; negative instances should have low probability of occurring (or putting it another way, (1 - py) should be large). The log likelihood function rewards matches between the outcome y and the predicted probability py, and penalizes mismatches (high py for negative instances, and vice versa).

If you think of deviance as analogous to variance, then the null deviance is similar to the variance of the data around the average rate of positive examples. The residual deviance is similar to the variance of the data around the model. As with variance, you want the residual deviance to be small, compared to the null deviance. The model summary reports the deviance and null deviance of the model on the training data; you can (and should) also calculate them for test data. In the following listing we calculate the deviances for both the training and test sets.

Listing 7.17. Computing deviance

loglikelihood <- function(y, py) {                                     1
   sum(y * log(py) + (1-y)*log(1 - py))
}

(pnull <- mean(as.numeric(train$atRisk))  )                            2
## [1] 0.01920912

(null.dev <- -2  *loglikelihood(as.numeric(train$atRisk), pnull) )     3
## [1] 2698.716

model$null.deviance                                                    4
## [1] 2698.716

pred <- predict(model, newdata = train, type = "response")             5
(resid.dev <- -2 * loglikelihood(as.numeric(train$atRisk), pred) )     6
## [1] 2462.992

model$deviance                                                         7
## [1] 2462.992

testy <- as.numeric(test$atRisk)                                       8
testpred <- predict(model, newdata = test,
                        type = "response")

( pnull.test <- mean(testy) )
## [1] 0.0172713

( null.dev.test <- -2 * loglikelihood(testy, pnull.test) )
## [1] 2110.91

( resid.dev.test <- -2 * loglikelihood(testy, testpred) )
## [1] 1947.094

1 Function to calculate the log likelihood of a dataset. Variable y is the outcome in numeric form (1 for positive examples, 0 for negative). Variable py is the predicted probability that y==1.
2 Calculates the rate of positive examples in the dataset
3 Calculates the null deviance
4 For training data, the null deviance is stored in the slot model$null.deviance.
5 Predicts probabilities for the training data
6 Calculates deviance of the model for training data
7 For training data, model deviance is stored in the slot model$deviance.
8 Calculates the null deviance and residual deviance for the test data

The pseudo R-squared

A useful goodness-of-fit measure based on the deviances is the pseudo R-squared: 1 - (dev.model/dev.null). The pseudo R-squared is the analog to the R-squared measure for linear regression. It’s a measure of how much of the deviance is “explained” by the model. Ideally, you want the pseudo R-squared to be close to 1. Let’s calculate the pseudo R-squared for both the test and training data.

Listing 7.18. Calculating the pseudo R-squared

pr2 <- 1 - (resid.dev / null.dev)

print(pr2)
## [1] 0.08734674
pr2.test <- 1 - (resid.dev.test / null.dev.test)
print(pr2.test)
## [1] 0.07760427

The model only explains about 7.7–8.7% of the deviance; it’s not a highly predictive model (you should have suspected that already from figure 7.14). This tells us that we haven’t yet identified all the factors that actually predict at-risk births.

Model significance

The other thing you can do with the null and residual deviances is check whether the model’s probability predictions are better than just guessing the average rate of positives, statistically speaking. In other words, is the reduction in deviance from the model meaningful, or just something that was observed by chance? This is similar to calculating the F-test statistic and associated p-value that are reported for linear regression. In the case of logistic regression, the test you’ll run is the chi-squared test. To do that, you need to know the degrees of freedom for the null model and the actual model (which are reported in the summary). The degrees of freedom of the null model is the number of data points minus 1:

df.null =  dim(train)[[1]] - 1

The degrees of freedom of the model that you fit is the number of data points minus the number of coefficients in the model:

df.model = dim(train)[[1]] - length(model$coefficients)

If the number of data points in the training set is large, and df.null - df.model is small, then the probability of the difference in deviances null.dev - resid.dev being as large as we observed is approximately distributed as a chi-squared distribution with df.null - df.model degrees of freedom.

Listing 7.19. Calculating the significance of the observed fit

( df.null <- dim(train)[[1]] - 1  )                               1
 ## [1] 14211

( df.model <- dim(train)[[1]] - length(model$coefficients) )      2
 ## [1] 14198

( delDev <- null.dev - resid.dev )                                3
 ## [1] 235.724
( deldf <- df.null - df.model )
## [1] 13
( p <- pchisq(delDev, deldf, lower.tail = FALSE) )                4
 ## [1] 5.84896e-43

1 The null model has (number of data points - 1) degrees of freedom.
2 The fitted model has (number of data points - number of coefficients) degrees of freedom.
3 Computes the difference in deviances and difference in degrees of freedom
4 Estimates the probability of seeing the observed difference in deviances under the null model (the p-value) using chi-squared distribution

The p-value is very small; it’s extremely unlikely that we could’ve seen this much reduction in deviance by chance. This means it is plausible (but unfortunately not definitive) that this model has found informative patterns in the data.

Goodness of fit vs. significance

It’s worth noting that the model we found is a significant model, just not a powerful one. The good p-value tells us that the model is significant: it predicts at-risk birth in the training data at a quality that is unlikely to be pure chance. The poor pseudo R-squared means that the model isn’t giving us enough information to effectively distinguish between low-risk and high-risk births.

It’s also possible to have good pseudo R-squared (on the training data) with a bad p-value. This is an indication of overfit. That’s why it’s a good idea to check both, or better yet, check the pseudo R-squared of the model on both training and test data.

The AIC

The last metric given in the section of the summary is the AIC, or the Akaike information criterion. The AIC is the log likelihood adjusted for the number of coefficients. Just as the R-squared of a linear regression is generally higher when the number of variables is higher, the log likelihood also increases with the number of variables.

Listing 7.20. Calculating the Akaike information criterion

aic <- 2 * (length(model$coefficients) -
         loglikelihood(as.numeric(train$atRisk), pred))
aic
## [1] 2490.992

The AIC is generally used to decide which and how many input variables to use in the model. If you train many different models with different sets of variables on the same training set, you can consider the model with the lowest AIC to be the best fit.

Fisher scoring iterations

The last line of the model summary is the number of Fisher scoring iterations:

Number of Fisher Scoring iterations: 7

The Fisher scoring method is an iterative optimization method, similar to Newton’s method, that glm() uses to find the best coefficients for the logistic regression model. You should expect it to converge in about six to eight iterations. If there are many more iterations than that, then the algorithm may not have converged, and the model may not be valid.

Separation and quasi-separation

The probable reason for non-convergence is separation or quasi-separation: one of the model variables or some combination of the model variables predicts the outcome perfectly for at least a subset of the training data. You’d think this would be a good thing; but, ironically, logistic regression fails when the variables are too powerful. Ideally, glm() will issue a warning when it detects separation or quasi-separation:

Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred

Unfortunately, there are situations when it seems that no warning is issued, but there are other warning signs:

An unusually high number of Fisher iterations
Very large coefficients, usually with extremely large standard errors
Residual deviances larger than the null deviances

If you see any of these signs, the model is suspect. The last section of this chapter covers one way to address the problem: regularization.

7.2.6. Logistic regression takeaways

Logistic regression is the go-to statistical modeling method for binary classification. As with linear regression, the coefficients of a logistic regression model can often function as advice. Here are some points to remember about logistic regression:

Logistic regression is well calibrated: it reproduces the marginal probabilities of the data.
Pseudo R-squared is a useful goodness-of-fit heuristic.
Logistic regression will have trouble with problems with a very large number of variables, or categorical variables with a very large number of levels.
Logistic regression can predict well even in the presence of correlated variables, but correlated variables lower the quality of the advice.
Overly large coefficient magnitudes, overly large standard errors on the coefficient estimates, and the wrong sign on a coefficient could be indications of correlated inputs.
Too many Fisher iterations, or overly large coefficients with very large standard errors, could be signs that your logistic regression model has not converged, and may not be valid.
glm() provides good diagnostics, but rechecking your model on test data is still your most effective diagnostic.

7.3. Regularization

As mentioned earlier, overly large coefficient magnitudes and overly large standard errors can indicate some issues in your model: nearly collinear variables in either a linear or logistic regression, or separation or quasi-separation in a logistic regression system.

Nearly collinear variables can cause the regression solver to needlessly introduce large coefficients that often nearly cancel each other out, and that have large standard errors. Separation/quasi-separation can cause a logistic regression to not converge to the intended solution; this is a separate source of large coefficients and large standard errors.

Overly large coefficient magnitudes are less trustworthy and can be hazardous when the model is applied to new data. Each of the coefficient estimates has some measurement noise, and with large coefficients this noise in estimates can drive large variations (and errors) in prediction. Intuitively speaking, large coefficients fit to nearly collinear variables must cancel each other out in the training data to express the observed effect of the variables on the outcome. This set of cancellations is an overfit of the training data, if the same variables don’t balance out in exactly the same way in future data.

Example

Suppose that age and years_in_workforce are strongly correlated, and being one year older/one year longer in the workforce increases log income by one unit in the training data. If only years_in_workforce is in the model, it would get a coefficient of about 1. What happens if the model includes age as well?

In some circumstances, if both age and years_in_workforce are in the model, linear regression might give years_in_workforce and age large counterbalancing coefficients of opposite sign; for instance a coefficient of 99 for years_in_workforce and age a coefficient of –98. These large coefficients would “cancel each other out” to the appropriate effect.

A similar effect can arise in a logistic model due to quasi-separation, even when there are no collinear variables. To demonstrate this, we’ll introduce the bigger scenario that we will work with in this section.

7.3.1. An example of quasi-separation

Example

Suppose a car review site rates cars on several characteristics, including affordability and safety rating. Car ratings can be “very good,” “good,” “acceptable,” or “unacceptable.” Your goal is to predict whether a car will fail the review: that is, get an unacceptable rating.

For this example, you will use again use the car data from the UCI Machine Learning Repository that you used in chapter 2. This dataset has information on 1728 makes of auto, with the following variables:

car_price—(vhigh, high, med, low)
maint_price—(vhigh, high, med, low)
doors—(2, 3, 4, 5, more)
persons—(2, 4, more)
lug_boot—(small, med, big)
safety—(low, med, high)

The outcome variable is rating (vgood, good, acc, unacc).

First, let’s read in the data and split it into training and test. If you have not done so already, download car.data.csv from https://github.com/WinVector/PDSwR2/blob/master/UCICar/car.data.csv and make sure the file is in your working directory.

Listing 7.21. Preparing the `cars` data

cars <- read.table(
  'car.data.csv',
  sep = ',',
  header = TRUE,
  stringsAsFactor = TRUE
)

vars <- setdiff(colnames(cars), "rating")              1

cars$fail <- cars$rating == "unacc"
outcome <- "fail"                                      2

set.seed(24351)
gp <- runif(nrow(cars))                                3

library("zeallot")
c(cars_test, cars_train) %<-% split(cars, gp < 0.7)    4

nrow(cars_test)
## [1] 499
nrow(cars_train)
## [1] 1229

1 Gets the input variables
2 You want to predict whether the car gets an unacceptable rating
3 Creates the grouping variable for the test/train split (70% for training, 30% for test)
4 The split() function returns a list of two groups with the group gp < 0.7 == FALSE first. The zeallot package’s %<-% multiassignment takes this list of values and unpacks them into the variables named cars_test and cars_train.

The first thing you might do to solve this problem is try a simple logistic regression.

Listing 7.22. Fitting a logistic regression model

library(wrapr)
(fmla <- mk_formula(outcome, vars) )

## fail ~ car_price + maint_price + doors + persons + lug_boot +
##     safety
## <environment: base>

model_glm <- glm(fmla,
            data = cars_train,
            family = binomial)

You will see that glm() returns a warning:

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

This warning indicates that the problem is quasi-separable: some set of variables perfectly predicts a subset of the data. In fact, this problem is simple enough that you can easily determine that a safety rating of low perfectly predicts that a car will fail the review (we leave that as an exercise for the reader). However, even cars with higher safety ratings can get ratings of unacceptable, so the safety variable only predicts a subset of the data.

You can also see the problem if you look at the summary of the model.

Listing 7.23. Looking at the model summary

summary(model_glm)

##
## Call:
## glm(formula = fmla, family = binomial, data = cars_train)
##
## Deviance Residuals:
##      Min        1Q    Median        3Q       Max
## -2.35684  -0.02593   0.00000   0.00001   3.11185
##
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)
## (Intercept)        28.0132  1506.0310   0.019 0.985160
## car_pricelow       -4.6616     0.6520  -7.150 8.67e-13 ***
## car_pricemed       -3.8689     0.5945  -6.508 7.63e-11 ***
## car_pricevhigh      1.9139     0.4318   4.433 9.30e-06 ***
## maint_pricelow     -3.2542     0.5423  -6.001 1.96e-09 ***
## maint_pricemed     -3.2458     0.5503  -5.899 3.66e-09 ***
## maint_pricevhigh    2.8556     0.4865   5.869 4.38e-09 ***
## doors3             -1.4281     0.4638  -3.079 0.002077 **
## doors4             -2.3733     0.4973  -4.773 1.82e-06 ***
## doors5more         -2.2652     0.5090  -4.450 8.58e-06 ***
## persons4          -29.8240  1506.0310  -0.020 0.984201          1
## personsmore       -29.4551  1506.0310  -0.020 0.984396
## lug_bootmed         1.5608     0.4529   3.446 0.000568 ***
## lug_bootsmall       4.5238     0.5721   7.908 2.62e-15 ***
## safetylow          29.9415  1569.3789   0.019 0.984778          2
## safetymed           2.7884     0.4134   6.745 1.53e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 1484.7  on 1228  degrees of freedom
## Residual deviance:  245.5  on 1213  degrees of freedom
## AIC: 277.5
##
## Number of Fisher Scoring iterations: 21)                         3

1 The variables persons4 and personsmore have notably large negative magnitudes, and a giant standard error.
2 The variable safetylow has a notably large positive magnitude, and a giant standard error.
3 The algorithm ran for an unusually large number of Fisher scoring iterations.

The variables safetylow, persons4, and personsmore all have unusually high magnitudes and very high standard errors. As mentioned earlier, safetylow always corresponds to an unacceptable rating, so safetylow is a strong indicator of failing the review. However, larger cars (cars that hold more people) are not always going to pass the review. It’s possible that the algorithm has observed that larger cars tend to be safer (get a safety rating better than safetylow), and so it is using the persons4 and personsmore variables to cancel out the overly high coefficient from safetylow.

In addition, you can see that the number of Fisher scoring iterations is unusually high; the algorithm did not converge.

This problem is fairly simple, so the model may predict acceptably well on the test set; however, in general, when you see evidence that glm() did not converge, you should not trust the model.

For comparison with the regularized algorithms, let’s plot the coefficients of the logistic regression model (figure 7.17).

Figure 7.17. Coefficients of the logistic regression model

Listing 7.24. Looking at the logistic model’s coefficients

coefs <- coef(model_glm)[-1]                      1
coef_frame <- data.frame(coef = names(coefs),
                        value = coefs)

library(ggplot2)
ggplot(coef_frame, aes(x = coef, y = value)) +
  geom_pointrange(aes(ymin = 0, ymax = value)) +
  ggtitle("Coefficients of logistic regression model") +
  coord_flip()

1 Gets the coefficients (except the intercept)

In the plot, coefficients that point to the right are positively correlated with failing the review, and coefficients that point to the left are negatively correlated with failure.

You can also look at the model’s performance on the test data.

Listing 7.25. The logistic model’s test performance

cars_test$pred_glm <- predict(model_glm,
                             newdata=cars_test,
                             type = "response")                          1

library(sigr)                                                            2

confmat <- function(dframe, predvar) {                                   3
   cmat <- table(truth = ifelse(dframe$fail, "unacceptable", "passed"),
               prediction = ifelse(dframe[[predvar]] > 0.5,
                                   "unacceptable", "passed"))
  accuracy <- sum(diag(cmat)) / sum(cmat)
  deviance <- calcDeviance(dframe[[predvar]], dframe$fail)
  list(confusion_matrix = cmat,
       accuracy = accuracy,
       deviance = deviance)
}

confmat(cars_test, "pred_glm")
## $confusion_matrix
##               prediction
## truth          passed unacceptable
##   passed          150            9
##   unacceptable     17          323
##
## $accuracy
## [1] 0.9478958
##
## $deviance
## [1] 97.14902

1 Gets the model’s predictions on the test set
2 Attaches the sigr package for deviance calculation (sigr includes a number of goodness-of-fit summaries and tests)
3 Convenience function to print confusion matrix, accuracy, and deviance

In this case, the model seems to be good. However, you cannot always trust non-converged models, or models with needlessly large coefficients.

In situations where you see suspiciously large coefficients with extremely large standard errors, whether due to collinearity or quasi-separation, we recommend regularization.^[1] Regularization adds a penalty to the formulation that biases the model’s coefficients towards zero. This makes it harder for the solver to drive the coefficients to unnecessarily large values.

¹
Some people suggest using principal components regression (PCR) to deal with collinear variables: PCR uses the existing variables to create synthetic variables that are mutually orthogonal, eliminating the collinearities. This won’t help with quasi-separation. We generally prefer regularization.

Regarding overfitting

The modeling goal is to predict well on future application data. Improving your measured performance on training data does not always do this. This is what we’ve been discussing as overfit. Regularization degrades the quality of the training data fit, in the hope of improving future model performance.

7.3.2. The types of regularized regression

There are multiple types of regularized regression, each defined by the penalty that is put on the model’s coefficients. Here we cover the different regularization approaches.

Ridge regression

Ridge regression (or L2-regularized regression) tries to minimize the training prediction error, subject to also minimizing the sum of the squared magnitudes of the coefficients.^[2] Let’s look at ridge regularized linear regression. Remember that linear regression tries to find the coefficients b such that

²
This is called “the L2 norm of the vector of coefficients,” hence the name.

f(x[i,]) = b[0] + b[1] x[i,1] + ... b[n] x[i,n]

is as close as possible to y[i] for all the training data. It does this by minimizing (y - f(x))^2, the sum of the squared error between y and f(x). Ridge regression tries to find the b that minimizes

(y - f(x))^2 + lambda * (b[1]^2 + ...  + b[n]^2)

where lambda >= 0. When lambda = 0, this reduces to regular linear regression; the larger lambda is, the harder the algorithm will penalize large coefficients. The expression for regularized logistic regression is similar.

How ridge regression affects coefficients

When variables are nearly collinear, ridge regression tends to average the collinear variables together. You can think of this as “ridge regression shares the credit.”

For instance, let’s go back to the example of fitting a linear regression for log income using both age and years in workforce (which are nearly collinear). Recall that being one year older/one year longer in the workforce increases log income by one unit in the training data.

In this situation, ridge regression might assign both variables age and years_in_workforce a coefficient of 0.5, which adds up to the appropriate effect.

Lasso regression

Lasso regression (or L1-regularized regression) tries to minimize the training prediction error, subject to also minimizing the sum of the absolute value of the coefficients.^[1] For linear regression, this looks like minimizing

¹
Or the “L1 norm of the vector of coefficients.”

(y - f(x))^2 + lambda * ( abs(b[1]) + abs(b[2]) + .... abs(b[n]) )

How lasso regression affects coefficients

When variables are nearly collinear, lasso regression tends to drive one or more of them to zero. So in the income scenario, lasso regression might assign years_in_workforce a coefficient of 1 and age a coefficient of 0.^[a] For this reason, lasso regression is often used as a form of variable selection. A larger lambda will tend to drive more coefficients to zero.

^a
As Hastie et al. point out in The Elements of Statistical Learning, 2nd ed (Springer, 2009), which of the correlated variables get zeroed out is somewhat arbitrary.

Elastic net

In some situations, like quasi-separability, the ridge solution may be preferred. In other situations, such as when you have a very large number of variables, many of which are correlated to each other, the lasso may be preferred. You may not be sure which is the best approach, so one compromise is to combine the two. This is called elastic net. The penalty of using elastic net is a combination of the ridge and the lasso penalties:

(1 - alpha) * (b[1]^2 + ...  + b[n]^2) +
    alpha * ( abs(b[1]) + abs(b[2]) + .... abs(b[n]) )

When alpha = 0, this reduces to ridge regression; when alpha = 1, it reduces to lasso. Different values of alpha between 0 and 1 give different trade-offs between sharing the credit among correlated variables, and only keeping a subset of them.

7.3.3. Regularized regression with glmnet

All the types of regularized regression that we’ve discussed are implemented in R by the package glmnet. Unfortunately, the glmnet package uses a calling interface that is not very R-like; in particular, it expects that the input data is a numeric matrix rather than a data frame. So we’ll use the glmnetUtils package to provide a more R-like interface to the functions.

Calling interfaces

It would be best if all modeling procedures had the same calling interface. The lm() and glm() packages nearly do, and glmnetUtils helps make glmnet more compatible with R’s calling interface conventions.

However, to use a given method correctly, you must know some things about its particular constraints and consequences. This means that even if all modeling methods had the same calling interface, you still must study the documentation to understand how to use it properly.

Let’s compare the different regularization approaches on the car-ratings prediction problem.

The ridge regression solution

When reducing the number of variables is not an issue, we generally try ridge regression first, because it’s a smoother regularization that we feel retains the most interpretability for the coefficients (but see the warning later in this section). The parameter alpha specifies the mixture of ridge and lasso penalties (0=ridge, 1=lasso); so for ridge regression, set alpha = 0. The parameter lambda is the regularization penalty.

Since you generally don’t know the best lambda, the original function glmnet:: glmnet() tries several values of lambda (100 by default) and returns the models corresponding to each value. The function glmnet::cv.glmnet() in addition does the cross-validation needed to pick the lambda that gives the minimum cross-validation error for a fixed alpha, and returns it as the field lambda.min. It also returns a value lambda.1se, the largest value of lambda such that the error is within 1 standard error of the minimum. This is shown in figure 7.18.

Figure 7.18. Schematic of `cv.glmnet()`

The function glmnetUtils::cv.glmnet() lets you call the cross-validated version in an R-friendlier way.

When using regularized regression, it’s a good idea to standardize, or center and scale the data (see section 4.2.2). Fortunately, cv.glmnet() does this by default. If for some reason you want to turn this off (perhaps you have already standardized the data), use the parameter standardize = FALSE.^[1]

¹
For help/documentation on glmnetUtils::cv.glmnet(), see help(cv.glmnet, package = "glmnet-Utils"), help(cv.glmnet, package = "glmnet"), and help(glmnet, package = "glmnet").

Listing 7.26. Fitting the ridge regression model

library(glmnet)
library(glmnetUtils)

(model_ridge <- cv.glmnet(fmla,
                         cars_train,
                         alpha = 0,
                         family = "binomial"))      1

## Call:
## cv.glmnet.formula(formula = fmla, data = cars_train, alpha = 0,
##     family = "binomial")
##
## Model fitting options:
##     Sparse model matrix: FALSE
##     Use model.frame: FALSE
##     Number of crossvalidation folds: 10
##     Alpha: 0
##     Deviance-minimizing lambda: 0.02272432  (+1 SE): 0.02493991

1 For logistic regression-style models, use family = “binomial”. For linear regression-style models, use family = “gaussian”.

Printing out model_ridge tells you the lambda that corresponds to the minimum cross-validation error (the deviance)—that is, model_ridge$lambda.min. It also reports the value of model_ridge$lambda.1se.

Remember that cv.glmnet() returns 100 (by default) models; of course, you really only want one—the “best” one. As shown in figure 7.18, when you call a function like predict() or coef(), the cv.glmnet object by default uses the model corresponding to lambda.1se, as some people consider lambda.1se less likely to be overfit than lambda.min.

The following listing examines the coefficients of the lambda.1se model. If you want to see the model corresponding to lambda.min, replace the first line of the listing with (coefs <- coef(model_ridge, s = model_ridge$lambda.min)).

Listing 7.27. Looking at the ridge model’s coefficients

(coefs <- coef(model_ridge))

## 22 x 1 sparse Matrix of class "dgCMatrix"
##                            1
## (Intercept)       2.01098708
## car_pricehigh     0.34564041
## car_pricelow     -0.76418240
## car_pricemed     -0.62791346
## car_pricevhigh    1.05949870
## maint_pricehigh   0.18896383
## maint_pricelow   -0.72148497
## maint_pricemed   -0.60000546
## maint_pricevhigh  1.14059599
## doors2            0.37594292
## doors3            0.01067978
## doors4           -0.21546650
## doors5more       -0.17649206
## persons2          2.61102897      1
## persons4         -1.35476871
## personsmore      -1.26074907
## lug_bootbig      -0.52193562
## lug_bootmed      -0.18681644
## lug_bootsmall     0.68419343
## safetyhigh       -1.70022006
## safetylow         2.54353980
## safetymed        -0.83688361

coef_frame <- data.frame(coef = rownames(coefs)[-1],
                        value = coefs[-1,1])

ggplot(coef_frame, aes(x = coef, y = value)) +
  geom_pointrange(aes(ymin = 0, ymax = value)) +
  ggtitle("Coefficients of ridge model") +
  coord_flip()

1 Note that all the levels of the categorical variable persons are present (no reference level).

Notice that cv.glmnet() does not use reference levels for categorical variables: for instance, the coefs vector includes the variables persons2, persons4, and personsmore, corresponding to the levels 2, 4, and “more” for the persons variable. The logistic regression model in section 7.3.1 used the variables persons4 and personsmore, and used the level value 2 as the reference level. Using all the variable levels when regularizing has the advantage that the coefficient magnitudes are regularized toward zero, rather than toward a (possibly arbitrary) reference level.

You can see in figure 7.19 that this model no longer has the unusually large magnitudes. The directions of the coefficients suggest that low safety ratings, small cars, and very high purchase or maintenance price all positively predict rating of unacceptable. One might suspect that small cars correlate with low safety ratings, so safetylow and persons2 are probably sharing the credit.

Figure 7.19. Coefficients of the ridge regression model

Regularization affects interpretability

Because regularization adds an additional term to the algorithm’s optimization function, you can’t quite interpret the coefficients the same way you did in sections 7.1.4 and 7.2.4. For instance, no coefficient significances are reported. However, you can at least use the signs of the coefficients as indications of which variables are positively or negatively correlated with the outcome in the joint model.

You can also evaluate the performance of model_ridge on the test data.

Listing 7.28. Looking at the ridge model’s test performance

prediction <- predict(model_ridge,
                     newdata = cars_test,
                     type = "response")

cars_test$pred_ridge <- as.numeric(prediction)    1

confmat(cars_test, "pred_ridge")
## $confusion_matrix
##               prediction
## truth          passed unacceptable
##   passed          147           12
##   unacceptable     16          324
##
## $accuracy
## [1] 0.9438878
##
## $deviance
## [1] 191.9248

1 The prediction variable is a 1-d matrix; convert it to a vector before adding it to the cars_test data frame.

To look at the predictions for the model corresponding to lambda.min, replace the first command of the preceding listing with this:

prediction <- predict(model_ridge,
                      newdata = cars_test,
                      type="response",
                      s = model_ridge$lambda.min)

The lasso regression solution

You can run the same steps as in the previous section with alpha = 1 (the default) to fit a lasso regression model. We leave fitting the model as an exercise for the reader; here are the results.

Listing 7.29. The lasso model’s coefficients

## 22 x 1 sparse Matrix of class "dgCMatrix"
##                             1
## (Intercept)      -3.572506339
## car_pricehigh     2.199963497
## car_pricelow     -0.511577936
## car_pricemed     -0.075364079
## car_pricevhigh    3.558630135
## maint_pricehigh   1.854942910
## maint_pricelow   -0.101916375
## maint_pricemed   -0.009065081
## maint_pricevhigh  3.778594043
## doors2            0.919895270
## doors3            .
## doors4           -0.374230464
## doors5more       -0.300181160
## persons2          9.299272641
## persons4         -0.180985786
## personsmore       .
## lug_bootbig      -0.842393694
## lug_bootmed       .
## lug_bootsmall     1.886157531
## safetyhigh       -1.757625171
## safetylow         7.942050790
## safetymed         .

As you see in figure 7.20, cv.glmnet() did not reduce the magnitudes of the largest coefficients as much, although it did zero out a few variables (doors3, personsmore, lug_boot_med, safety_med), and it selected a similar set of variables as strongly predictive of an unacceptable rating.

Figure 7.20. Coefficients of the lasso regression model

The lasso model’s accuracy on the test data is similar to the ridge model’s, but the deviance is much lower, indicating better model performance on the test data.

Listing 7.30. The lasso model’s test performance

### $confusion_matrix
##               prediction
## truth          passed unacceptable
##   passed          150            9
##   unacceptable     17          323
##
## $accuracy
## [1] 0.9478958
##
## $deviance
## [1] 112.7308

The elastic net solution: picking alpha

The cv.glmnet() function only optimizes over lambda; it assumes that alpha, the variable that specifies the mix of the ridge and lasso penalties, is fixed. The glmnetUtils package provides a function called cva.glmnet() that will simultaneously cross-validate for both alpha and lambda.

Listing 7.31. Cross-validating for both `alpha` and `lambda`

(elastic_net <- cva.glmnet(fmla,
                          cars_train,
                          family = "binomial"))

## Call:
## cva.glmnet.formula(formula = fmla, data = cars_train, family = "binomial")
##
## Model fitting options:
##     Sparse model matrix: FALSE
##     Use model.frame: FALSE
##     Alpha values: 0 0.001 0.008 0.027 0.064 0.125 0.216 0.343 0.512 0.729 1
##     Number of crossvalidation folds for lambda: 10

The process of extracting the best model is a bit involved. Unlike cv.glmnet, cva.glmnet doesn’t return an alpha.min or an alpha.1se. Instead, the field elastic_ net$alpha returns all the alphas that the function tried (11 of them, by default), and elastic_net$modlist returns all the corresponding glmnet::cv.glmnet model objects (see figure 7.21). Each one of these model objects is really 100 models, so for a given alpha, we’ll choose the lambda.1se model as “the best model.”

Figure 7.21. Schematic of using `cva.glmnet` to pick `alpha`

The following listing implements the process sketched in figure 7.21 to get the mean cross-validation error for each “best model,” and plot the errors as a function of alpha (figure 7.22). You can create a similar plot using the function minlossplot(elastic_ net), but the following listing also returns the value of the best tested alpha.

Figure 7.22. Cross-validation error as a function of `alpha`

Listing 7.32. Finding the minimum error `alpha`

get_cvm <- function(model) {                                           1
   index <- match(model$lambda.1se, model$lambda)
  model$cvm[index]
}

enet_performance <- data.frame(alpha = elastic_net$alpha)              2
models <- elastic_net$modlist                                          3
enet_performance$cvm <- vapply(models, get_cvm, numeric(1))            4

minix <- which.min(enet_performance$cvm)                               5
(best_alpha <- elastic_net$alpha[minix])                               6
## [1] 0.729
ggplot(enet_performance, aes(x = alpha, y = cvm)) +                    7
   geom_point() +
  geom_line() +
  geom_vline(xintercept = best_alpha, color = "red", linetype = 2) +
  ggtitle("CV loss as a function of alpha")

1 A function to get the mean cross-validation error of a cv.glmnet lambda.1se model
2 Gets the alphas that the algorithm tried
3 Gets the model objects produced
4 Gets the errors of each best model
5 Finds the minimum cross-validation error
6 Gets the corresponding alpha
7 Plots the model performances as a function of alpha

Remember that both cv.glmnet and cva.glmnet are randomized, so the results can vary from run to run. The documentation for glmnetUtils (https://cran.r-project.org/web/packages/glmnetUtils/vignettes/intro.html) recommends running cva.glmnet multiple times to reduce the noise. If you want to cross-validate for alpha, we suggest calculating the equivalent of enet_performance multiple times, and averaging the values of the cvm column together—the alpha values will be identical from run to run, although the corresponding lambda.1se values may not be. After you’ve determined the alpha that corresponds to the best average cvm, call cv.glmnet one more time with the chosen alpha to get the final model.

Listing 7.33. Fitting and evaluating the elastic net model

(model_enet <- cv.glmnet(fmla,
                       cars_train,
                       alpha = best_alpha,
                       family = "binomial"))

## Call:
## cv.glmnet.formula(formula = fmla, data = cars_train, alpha = best_alpha,
##     family = "binomial")
##
## Model fitting options:
##     Sparse model matrix: FALSE
##     Use model.frame: FALSE
##     Number of crossvalidation folds: 10
##     Alpha: 0.729
##     Deviance-minimizing lambda: 0.0002907102  (+1 SE): 0.002975509

prediction <- predict(model_enet,
                      newdata = cars_test,
                      type = "response")

cars_test$pred_enet <- as.numeric(prediction)

confmat(cars_test, "pred_enet")

## $confusion_matrix
##               prediction
## truth          passed unacceptable
##   passed          150            9
##   unacceptable     17          323
##
## $accuracy
## [1] 0.9478958
##
## $deviance
## [1] 117.7701

It’s also worth noting that in this case, the cross-validated loss falls off quite quickly after alpha=0, so in practice, almost any non-zero alpha will give models of similar quality.

Summary

Both linear and logistic regression assume that the outcome is a function of a linear combination of the inputs. This seems restrictive, but in practice linear and logistic regression models can perform well even when the theoretical assumptions aren’t exactly met. We’ll show how to further work around these limits in chapter 10.

Linear and logistic regression can also provide advice by quantifying the relationships between the outcomes and the model’s inputs. Since the models are expressed completely by their coefficients, they’re small, portable, and efficient—all valuable qualities when putting a model into production. If the model’s errors are uncorrelated with y, the model might be trusted to extrapolate predictions outside the training range. Extrapolation is never completely safe, but it’s sometimes necessary.

In situations where variables are correlated or the prediction problem is quasi-separable, linear methods may not perform as well. In these cases, regularization methods can produce models that are safer to apply to new data, although the coefficients of these models are not as useful for advice about the relationships between variables and the outcome.

While learning about linear models in this chapter, we have assumed that the data is well behaved: the data has no missing values, the number of possible levels for categorical variables is low, and all possible levels are present in the training data. In real-world data, these assumptions are not always true. In the next chapter, you will learn about advanced methods to prepare ill-behaved data for modeling.

In this chapter you have learned

How to predict numerical quantities with linear regression models
How to predict probabilities or classify using logistic regression models
How to interpret the diagnostics from lm() and glm() models
How to interpret the coefficients of linear models
How to diagnose when a linear model may not be “safe” or not as reliable (collinearity, quasi-separation)
How to use glmnet to fit regularized linear and logistic regression models

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 7. Linear and logistic regression

Create new playlist

Sign In

Sign Up

Chapter 7. Linear and logistic regression

Figure 7.1. Mental model

7.1. Using linear regression

7.1.1. Understanding linear regression

Example

Figure 7.2. The linear relationship between daily_cals_down and pounds_lost

equation 7.1. The expression for a linear regression model

Example

Thinking about linear regression

When the assumptions of linear regression are violated

Figure 7.3. Fit versus actuals for y=x2

Extrapolation is not as safe as interpolation

Introducing the PUMS dataset

Example

Listing 7.1. Loading the PUMS data and fitting a model

7.1.2. Building a linear regression model

Figure 7.4. Building a linear model using lm()

R stores training data in the model

7.1.3. Making predictions

Figure 7.5. Making predictions with a linear regression model

Characterizing prediction quality

Figure 7.6. Plot of actual log income as a function of predicted log income

Listing 7.2. Plotting log income as a function of predicted log income

Figure 7.7. Plot of residual error as a function of prediction

Listing 7.3. Plotting residuals income as a function of predicted log income

On average, are the predictions correct?

Are there systematic errors?

Figure 7.8. An example of systematic errors in model predictions

R-squared and RMSE

Listing 7.4. Computing R-squared

Listing 7.5. Calculating root mean square error

7.1.4. Finding relations and extracting advice

Figure 7.9. The model coefficients

Reported coefficients

7.1.5. Reading the model summary and characterizing coefficient quality

Figure 7.10. Model summary

The original model call

The residuals summary

Listing 7.6. Summarizing residuals

The coefficients table

Figure 7.11. Model summary coefficient columns

The p-value and significance

Overall model quality summaries

Degrees of freedom

Residual standard error

Multiple and adjusted R-squared

The F-statistic and its p-value

Interpreting model significances

7.1.6. Linear regression takeaways

7.2. Using logistic regression

7.2.1. Understanding logistic regression

Example

Figure 7.12. Mapping the odds of a flight delay to log-odds

Figure 7.13. Mapping log-odds to the probability of a flight delay via the sigmoid function

equation 7.2. The expression for a logistic regression model

Example

Listing 7.7. Loading the CDC data

Table 7.1. Some variables in the natality dataset

7.2.2. Building a logistic regression model

Listing 7.8. Building the model formula

Listing 7.9. Fitting the logistic regression model

Don’t forget the family argument!

7.2.3. Making predictions

Listing 7.10. Applying the logistic regression model

Listing 7.11. Preserving marginal probabilities with logistic regression

Characterizing prediction quality

Listing 7.12. Plotting distribution of prediction score grouped by known outcome

Figure 7.14. Distribution of score broken up by positive examples (TRUE) and negative examples (FALSE)

Figure 7.15. Reproduction of the spam filter score distributions from chapter 6

Figure 7.16. Enrichment (top) and recall (bottom) plotted as functions of threshold for the training set

Listing 7.13. Exploring modeling trade-offs

Listing 7.14. Evaluating the chosen model

7.2.4. Finding relations and extracting advice from logistic models

Listing 7.15. The model coefficients

Logistic regression also dislikes a very large variable count

Interpreting the coefficients

Table of Contents for
Chapter 7. Linear and logistic regression

Figure 7.2. The linear relationship between `daily_cals_down` and `pounds_lost`

Figure 7.3. Fit versus actuals for y=x²

Figure 7.4. Building a linear model using `lm()`

Listing 7.21. Preparing the `cars` data

Figure 7.18. Schematic of `cv.glmnet()`

Listing 7.31. Cross-validating for both `alpha` and `lambda`

Figure 7.21. Schematic of using `cva.glmnet` to pick `alpha`

Figure 7.22. Cross-validation error as a function of `alpha`

Listing 7.32. Finding the minimum error `alpha`