Variable selection is an important process, as it tries to make models simpler to interpret, easier to train, and free of spurious associations by eliminating variables unrelated to the output. This is one possible approach to dealing with the problem of overfitting. In general, we don't expect a model to completely fit our training data; in fact, the problem of overfitting often means that it may be detrimental to our predictive model's accuracy on unseen data if we fit our training data too well. In this section on regularization, we'll study an alternative to reducing the number of variables in order to deal with overfitting. Regularization is essentially a process of introducing an intentional bias or constraint in our training procedure that prevents our coefficients from taking large values. As this is a process that tries to shrink the coefficients, the methods we'll look at are also known as shrinkage methods.
When the number of parameters is very large, particularly compared to the number of available observations, linear regression tends to exhibit very high variance. This is to say that small changes in a few of the observations will cause the coefficients to change substantially. Ridge regression is a method that introduces bias through its constraint but is effective at reducing the model's variance. Ridge regression tries to minimize the sum of the residual sum of squares and a term that involves the sum of the squares of the coefficients multiplied by a constant for which we'll use the Greek letter λ. For a model with k parameters, not counting the constant term β0, and a data set with n observations, ridge regression minimizes the following quantity:
We are still minimizing the RSS but the second term is the penalty term, which is high when any of the coefficients is high. Thus, when minimizing, we are effectively pushing the coefficients to smaller values. The λ parameter is known as a meta parameter, which we need to select or tune. A very large value of λ will mask the RSS term and just push the coefficients to zero. An overly small value of λ will not be as effective against overfitting and a λ parameter of 0 just performs regular linear regression.
When performing ridge regression, we often want to scale by dividing the values of all our features by their variance. This was not the case with regular linear regression because if one feature is scaled by a factor of ten, then the coefficient will simply be scaled by a factor of a tenth to compensate. With ridge regression, the scale of a feature affects the computation of all other features through the penalty term.
The lasso is an alternative regularization method to ridge regression. The difference appears only in the penalty term, which involves minimizing the sum of the absolute values of the coefficients.
It turns out that this difference in the penalty term is very significant, as the lasso combines both shrinkage and selection because it shrinks some coefficients to exactly zero, which is not the case with ridge regression. Despite this, there is no clear winner between these two. Models that depend on a subset of the input features will tend to perform better with lasso; models that have a large spread in coefficients across many different variables will tend to perform better with ridge regression. It is usually worth trying both.
The penalty in ridge regression is often referred to as an l2 penalty, whereas the penalty term in lasso is known as an l1 penalty. This arises from the mathematical notion of a norm of a vector. A norm of a vector is a function that assigns a positive number to that vector to represent its length or size. There are many different types of norms. Both the l1 and l2 norms are examples of a family of norms known as p-norms that have the following general form for a vector v with n components:
There are a number of different functions and packages that implement ridge regression, such lm.ridge()
from the MASS
package and ridge()
from the genridge
package. For the lasso there is also the lars
package. In this chapter, we are going to work with the glmnet()
function from the glmnet
package due to its consistent and friendly interface. The key to working with regularization is to determine an appropriate value of λ to use. The approach that the glmnet()
function uses is to use a grid of different λ values and train a regression model for each value. Then, one can either pick a value manually or use a technique to estimate the best lambda. We can specify the sequence of λ values to try via the lambda
parameter; otherwise, a default sequence with 100 values will be used. The first parameter to the glmnet()
function must be a matrix of features, which we can build using the model.matrix()
function.
The second parameter is a vector with the output variable. Finally, the alpha
parameter is a switch between ridge regression (0) and lasso (1). We're now ready to train some models on the cars data set:
> library(glmnet) > cars_train_mat <- model.matrix(Price ~ .-Saturn, cars_train)[,-1] > lambdas <- 10 ^ seq(8, -4, length = 250) > cars_models_ridge <- glmnet(cars_train_mat, cars_train$Price, alpha = 0, lambda = lambdas) > cars_models_lasso <- glmnet(cars_train_mat, cars_train$Price, alpha = 1, lambda = lambdas)
As we provided a sequence of 250 λ values, we've actually trained 250 ridge regression models and another 250 lasso models. We can see the value of λ from the lambda
attribute of the object that is produced by glmnet()
and apply the coef()
function on this object to retrieve the corresponding coefficients for the 100th model, as follows:
> cars_models_ridge$lambda[100] [1] 1694.009 > coef(cars_models_ridge)[,100] (Intercept) Mileage Cylinder Doors 6217.5498831 -0.1574441 2757.9937160 371.2268405 Cruise Sound Leather Buick 1694.6023651 100.2323812 1326.7744321 -358.8397493 Cadillac Chevy Pontiac Saab 11160.4861489 -2370.3268837 -2256.7482905 8416.9209564 convertible hatchback sedan 10576.9050477 -3263.4869674 -2058.0627013
We can use the plot()
function to obtain a plot showing how the values of the coefficients change as the logarithm of λ changes. It is very helpful to show the corresponding plot for ridge regression and lasso side by side:
> layout(matrix(c(1, 2), 1, 2)) > plot(cars_models_ridge, xvar = "lambda", main = "Ridge Regression ") > plot(cars_models_lasso, xvar = "lambda", main = "Lasso ")
The key difference between these two graphs is that lasso forces many coefficients to fall to zero exactly, whereas in ridge regression, they tend to drop off smoothly and only become zero altogether at extreme values of λ. This is further evident by reading the values of the numbers on the top horizontal axis of both graphs, which show the number of non-zero coefficients as λ varies. In this way, the lasso has a significant advantage in that it can often be used to perform feature selection (because a feature with a zero coefficient is essentially not included in the model) as well as providing regularization to minimize the issue of overfitting. We can obtain other useful plots by changing the value supplied to the xvar
parameter. The value norm
plots the l1 norm of the coefficients on the x-axis and dev
plots the percentage deviance explained. We will learn about deviance in the next chapter.
To deal with the issue of finding a good value for λ, the glmnet()
package offers the cv.glmnet()
function. This uses a technique known as cross-validation (we'll study this in Chapter 5, Support Vector Machines) on the training data to find an appropriate λ that minimizes the MSE:
> ridge.cv <- cv.glmnet(cars_train_mat, cars_train$Price, alpha = 0) > lambda_ridge <- ridge.cv$lambda.min > lambda_ridge [1] 641.6408 > lasso.cv <- cv.glmnet(cars_train_mat, cars_train$Price, alpha = 1) > lambda_lasso <- lasso.cv$lambda.min > lambda_lasso [1] 10.45715
If we plot the result produced by the cv.glmnet()
function, we can see how the MSE changes over the different values of lambda:
The bars shown above and below each dot are the error bars showing one standard deviation above and below the estimate of the MSE for each plotted value of lambda. The plots also show two vertical dotted lines. The first vertical line shown corresponds to the value of lambda.min
, which is the optimal value proposed by cross-validation. The second vertical line to the right is the value in the attribute lambda.1se
. This corresponds to a value that is 1 standard error away from lambda.min
and produces a more regularized model.
With the glmnet
package, the predict()
function now operates in a variety of contexts. We can, for example, obtain the coefficients of a model for a lambda value that was not in our original list. For example, we have this:
> predict(cars_models_lasso, type = "coefficients", s = lambda_lasso) 15 x 1 sparse Matrix of class "dgCMatrix" 1 (Intercept) -521.3516739 Mileage -0.1861493 Cylinder 3619.3006985 Doors 1400.7484461 Cruise 310.9153455 Sound 340.7585158 Leather 830.7770461 Buick 1139.9522370 Cadillac 13377.3244020 Chevy -501.7213442 Pontiac -1327.8094954 Saab 12306.0915679 convertible 11160.6987522 hatchback -6072.0031626 sedan -4179.9112364
Note that it seems that the lasso has not forced any coefficients to zero in this case, indicating that based on the MSE, it is not suggesting to remove any of them for the cars data set. Finally, using the predict()
function again, we can make predictions with a regularized model using the newx
parameter to provide a matrix of features for observations on which we want to make predictions:
> cars_test_mat <- model.matrix(Price ~ . -Saturn, cars_test)[,-1] > cars_ridge_predictions <- predict(cars_models_ridge, s = lambda_ridge, newx = cars_test_mat) > compute_mse(cars_ridge_predictions, cars_test$Price) [1] 7609538 > cars_lasso_predictions <- predict(cars_models_lasso, s = lambda_lasso, newx = cars_test_mat) > compute_mse(cars_lasso_predictions, cars_test$Price) [1] 7173997
The lasso model performs best, and unlike ridge regression in this case, also slightly outperforms the regular model on the test data.