Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Regularization

Variable selection is an important process, as it tries to make models simpler to interpret, easier to train, and free of spurious associations by eliminating variables unrelated to the output. This is one possible approach to dealing with the problem of overfitting. In general, we don't expect a model to completely fit our training data; in fact, the problem of overfitting often means that it may be detrimental to our predictive model's accuracy on unseen data if we fit our training data too well. In this section on regularization, we'll study an alternative to reducing the number of variables in order to deal with overfitting. Regularization is essentially a process of introducing an intentional bias or constraint in our training procedure that prevents our coefficients from taking large values. As this is a process that tries to shrink the coefficients, the methods we'll look at are also known as shrinkage methods.

Ridge regression

When the number of parameters is very large, particularly compared to the number of available observations, linear regression tends to exhibit very high variance. This is to say that small changes in a few of the observations will cause the coefficients to change substantially. Ridge regression is a method that introduces bias through its constraint but is effective at reducing the model's variance. Ridge regression tries to minimize the sum of the residual sum of squares and a term that involves the sum of the squares of the coefficients multiplied by a constant for which we'll use the Greek letter λ. For a model with k parameters, not counting the constant term β₀, and a data set with n observations, ridge regression minimizes the following quantity:

We are still minimizing the RSS but the second term is the penalty term, which is high when any of the coefficients is high. Thus, when minimizing, we are effectively pushing the coefficients to smaller values. The λ parameter is known as a meta parameter, which we need to select or tune. A very large value of λ will mask the RSS term and just push the coefficients to zero. An overly small value of λ will not be as effective against overfitting and a λ parameter of 0 just performs regular linear regression.

When performing ridge regression, we often want to scale by dividing the values of all our features by their variance. This was not the case with regular linear regression because if one feature is scaled by a factor of ten, then the coefficient will simply be scaled by a factor of a tenth to compensate. With ridge regression, the scale of a feature affects the computation of all other features through the penalty term.

Least absolute shrinkage and selection operator (lasso)

The lasso is an alternative regularization method to ridge regression. The difference appears only in the penalty term, which involves minimizing the sum of the absolute values of the coefficients.

It turns out that this difference in the penalty term is very significant, as the lasso combines both shrinkage and selection because it shrinks some coefficients to exactly zero, which is not the case with ridge regression. Despite this, there is no clear winner between these two. Models that depend on a subset of the input features will tend to perform better with lasso; models that have a large spread in coefficients across many different variables will tend to perform better with ridge regression. It is usually worth trying both.

Note

The penalty in ridge regression is often referred to as an l₂ penalty, whereas the penalty term in lasso is known as an l₁ penalty. This arises from the mathematical notion of a norm of a vector. A norm of a vector is a function that assigns a positive number to that vector to represent its length or size. There are many different types of norms. Both the l₁ and l₂ norms are examples of a family of norms known as p-norms that have the following general form for a vector v with n components:

Implementing regularization in R

There are a number of different functions and packages that implement ridge regression, such lm.ridge() from the MASS package and ridge() from the genridge package. For the lasso there is also the lars package. In this chapter, we are going to work with the glmnet() function from the glmnet package due to its consistent and friendly interface. The key to working with regularization is to determine an appropriate value of λ to use. The approach that the glmnet() function uses is to use a grid of different λ values and train a regression model for each value. Then, one can either pick a value manually or use a technique to estimate the best lambda. We can specify the sequence of λ values to try via the lambda parameter; otherwise, a default sequence with 100 values will be used. The first parameter to the glmnet() function must be a matrix of features, which we can build using the model.matrix() function.

The second parameter is a vector with the output variable. Finally, the alpha parameter is a switch between ridge regression (0) and lasso (1). We're now ready to train some models on the cars data set:

> library(glmnet)
> cars_train_mat <- model.matrix(Price ~ .-Saturn, cars_train)[,-1]
> lambdas <- 10 ^ seq(8, -4, length = 250)
> cars_models_ridge <- 
  glmnet(cars_train_mat, cars_train$Price, alpha = 0, lambda = lambdas)
> cars_models_lasso <- 
  glmnet(cars_train_mat, cars_train$Price, alpha = 1, lambda = lambdas)

As we provided a sequence of 250 λ values, we've actually trained 250 ridge regression models and another 250 lasso models. We can see the value of λ from the lambda attribute of the object that is produced by glmnet() and apply the coef() function on this object to retrieve the corresponding coefficients for the 100th model, as follows:

> cars_models_ridge$lambda[100]
[1] 1694.009
> coef(cars_models_ridge)[,100]
  (Intercept)       Mileage      Cylinder         Doors 
 6217.5498831    -0.1574441  2757.9937160   371.2268405 
       Cruise         Sound       Leather         Buick 
 1694.6023651   100.2323812  1326.7744321  -358.8397493 
     Cadillac         Chevy       Pontiac          Saab 
11160.4861489 -2370.3268837 -2256.7482905  8416.9209564 
  convertible     hatchback         sedan 
10576.9050477 -3263.4869674 -2058.0627013

We can use the plot() function to obtain a plot showing how the values of the coefficients change as the logarithm of λ changes. It is very helpful to show the corresponding plot for ridge regression and lasso side by side:

> layout(matrix(c(1, 2), 1, 2))
> plot(cars_models_ridge, xvar = "lambda", main = "Ridge 
  Regression
")
> plot(cars_models_lasso, xvar = "lambda", main = "Lasso
")

The key difference between these two graphs is that lasso forces many coefficients to fall to zero exactly, whereas in ridge regression, they tend to drop off smoothly and only become zero altogether at extreme values of λ. This is further evident by reading the values of the numbers on the top horizontal axis of both graphs, which show the number of non-zero coefficients as λ varies. In this way, the lasso has a significant advantage in that it can often be used to perform feature selection (because a feature with a zero coefficient is essentially not included in the model) as well as providing regularization to minimize the issue of overfitting. We can obtain other useful plots by changing the value supplied to the xvar parameter. The value norm plots the l1 norm of the coefficients on the x-axis and dev plots the percentage deviance explained. We will learn about deviance in the next chapter.

To deal with the issue of finding a good value for λ, the glmnet() package offers the cv.glmnet() function. This uses a technique known as cross-validation (we'll study this in Chapter 5, Support Vector Machines) on the training data to find an appropriate λ that minimizes the MSE:

> ridge.cv <- cv.glmnet(cars_train_mat, cars_train$Price, alpha = 0)
> lambda_ridge <- ridge.cv$lambda.min
> lambda_ridge
[1] 641.6408

> lasso.cv <- cv.glmnet(cars_train_mat, cars_train$Price, alpha = 1)
> lambda_lasso <- lasso.cv$lambda.min
> lambda_lasso
[1] 10.45715

If we plot the result produced by the cv.glmnet() function, we can see how the MSE changes over the different values of lambda:

The bars shown above and below each dot are the error bars showing one standard deviation above and below the estimate of the MSE for each plotted value of lambda. The plots also show two vertical dotted lines. The first vertical line shown corresponds to the value of lambda.min, which is the optimal value proposed by cross-validation. The second vertical line to the right is the value in the attribute lambda.1se. This corresponds to a value that is 1 standard error away from lambda.min and produces a more regularized model.

With the glmnet package, the predict() function now operates in a variety of contexts. We can, for example, obtain the coefficients of a model for a lambda value that was not in our original list. For example, we have this:

> predict(cars_models_lasso, type = "coefficients", s = lambda_lasso)
15 x 1 sparse Matrix of class "dgCMatrix"
                        1
(Intercept)  -521.3516739
Mileage        -0.1861493
Cylinder     3619.3006985
Doors        1400.7484461
Cruise        310.9153455
Sound         340.7585158
Leather       830.7770461
Buick        1139.9522370
Cadillac    13377.3244020
Chevy        -501.7213442
Pontiac     -1327.8094954
Saab        12306.0915679
convertible 11160.6987522
hatchback   -6072.0031626
sedan       -4179.9112364

Note that it seems that the lasso has not forced any coefficients to zero in this case, indicating that based on the MSE, it is not suggesting to remove any of them for the cars data set. Finally, using the predict() function again, we can make predictions with a regularized model using the newx parameter to provide a matrix of features for observations on which we want to make predictions:

> cars_test_mat <- model.matrix(Price ~ . -Saturn, cars_test)[,-1]
> cars_ridge_predictions <- predict(cars_models_ridge, s = 
                            lambda_ridge, newx = cars_test_mat)
> compute_mse(cars_ridge_predictions, cars_test$Price)
[1] 7609538
> cars_lasso_predictions <- predict(cars_models_lasso, s = 
                            lambda_lasso, newx = cars_test_mat)
> compute_mse(cars_lasso_predictions, cars_test$Price)
[1] 7173997

The lasso model performs best, and unlike ridge regression in this case, also slightly outperforms the regular model on the test data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Regularization

Create new playlist

Sign In

Sign Up

Regularization

Ridge regression

Least absolute shrinkage and selection operator (lasso)

Note

Implementing regularization in R

Table of Contents for
Regularization