Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Regularization in logistic regression

One of the dangers of machine learning is over-fitting: the algorithm captures not only the signal in the training set, but also the statistical noise that results from the finite size of the training set.

A way to mitigate over-fitting in logistic regression is to use regularization: we impose a penalty for large values of the parameters when optimizing. We can do this by adding a penalty to the cost function that is proportional to the magnitude of the parameters. Formally, we re-write the logistic regression cost function (described in Chapter 2, Manipulating Data with Breeze) as:

where is the normal logistic regression cost function:

Here, params is the vector of parameters, is the vector of features for the i^th training example, and is 1 if the i th training example is spam, and 0 otherwise. This is identical to the logistic regression cost-function introduced in Chapter 2, Manipulating data with Breeze, apart from the addition of the regularization term , the norm of the parameter vector. The most common value of n is 2, in which case is just the magnitude of the parameter vector:

The additional regularization term drives the algorithm to reduce the magnitude of the parameter vector. When using regularization, features must all have comparable magnitude. This is commonly achieved by normalizing the features. The logistic regression estimator provided by MLlib normalizes all features by default. This can be turned off with the setStandardization parameter.

Spark has two hyperparameters that can be tweaked to control regularization:

The type of regularization, set with the elasticNetParam parameter. A value of 0 indicates regularization.
The degree of regularization ( in the cost function), set with the regParam parameter. A high value of the regularization parameter indicates a strong regularization. In general, the greater the danger of over-fitting, the larger the regularization parameter ought to be.

Let's create a new logistic regression instance that uses regularization:

scala> val lrWithRegularization = (new LogisticRegression()
  .setMaxIter(50))
lrWithRegularization: LogisticRegression = logreg_16b65b325526

scala> lrWithRegularization.setElasticNetParam(0) lrWithRegularization.type = logreg_1e3584a59b3a

To choose the appropriate value of , we fit the pipeline to the training set and calculate the classification error on the test set for several values of . Further on in the chapter, we will learn about cross-validation in MLlib, which provides a much more rigorous way of choosing hyper-parameters.

scala> val lambdas = Array(0.0, 1.0E-12, 1.0E-10, 1.0E-8)
lambdas: Array[Double] = Array(0.0, 1.0E-12, 1.0E-10, 1.0E-8)

scala> lambdas foreach { lambda =>
  lrWithRegularization.setRegParam(lambda)
  val pipeline = new Pipeline().setStages(
    Array(indexer, tokenizer, hashingTF, lrWithRegularization))
  val model = pipeline.fit(trainDF)
  val transformedTest = model.transform(testDF)
  val classificationError = transformedTest.filter { 
    $"prediction" !== $"label"
  }.count
  println(s"$lambda => $classificationError")
}
0 => 20
1.0E-12 => 20
1.0E-10 => 20
1.0E-8 => 23

For our example, we see that any attempt to add L₂ regularization leads to a decrease in classification accuracy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Regularization in logistic regression

Create new playlist

Sign In

Sign Up

Regularization in logistic regression

Table of Contents for
Regularization in logistic regression