One of the dangers of machine learning is over-fitting: the algorithm captures not only the signal in the training set, but also the statistical noise that results from the finite size of the training set.
A way to mitigate over-fitting in logistic regression is to use regularization: we impose a penalty for large values of the parameters when optimizing. We can do this by adding a penalty to the cost function that is proportional to the magnitude of the parameters. Formally, we re-write the logistic regression cost function (described in Chapter 2, Manipulating Data with Breeze) as:
where is the normal logistic regression cost function:
Here, params is the vector of parameters, is the vector of features for the ith training example, and is 1 if the i th training example is spam, and 0 otherwise. This is identical to the logistic regression cost-function introduced in Chapter 2, Manipulating data with Breeze, apart from the addition of the regularization term , the norm of the parameter vector. The most common value of n is 2, in which case is just the magnitude of the parameter vector:
The additional regularization term drives the algorithm to reduce the magnitude of the parameter vector. When using regularization, features must all have comparable magnitude. This is commonly achieved by normalizing the features. The logistic regression estimator provided by MLlib normalizes all features by default. This can be turned off with the setStandardization
parameter.
Spark has two hyperparameters that can be tweaked to control regularization:
elasticNetParam
parameter. A value of 0 indicates regularization.
regParam
parameter. A high value of the regularization parameter indicates a strong regularization. In general, the greater the danger of over-fitting, the larger the regularization parameter ought to be.
Let's create a new logistic regression instance that uses regularization:
scala> val lrWithRegularization = (new LogisticRegression() .setMaxIter(50)) lrWithRegularization: LogisticRegression = logreg_16b65b325526 scala> lrWithRegularization.setElasticNetParam(0) lrWithRegularization.type = logreg_1e3584a59b3a
To choose the appropriate value of , we fit the pipeline to the training set and calculate the classification error on the test set for several values of . Further on in the chapter, we will learn about cross-validation in MLlib, which provides a much more rigorous way of choosing hyper-parameters.
scala> val lambdas = Array(0.0, 1.0E-12, 1.0E-10, 1.0E-8) lambdas: Array[Double] = Array(0.0, 1.0E-12, 1.0E-10, 1.0E-8) scala> lambdas foreach { lambda => lrWithRegularization.setRegParam(lambda) val pipeline = new Pipeline().setStages( Array(indexer, tokenizer, hashingTF, lrWithRegularization)) val model = pipeline.fit(trainDF) val transformedTest = model.transform(testDF) val classificationError = transformedTest.filter { $"prediction" !== $"label" }.count println(s"$lambda => $classificationError") } 0 => 20 1.0E-12 => 20 1.0E-10 => 20 1.0E-8 => 23
For our example, we see that any attempt to add L2 regularization leads to a decrease in classification accuracy.