The ordinary least squares method for finding the regression parameters is a specific case of the maximum likelihood. Therefore, regression models are subject to the same challenge in terms of overfitting as any other discriminative models. You are already aware of the fact that regularization is used to reduce model complexity and avoid overfitting, as stated in the Overfitting section in Chapter 2, Hello World!
Regularization consists of adding a J(w) penalty function to the loss function (or RSS in the case of a regressive classifier) in order to prevent the model parameters (also known as weights) from reaching high values. A model that fits a training set very well tends to have many features variables with relatively large weights. This process is known as shrinkage. Practically, shrinkage involves adding a function with model parameters as an argument to the loss function (M5):
The penalty function is completely independent of the training set {x,y}. The penalty term is usually expressed as a power to the function of the norm of the model parameters (or weights) wd. For a model of D dimension, the generic Lp -norm is defined as follows (M6):
The two most commonly used penalty functions for regularization are L1 and L2.
Regularization in machine learning
The regularization technique is not specific to the linear or logistic regression. Any algorithm that minimizes the residual sum of squares, such as a support vector machine or feed-forward neural network, can be regularized by adding a roughness penalty function to the RSS.
The L1 regularization applied to the linear regression is known as the lasso regularization. The ridge regression is a linear regression that uses the L2 regularization penalty.
You may wonder which regularization makes sense for a given training set. In a nutshell, L2 and L1 regularizations differ in terms of computation efficiency, estimation, and features selection [6:10] [6:11]:
Let's implement the ridge regression, and then evaluate the impact of the L2-norm penalty factor.
The ridge regression is a multivariate linear regression with an L2-norm penalty term (M7):
The computation of the ridge regression parameters requires the resolution of a system of linear equations that are similar to the linear regression.
The implementation of the ridge regression adds the L2 regularization term to the multiple linear regression computation of the Apache Commons Math Library. The methods of RidgeRegression
have the same signature as their ordinary least squares counterparts except for the lambda
L2 penalty term (line 1
):
class RidgeRegression[T <: AnyVal]( //1 xt: XVSeries[T], expected: DblVector, lambda: Double)(implicit f: T => Double) extends ITransform[Array[T]](xt) with Regression with Monitor[Double] { //2 type V = Double //3 override def train: Option[RegressionModel] //4 override def |> : PartialFunction[Array[T], Try[V]] }
The RidgeRegression
class is implemented as an ITransform
data transformation whose model is implicitly derived from the input data (training set), as described in the Monadic data transformation section in Chapter 2, Hello World! (line 2
). The V
type of the output of the |>
predictive function is a Double
(line 3
). The model is created through training during the instantiation of the class (line 4
).
The relationship between the different components of the ridge regression is described in the following UML class diagram:
The UML diagram omits the helper traits or classes such as Monitor
or the Apache Commons Math components.
Let's take a look at the training method, train
:
def train: RegressionModel = { val mlr = new RidgeRAdapter(lambda, xt.head.size) //5 mlr.createModel(data, expected) //6 RegressionModel(mlr.getWeights, mlr.getRss) //7 }
It is rather simple; it initialized and executed the regression algorithm implemented in the RidgeRAdapter
class (line 5
), which acts as an adapter to the internal Apache Commons Math library AbstractMultipleLinearRegression
class in the org.apache.commons.math3.stat.regression
package (line 6
). The method returns a fully initialized regression model that is similar to the ordinary least squared regression (line 7
).
Let's take a look at the RidgeRAdapter
adapter class:
class RidgeRAdapter( lambda: Double, dim: Int) extends AbstractMultipleLinearRegression { var qr: QRDecomposition = _ //8 def createModel(x: DblMatrix, y: DblVector): Unit ={ //9 this.newXSampleData(x) //10 super.newYSampleData(y.toArray) } def getWeights: DblArray = calculateBeta.toArray //11 def getRss: Double = rss }
The constructor for the RidgeRAdapter
class takes two parameters: the lambda
L2 penalty parameter and the number of features, dim
, in an observation. The QR decomposition in the AbstractMultipleLinearRegression
base class does not process the penalty term (line 8
). Therefore, the creation of the model has to be redefined in the createModel
method (line 9
), which requires to override the newXSampleData
method (line 10
):
override protected def newXSampleData(x: DblMatrix): Unit = { super.newXSampleData(x) //12 val r: RealMatrix = getX Range(0, dim).foreach(i => r.setEntry(i, i, r.getEntry(i,i) + lambda) ) //13 qr = new QRDecomposition(r) //14 }
The newXSampleData
method overrides the default observations-features r
matrix (line 12
) by adding the lambda
coefficient to its diagonal elements (line 13
), and then updating the QR decomposition components (line 14
).
The weights for the ridge regression models is computed by implementing the M6 formula (line 11
) in the calculateBeta
overridden method (line 15
):
override protected def calculateBeta: RealVector = qr.getSolver().solve(getY()) //15
The predictive algorithm for the ordinary least squares regression is implemented by the |>
data transformation. The method predicts the output value, given a model and an input x
value (line 16
):
def |> : PartialFunction[Array[T], Try[V]] = {
case x: Array[T] if(isModel &&
x.length == model.get.size-1) =>
Try( dot(x, model.get) ) //16
}
The objective of the test case is to identify the impact of the L2 penalization on the RSS value and then compare the predicted values with the original values.
Let's consider the first test case related to the regression on the daily price variation of the Copper ETF (symbol: CU) using the stock daily volatility and volume as features. The implementation of the extraction of observations is identical to that for the least squares regression, as described in the previous section:
val LAMBDA: Double = 0.5 val src = DataSource(path, true, true, 1) //17 for { price <- src.get(adjClose) //18 volatility <- src.get(volatility) //19 volume <- src.get(volume) //20 (features, expected) <- differentialData(volatility, volume, price, diffDouble) //21 regression <- RidgeRegression[Double](features, expected, LAMBDA) //22 } yield { if( regression.isModel ) { val trend = features .map( dot(_, regression.weights.get) ) //23 val y1 = predict(0.2, expected, volatility, volume) //24 val y2 = predict(5.0, expected, volatility, volume) val output = (2 until 10 by 2).map( n => predict(n*0.1, expected, volatility, volume) ) } }
Let's take a look at the steps required for the execution of the test. The steps consist of collecting data, extracting the features and expected values, and training the ridge regression model:
price
trading session closing, the volatility
session, and the volume
session for the ETF CU using the DataSource
transformation (line 17
).price
of the ETF (line 18
), its volatility
within a trading session (line 19
), and the volume
trading during the same session (line 20
).expected
outcome {0, 1} for training the model, where 1
represents the increase in the price and 0
represents the decrease in the price (line 21). The differentialData
generic method of the XTSeries
singleton is described in the Time series in Scala section in Chapter 3, Data Preprocessing.features
set and the expected
change in the daily stock price (line 22
).trend
values using the dot
function of the RegressionModel
singleton (line 23
).predict
method (line 24
).The code is as follows:
def predict( lambda: Double, deltaPrice: DblVector, volatility: DblVector, volume: DblVector): DblVector = { val observations = zipToSeries(volatility, volume)//25 val regression = new RidgeRegression[Double](observations, deltaPrice, lambda) val fnRegr = regression |> //26 observations.map( fnRegr(_).get) //27 }
The observations are extracted from the volatility
and volume
time series (line 25
). The predictive method for the fnRegr
ridge regression (line 26
) is applied to each observation (line 27
). The RSS value, rss
, is plotted for different values of λ, as shown in the following chart:
The residual sum of squares decreases as λ increases. The curve seems to be reaching for a minimum around λ = 1. The case of λ = 0 corresponds to the least squares regression.
Next, let's plot the RSS value for λ varying between 1 and 100:
This time around, the value of RSS increases with λ before reaching a maximum for λ > 60. This behavior is consistent with other findings [6:12]. As λ increases, the overfitting gets more expensive, and therefore, the RSS value increases.
Let's plot the predicted price variation of the Copper ETF using the ridge regression with different values of lambda (λ):
The original price variation of the Copper ETF, Δ = price(t + 1) - price(t), is plotted as λ = 0. Let's analyze the behavior of the predictive model for different values of λ:
The logistic regression, which was briefly introduced in the Let's kick the tires section in Chapter 1, Getting Started, is the next logical regression model to be discussed. The logistic regression relies on optimization methods. Let's go through a short refresher course in optimization before diving into the logistic regression.