Chapter 6. Regression and Regularization

In the first chapter, we briefly introduced the binary logistic regression (the binomial logistic regression for a single variable) as our first test case. The purpose was to illustrate the concept of discriminative classification. There are many more regression models, starting with the ubiquitous ordinary least square linear regression and the logistic regression [6:1].

The purpose of regression is to minimize a loss function, with the residual sum of squares (RSS) being one that is commonly used. The problem of overfitting described in the Overfitting section under Assessing a model in Chapter 2, Hello World!, can be addressed by adding a penalty term to the loss function. The penalty term is an element of the larger concept of regularization.

The first section of this chapter will describe and implement the linear least-squares regression. The second section will introduce the concept of regularization with an implementation of the ridge regression. Finally, the logistic regression will be revisited in detail from the perspective of a classification model.

Linear regression

Linear regression is by far the most widely used, or at least the most commonly known, regression method. The terminology is usually associated with the concept of fitting a model to data and minimizing the errors between the expected and predicted values by computing the sum of square errors, residual sum of square errors, or least-square errors.

The least squares problems fall into the following two categories:

  • Ordinary least squares
  • Nonlinear least squares

One-variate linear regression

Let's start with the simplest form of linear regression, which is the single variable regression, in order to introduce the terms and concepts behind linear regression. In its simplest interpretation, the one-variate linear regression consists of fitting a line to a set of data points {x, y}.

Note

M1: A single variable linear regression for a model f with weights wj for features xj and labels (or expected values) yj is given by the following formula:

One-variate linear regression

Here, w1 is the slope, w0 is the intercept, f is the linear function that minimizes the RSS, and (xj, yj) is a set of n observations.

The RSS is also known as the sum of squared errors (SSE). The mean squared error (MSE) for n observations is defined as the ratio RSS/n.

Note

Terminology

The terminology used in the scientific literature regarding regression is a bit confusing at times. Regression weights are also known as regression coefficients or regression parameters. The weights are referred to as w in formulas and the source code throughout the chapter, although β is also used in reference books.

Implementation

Let's create a SingleLinearRegression parameterized class to implement the M1 formula. The linear regression is a data transformation that uses a model implicitly derived or built from data. Therefore, the simple linear regression implements the ITransform trait, as described in the Monadic data transformation section in Chapter 2, Hello World!

The SingleLinearRegression class takes the following two arguments:

  • An xt vector of single variable observations
  • A vector of expected values or labels (line 1)

The code will be as follows:

class SingleLinearRegression[T <: AnyVal](
    xt: XSeries[T], 
    expected: Vector[T])(implicit f: T => Double)
  extends ITransform[T](xt) with Monitor[Double] {   //1
  type V = Double  //2

  val model: Option[DblPair] = train //3
  def train: Option[DblPair]
  override def |> : PartialFunction[T, Try[V]]
  def slope: Option[Double] = model.map(_._1)
  def intercept: Option[Double] = model.map(_._2)
}

The Monitor trait is used to collect the profiling information during training (refer to the Monitor section under Utility classes in the Appendix A, Basic Concepts).

The class has to define the type of the output of the |> prediction method, which is a Double (line 2).

Note

Model instantiation

The model parameters are computed through training and the model is instantiated regardless of whether the model is actually validated. A commercial application requires the model to be validated using a methodology such as the K-fold validation, as described in the Design template for immutable classifiers section in the Appendix A, Basic Concepts.

The training generates the model defined as the regression weights (slope and intercept) (line 3). The model is set as None if an exception is thrown during training:

def train: Option[DblPair] = {
   val regr = new SimpleRegression(true) //4
   regr.addData(zipToSeries(xt, expected).toArray)  //5
   Some((regr.getSlope, regr.getIntercept))  //6
}

The regression weights or coefficients, that is the model tuple, are computed using the SimpleRegression class from the stats.regression package of the Apache Commons Math library with the true argument to trigger the computation of the intercept (line 4). The input time series and the labels (or expected values) are zipped to generate an array of two values (input and expected) (line 5). The model is initialized with the slope and intercept computed during the training (line 6).

The zipToSeries method of the XTSeries object is described in the Time series in Scala section in Chapter 3, Data Preprocessing.

Note

private versus private[this]

A private value or variable can be accessed only by all the instances of a class. A value declared private[this] can be manipulated only by the this instance. For example, the value model can be accessed only by the this instance of SingleLinearRegression.

Test case

For our first test case, we compute the single variate linear regression of the price of the Copper ETF (ticker symbol: CU) over a period of 6 months (January 1, 2013 to June 30, 2013):

val path = "resources/data/chap6/CU.csv"
for {
  price <- DataSource(path, false, true, 1) get adjClose //7
  days <- Try(Vector.tabulate(price.size)(_.toDouble)) //8
  linRegr <- SingleLinearRegression[Double](days, price) //9
} yield {
  if( linRegr.isModel ) {
    val slope = linRegr.slope.get
    val intercept = linRegr.intercept.get
    val error = mse(days, price, slope, intercept)//10
  }
  …
}

The daily closing price of the ETF CU is extracted from a CSV file (line 7) as the expected values using a DataSource instance, as described in the Data extraction and Data sources section in the Appendix A, Basic Concepts. The x values days are automatically generated as a linear function (line 8). The expected values (price) and sessions (days) are the inputs to the instantiation of the simple linear regression (line 9).

Once the model is created successfully, the test code computes the mse mean squared error of the predicted and expected values (line 10):

def mse(
    predicted: DblVector, 
    expected: DblVector, 
    slope: Double, 
    intercept: Double): Double = {
  val predicted = xt.map( slope*_ + intercept)
  XTSeries.mse(predicted, expected)  //11
}

The mean least squared error is computed using the mse method of XTSeries (line 11). The original stock price and linear regression equation are plotted on the following chart:

Test case

The total least square error is 0.926.

Although the single variable linear regression is convenient, it is limited to a scalar time series. Let's consider the case of multiple variables.

Ordinary least squares regression

The ordinary least squares regression computes the parameters w of a linear function y = f(x0, x2 … xd) by minimizing the residual sum of squares. The optimization problem is solved by performing vector and matrix operations (transposition, inversion, and substitution).

Note

M2: The minimization of the loss function is given by the following formula:

Ordinary least squares regression

Here, wj is the weights or parameters of the regression, (xi, yi)i:0, n-1 is the n observations of a vector x and an expected output value y, and f is the linear multivariate function, y = f (x0, x1, …,xd).

There are several methodologies to minimize the residual sum of squares for a linear regression:

  • Resolution of the set of n equations with d variables (weights) using the QR decomposition of the n by d matrix, representing the time series of n observations of a vector of d dimensions with n >= d [6:2]
  • Singular value decomposition on the observations-features matrix, in the case where the dimension d exceeds the number of observations n [6:3]
  • Gradient descent [6:4]
  • Stochastic gradient descent [6:5]

An overview of these matrix decompositions and optimization techniques can be found in the Linear algebra and Summary of optimization techniques sections in the Appendix A, Basic Concepts.

The QR decomposition generates the smallest relative error MSE for the most common least squares problem. The technique is used in our implementation of the least squares regression.

Design

The implementation of the least squares regression leverages the Apache Commons Math library implementation of the ordinary least squares regression [6:6].

This chapter describes several types of regression algorithms. It makes sense to define a generic Regression trait that defines the key element component of a regression algorithm.

  • A model of the RegressionModel type (line 1)
  • Two methods to access the components of the regression model: weights and rss (line 2 and 3)
  • A train polymorphic method that implements the training of this specific regression algorithm (line 4)
  • A training protected method that wraps train into a Try monad

The code will be as follows:

trait Regression {
  val model: Option[RegressionModel] = training //1
   
  def weights: Option[DblArray] = model.map( _.weights)//2
  def rss: Option[Double] = model.map(_.rss) //3
  def isModel: Boolean = model != None
     
  protected def train: RegressionModel  //4
  def training: Option[RegressionModel] = Try(train) match {
    case Success(_model) => Some(_model)
    case Failure(e) => e match {
      case err: MatchError => { … }        case _ => { … }
    }
  }
}

The model is simply defined by its weights and its residual sum of squares (line 5):

case class RegressionModel(  //5
   val weights: DblArray, val rss: Double) extends Model

object RegressionModel {
  def dot[T <: AnyVal](x: Array[T], 
     w: DblArray)(implicit f: T => Double): Double = 
     x.zip(weights.drop(1))  
      .map{ case(_x, w) => _x*w}.sum + weights.head //6
}

The RegressionModel companion object implements the computation of the dot inner product of the weights regression and an observation, x (line 6). The dot method is used throughout the chapter.

Let's create a MultiLinearRegression class as a data transformation whose model is implicitly derived from the input data (training set), as described in the Monadic data transformation section in Chapter 2, Hello World!:

class MultiLinearRegression[T <: AnyVal](
     xt: XVSeries[T], 
     expected: DblVector)(implicit f: T => Double)
  extends ITransform[Array[T]](xt) with Regression 
       with Monitor[Double] { //7
  type V = Double  //8

  override def train: Option[RegressionModel] //9
  override def |> : PartialFunction[Array[T], Try[V]] //10
}

The MultiLinearRegression class takes two arguments: the multidimensional time series of the xt observations and the vector of expected values (line 7). The class implements the ITransform trait and needs to define the type of the output value for the prediction or regression, V as a Double (line 8). The constructor for MultiLinearRegression creates the model through training (line 9). The ITransform trait's |> method implements the runtime prediction for the multilinear regression (line 10).

The Monitor trait is used to collect the profiling information during training (refer to the Monitor section under Utility classes in the Appendix A, Basic Concepts).

Note

The regression model

The RSS is included in the model because it provides the client code with the important information regarding the accuracy of the underlying technique used to minimize the loss function.

The relationship between the different components of the multilinear regression is described in the following UML class diagram:

Design

The UML class diagram for the multilinear (OLS) regression

The UML diagram omits the helper traits and classes such as Monitor or the Apache Commons Math components.

Implementation

The training is performed during the instantiation of the MultiLinearRegression class (refer to the Design template for immutable classifiers section in the Appendix A, Basic Concepts):

def train: RegressionModel = {
  val olsMlr = new MultiLinearRAdapter   //11
  olsMlr.createModel(expected, data)  //12
  RegressionModel(olsMlr.weights, olsMlr.rss) //13
}

The functionality of the ordinary least squares regression in the Apache Commons Math library is accessed through an olsMlr reference to the MultiLinearRAdapter adapter class (line 11).

The train method creates the model by invoking the OLSMultipleLinearRegression Apache Commons Math class (line 12) and returns the regression model (line 13). The various methods of the class are accessed through the MultiLinearRAdapter adapter class:

class MultiLinearRAdapter extends OLSMultipleLinearRegression {
  def createModel(y: DblVector, x: Vector[DblArray]): Unit = 
     super.newSampleData(y.toArray, x.toArray)
  
  def weights: DblArray = estimateRegressionParameters
  def rss: Double = calculateResidualSumOfSquares
}

The createModel, weights, and rss methods route the request to the corresponding methods in OLSMultipleLinearRegression.

The Try{} Scala exception handling monad is used as the return type for the train method in order to catch the different types of exceptions thrown by the Apache Commons Math library such as MathIllegalArgumentException, MathRuntimeException, or OutOfRangeException.

Note

Exception handling

Wrapping up invocation of methods in a third party with a Try {} Scala exception handler matters for a couple of reasons:

  • It makes debugging easier by segregating your code from the third party
  • It allows your code to recover from the exception by reexecuting the same function with an alternative third-party library method, whenever possible

The predictive algorithm for the ordinary least squares regression is implemented by the |> data transformation. The method predicts the output value, given a model and an input value, x:

def |> : PartialFunction[Array[T], Try[V]] = {
  case x: Array[T] if isModel && 
       x.length == model.get.size-1  
         => Try( dot(x, model.get) ) //14
}

The predictive value is computed using the dot method defined in the RegressionModel singleton, which was introduced earlier in this section (line 14).

Test case 1 – trending

Trending consists of extracting the long-term movement in a time series. Trend lines are detected using a multivariate least squares regression. The objective of this first test is to evaluate the filtering capability of the ordinary least squares regression.

The regression is performed on the relative price variation of the Copper ETF (ticker symbol: CU). The selected features are volatility and volume, and the label or target variable is the price change between two consecutive y trading sessions.

The naming convention for the trading data and metrics is described in the Trading data section under Technical analysis in the Appendix A, Basic Concepts.

The volume, volatility, and price variation for CU between January 1, 2013 and June 30, 2013 are plotted on the following chart:

Test case 1 – trending

The chart for price variation, volatility, and trading volume for Copper ETF

Let's write the client code to compute the multivariate linear regression, price change = w0 + volatility.w1 + volume.w2:

import  YahooFinancials._
val path = "resources/data/chap6/CU.csv"  //15
val src = DataSource(path, true, true, 1)  //16

for {
  price <- src.get(adjClose)  //17
  volatility <- src.get(volatility)  //18
  volume <- src.get(volume)  //19
  (features, expected) <- differentialData(volatility, 
                     volume, price, diffDouble)  //20
  regression <- MultiLinearRegression[Double]( 
                  features, expected)  //21
} yield {
  if( regression.isModel ) {
    val trend = features.map(dot(_,regression.weights.get))
    display(expected, trend)  //22
  }
}

Let's take a look at the steps required for the execution of the test: it consists of collecting data, extracting the features and expected values, and training the multilinear regression model:

  1. Locate the CSV formatted data source file (line 15).
  2. Create a data source extractor, DataSource, for the trading session closing price, the volatility session, and the volume session for the ETF CU (line 16).
  3. Extract the price of the ETF (line 17), its volatility within a trading session (line 18), and the trading volume during the session (line 19) using the DataSource transform.
  4. Generate the labeled data as a pair of features (relative volatility and relative volume for the ETF) and expected outcome (0, 1) for training the model for which 1 represents the increase in the price and 0 represents the decrease in the price (line 20). The differentialData generic method of the XTSeries singleton is described in the Time series in Scala section in Chapter 3, Data Preprocessing.
  5. The multilinear regression is instantiated using the features set and the expected change in the daily ETF price (line 21).
  6. Display the expected and trending values using JFreeChart (line 22).

The time series of expected values and the data predicted by the regression are plotted on the following chart:

Test case 1 – trending

The price variation and the least squares regression for the Copper ETF according to volatility and volume

The least squares regression model is defined by the linear function for the estimation of price variation as follows:

price(t+1)-price(t) = -0.01 + 0.014 volatility – 0.0042.volume

The estimated price change (the dotted line in the preceding chart) represents the long-term trend from which the noise is filtered out. In other words, the least squares regression operates as a simple low-pass filter as an alternative to some of the filtering techniques such as the discrete Fourier transform or the Kalman filter, as described in Chapter 3, Data Preprocessing [6:7].

Although trend detection is an interesting application of the least squares regression, the method has limited filtering capabilities for time series [6:8]:

  • It is sensitive to outliers
  • The first and last few observations need to be discarded
  • As a deterministic method, it does not support noise analysis (distribution, frequencies, and so on)

Test case 2 – feature selection

The second test case is related to feature selection. The objective is to discover which subset of initial features generates the most accurate regression model, that is, the model with the smallest residual sum of squares on the training set.

Let's consider an initial set of D features {xi}. The objective is to estimate the subset of features {xid} that are most relevant to the set of observations using a least squares regression. Each subset of features is associated with a fj(x|wj) model:

Test case 2 – feature selection

A diagram for the features set selection

The ordinary least square regression is used to select the model parameters w in the case the feature set is small. Performing the regression of each subset of a large original feature set is not practical.

Note

M3: The features selection can be expressed mathematically as follows:

Test case 2 – feature selection

Here, wjk is the weights of the regression for the function/model fj, (xi, yi)i:0,n-1 is the n observations of a vector x and expected output value y, and f is the linear multivariate function, y = f (x0, x1, …,xd).

Let's consider the following four financial time series over the period from January 1, 2009 to December 31, 2013:

  • The exchange rate of Chinese Yuan to US Dollar
  • The S&P 500 index
  • The spot price of gold
  • The 10-year Treasury bond price

The problem is to estimate which combination of the S&P 500 index, gold price, and 10-year Treasury bond price variables is the most correlated to the exchange rate of the Yuan. For practical reasons, we use the Exchange Trade Funds CYN as the proxy for the Yuan/US dollar exchange rate (similarly, SPY, GLD, and TLT for the S&P 500 index, spot price of gold, and 10-year Treasury bond price, respectively).

Note

Automation of features extraction

The code in this section implements an ad hoc extraction of features with an arbitrary fixed set of models. The process can be easily automated with an optimizer (the gradient descent, genetic algorithm, and so on) using 1/RSS as the objective function.

The number of models to evaluate is relatively small, so an ad hoc approach to compute the RSS for each combination is acceptable. Let's take a look at the following graph:

Test case 2 – feature selection

The graph of the Chinese Yuan exchange rate, gold price, 10-year Treasury bond price, and S&P 500 index

The getRss method implements the computation of the RSS value given a set of xt observations, y expected (smoothed) values, and featureLabels labels for features and then returns a textual result:

def getRss(
     xt: Vector[DblArray], 
     expected: DblVector, 
     featureLabels: Array[String]): String = {

  val regression = 
         new MultiLinearRAdapter[Double](xt,expected) //23
  val modelStr = regression.weights.get.view
        .zipWithIndex.map{ case( w, n) => {
    val weights_str = format(w, emptyString, SHORT)
    if(n == 0 ) s"${featureLabels(n)} = $weights_str"
    else s"${weights_str}.${featureLabels(n)}"
  }}.mkString(" + ")
  s"model: $modelStr
RSS =${regression.get.rss}" //24
}

The getRss method merely trains the model by instantiating the multilinear regression class (line 23). Once the regression model is trained during the instantiation of the MultiLinearRegression class, the coefficients of the regression weights and the RSS values are stringized (line 24). The getRss method is invoked for any combination of the ETF, GLD, SPY, and TLT variables against the CNY label.

Let's take a look at the following test code:

val SMOOTHING_PERIOD: Int = 16  //25
val path = "resources/data/chap6/"
val symbols = Array[String]("CNY", "GLD", "SPY", "TLT") //26
val movAvg = SimpleMovingAverage[Double](SMOOTHING_PERIOD) //27

for {
  pfnMovAve <- Try(movAvg |>)  //28
  smoothed <- filter(pfnMovAve)  //29
  models <- createModels(smoothed)  //30
  rsses <- Try(getModelsRss(models, smoothed)) //31
  (mses, tss) <- totalSquaresError(models,smoothed.head) //32
} yield {
   s"""${rsses.mkString("
")}
${mses.mkString("
")}
      | 
Residual error= $tss".stripMargin
}

The dataset is large (1,260 trading sessions) and noisy enough to warrant filtering using a simple moving average with a period of 16 trading sessions (line 25). The purpose of the test is to evaluate the possible correlation between the four ETFs: CNY, GLD, SPY, and TLT (line 26). The execution test instantiates the simple moving average (line 27), as described in the The simple moving average section in Chapter 3, Data Preprocessing.

The workflow executes the following steps:

  1. Instantiate a simple moving average pfnMovAve partial function (line 28).
  2. Generate smoothed historical prices for the CNY, GLD, SPY, and TLT ETFs using the filter function (line 29):
    Type PFNMOVAVE = PartialFunction[DblVector, Try[DblVector]]
    
    def filter(pfnMovAve: PFNMOVEAVE): Try[Array[DblVector]] = Try {
       symbols.map(s => DataSource(s"$path$s.csv", true, true, 1))
          .map( _.get(adjClose) )
          .map( pfnMovAve(_)).map(_.get)
  3. Generate the list of features for each model using the createModels method (line 30):
    type Models = List[(Array[String], DblMatrix)]
    
    def createModels(smoothed: Array[DblVector]): Try[Models] = 
    Try {
      val features = smoothed.drop(1).map(_.toArray)  //33
      List[(Array[String], DblMatrix)](   //34
       (Array[String]("CNY","SPY","GLD","TLT"), features.transpose),
       (Array[String]("CNY","GLD","TLT"),features.drop(1).transpose),
       (Array[String]("CNY","SPY","GLD"),features.take(2).transpose),
       (Array[String]("CNY","SPY","TLT"), features.zipWithIndex
                          .filter( _._2 != 1).map( _._1).transpose),
       (Array[String]("CNY","GLD"), features.slice(1,2).transpose)
       )
    }

    The smoothed values for CNY are used as the expected values. Therefore, they are removed from the features list (line 33). The five models are evaluated by adding or removing elements from the features list (line 34).

  4. Next, the workflow computes the residual sum of squares for all the models using getModelsRss (line 31). The method invokes getRss, which was introduced earlier in this section, for each model (line 35):
    def getModelsRss(
        models: Models, 
        y: Array[DblVector]): List[String] = 
      models.map{ case (labels, m) => 
             s"${getRss(m.toVector, y.head, labels)}" }  //35
    
  5. Finally, the last step of the workflow consists of computing the mses mean squared errors for each model and the total squared errors (line 33):
    def totalSquaresError(
        models: Models, 
        expected: DblVector): Try[(List[String], Double)] = Try {
     
      val errors = models.map{case (labels,m) => 
    rssSum(m, expected)._1}//36
      val mses = models.zip(errors)
               .map{case(f,e) => s"MSE: ${f._1.mkString(" ")} = $e"}
      (mses, Math.sqrt(errors.sum)/models.size)  //37
    }

The totalSquaresError method computes the error for each model by summing the RSS value, rssSum, for each model (line 36). The method returns a pair of an array of the mean squared error for each model and the total squared error (line 37).

The RSS does not always provide an accurate visualization of the fitness of the regression model. The fitness of the regression model is commonly assessed using the r2 statistics. The r2 value is a number that indicates how well data fits into a statistical model.

Note

M4: The RSS and r2 statistics are defined by the following formulae:

Test case 2 – feature selection

The implementation of the computation of the r2 statistics is simple. For each model fj, the rssSum method computes the tuple (rss and least squares errors), as defined in the M4 formula:

def rssSum(xt: DblMatrix, expected: DblVector): DblPair = {
  val regression = 
         MultiLinearRegression[Double](xt, expected) //38
  val pfnRegr = regression |>  //39
  val results = sse(expected.toArray, xt.map(pfnRegr(_).get))
  (regression.rss, results) //40
}

The rssSum method instantiates the MultiLinearRegression class (line 38), retrieves the RSS value, and then validates the pfnRegr regressive model (line 39) against the expected values (line 40). The output of the test is presented in the following screenshot:

Test case 2 – feature selection

The output results clearly show that the three variable regression, CNY = f (SPY, GLD, TLT), is the most accurate or fittest model for the CNY time series, followed by CNY = f (SPY, TLT). Therefore, the feature selection process generates the features set, {SPY, GLD, TLT}.

Let's plot the model against the raw data:

Test case 2 – feature selection

Ordinary least regression on the Chinese Yuan ETF (CNY)

The regression model smoothed the original CNY time series. It weeded out all but the most significant price variation. The graph plotting the r2 value for each of the model confirms that the three features model CNY=f (SPY, GLD, TLT) is the most accurate:

Test case 2 – feature selection

Note

The general linear regression

The concept of a linear regression is not restricted to polynomial fitting models such as y = w0 + w1.x + w2.x2 + …+ wnxn. Regression models can be also defined as a linear combination of basis functions such as ϕj: y = w0 + w11(x) + w2ϕ2(x) + … + wnn(x) [6:9].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset