Dimension reduction

Without prior knowledge of the problem domain, data scientists include all possible features in their first attempt to create a classification, prediction, or regression model. After all, making assumptions is a poor and dangerous approach to reduce the search space. It is not uncommon for a model to use hundreds of features, adding complexity and significant computation costs to build and validate the model.

Noise filtering techniques reduce the sensitivity of the model to features that are associated with sporadic behavior. However, these noise-related features are not known prior to the training phase, and therefore, cannot be completely discarded. As a consequence, training of the model becomes a very cumbersome and time-consuming task.

Overfitting is another hurdle that can arise from a large feature set. A training set of limited size does not allow you to create an accurate model with a large number of features.

Dimension reduction techniques alleviate these problems by detecting features that have little influence on the overall model behavior.

There are three approaches to reduce the number of features in a model:

  • Statistical analysis solutions such as ANOVA for smaller feature sets
  • Regularization and shrinking techniques, which are introduced in the Regularization section in Chapter 6, Regression and Regularization
  • Algorithms that maximize the variance of the dataset by transforming the covariance matrix

The next section introduces one of the most commonly used algorithms of the third category: principal component analysis.

Principal components analysis

The purpose of principal components analysis is to transform the original set of features into a new set of ordered features by decreasing the order of variance. The original observations are transformed into a set of variables with a lower degree of correlation. Let's consider a model with two features, {x, y}, and a set of observations, {xi, yi}, plotted on the following chart:

Principal components analysis

Visualization of principal components analysis for a two-dimensional model

The x and y features are converted into two X and Y variables (that is rotation) to appropriately match the distribution of observations. The variable with the highest variance is known as the first principal component. The variable with the nth highest variance is known as the nth principal component.

Algorithm

I highly recommend that you read the tutorial from Lindsay Smith [4:13] that describes the PCA algorithm in a very concrete and simple way using a two-dimensional model.

Note

PCA and covariance matrix

M11: The covariance of two X and Y features with the observations set {xi, yi} and their respective mean values is defined as:

Algorithm

Here, Algorithm and Algorithm are the respective mean values for the observations, x and y.

M12: The covariance is computed from the Z-score of each observation:

Algorithm

M13: For a model with n features, xi, the covariance matrix is defined as:

Algorithm

M14: The transformation of x to X consists of computing the eigenvalues of the covariance matrix:

Algorithm

M15: The eigenvalues are ranked by their decreasing order of variance. Finally, the m top eigenvalues for which the cumulative of variance exceeds a predefined threshold (percentage of the trace of the matrix) are the principal components:

Algorithm

The algorithm is implemented in five steps:

  1. Compute the Z-score for the observations by standardizing the mean and standard deviation.
  2. Compute the covariance matrix Σ for the original set of observations.
  3. Compute the new covariance matrix Σ' for the observations with the transformed features by extracting the eigenvalues and eigenvectors.
  4. Convert the matrix to rank eigenvalues by decreasing the order of variance. The ordered eigenvalues are the principal components.
  5. Select the principal components for which the total sum of variance exceeds a threshold by a percentage of the trace of the new covariance matrix.

The extraction of principal components by diagonalization of the covariance matrix Σ is visualized in the following diagram. The shades of grey used to represent the covariance value varies from white (lowest value) to black (highest value):

Algorithm

Visualization of the extraction of eigenvalues in PCA

The eigenvalues (variance of X) are ranked by the decreasing order of their values. The PCA algorithm succeeds when the cumulative value of the last eigenvalues (the right-bottom section of the diagonal matrix) becomes insignificant.

Implementation

The principal components analysis can be easily implemented using the Apache Commons Math library methods that compute the eigenvalues and eigenvectors. The PCA class is defined as a data transformation of the ITransform type, as described in the Monadic data transformation section in Chapter 2, Hello World!

The PCA class has a single argument: the xt training set (line 1). The output type has a Double for the projection of an observation along with the eigenvectors (line 2). The constructor defines the z-score norm function (line 3):

class PCA[@specialized(Double) T <: AnyVal](
    xt: XVSeries[T])(implicit f: T => Double) 
  extends ITransform[Array[T]](xt) with Monitor[T] { //1
  type V = Double    //2
  
  val norm = (xv: XVSeries[T]) =>  zScores(xv) //3
  val model: Option[PCAModel] = train //4
  override def |> : PartialFunction[U, Try[V]]
}

The model for the PCA algorithm is defined by the PCAModel case class (line 4).

Note

Triggering an implicit conversion

Implicit conversions can be invoked through an assignment to fully declared variables. For example, the conversion from XVSeries[T] to XVseries[Double] is invoked by the declaring the type of the target variable: val z: XVSeries[Double] = xv (line 4).

The model for the PCA algorithm, PCAModel, consists of the covariance matrix, covariance defined in the formula M11, and array of eigenvalues computed in the formula M16:

case class PCAModel(val covariance: DblMatrix, 
   val eigenvalues: DblArray)

The |> transformative method implements the computation of the principal components (that is, the eigenvector and eigenvalues):

def train: Option[PCAModel] = zScores(xt).map(x => { //5
  val obs: DblMatrix = x.toArray
  val cov = new Covariance(obs).getCovarianceMatrix //6

  val transform = new EigenDecomposition(cov) //7
  val eigenVectors = transform.getV  //8
    val eigenValues = 
           new ArrayRealVector(transform.getRealEigenvalues)

  val covariance = obs.multiply(eigenVectors).getData  //9
  PCAModel(covariance, eigenValues.toArray)   //10
}) match {/* … */}

The normalization function zScores the Z-score transformation (formula M12) (line 5). Next, the method computes the covariance matrix from the normalized data (line 6). The eigenvectors, eigenVectors, are computed (line 7) and then retrieved using the getV method in the Apache Commons Math EigenDecomposition class (line 8). The method computes the diagonal, transformed covariance matrix from the eigenvector (line 9). Finally, the data transformation returns an instance of the PCA model (line 10).

The |> predictive method consists of projecting an observation onto the principal components:

override def |> : PartialFunction[Array[T], Try[V]] = {
  case x: Array[T] 
     if(isModel && x.length == dimension(xt)) => 
        Try( inner(x, model.get.eigenvalues) )
}

The inner method of the XTSeries object computes the dot product of the values x and the eigenvalues model.

Test case

Let's apply the PCA algorithm to extract a subset of the features that represents some of the financial metrics ratios of 34 S&P 500 companies. The metrics under consideration are as follows:

  • Trailing Price-to-Earnings ratio (PE)
  • Price-to-Sale ratio (PS)
  • Price-to-Book ratio (PB)
  • Return on Equity (ROE)
  • Operation Margin (OM)

The financial metrics are described in the Terminology section under Finances 101 in the Appendix A, Basic Concepts.

The input data is specified with the following format as a tuple (a ticker symbol and an array of five financial ratios, PE, PS, PB, ROE, and OM):

val data = Array[(String, DblVector)] (
  // Ticker              PE     PS     PB   ROE    OM
  ("QCOM", Array[Double](20.8, 5.32, 3.65, 17.65,29.2)),
  ("IBM",  Array[Double](13, 1.22, 12.2, 88.1,19.9)), 
   …
)

The client code that executes the PCA algorithm is defined simply as follows:

val dim = data.head._2.size
val input = data.map( _._2.take(dim)) 
val pca = new PCA[Double](input) //11
show(s"PCA model: ${pca.toString}")  //12

The PCA is instantiated with the input data (line 11) and then displayed in a textual format (line 12).

Evaluation

The first test on the 34 financial ratios uses a model that has five dimensions. As expected, the algorithm produces a list of five ordered eigenvalues:

2.5321, 1.0350, 0.7438, 0.5218, 0.3284

Let's plot the relative value of the eigenvalues (that is, relative importance of each feature) on the following bar chart:

Evaluation

Distribution of eigenvalues in PCA for 5 dimensions

The chart shows that three out of five features account for 85 percent of the total variance (trace of the transformed covariance matrix). I invite you to experiment with different combinations of these features. The selection of a subset of the existing features is as simple as applying Scala's take or drop methods:

data.map( _._2.take(dim))

Let's plot the cumulative eigenvalues for the three different model configurations:

  • Five features: PE, PS, PB, ROE, and OM
  • Four features: PE, PS, PB, and ROE
  • Three features: PE, PS, and PB

The graph will be as follows:

Evaluation

Distribution of eigenvalues in PCA for 3, 4, and 5 features

The chart displays the cumulative value of eigenvalues that are the variance of the transformed features, Xi. If we apply a threshold of 90 percent to the cumulative variance, then the number of principal components for each test model is as follows:

  • {PE, PS, PB}: 2
  • {PE, PS, PB, ROE}:3
  • {PE, PS, PB, ROE, OM}: 3

In conclusion, the PCA algorithm reduced the dimension of the model by 33 percent for the three-feature model, 25 percent for the four-feature model, and 40 percent for the five-feature model for a threshold of 90 percent.

Note

Cross-validation of PCA

Like any other unsupervised learning technique, the resulting principal components have to be validated through a one or K-fold cross-validation methodology using a regression estimator such as Partial Least Square Regression (PLSR) or the Predicted Residual Error Sum of Squares (PRESS). For those who are not afraid of statistics, I recommend that you read Fast Cross-validation in Robust PCA by S. Engelen and M. Hubert [4:14]. You need to be aware of the fact that the implementation of these regression estimators is not simple. The validation of the PCA is beyond the scope of this book.

Principal components analysis is a special case of the more general factor analysis. The latter class of algorithm does not require the transformation of the covariance matrix to be orthogonal.

Non-linear models

The principal components analysis technique requires the model to be linear. Although the study of such algorithms is beyond the scope of this book, it's worth mentioning two approaches that extend PCA for non-linear models:

  • Kernel PCA
  • Manifold learning

Kernel PCA

PCA extracts a set of orthogonal linear projections of an array of correlated values, X = {xi}. The kernel PCA algorithm consists of extracting a similar set of orthogonal projection of the inner product matrix, XTX. Nonlinearity is supported by applying a kernel function to the inner product. Kernel functions are described in the Kernel functions section in Chapter 8, Kernel Models and Support Vector Machines. The kernel PCA is an attempt to extract a low dimension feature set (or manifold) from the original observation space. The linear PCA is the projection on the tangent space of the manifold.

Manifolds

The concept of a manifold is borrowed from differential geometry. Manifolds generalize the notions of curves in a two-dimension space or surfaces in a three dimension space to a higher dimension. Nonlinear models are associated with Riemann manifolds whose metric is the inner product, XTX, on a tangent space. The manifold represents a low dimension feature space embedded into the original observation space. The idea is to project the principal components from the linear tangent space to a manifold using a exponential map. This feat is accomplished using a variety of techniques from Local Linear Embedding and density preserving maps to Laplacian Eigenmaps [4:15].

The vector of observations cannot be directly used on a manifold because metrics such as norms or inner products depend on the location of the manifold the vector is applied to. Computation on manifolds relies on tensors such as contravariant and covariant vectors. Tensors algebra is supported by covariant and contravariant functors, which is introduced in the Abstraction section in Chapter 1, Getting Started.

Techniques that use differentiable manifolds are known as spectral dimensionality reduction.

Note

Alternative dimension reduction techniques

Here are some more alternative techniques, listed as references: factor analysis, principal factor analysis, maximum likelihood factor analysis, independent component analysis, Random projection, nonlinear ICA, Kohonen's self-organizing maps, neural networks, and multidimensional scaling, just to name a few [4:16].

Manifold learning algorithms such as classifiers and dimension reduction techniques are associated with semi-supervised learning.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset