Principal Component Analysis

Principal Component Analysis is a technique that takes datasets that have several correlated features and projects them onto a coordinate (axis) system that has fewer correlated features. These new, uncorrelated features (which I referred to before as a super-columns) are called principal components. The principal components serve as an alternative coordinate system to the original feature space that requires fewer features and captures as much variance as possible. If we refer back to our example with the cameras, the principal components are exemplified by the cameras themselves.

Put another way, the goal of the PCA is to identify patterns and latent structures within datasets in order to create new columns and use these columns instead of the original features. Just as in feature selection, if we start with a data matrix of size n x d where n is the number of observations and d is the number of original features, we are projecting this matrix onto a matrix of size n x k (where k < d).

Our principal components give rise to new columns that maximize the variance in our data. This means that each column is trying to explain the shape of our data. Principal components are ordered by variance explained so that the first principal component does the most to explain the variance of the data, while the second component does the second most to explain the variance of the data. The goal is to utilize as many components as we need in order to optimize the machine learning task, whether it be supervised or unsupervised learning:

Feature transformation is about transforming datasets into matrices with the same number of rows with a reduced number of features. This is similar to the point of feature selection but in this case, we are concerned with the creation of brand new features.

PCA is itself an unsupervised task, meaning that it does not utilize a response column in order to make the projection/transformation. This matters because the second feature transformation algorithm that we will work with will be supervised and will utilize the response variable in order to create super-columns in a different way that optimizes predictive tasks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset