Dimensionality reduction

Complex and noisy characteristics of textual data with high dimensions can be handled by dimensionality reduction techniques. These techniques reduce the dimension of the textual data while still preserving its underlying statistics. Though the dimensions are reduced, it is important to preserve the inter-document relationships. The idea is to have minimum number of dimensions, which can preserve the intrinsic dimensionality of the data.

A textual collection is mostly represented in the form of a term document matrix wherein we have the importance of each term in a document. The dimensionality of such a collection increases with the number of unique terms. If we were to suggest the simplest possible dimensionality reduction method, that would be to specify the limit or boundary on the distribution of different terms in the collection. Any term that occurs with a significantly high frequency is not going to be informative for us, and the barely present terms can undoubtedly be ignored and considered as noise. Some examples of stop words are is, was, then, and the.

Words that generally occur with high frequency and have no particular meaning are referred to as stop words. Words that occur just once or twice are more likely to be spelling errors or complicated words, and hence both these and stop words should not be considered for modeling the document in the Term Document Matrix (TDM).

We will discuss a few dimensionality reduction techniques in brief and dive into their implementation using R.

Principal component analysis

Principal component analysis (PCA) reveals the internal structure of a dataset in a way that best explains the variance within the data. PCA identifies patterns to reduce the dimensions of the dataset without significant loss of information. The main aim of PCA is to project a high-dimensional feature space into a smaller subset to decrease computational cost. PCA helps in computing new features, which are called principal components; these principal components are uncorrelated linear combinations of the original features projected in the direction of higher variability. The important point is to map the set of features into a matrix, M, and compute the eigenvalues and eigenvectors. Eigenvectors provide simpler solutions to problems that can be modeled using linear transformations along axes by stretching, compressing, or flipping. Eigenvalues provide the length and magnitude of eigenvectors where such transformations occur. Eigenvectors with greater eigenvalues are selected in the new feature space because they enclose more information than eigenvectors with lower eigenvalues for a data distribution. The first principle component has the greatest possible variance, that is, the largest eigenvalues compared with the next principal component uncorrelated, relative to the first PC. The nth PC is the linear combination of the maximum variance that is uncorrelated with all previous PCs.

PCA comprises the following steps:

  1. Compute the n-dimensional mean of the given dataset.
  2. Compute the covariance matrix of the features.
  3. Compute the eigenvectors and eigenvalues of the covariance matrix.
  4. Rank/sort the eigenvectors by descending eigenvalue.
  5. Choose x eigenvectors with the largest eigenvalues.

Eigenvector values represent the contribution of each variable to the principal component axis. Principal components are oriented in the direction of maximum variance in m-dimensional space.

PCA is one of the most widely used multivariate methods for discovering meaningful, new, informative, and uncorrelated features. This methodology also reduces dimensionality by rejecting low-variance features and is useful in reducing the computational requirements for classification and regression analysis.

Using R for PCA

R also has two inbuilt functions for accomplishing PCA: prcomp() and princomp(). These two functions expect the dataset to be organized with variables in columns and observations in rows and has a structure like a data frame. They also return the new data in the form of a data frame, and the principal components are given in columns.

prcomp() and princomp() are similar functions used for accomplishing PCA; they have a slightly different implementation for computing PCA. Internally, the princomp() function performs PCA using eigenvectors. The prcomp() function uses a similar technique known as singular value decomposition (SVD). SVD has slightly better numerical accuracy, so prcomp() is generally the preferred function.

Note

princomp() fails in situations if the number of variables is larger than the number of observations.

Each function returns a list whose class is prcomp() or princomp().

The information returned and terminology is summarized in the following table:

prcomp()

princomp()

Explanation

sdev

sdev

Standard deviation of each column

Rotations

Loading

Principle components

Center

Center

Subtracted value of each row or column to get the center data

Scale

Scale

Scale factors used

X

Score

The rotated data

 

n.obs

Number of observations of each variable

 

Call

The call to function that created the object

Here's a list of the functions available in different R packages for performing PCA:

  • PCA(): FactoMineR package
  • acp(): amap package
  • prcomp(): stats package
  • princomp(): stats package
  • dudi.pca(): ade4 package
  • pcaMethods: This package from Bioconductor has various convenient methods to compute PCA

Understanding the FactoMineR package

FactomineR is a R package that provides multiple functions for multivariate data analysis and dimensionality reduction. The functions provided in the package not only deals with quantitative data but also categorical data. Apart from PCA, correspondence and multiple correspondence analyses can also be performed using this package:

library(FactoMineR)
data<-replicate(10,rnorm(1000))
result.pca = PCA(data[,1:9], scale.unit=TRUE, graph=T)
print(result.pca)

Results for the principal component analysis (PCA).

The analysis was performed on 1,000 individuals, described by nine variables.

The results are available in the following objects:

Name

Description

$eig

Eigenvalues

$var

Results for the variables

$var$coord

coord. for the variables

$var$cor

Correlations variables - dimensions

$var$cos2

cos2 for the variables

$var$contrib

Contributions of the variables

$ind

Results for the individuals

$ind$coord

coord. for the individuals

$ind$cos2

cos2 for the individuals

$ind$contrib

Contributions of the individuals

$call

Summary statistics

$call$centre

Mean of the variables

$call$ecart.type

Standard error of the variables

$call$row.w

Weights for the individuals

$call$col.w

Weights for the variables

Eigenvalue percentage of variance cumulative percentage of variance:

comp 1  1.1573559       12.859510                  12.85951
comp 2  1.0991481       12.212757                  25.07227
comp 3  1.0553160       11.725734                  36.79800
comp 4  1.0076069       11.195632                  47.99363
comp 5  0.9841510       10.935011                  58.92864
comp 6  0.9782554       10.869505                  69.79815
comp 7  0.9466867       10.518741                  80.31689
comp 8  0.9172075       10.191194                  90.50808
comp 9  0.8542724       9.491916                   100.00000

Amap package

Amap is another package in the R environment that provides tools for clustering and PCA. It is an acronym for Another Multidimensional Analysis Package. One of the most widely used functions in this package is acp(), which does PCA on a data frame.

This function is akin to princomp() and prcomp(), except that it has slightly different graphic represention.

For more intricate details, refer to the CRAN-R resource page:

https://cran.r-project.org/web/packages/lLibrary(amap/amap.pdf

Library(amap
acp(data,center=TRUE,reduce=TRUE)

Additionally, weight vectors can also be provided as an argument. We can perform a robust PCA by using the acpgen function in the amap package:

acpgen(data,h1,h2,center=TRUE,reduce=TRUE,kernel="gaussien") K(u,kernel="gaussien") 
W(x,h,D=NULL,kernel="gaussien")
acprob(x,h,center=TRUE,reduce=TRUE,kernel="gaussien")

Proportion of variance

We look to construct components and to choose from them, the minimum number of components, which explains the variance of data with high confidence.

R has a prcomp() function in the base package to estimate principal components. Let's learn how to use this function to estimate the proportion of variance, eigen facts, and digits:

pca_base<-prcomp(data)
print(pca_base) 

The pca_base object contains the standard deviation and rotations of the vectors. Rotations are also known as the principal components of the data. Let's find out the proportion of variance each component explains:

pr_variance<- (pca_base$sdev^2/sum(pca_base$sdev^2))*100
pr_variance
 [1] 11.678126 11.301480 10.846161 10.482861 10.176036  9.605907  9.498072
 [8]  9.218186  8.762572  8.430598

pr_variance signifies the proportion of variance explained by each component in descending order of magnitude.

Let's calculate the cumulative proportion of variance for the components:

cumsum(pr_variance)
 [1]  11.67813  22.97961  33.82577  44.30863  54.48467  64.09057  73.58864
 [8]  82.80683  91.56940 100.00000

Components 1-8 explain the 82% variance in the data.

Scree plot

If you wish to plot the variances against the number of components, you can use the screeplot function on the fitted model:

screeplot(pca_base)
Scree plot

Reconstruction error

The components are estimated in such a way that the variance across each component is maximized while the reconstruction error is minimized. Reconstruction error in terms of PCA is the squared Euclidean distance between the actual data and its estimated equivalent. We intend to choose the orthonormal basis, which minimizes the error and the eigenvectors with minimum eigenvalues.

Can we reconstruct the original data, using the results of PCA?

loadings = t(pca_base$rotation[,1])
scores = pca_base$x[,1]
reconstruct_data = scores %*% loadings  + colMeans(data)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset