Complex and noisy characteristics of textual data with high dimensions can be handled by dimensionality reduction techniques. These techniques reduce the dimension of the textual data while still preserving its underlying statistics. Though the dimensions are reduced, it is important to preserve the inter-document relationships. The idea is to have minimum number of dimensions, which can preserve the intrinsic dimensionality of the data.
A textual collection is mostly represented in the form of a term document matrix wherein we have the importance of each term in a document. The dimensionality of such a collection increases with the number of unique terms. If we were to suggest the simplest possible dimensionality reduction method, that would be to specify the limit or boundary on the distribution of different terms in the collection. Any term that occurs with a significantly high frequency is not going to be informative for us, and the barely present terms can undoubtedly be ignored and considered as noise. Some examples of stop words are is, was, then, and the.
Words that generally occur with high frequency and have no particular meaning are referred to as stop words. Words that occur just once or twice are more likely to be spelling errors or complicated words, and hence both these and stop words should not be considered for modeling the document in the Term Document Matrix (TDM).
We will discuss a few dimensionality reduction techniques in brief and dive into their implementation using R.
Principal component analysis (PCA) reveals the internal structure of a dataset in a way that best explains the variance within the data. PCA identifies patterns to reduce the dimensions of the dataset without significant loss of information. The main aim of PCA is to project a high-dimensional feature space into a smaller subset to decrease computational cost. PCA helps in computing new features, which are called principal components; these principal components are uncorrelated linear combinations of the original features projected in the direction of higher variability. The important point is to map the set of features into a matrix, M, and compute the eigenvalues and eigenvectors. Eigenvectors provide simpler solutions to problems that can be modeled using linear transformations along axes by stretching, compressing, or flipping. Eigenvalues provide the length and magnitude of eigenvectors where such transformations occur. Eigenvectors with greater eigenvalues are selected in the new feature space because they enclose more information than eigenvectors with lower eigenvalues for a data distribution. The first principle component has the greatest possible variance, that is, the largest eigenvalues compared with the next principal component uncorrelated, relative to the first PC. The nth PC is the linear combination of the maximum variance that is uncorrelated with all previous PCs.
PCA comprises the following steps:
Eigenvector values represent the contribution of each variable to the principal component axis. Principal components are oriented in the direction of maximum variance in m-dimensional space.
PCA is one of the most widely used multivariate methods for discovering meaningful, new, informative, and uncorrelated features. This methodology also reduces dimensionality by rejecting low-variance features and is useful in reducing the computational requirements for classification and regression analysis.
R also has two inbuilt functions for accomplishing PCA: prcomp()
and princomp()
. These two functions expect the dataset to be organized with variables in columns and observations in rows and has a structure like a data frame. They also return the new data in the form of a data frame, and the principal components are given in columns.
prcomp()
and princomp()
are similar functions used for accomplishing PCA; they have a slightly different implementation for computing PCA. Internally, the princomp()
function performs PCA using eigenvectors. The prcomp()
function uses a similar technique known as singular value decomposition (SVD). SVD has slightly better numerical accuracy, so prcomp()
is generally the preferred function.
Each function returns a list whose class is prcomp()
or princomp()
.
The information returned and terminology is summarized in the following table:
|
|
Explanation |
|
|
Standard deviation of each column |
Rotations |
Loading |
Principle components |
Center |
Center |
Subtracted value of each row or column to get the center data |
Scale |
Scale |
Scale factors used |
X |
Score |
The rotated data |
n.obs |
Number of observations of each variable | |
Call |
The call to function that created the object |
Here's a list of the functions available in different R packages for performing PCA:
FactomineR
is a R package that provides multiple functions for multivariate data analysis and dimensionality reduction. The functions provided in the package not only deals with quantitative data but also categorical data. Apart from PCA, correspondence and multiple correspondence analyses can also be performed using this package:
library(FactoMineR) data<-replicate(10,rnorm(1000)) result.pca = PCA(data[,1:9], scale.unit=TRUE, graph=T) print(result.pca)
Results for the principal component analysis (PCA).
The analysis was performed on 1,000 individuals, described by nine variables.
The results are available in the following objects:
Name |
Description |
---|---|
|
Eigenvalues |
|
Results for the variables |
|
coord. for the variables |
|
Correlations variables - dimensions |
|
cos2 for the variables |
|
Contributions of the variables |
|
Results for the individuals |
|
coord. for the individuals |
|
cos2 for the individuals |
|
Contributions of the individuals |
|
Summary statistics |
|
Mean of the variables |
|
Standard error of the variables |
|
Weights for the individuals |
|
Weights for the variables |
Eigenvalue percentage of variance cumulative percentage of variance:
comp 1 1.1573559 12.859510 12.85951 comp 2 1.0991481 12.212757 25.07227 comp 3 1.0553160 11.725734 36.79800 comp 4 1.0076069 11.195632 47.99363 comp 5 0.9841510 10.935011 58.92864 comp 6 0.9782554 10.869505 69.79815 comp 7 0.9466867 10.518741 80.31689 comp 8 0.9172075 10.191194 90.50808 comp 9 0.8542724 9.491916 100.00000
Amap is another package in the R environment that provides tools for clustering and PCA. It is an acronym for Another Multidimensional Analysis Package. One of the most widely used functions in this package is acp()
, which does PCA on a data frame.
This function is akin to princomp()
and prcomp()
, except that it has slightly different graphic represention.
For more intricate details, refer to the CRAN-R resource page:
https://cran.r-project.org/web/packages/lLibrary(amap/amap.pdf
Library(amap acp(data,center=TRUE,reduce=TRUE)
Additionally, weight vectors can also be provided as an argument. We can perform a robust PCA by using the acpgen
function in the amap
package:
acpgen(data,h1,h2,center=TRUE,reduce=TRUE,kernel="gaussien") K(u,kernel="gaussien") W(x,h,D=NULL,kernel="gaussien") acprob(x,h,center=TRUE,reduce=TRUE,kernel="gaussien")
We look to construct components and to choose from them, the minimum number of components, which explains the variance of data with high confidence.
R has a prcomp()
function in the base package to estimate principal components. Let's learn how to use this function to estimate the proportion of variance, eigen facts, and digits:
pca_base<-prcomp(data) print(pca_base)
The pca_base
object contains the standard deviation and rotations of the vectors. Rotations are also known as the principal components of the data. Let's find out the proportion of variance each component explains:
pr_variance<- (pca_base$sdev^2/sum(pca_base$sdev^2))*100 pr_variance [1] 11.678126 11.301480 10.846161 10.482861 10.176036 9.605907 9.498072 [8] 9.218186 8.762572 8.430598
pr_variance
signifies the proportion of variance explained by each component in descending order of magnitude.
Let's calculate the cumulative proportion of variance for the components:
cumsum(pr_variance) [1] 11.67813 22.97961 33.82577 44.30863 54.48467 64.09057 73.58864 [8] 82.80683 91.56940 100.00000
Components 1-8 explain the 82% variance in the data.
The components are estimated in such a way that the variance across each component is maximized while the reconstruction error is minimized. Reconstruction error in terms of PCA is the squared Euclidean distance between the actual data and its estimated equivalent. We intend to choose the orthonormal basis, which minimizes the error and the eigenvectors with minimum eigenvalues.
Can we reconstruct the original data, using the results of PCA?
loadings = t(pca_base$rotation[,1]) scores = pca_base$x[,1] reconstruct_data = scores %*% loadings + colMeans(data)