Just like PCA, the basic idea behind correspondence analysis is to reduce the dimensionality of data and represent it in a low-dimensionality space. Correspondence analysis basically deals with contingency tables or cross tabs. This technique is designed to perform exploratory analysis on multi-way tables with some degree of correspondence between their dimensions. The common methodology followed for correspondence analysis involves the standardization of the cross tab table of frequencies so that the entries in the cross tab can be represented in terms of distance between the dimensions in a low-dimensional space.
There are a few packages available in R that provide efficient functions for correspondence analysis:
R functions |
Package |
---|---|
|
|
|
|
|
|
|
|
|
|
Let's look at an example application of the R functions for simple correspondence analysis:
# Load the package anacor to the session library(anacor) #Load dataset tocher , it s a frequency table data(tocher) resid<- anacor(tocher, scaling = c("standard", "centroid")) print(resid)
CA fit:
Sum of eigenvalues: 0.2293315
Total chi-square value: 1240.039
Chi-Square decomposition
Chisq Proportion Cumulative Proportion:
Component 1 1073.331 0.866 0.866
Component 2 162.077 0.131 0.996
Component 3 4.630 0.004 1.000
Looking at the Chisquare decomposition, we can conclude that component 1 contributes to 86% of total inertia. Components 1 and 2 together are good enough to account for a significant percentage of inertia.
Let's run the Chisquare test of independence:
chisq.test(tocher)
Pearson's Chisquared test:
data: tocher
X-squared = 1240.039, df = 12, p-value < 2.2e-16
Let's visualize the joint and graph plots for the residuals:
plot(resid, plot.type = "jointplot", ylim = c(-1.5, 1.5)) plot(resid, plot.type = "graphplot", wlines = 5)
Other plotting options available in amacor
are as follows:
Canonical correspondence analysis (CCA) is different from PCA, as the relationships in PCA are linear. To explore the relationship between two multivariate set of variables, where we assume a cause-effect relation, we apply CCA. The qualitative variables are recorded as binary dummy variables for CCA and the fitted model provides a quantitative rescaling of the categorical variables:
library(ca) data(bitterling) data<-bitterling total <- sum(data) nrows <- nrow(data) ncols <- ncol(data) #dimensionality a <- min(ncol(data)-1, nrow(data)-1) labs<-c(1:a) #x- axis lables # create the contingency table data_matrix<-as.matrix(data) # Add row column profile to contingency table data_rowsum<-addmargins(data_matrix,1) data_colsum<-addmargins(data_matrix,2) # Apply average rule ,to get number of dimensions col_dim<-round(100/(ncols-1), digits=1) row_dim<-round(100/(nrows-1), digits=1) thresh_dim<-(max(col_dim, row_dim)) data_ca<- summary(ca(data)) n_dim<- length(which(data_ca$scree[,3]>=thresh_dim)) # Malinvaud's Test mal_ca<-CA(data, ncp=a, graph=FALSE) mal_trow <- a mal_tcol <- 6 mal_out <-matrix(ncol= mal_tcol, nrow=mal_trow) names(mal_out) <- c("K", "Dimension", "Eigen value", "Chi-square", "df", "p value") mal_out[,1] <- c(0:(a-1)) mal_out[,2] <- c(1:a) library(foreach) library(doParallel) cl <- makeCluster(4) # number of cores registerDoParallel(cl) foreach(i = 1:mal_trow) %dopar% { k <- -1+i mal_out[i,3] <- mal_ca$eig[i,1] mal_out[i,5] <- (nrows-k-1)*(ncols-k-1) } mal_out[,4] <- rev(cumsum(rev(mal_out[,3])))*total mal_out[,6] <- round(pchisq(mal_out[,4], mal_out[,5], lower.tail=FALSE), digits=6) optimal.dimensionality <- length(which(mal_out[,6]<=0.05)) # plot bar chart of correlation between rows and columns, and add reference line dev.new() perf_corr<-(1.0) sqr.trace<-round(sqrt(sum(data_ca$scree[,2])), digits=3) barplot(c(perf_corr, sqr.trace), main="Correlation coefficient between rows & columns (=square root of the inertia)", sub="reference line: threshold of important correlation ", ylab="correlation coeff.", names.arg=c("correlation coeff. range", "correlation coeff. bt rows & cols"), cex.main=0.80, cex.sub=0.80, cex.lab=0.80) abline(h=0.20)
barplot(data_ca$scree[,3], xlab="Dimensions", ylab="% of Inertia", names.arg=data_ca$scree[,1]) abline(h=thresh.sig.dim) title (main="Percentage of inertia attributed to the dimensions", sub="ref line: threshold of an optimal dimensionality of the solution, according to the average rule", cex.main=0.80, cex.sub=0.80) plot(mal_out[,6], type="o", xaxt="n", xlim=c(1, a), xlab="Dimensions", ylab="p value") axis(1, at=labs, labels=sprintf("%.0f",labs)) title(main="Malinvaud's test Plot", sub="dashed line: alpha 0.05 threshold", col.sub="RED", cex.sub=0.80) abline(h=0.05, lty=2, col="RED")
userow_dimensionality <- 3 dims.to.be.plotted <- userow_dimensionality # CA analysis by Greenacre's package to be used later on for the Standard Biplots res.ca <- ca(data, nd=dims.to.be.plotted) str(res.ca) List of 15 $ sv : num [1:11] 0.831 0.799 0.585 0.529 0.443 ... $ nd : num 3 $ rownames : chr [1:12] "jk" "tu" "hb" "chs" ... $ rowmass : num [1:12] 0.1748 0.0382 0.1021 0.0709 0.0853 ... $ rowdist : num [1:12] 1.19 1.01 1.02 1.31 2.69 ... $ rowinertia: num [1:12] 0.2466 0.0386 0.1064 0.1214 0.6195 ... $ rowcoord : num [1:12, 1:3] -0.04047 -0.00192 0.16761 0.20569 -3.12697 ... $ rowsup : logi(0) $ colnames : chr [1:12] "jk" "tu" "hb" "chs" ... $ colmass : num [1:12] 0.1983 0.0101 0.1167 0.0792 0.0864 ... $ coldist : num [1:12] 1.159 0.798 1.075 1.191 2.677 ... $ colinertia: num [1:12] 0.26613 0.00644 0.13485 0.11225 0.6189 ... $ colcoord : num [1:12, 1:3] -0.0126 0.0104 0.1632 0.2324 -3.113 ... $ colsup : logi(0) $ call : language ca.matrix(obj = as.matrix(obj), nd = ..1) - attr(*, "class")= chr "ca" # CA output as dataframe to be used for the some graphs to come cadataframe<-summary(ca(data, nd=dims.to.be.plotted)) # plot the quality of the display of categories on successive pairs of dimensions #row categories dev.new() counter <- 1 for(i in seq(9, ncol(cadataframe$rows), 3)){ counter <- counter +1 quality.rows <- (cadataframe$rows[,6]+cadataframe$rows[,i])/10 barplot(quality.rows, ylim=c(0,100), xlab="Row categories", ylab=paste("Quality of the display (% of inertia) on Dim. 1+", counter), names.arg=cadataframe$rows[,1], cex.lab=0.80) }
#column categories dev.new() counter <- 1 for(i in seq(9, ncol(cadataframe$columns), 3)){ counter <- counter +1 quality.cols <- (cadataframe$columns[,6]+cadataframe$columns[,i])/10 barplot(quality.cols, ylim=c(0,100), xlab="Column categories", ylab=paste("Quality of the display (% of inertia) on Dim. 1+", counter), names.arg=cadataframe$columns[,1], cex.lab=0.80) } # charts of categories contribution # plot bar charts of contribution of row categories to the axes, and add a reference line dev.new() counter <- 0 for(i in seq(7, ncol(cadataframe$rows), 3)){ counter <- counter +1 barplot(cadataframe$rows[,i], ylim=c(0,1000), xlab="Row categories", ylab=paste("Contribution to Dim. ",counter," (in permills)"), names.arg=cadataframe$rows[,1], cex.lab=0.80) abline(h=round(((100/nrows)*10), digits=0)) } # plot bar charts of contribution of column categories to the axes, and add a reference line dev.new() counter <- 0 for(i in seq(7, ncol(cadataframe$columns), 3)){ counter <- counter +1 barplot(cadataframe$columns[,i], ylim=c(0,1000), xlab="Column categories", ylab=paste("Contribution to Dim. ",counter," (in permills)"), names.arg=cadataframe$columns[,1], cex.lab=0.80) abline(h=round(((100/ncols)*10), digits=0)) } # let us estimate the correlation of categories to dimensions # row categories dev.new() counter <- 0 for(i in seq(6, ncol(cadataframe$rows), 3)){ counter <- counter +1 correl.rows <- round(sqrt((cadataframe$rows[,i]/1000)), digits=3) barplot(correl.rows, ylim=c(0,1), xlab="Row categories", ylab=paste("Correlation with Dim. ", counter), names.arg=cadataframe$rows[,1], cex.lab=0.80) } #column categories dev.new() counter <- 0 for(i in seq(6, ncol(cadataframe$columns), 3)){ counter <- counter +1 correl.cols <- round(sqrt((cadataframe$columns[,i]/1000)), digits=3) barplot(correl.cols, ylim=c(0,1), xlab="Column categories", ylab=paste("Correlation with Dim. ", counter), names.arg=cadataframe$columns[,1], cex.lab=0.80) } #let us check the Contingency Table print(addmargins(data_matrix)) # Association coefficients can be estimated by library(vcd) print(assocstats(data_matrix)) X^2 df P(> X^2) Likelihood Ratio 9251 121 0 Pearson 14589 121 0 Phi-Coefficient: 1.581 Contingency Coeff.: 0.845 Cramer's V : 0.477
#Chi-square test print(chisq.test(data)) X-squared = 14589.07, df = 121, p-value < 2.2e-16 #Total Inertia print(sum(cadataframe$scree[,2])) [1] 2.499841 # Square root of the Total Inertia print(sqr.trace) [1] 1.581 # Correspondence Analysis summary print(cadataframe) Principal inertias (eigenvalues): dim value % cum% scree plot 1 0.689905 27.6 27.6 ************************* 2 0.639174 25.6 53.2 *********************** 3 0.342155 13.7 66.9 ************ 4 0.280273 11.2 78.1 ********** 5 0.196284 7.9 85.9 ******* 6 0.154954 6.2 92.1 ****** 7 0.142684 5.7 97.8 ***** 8 0.048760 2.0 99.8 ** 9 0.005384 0.2 100.0 10 0.000232 0.0 100.0 11 3.5e-050 0.0 100.0 -------- ----- Total: 2.499841 100.0
Multiple correspondence analysis is methodology to establish the association between multiple discrete categorical or qualitative variables. This makes it different from simple correspondence analysis, which accounts for association between only two categorical variables. It is a compelling statistical tool used for allocating scores to subjects and sets for multiple categorical variables. Multiple correspondence analyses are categorized by the optimal scaling of categorical variables. This analysis is considered as a categorical equivalent of PCA, a form of non-linear principal component analysis. It is also seen as multidimensional scaling of matrices. Multiple correspondence analyses have been chosen by many academic fields to analyze huge amount of survey data.
This technique provides the association between two or more categorical variables. The data can be represented graphically in a highly informative and intuitive way using this technique. One of the distinctive features of correspondence analysis is the ways in which one can derive the basic simultaneous equations. These equations are related to the Pearson's chi-squared statistic and also to the different methods of quantification. The association between the variables of a two-way contingency table may be considered a special case of multiple correspondence analyses.
The scaling in multiple correspondence analyses can be performed by the following methods:
There are different variations of multiple correspondence analysis:
MCA functions |
Package |
---|---|
|
|
|
|
|
|
|
|
|
|
Singular vector decomposition (SVD) is a dimensionality reduction technique that gained a lot of popularity in recent times after the famous Netflix Movie Recommendation challenge. Since its inception, it has found its usage in many applications in statistics, mathematics, and signal processing.
It is primarily a technique to factorize any matrix; it can be real or a complex matrix. A rectangular matrix can be factorized into two orthonormal matrices and a diagonal matrix of positive real values. An m*n matrix is considered as m points in n-dimensional space; SVD attempts to find the best k dimensional subspace that fits the data:
SVD in R is used to compute approximations of singular values and singular vectors of large-scale data matrices. These approximations are made using different types of memory-efficient algorithm, and IRLBA is one of them (named after Lanczos bi-diagonalization (IRLBA) algorithm). We shall be using the irlba package here in order to implement SVD.
The following code will show the implementation of SVD using R:
# List of packages for the session packages = c("foreach", "doParallel", "irlba") # Install CRAN packages (if not already installed) inst <- packages %in% installed.packages() if(length(packages[!inst]) > 0) install.packages(packages[!inst]) # Load packages into session lapply(packages, require, character.only=TRUE) # register the parallel session for registerDoParallel(cores=detectCores(all.tests=TRUE)) std_svd <- function(x, k, p=25, iter=0 1 ) { m1 <- as.matrix(x) r <- nrow(m1) c <- ncol(m1) p <- min( min(r,c)-k,p) z <- k+p m2 <- matrix ( rnorm(z*c), nrow=c, ncol=z) y <- m1 %*% m2 q <- qr.Q(qr(y)) b<- t(q) %*% m1 #iterations b1<-foreach( i=i1:iter ) %dopar% { y1 <- m1 %*% t(b) q1 <- qr.Q(qr(y1)) b1 <- t(q1) %*% m1 } b1<-b1[[iter]] b2 <- b1 %*% t(b1) eigens <- eigen(b2, symmetric=T) result <- list() result$svalues <- sqrt(eigens$values)[1:k] u1=eigens$vectors[1:k,1:k] result$u <- (q %*% eigens$vectors)[,1:k] result$v <- (t(b) %*% eigens$vectors %*% diag(1/eigens$values))[,1:k] return(result) } svd<- std_svd(x=data,k=5)) # singular vectors svd$svalues [1] 35.37645 33.76244 32.93265 32.72369 31.46702
We obtain the following values after running SVD using the IRLBA algorithm:
d
: approximate singular values. u
: nu approximate left singular vectorsv
: nv approximate right singular vectorsiter
: # of IRLBA algorithm iterationsmprod
: # of matrix vector products performedThese values can be used for obtaining results of SVD and understanding the overall statistics about how the algorithm performed.
Latent factors
# svd$u, svd$v dim(svd$u) #u value after running IRLBA [1] 1000 5 dim(svd$v) #v value after running IRLBA [1] 10 5
A modified version of the previous function can be achieved by altering the power iterations for a robust implementation:
foreach( i = 1:iter )%dopar% { y1 <- m1 %*% t(b) y2 <- t(y1) %*% y1 r2 <- chol(y2, pivot = T) q1 <- y2 %*% solve(r2) b1 <- t(q1) %*% m1 } b2 <- b1 %*% t(b1)
Some other functions available in R packages are as follows:
Functions |
Package |
---|---|
|
|
|
|
|
|
ISOMAP – moving toward non-linearity
ISOMAP is a nonlinear dimension reduction method and is representative of isometric mapping methods. ISOMAP is one of the approaches for manifold learning. ISOMAP finds the map that preserves the global, nonlinear geometry of the data by preserving the geodesic manifold inter-point distances. Like multi-dimensional scaling, ISOMAP creates a visual presentation of distance of a number of objects. Geodesic is the shortest curve along the manifold connecting two points induced by a neighborhood graph. Multi-dimensional scaling uses the Euclidian distance measure; since the data is in a nonlinear format, ISOMPA uses geodesic distance. ISOMAP can be viewed as an extension of metric multi-dimensional scaling.
At a very high level, ISOMAP can be describes in four steps:
Geodesic distance approximation is basically calculated in three ways:
source("http://bioconductor.org/biocLite.R") biocLite("RDRToolbox") library('RDRToolbox') swiss_Data=SwissRoll(N = 1000, Plot=TRUE) x=SwissRoll() open3d() plot3d(x, col=rainbow(1050)[-c(1:50)],box=FALSE,type="s",size=1)
simData_Iso = Isomap(data=swiss_Data, dims=1:10, k=10,plotResiduals=TRUE)
library(vegan)data(BCI) distance <- vegdist(BCI) tree <- spantree(dis) pl1 <- ordiplot(cmdscale(dis), main="cmdscale") lines(tree, pl1, col="red") z <- isomap(distance, k=3) rgl.isomap(z, size=4, color="red") pl2 <- plot(isomap(distance, epsilon=0.5), main="isomap epsilon=0.5") pl3 <- plot(isomap(distance, k=5), main="isomap k=5") pl4 <- plot(z, main="isomap k=3")