Correspondence analysis

Just like PCA, the basic idea behind correspondence analysis is to reduce the dimensionality of data and represent it in a low-dimensionality space. Correspondence analysis basically deals with contingency tables or cross tabs. This technique is designed to perform exploratory analysis on multi-way tables with some degree of correspondence between their dimensions. The common methodology followed for correspondence analysis involves the standardization of the cross tab table of frequencies so that the entries in the cross tab can be represented in terms of distance between the dimensions in a low-dimensional space.

There are a few packages available in R that provide efficient functions for correspondence analysis:

R functions

Package

ca()

ca

corresp(formula,nf,data)

MASS

dudi.coa(df, scannf = TRUE, nf = 2)

ade4

CA()

FactorMineR

afc()

amap

Let's look at an example application of the R functions for simple correspondence analysis:

# Load the package anacor to the session
library(anacor)
#Load dataset tocher , it s a frequency table 
data(tocher)
resid<- anacor(tocher, scaling = c("standard", "centroid"))
print(resid)

CA fit:

Sum of eigenvalues: 0.2293315

Total chi-square value: 1240.039

Chi-Square decomposition

Chisq Proportion Cumulative Proportion:

Component 1 1073.331 0.866 0.866

Component 2 162.077 0.131 0.996

Component 3 4.630 0.004 1.000

Looking at the Chisquare decomposition, we can conclude that component 1 contributes to 86% of total inertia. Components 1 and 2 together are good enough to account for a significant percentage of inertia.

Let's run the Chisquare test of independence:

chisq.test(tocher)

Pearson's Chisquared test:

data: tocher

X-squared = 1240.039, df = 12, p-value < 2.2e-16

Let's visualize the joint and graph plots for the residuals:

plot(resid, plot.type = "jointplot", ylim = c(-1.5, 1.5))
plot(resid, plot.type = "graphplot", wlines = 5)
Correspondence analysis

Other plotting options available in amacor are as follows:

  • regplot: Frequency grid and regression line is plotted
  • transplot: plot of initial to transformed/scaled data
  • benzplot: plots the observed distance against fitted distance
  • rowplot,colplot: plots row/column score separately

Canonical correspondence analysis

Canonical correspondence analysis (CCA) is different from PCA, as the relationships in PCA are linear. To explore the relationship between two multivariate set of variables, where we assume a cause-effect relation, we apply CCA. The qualitative variables are recorded as binary dummy variables for CCA and the fitted model provides a quantitative rescaling of the categorical variables:

library(ca)
data(bitterling)
data<-bitterling
total <- sum(data)
nrows <- nrow(data)
ncols <- ncol(data)
#dimensionality
a <- min(ncol(data)-1, nrow(data)-1) 
labs<-c(1:a) #x- axis lables
# create the contingency table
data_matrix<-as.matrix(data)
# Add row column profile to contingency table
data_rowsum<-addmargins(data_matrix,1)
data_colsum<-addmargins(data_matrix,2)
# Apply average rule ,to get number of dimensions
col_dim<-round(100/(ncols-1), digits=1)
row_dim<-round(100/(nrows-1), digits=1)
thresh_dim<-(max(col_dim, row_dim))
data_ca<- summary(ca(data))
n_dim<- length(which(data_ca$scree[,3]>=thresh_dim))
# Malinvaud's Test
mal_ca<-CA(data, ncp=a, graph=FALSE)
mal_trow <- a
mal_tcol <- 6
mal_out <-matrix(ncol= mal_tcol, nrow=mal_trow)
names(mal_out) <- c("K", "Dimension", "Eigen value", "Chi-square", "df", "p value")
mal_out[,1] <- c(0:(a-1))
mal_out[,2] <- c(1:a)
library(foreach)
library(doParallel)
cl <- makeCluster(4) # number of cores
registerDoParallel(cl)
foreach(i = 1:mal_trow) %dopar% {
  k <- -1+i
  mal_out[i,3] <- mal_ca$eig[i,1]
  mal_out[i,5] <- (nrows-k-1)*(ncols-k-1)
}
mal_out[,4] <- rev(cumsum(rev(mal_out[,3])))*total
mal_out[,6] <- round(pchisq(mal_out[,4], mal_out[,5], lower.tail=FALSE), digits=6)
optimal.dimensionality <- length(which(mal_out[,6]<=0.05))
# plot bar chart of correlation between rows and columns, and add reference line
dev.new()
perf_corr<-(1.0)
sqr.trace<-round(sqrt(sum(data_ca$scree[,2])), digits=3)
barplot(c(perf_corr, sqr.trace), main="Correlation coefficient between rows & columns (=square root of the inertia)", sub="reference line: threshold of important correlation ", ylab="correlation coeff.", names.arg=c("correlation coeff. range", "correlation coeff. bt rows & cols"), cex.main=0.80, cex.sub=0.80, cex.lab=0.80)
abline(h=0.20)
Canonical correspondence analysis
barplot(data_ca$scree[,3], xlab="Dimensions", ylab="% of Inertia", names.arg=data_ca$scree[,1])
abline(h=thresh.sig.dim)
title (main="Percentage of inertia attributed to the dimensions", sub="ref line: threshold of an optimal dimensionality of the solution, according to the average rule", cex.main=0.80, cex.sub=0.80)
plot(mal_out[,6], type="o", xaxt="n", xlim=c(1, a), xlab="Dimensions", ylab="p value")
axis(1, at=labs, labels=sprintf("%.0f",labs))
title(main="Malinvaud's test Plot", sub="dashed line: alpha 0.05 threshold", col.sub="RED", cex.sub=0.80)
abline(h=0.05, lty=2, col="RED")
Canonical correspondence analysis
userow_dimensionality <- 3
dims.to.be.plotted <- userow_dimensionality 
# CA analysis by Greenacre's package to be used later on for the Standard Biplots
res.ca <- ca(data, nd=dims.to.be.plotted)
str(res.ca)
List of 15
 $ sv        : num [1:11] 0.831 0.799 0.585 0.529 0.443 ...
 $ nd        : num 3
 $ rownames  : chr [1:12] "jk" "tu" "hb" "chs" ...
 $ rowmass   : num [1:12] 0.1748 0.0382 0.1021 0.0709 0.0853 ...
 $ rowdist   : num [1:12] 1.19 1.01 1.02 1.31 2.69 ...
 $ rowinertia: num [1:12] 0.2466 0.0386 0.1064 0.1214 0.6195 ...
 $ rowcoord  : num [1:12, 1:3] -0.04047 -0.00192 0.16761 0.20569 -3.12697 ...
 $ rowsup    : logi(0) 
 $ colnames  : chr [1:12] "jk" "tu" "hb" "chs" ...
 $ colmass   : num [1:12] 0.1983 0.0101 0.1167 0.0792 0.0864 ...
 $ coldist   : num [1:12] 1.159 0.798 1.075 1.191 2.677 ...
 $ colinertia: num [1:12] 0.26613 0.00644 0.13485 0.11225 0.6189 ...
 $ colcoord  : num [1:12, 1:3] -0.0126 0.0104 0.1632 0.2324 -3.113 ...
 $ colsup    : logi(0) 
 $ call      : language ca.matrix(obj = as.matrix(obj), nd = ..1)
 - attr(*, "class")= chr "ca"
# CA output as dataframe to be used for the some graphs to come
cadataframe<-summary(ca(data, nd=dims.to.be.plotted))
# plot the quality of the display of categories on successive pairs of dimensions
#row categories
dev.new()
counter <- 1
for(i in seq(9, ncol(cadataframe$rows), 3)){  
  counter <- counter +1
  quality.rows <- (cadataframe$rows[,6]+cadataframe$rows[,i])/10
  barplot(quality.rows, ylim=c(0,100), xlab="Row categories", ylab=paste("Quality of the display (% of inertia) on Dim. 1+", counter), names.arg=cadataframe$rows[,1], cex.lab=0.80)
}
Canonical correspondence analysis
#column categories
dev.new()
counter <- 1
for(i in seq(9, ncol(cadataframe$columns), 3)){  
  counter <- counter +1
  quality.cols <- (cadataframe$columns[,6]+cadataframe$columns[,i])/10
  barplot(quality.cols, ylim=c(0,100), xlab="Column categories", ylab=paste("Quality of the display (% of inertia) on Dim. 1+", counter), names.arg=cadataframe$columns[,1], cex.lab=0.80)
}
# charts of categories contribution
# plot bar charts of contribution of row categories to the axes, and add a reference line
dev.new()
counter <- 0
for(i in seq(7, ncol(cadataframe$rows), 3)){  
  counter <- counter +1
  barplot(cadataframe$rows[,i], ylim=c(0,1000), xlab="Row categories", ylab=paste("Contribution to Dim. ",counter," (in permills)"), names.arg=cadataframe$rows[,1], cex.lab=0.80)
  abline(h=round(((100/nrows)*10), digits=0))
}
# plot bar charts of contribution of column categories to the axes, and add a reference line
dev.new()
counter <- 0
for(i in seq(7, ncol(cadataframe$columns), 3)){  
  counter <- counter +1
  barplot(cadataframe$columns[,i], ylim=c(0,1000), xlab="Column categories", ylab=paste("Contribution to Dim. ",counter," (in permills)"), names.arg=cadataframe$columns[,1], cex.lab=0.80)
  abline(h=round(((100/ncols)*10), digits=0))
}
# let us estimate the correlation of categories to dimensions
# row categories
dev.new()
counter <- 0
for(i in seq(6, ncol(cadataframe$rows), 3)){  
  counter <- counter +1
  correl.rows <- round(sqrt((cadataframe$rows[,i]/1000)), digits=3)
  barplot(correl.rows, ylim=c(0,1), xlab="Row categories", ylab=paste("Correlation with Dim. ", counter), names.arg=cadataframe$rows[,1], cex.lab=0.80)
}
#column categories
dev.new()
counter <- 0
for(i in seq(6, ncol(cadataframe$columns), 3)){  
  counter <- counter +1
  correl.cols <- round(sqrt((cadataframe$columns[,i]/1000)), digits=3)
  barplot(correl.cols, ylim=c(0,1), xlab="Column categories", ylab=paste("Correlation with Dim. ", counter), names.arg=cadataframe$columns[,1], cex.lab=0.80)
}
  #let us check the Contingency Table
print(addmargins(data_matrix))
# Association coefficients can be estimated by 
library(vcd)
print(assocstats(data_matrix))
                      X^2  df P(> X^2)
Likelihood Ratio 9251 121        0
Pearson                14589 121        0
Phi-Coefficient: 1.581 
Contingency Coeff.: 0.845 
Cramer's V        : 0.477    

Pearson's Chi-squared test

#Chi-square test
print(chisq.test(data))
X-squared = 14589.07, df = 121, p-value < 2.2e-16
#Total Inertia
print(sum(cadataframe$scree[,2]))
[1] 2.499841
# Square root of the Total Inertia
print(sqr.trace)
[1] 1.581
# Correspondence Analysis summary
print(cadataframe)
       
Principal inertias (eigenvalues):
 dim    value      %   cum%   scree plot               
 1      0.689905  27.6  27.6  *************************
 2      0.639174  25.6  53.2  ***********************  
 3      0.342155  13.7  66.9  ************             
 4      0.280273  11.2  78.1  **********               
 5      0.196284   7.9  85.9  *******                  
 6      0.154954   6.2  92.1  ******                   
 7      0.142684   5.7  97.8  *****                    
 8      0.048760   2.0  99.8  **                       
 9      0.005384   0.2 100.0                           
 10     0.000232   0.0 100.0                           
 11     3.5e-050   0.0 100.0                           
        -------- -----                                 
 Total: 2.499841 100.0                                 

Multiple correspondence analysis

Multiple correspondence analysis is methodology to establish the association between multiple discrete categorical or qualitative variables. This makes it different from simple correspondence analysis, which accounts for association between only two categorical variables. It is a compelling statistical tool used for allocating scores to subjects and sets for multiple categorical variables. Multiple correspondence analyses are categorized by the optimal scaling of categorical variables. This analysis is considered as a categorical equivalent of PCA, a form of non-linear principal component analysis. It is also seen as multidimensional scaling of matrices. Multiple correspondence analyses have been chosen by many academic fields to analyze huge amount of survey data.

This technique provides the association between two or more categorical variables. The data can be represented graphically in a highly informative and intuitive way using this technique. One of the distinctive features of correspondence analysis is the ways in which one can derive the basic simultaneous equations. These equations are related to the Pearson's chi-squared statistic and also to the different methods of quantification. The association between the variables of a two-way contingency table may be considered a special case of multiple correspondence analyses.

The scaling in multiple correspondence analyses can be performed by the following methods:

  • Generalized SVD
  • Least-squares algorithms and alternating least squares
  • Eigen-decomposition

There are different variations of multiple correspondence analysis:

  • Stacking and concatenation
  • Joint correspondence analysis
  • Ordered multiple correspondence analysis

MCA functions

Package

mca()

MCA

MCA()

FactomineR

dudi.acm()

ade4

homals()

homals

Mica()

ca

Singular vector decomposition (SVD) is a dimensionality reduction technique that gained a lot of popularity in recent times after the famous Netflix Movie Recommendation challenge. Since its inception, it has found its usage in many applications in statistics, mathematics, and signal processing.

It is primarily a technique to factorize any matrix; it can be real or a complex matrix. A rectangular matrix can be factorized into two orthonormal matrices and a diagonal matrix of positive real values. An m*n matrix is considered as m points in n-dimensional space; SVD attempts to find the best k dimensional subspace that fits the data:

Multiple correspondence analysis

SVD in R is used to compute approximations of singular values and singular vectors of large-scale data matrices. These approximations are made using different types of memory-efficient algorithm, and IRLBA is one of them (named after Lanczos bi-diagonalization (IRLBA) algorithm). We shall be using the irlba package here in order to implement SVD.

Implementation of SVD using R

The following code will show the implementation of SVD using R:

# List of packages for the session
packages = c("foreach", "doParallel", "irlba")
# Install CRAN packages (if not already installed)
inst <- packages %in% installed.packages()
if(length(packages[!inst]) > 0) install.packages(packages[!inst])
# Load packages into session 
lapply(packages, require, character.only=TRUE)
# register the parallel session for 
registerDoParallel(cores=detectCores(all.tests=TRUE))
std_svd <- function(x, k, p=25, iter=0 1 ) { 
m1 <- as.matrix(x)
r <- nrow(m1)
c <- ncol(m1)
p <- min( min(r,c)-k,p)
z <- k+p
m2 <- matrix ( rnorm(z*c), nrow=c, ncol=z)
y <- m1 %*% m2
q <- qr.Q(qr(y))
b<- t(q) %*% m1
#iterations
b1<-foreach( i=i1:iter ) %dopar% { 
  y1 <- m1 %*% t(b)
  q1 <- qr.Q(qr(y1))
  b1 <- t(q1) %*% m1
}
b1<-b1[[iter]]
b2 <- b1 %*% t(b1)
eigens <- eigen(b2, symmetric=T)
result <- list()
result$svalues <- sqrt(eigens$values)[1:k]
u1=eigens$vectors[1:k,1:k]
result$u <- (q %*% eigens$vectors)[,1:k]
result$v <- (t(b) %*% eigens$vectors %*% diag(1/eigens$values))[,1:k]
return(result)
}
svd<- std_svd(x=data,k=5))
# singular vectors
svd$svalues
[1] 35.37645 33.76244 32.93265 32.72369 31.46702

We obtain the following values after running SVD using the IRLBA algorithm:

  • d: approximate singular values.
  • u: nu approximate left singular vectors
  • v: nv approximate right singular vectors
  • iter: # of IRLBA algorithm iterations
  • mprod: # of matrix vector products performed

These values can be used for obtaining results of SVD and understanding the overall statistics about how the algorithm performed.

Latent factors

# svd$u, svd$v
dim(svd$u)  #u value after running IRLBA
[1] 1000    5
dim(svd$v)  #v value after running IRLBA
[1] 10  5

A modified version of the previous function can be achieved by altering the power iterations for a robust implementation:

foreach( i = 1:iter )%dopar% { 
  y1 <- m1 %*% t(b)
  y2 <- t(y1) %*% y1
  r2 <- chol(y2, pivot = T)
  q1 <- y2 %*% solve(r2)
  b1 <- t(q1) %*% m1
}
b2 <- b1 %*% t(b1)

Some other functions available in R packages are as follows:

Functions

Package

svd()

svd

Irlba()

irlba

svdImpute

bcv

ISOMAP – moving toward non-linearity

ISOMAP is a nonlinear dimension reduction method and is representative of isometric mapping methods. ISOMAP is one of the approaches for manifold learning. ISOMAP finds the map that preserves the global, nonlinear geometry of the data by preserving the geodesic manifold inter-point distances. Like multi-dimensional scaling, ISOMAP creates a visual presentation of distance of a number of objects. Geodesic is the shortest curve along the manifold connecting two points induced by a neighborhood graph. Multi-dimensional scaling uses the Euclidian distance measure; since the data is in a nonlinear format, ISOMPA uses geodesic distance. ISOMAP can be viewed as an extension of metric multi-dimensional scaling.

At a very high level, ISOMAP can be describes in four steps:

  1. Determine the neighbor of each point
  2. Construct a neighborhood graph
  3. Compute the shortest distance path between all pairs
  4. Construct k-dimensional coordinate vectors by applying MDS

Geodesic distance approximation is basically calculated in three ways:

  • Neighboring points: Input-space distance
  • Faraway points: A sequence of short hops between neighboring points
  • Method: Finding shortest paths in a graph with edges connecting neighboring data points
source("http://bioconductor.org/biocLite.R")
biocLite("RDRToolbox")
library('RDRToolbox')
swiss_Data=SwissRoll(N = 1000, Plot=TRUE)
x=SwissRoll()
open3d()
plot3d(x, col=rainbow(1050)[-c(1:50)],box=FALSE,type="s",size=1)
Implementation of SVD using R
simData_Iso = Isomap(data=swiss_Data, dims=1:10, k=10,plotResiduals=TRUE)
Implementation of SVD using R
library(vegan)data(BCI)
distance <- vegdist(BCI)
tree <- spantree(dis)
pl1 <- ordiplot(cmdscale(dis), main="cmdscale")
lines(tree, pl1, col="red")
z <- isomap(distance, k=3)
rgl.isomap(z, size=4, color="red")
pl2 <- plot(isomap(distance, epsilon=0.5), main="isomap epsilon=0.5")
pl3 <- plot(isomap(distance, k=5), main="isomap k=5")
pl4 <- plot(z, main="isomap k=3")
Implementation of SVD using R
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset