Latent Semantic Analysis (LSA) is a modeling technique that can be used to understand a given collection of documents. It also provides us with insights into the relationship between words in the documents, unravels the concealed structure in the document contents, and creates a group of suitable topics - each topic has information about the data variation that explains the context of the corpus. This modeling technique can come in handy in a variety of natural language processing or information retrieval tasks. LSA can filter out the noise features in the data and represent the data in a simpler form, and discover topics with high affinity.
The topics that are extracted from the collection of documents have the following properties:
SVD allows us to obtain a low dimensional vector from a high dimensional vector with minimum loss of precision.
We are going to use the lsa
package for latent semantic analysis. The package supports methods for dimensionality calculation, term weighting, triple binding, and correlation measurement. It provides a high-level abstraction to core API's lsa()
, fold_in() as.textmatrix()
, query()
, textmatrix()
functions.
install.packages("ggplot2") install.packages("lsa") install.packages("tm")
library(tm) library(ggplot2) library(lsa)
InvestingMantra<- c( "Little Guide to Stock Market Investing.", "Investing For Dummies, 4th Edition.", "Guarantee Your Stock Market Returns.", "The Little Book of Value Investing.", "Value of Investing: From Paul", "Rich Dad's Guide to Investing: What the Rich Invest in, That the Poor and the Middle Class Do Not!", "Investing in Real Estate, 5th Edition.", "Stock Investing For Dummies.", "The ABC's of Real Estate Investing" )
factor
as follows:view <- factor(rep(c("T1", "T2", "T3","T4", "T5", "T6","T7", "T8", "T9"), each = 1))
DataFrame
from the character
class:IM_DataFrame<- data.frame(InvestingMantra, view, stringsAsFactors = FALSE)
Corpus
object:investingMantra_corpus<- Corpus(VectorSource(IM_DataFrame$InvestingMantra)) inspect(investingMantra_corpus)
investingMantra_corpus<- tm_map(investingMantra_corpus, tolower) investingMantra_corpus<- tm_map(investingMantra_corpus, removePunctuation) investingMantra_corpus<- tm_map(investingMantra_corpus, removeNumbers) investingMantra_corpus<- tm_map(investingMantra_corpus, function(x) removeWords(x, stopwords("english"))) investingMantra_corpus<- tm_map(investingMantra_corpus, stemDocument, language = "english") investingMantra_corpus<- tm_map(investingMantra_corpus, PlainTextDocument)
corpus
after cleansing:inspect(investingMantra_corpus)
TermDocumentMatrix
:investingMantra_TD_Matrix<- as.matrix(TermDocumentMatrix(investingMantra_corpus))
investingMantra_TD_Matrix.lsa<- lw_bintf(investingMantra_TD_Matrix) * gw_idf(investingMantra_TD_Matrix)
lsaSpace
:lsaSpace<- lsa(investingMantra_TD_Matrix.lsa)
distnce_matrix_lsa<- dist(t(as.textmatrix(lsaSpace))) distnce_matrix_lsa
fit<- cmdscale(distnce_matrix_lsa , eig = TRUE, k = 2) points<- data.frame(x = fit$points[, 1], y = fit$points[, 2]) ggplot(points, aes(x = x, y = y)) + geom_point(data = points, aes(x = x, y = y,size=5, color = IM_DataFrame$view)) + geom_text(data = points, aes(x = x, y = y - 0.2, label = row.names(IM_DataFrame)))