Latent semantic analysis

Latent Semantic Analysis (LSA) is a modeling technique that can be used to understand a given collection of documents. It also provides us with insights into the relationship between words in the documents, unravels the concealed structure in the document contents, and creates a group of suitable topics - each topic has information about the data variation that explains the context of the corpus. This modeling technique can come in handy in a variety of natural language processing or information retrieval tasks. LSA can filter out the noise features in the data and represent the data in a simpler form, and discover topics with high affinity.

The topics that are extracted from the collection of documents have the following properties:

  • The amount of similarity each topic has with each document in the corpus.
  • The amount of similarity each topic has with each term in the corpus.
  • It also provides a significance score that highlights the importance of the topic and the variance in the data set.
  • LSA uses a linear algebra technique called singular value decomposition (SVD). LSA deduces a lower-dimensional representation of vectors from high dimensional space. The input to the LSA model is a term-document matrix that is generated from the corpus using word frequencies - each column corresponds to a document and each row corresponds to terms. SVD then factorizes this matrix into three matrices: the first matrix expresses topics in regard to documents, the second matrix expresses topics in regard to terms and the third matrix contains the importance for each topic.

SVD allows us to obtain a low dimensional vector from a high dimensional vector with minimum loss of precision.

R Package for latent semantic analysis

We are going to use the lsa package for latent semantic analysis. The package supports methods for dimensionality calculation, term weighting, triple binding, and correlation measurement. It provides a high-level abstraction to core API's lsa(), fold_in() as.textmatrix(), query(), textmatrix() functions.

Illustrative example of LSA

  1. First, we will install the required libraries:
    install.packages("ggplot2")
    install.packages("lsa")
    install.packages("tm")
  2. Next, we will load the libraries:
    library(tm)
    library(ggplot2)
    library(lsa)
  3. We need to make up data to apply LSA, so let's create a corpus; I have randomly selected and made up nine titles of books related to investment:
    InvestingMantra<- c(
         "Little Guide to Stock Market Investing.",
         "Investing For Dummies, 4th Edition.",
         "Guarantee Your Stock Market Returns.",
         "The Little Book of Value Investing.",
         "Value of Investing: From Paul",
         "Rich Dad's Guide to Investing: What the Rich Invest in, That 
          the Poor and the Middle Class Do Not!",
         "Investing in Real Estate, 5th Edition.",
         "Stock Investing For Dummies.",
        "The ABC's of Real Estate Investing"
     )
  4. Let's have an index for each title of the book, which will help us in plotting graphs - let's create factor as follows:
    view <- factor(rep(c("T1", "T2", "T3","T4", "T5", "T6","T7", "T8", "T9"), each = 1))
  5. We will create a DataFrame from the character class:
    IM_DataFrame<- data.frame(InvestingMantra, view, stringsAsFactors = FALSE)
  6. For the purpose of analysis, we need to convert the data frame into a Corpus object:
    investingMantra_corpus<- Corpus(VectorSource(IM_DataFrame$InvestingMantra))
    
    inspect(investingMantra_corpus)
  7. We now need to pre-process the corpus, convert text to lower-case, perform stemming, remove numbers, remove punctuation characters, and remove stop words:
    investingMantra_corpus<- tm_map(investingMantra_corpus, tolower)
    investingMantra_corpus<- tm_map(investingMantra_corpus, removePunctuation)
    investingMantra_corpus<- tm_map(investingMantra_corpus, removeNumbers)
    investingMantra_corpus<- tm_map(investingMantra_corpus, function(x) removeWords(x, stopwords("english")))
    investingMantra_corpus<- tm_map(investingMantra_corpus, stemDocument, language = "english")
    investingMantra_corpus<- tm_map(investingMantra_corpus, PlainTextDocument)
  8. Inspect the corpus after cleansing:
    inspect(investingMantra_corpus)
  9. Find the TermDocumentMatrix:
    investingMantra_TD_Matrix<- as.matrix(TermDocumentMatrix(investingMantra_corpus))
  10. Calculate a weighted document-term matrix according to the chosen local and/or global weighting scheme:
    investingMantra_TD_Matrix.lsa<- lw_bintf(investingMantra_TD_Matrix) * gw_idf(investingMantra_TD_Matrix)  
  11. Calculate the latent semantic space for the give document-term matrix and create lsaSpace:
    lsaSpace<- lsa(investingMantra_TD_Matrix.lsa) 
  12. Compute the distance matrix:
    distnce_matrix_lsa<- dist(t(as.textmatrix(lsaSpace)))  
    distnce_matrix_lsa
  13. Let's start to plot the distance matrix:
    fit<- cmdscale(distnce_matrix_lsa , eig = TRUE, k = 2)
    points<- data.frame(x = fit$points[, 1], y = fit$points[, 2])
    ggplot(points, aes(x = x, y = y)) + geom_point(data = points, aes(x = x, y = y,size=5, color = IM_DataFrame$view)) + geom_text(data = points, aes(x = x, y = y - 0.2, label = row.names(IM_DataFrame)))
  14. The following figure shows the plot of titles; we can clearly see the cluster of titles:
    • T1, and T3 talk about Stock Market
    • T7, and T9 talk about Real Estate
    • T2, T4, T5, and T8 talk about investing
    • T6 is an outlier
    Illustrative example of LSA
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset