High dimensional unstructured data comes with the great trouble of organizing, querying, and information retrieval. If we can learn how to extract latent thematic structure in a text document or a collection of such documents, we can harness the wealth of information that can be retrieved; something that would not have been feasible without the advancements in natural language processing methodologies. In this chapter, we will learn about topic modeling and text summarization. We will learn how to extract hidden themes from documents and collections in order to be able to effectively use it for dozens of purposes such as corpus summarization, document organization, document classification, taxonomy generation of web documents, organizing search engine query results, news or article recommendation systems, and duplicate content detection. We will also discuss an interesting application of probabilistic language models in sentence completion:
Topic models can be used for discovering the underlying themes or topics that are present in an unstructured collection of documents. The collection of documents can be organized based on the discovered topics, so that users can easily browse through the documents based on topics of their interest. There are various topic modeling algorithms that can be applied to a collection of documents to achieve this. Clustering is a very useful technique used to group documents, but this doesn't always fit the requirements. When we cluster a text document, the results in each text exclusively belong to exactly one cluster. Let's consider this scenario: We have a book called Text Mining with R Programming Language. Should this book be grouped with R programming-related books, or with text mining-related books? The book is about R programming as well as text mining, and thus should be listed in both sections. In this topic, we will learn methods that do not cluster documents into completely separate groups, but allow each document to refer to several topics. These topics will be identified automatically from a collection of text documents. The field of machine learning that deals with these problems is called topic modeling. What does a document signify? What are the themes that can be ascribed to? A collection of social media articles can relate to multiple different themes, such as business, fashion, and sports. If given a task of organizing or querying a large collection of document corpuses, topic modeling is a tool that comes in handy.
Let's briefly discuss the two common topic models:
Latent Dirichlet Allocation (LDA) is one of the most widely used topic modeling methods, which belongs to a class of models that are called generative models. There are latent themes present in every document, and each word in the document contributes to the theme or topic, which encodes our assumption about the document or collection. If we can effectively group documents with similar underlying topics and themes, we can solve the trivial issues in searching, organizing, and summarizing huge archives of unstructured data. We uncover the latent topics pervading the collection of documents and annotate the documents according to the topics discovered, which is utilized to extract context, summarize, or organize the collection. The idea behind LDA is that we assume a fixed number of topics are distributed over the documents in the whole collection:
Each document is an amalgamation of multiple topics across the corpus; each topic is an assortment of thousands of words, while each word is an entity that contributes to the theme of the document. Still, we only can observe a document as a whole; everything else is hidden. Probabilistic models of topic modeling have the objective to dissect documents to extract those latent features, which can help summarize a document or organize a group of them.
Topic modeling adopts a three-pronged strategy to tackle the complexity of extracting themes from the collection of documents:
The topic labeled to an observed word depends on a posterior, which takes into account the topic and proportion parameters defined, and the assignment of topics to each word in a document, as well as to each document in a corpus. The topic can be assumed to be a probability distribution across the multitude of words, while the topic models are nothing but a probabilistic relationship between the latent unobserved themes and faction of observed linguistic variables. LDA is a model that imbibes this process. This model randomly generates observable data values based on some hidden parameters and follows a generative process; in this process we estimate the joint probability distribution of all the variables. We calculate the probability weights for words, and create the topics based on the weight of each word; each topic will assign different weights to different words. For this model, the order of the words does not matter as it will treat each document as a bag of words - this assumption may not be the best, since the sequence of the words in the sentence are lost. The order of the documents also does not matter. This type of language simplification is very rudimentary and often works because it still helps us to understand the semantics of the topics; knowing which words were used in a document and their frequencies makes it good enough to make decisions on which topic they belong to.
install.packages("topicmodels") install.packages("tm") library(topicmodels) library(tm)
For understanding topic modeling we will be using the NYTimes dataset. The dataset is arranged in five columns, such as Article_ID, Date, Title, Subject Topic, and Code - it has 3,104 rows of data. This dataset contains headlines from NYTimes articles. We will choose a random sample of 500 articles for the purposes of our example and understanding:
data(NYTimes)
Let's use the sample function from R base package - this function takes a sample of the specified size from the input elements either with or without replacement. After execution of the preceding statement, NYTimesData
will have 500 random samples of data in its data frame:
NYTimesData <- NYTimes[sample(1:3104,size=500,replace=FALSE),]
head(NYTimesData , 5) Article_ID Date TitleSubject Topic.Code 28096 10-May-01 Horse Racing; Mystery Illness Hurts Horse Breeding in Kentucky fungus believed to be cause of stillborn foal births in Kentucky 4 42085 1-Apr-96 POLITICS: IN CONGRESS; The Speaker's Gruff No. 2 Takes Charge in the House dick armey 20
Textdata <- cbind(as.vector(NYTimesData$Title),as.vector(NYTimesData$Subject));
apply
function:TestData <- apply(as.matrix(Textdata), 1, paste, collapse = " ")
UTF8
character encoding:TestData <- sapply(as.vector(TestData, mode = "character"), iconv, to = "UTF8", sub = "byte")
Corpus
from the data so that we can run various pre-processing steps on it:corpus <- Corpus(VectorSource(TestData), readerControl = list(language = 'english'))
control <- list(bounds = list(local = c(1, Inf)), language = 'english', tolower = TRUE, removeNumbers = TRUE, removePunctuation = TRUE, stopwords = TRUE, stripWhitespace = TRUE, stemWords=wordStem , wordLengths = c(3,20), weighting = weightTf)
Now, we will create the DocumentTermMatrix
and pass our list of pre-processing actions that have to be performed on our corpus. We can also remove the sparse terms from the matrix if need be:
matrix <- DocumentTermMatrix(corpus, control = control)
For LDA the number of topics must be fixed before modeling; we have to identify the number of topics in our dataset. The NYTimes
dataset has already been classified. We can simply use a unique()
function to determine the number of unique topics:
numberOfTopics <- length(unique(NYTimesData$Topic.Code))
Now, let's generate the LDA model based on the two inputs, numberOfTopics
and DocumentTermMatrix
:
lda <- LDA(matrix, numberOfTopics) terms(lda)
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10 Topic 11 Topic 12
"end" "program" "crisis" "kerry" "budget" "scandal" "bush" "court" "president" "schools" "suicide" "iraq"
Topic 13 Topic 14 Topic 15 Topic 16 Topic 17 Topic 18 Topic 19 Topic 20 Topic 21 Topic 22 Topic 23 Topic 24
"special" "new" "bill" "census" "sars" "time" "war" "microsoft" "day" "iraq" "bush" "oil"
Topic 25 Topic 26 Topic 27
"campaign" "bush" "ground"
As the name of the model suggests, this way of modeling collections helps us understand the correlation between the hidden topics in the collection of documents. This technique is very useful when we have to understand the relationship between each topic and build graphs about the topics, or build a topic of interest or document browser. This will help the user in navigating through the collection of documents by their topics of interest or preference, and ease their experience in finding the right content from a huge set of documents. An LDA model sets the basic principles for topic modeling and correlated topic modeling is an extension of this, building upon the LDA model. As explained in the previous section, an LDA model does not take into account the order in which the words occur or whether the order of the words is lost or is exchangeable. LDA is a high dimensional vector model - in LDA, the assumption is that the occurrence of one topic is not correlated to another topic; for example, an LDA fails to directly model correlation between topics, whereas a Correlated topic model (CTM) is a hierarchical model. A CTM provides better insights about the data, which helps in better visualization. A CTM models the words of each document from document-specific random variables, and can capture the diversity in grouped data that illustrates multiple hidden patterns. A CTM gives better predictive performance, but comes at the expense of extra computation cost.
For both models; LDA and CTM, the number of topics has to be fixed before modeling a corpus; they follow a generative process:
This is the list of steps followed in topic modeling:
Selecting the number of topics is a tricky problem; for fitting a given document-term matrix using the LDA model or the CTM, the number of topics needs to be fixed before modeling. There are various approaches to select the number of topics:
We use the topicmodels
package for topic modeling. This package provides an interface to the C code for LDA models and CTM, and C++ code for fitting LDA models using Gibbs sampling.
The main functions in package topic models for fitting the LDA and CTM models are LDA()
and CTM()
, respectively. These two functions have the same arguments. The functions are as shown in the following code:
LDA(x, k, method = "VEM", control = NULL, model = NULL ...) CTM(x, k, method = "VEM", control = NULL, model = NULL ...)
The following arguments are possible for the control list:
control_LDA_VEM<- list( estimate.alpha = TRUE, alpha = 50/k, estimate.beta = TRUE,verbose = 0, prefix = tempfile(), save = 0, keep = 0, seed = as.integer(Sys.time()), nstart = 1, best = TRUE,var = list(iter.max = 500, tol = 10^-6),em = list(iter.max = 1000, tol = 10^-4),initialize = "random")
For more information about function parameters, refer to the package documentation.