Chapter 5. Text Summarization and Clustering

High dimensional unstructured data comes with the great trouble of organizing, querying, and information retrieval. If we can learn how to extract latent thematic structure in a text document or a collection of such documents, we can harness the wealth of information that can be retrieved; something that would not have been feasible without the advancements in natural language processing methodologies. In this chapter, we will learn about topic modeling and text summarization. We will learn how to extract hidden themes from documents and collections in order to be able to effectively use it for dozens of purposes such as corpus summarization, document organization, document classification, taxonomy generation of web documents, organizing search engine query results, news or article recommendation systems, and duplicate content detection. We will also discuss an interesting application of probabilistic language models in sentence completion:

  • Topic modeling
  • Latent semantic analysis
  • Machine learning-based text summarization
  • Text clustering
  • Sentence completion

Topic modeling

Topic models can be used for discovering the underlying themes or topics that are present in an unstructured collection of documents. The collection of documents can be organized based on the discovered topics, so that users can easily browse through the documents based on topics of their interest. There are various topic modeling algorithms that can be applied to a collection of documents to achieve this. Clustering is a very useful technique used to group documents, but this doesn't always fit the requirements. When we cluster a text document, the results in each text exclusively belong to exactly one cluster. Let's consider this scenario: We have a book called Text Mining with R Programming Language. Should this book be grouped with R programming-related books, or with text mining-related books? The book is about R programming as well as text mining, and thus should be listed in both sections. In this topic, we will learn methods that do not cluster documents into completely separate groups, but allow each document to refer to several topics. These topics will be identified automatically from a collection of text documents. The field of machine learning that deals with these problems is called topic modeling. What does a document signify? What are the themes that can be ascribed to? A collection of social media articles can relate to multiple different themes, such as business, fashion, and sports. If given a task of organizing or querying a large collection of document corpuses, topic modeling is a tool that comes in handy.

Let's briefly discuss the two common topic models:

  • Latent Dirichlet Allocation (LDA)
  • Correlated Topic Model (CTM)

Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is one of the most widely used topic modeling methods, which belongs to a class of models that are called generative models. There are latent themes present in every document, and each word in the document contributes to the theme or topic, which encodes our assumption about the document or collection. If we can effectively group documents with similar underlying topics and themes, we can solve the trivial issues in searching, organizing, and summarizing huge archives of unstructured data. We uncover the latent topics pervading the collection of documents and annotate the documents according to the topics discovered, which is utilized to extract context, summarize, or organize the collection. The idea behind LDA is that we assume a fixed number of topics are distributed over the documents in the whole collection:

Latent Dirichlet Allocation

Tip

The image is vital to this topic and similar image needs to be generated, as it is a copied image from web.

Each document is an amalgamation of multiple topics across the corpus; each topic is an assortment of thousands of words, while each word is an entity that contributes to the theme of the document. Still, we only can observe a document as a whole; everything else is hidden. Probabilistic models of topic modeling have the objective to dissect documents to extract those latent features, which can help summarize a document or organize a group of them.

Topic modeling adopts a three-pronged strategy to tackle the complexity of extracting themes from the collection of documents:

  • Every word in each document is assigned a topic
  • The proportion of each unique topic is estimated for every document
  • For every corpus, the topic distribution is explored

The topic labeled to an observed word depends on a posterior, which takes into account the topic and proportion parameters defined, and the assignment of topics to each word in a document, as well as to each document in a corpus. The topic can be assumed to be a probability distribution across the multitude of words, while the topic models are nothing but a probabilistic relationship between the latent unobserved themes and faction of observed linguistic variables. LDA is a model that imbibes this process. This model randomly generates observable data values based on some hidden parameters and follows a generative process; in this process we estimate the joint probability distribution of all the variables. We calculate the probability weights for words, and create the topics based on the weight of each word; each topic will assign different weights to different words. For this model, the order of the words does not matter as it will treat each document as a bag of words - this assumption may not be the best, since the sequence of the words in the sentence are lost. The order of the documents also does not matter. This type of language simplification is very rudimentary and often works because it still helps us to understand the semantics of the topics; knowing which words were used in a document and their frequencies makes it good enough to make decisions on which topic they belong to.

Latent Dirichlet Allocation
install.packages("topicmodels")
install.packages("tm")
library(topicmodels)
library(tm)

For understanding topic modeling we will be using the NYTimes dataset. The dataset is arranged in five columns, such as Article_ID, Date, Title, Subject Topic, and Code - it has 3,104 rows of data. This dataset contains headlines from NYTimes articles. We will choose a random sample of 500 articles for the purposes of our example and understanding:

data(NYTimes)

Let's use the sample function from R base package - this function takes a sample of the specified size from the input elements either with or without replacement. After execution of the preceding statement, NYTimesData will have 500 random samples of data in its data frame:

NYTimesData <- NYTimes[sample(1:3104,size=500,replace=FALSE),]
  1. If we inspect the data frame, we have two text columns, title and subject:
    head(NYTimesData , 5)
    Article_ID Date       TitleSubject     Topic.Code
    28096     10-May-01    Horse Racing; Mystery Illness Hurts Horse Breeding in Kentucky        fungus believed to be cause of stillborn foal births in Kentucky	4
    42085      1-Apr-96   POLITICS: IN CONGRESS; The Speaker's Gruff No. 2 Takes Charge in the House  dick armey      20
  2. Let's combine the data from both the columns into a variable for processing:
    Textdata <- cbind(as.vector(NYTimesData$Title),as.vector(NYTimesData$Subject));
  3. Now we will convert it to character class by passing the text data into an apply function:
    TestData <- apply(as.matrix(Textdata), 1, paste, collapse = " ")
  4. Here we are converting the test data to UTF8 character encoding:
    TestData <- sapply(as.vector(TestData, mode = "character"), iconv, to = "UTF8", sub = "byte")
  5. Now let's create a Corpus from the data so that we can run various pre-processing steps on it:
    corpus <- Corpus(VectorSource(TestData), readerControl = list(language = 'english'))
  6. As explained in the previous chapters, we will do various pre-processing steps on the text data before analyzing it as shown here:
    • Remove punctuation
    • Remove numbers
    • Remove stop words
    • Strip out white spaces
    • Stem words
control <- list(bounds = list(local = c(1, Inf)), language = 'english', tolower = TRUE, removeNumbers = TRUE, removePunctuation = TRUE, stopwords = TRUE, stripWhitespace = TRUE,  stemWords=wordStem , wordLengths = c(3,20), weighting = weightTf)

Now, we will create the DocumentTermMatrix and pass our list of pre-processing actions that have to be performed on our corpus. We can also remove the sparse terms from the matrix if need be:

matrix <- DocumentTermMatrix(corpus, control = control)	

For LDA the number of topics must be fixed before modeling; we have to identify the number of topics in our dataset. The NYTimes dataset has already been classified. We can simply use a unique() function to determine the number of unique topics:

numberOfTopics <- length(unique(NYTimesData$Topic.Code))

Now, let's generate the LDA model based on the two inputs, numberOfTopics and DocumentTermMatrix:

lda <- LDA(matrix, numberOfTopics)

terms(lda)

Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8 Topic 9 Topic 10 Topic 11 Topic 12

"end" "program" "crisis" "kerry" "budget" "scandal" "bush" "court" "president" "schools" "suicide" "iraq"

Topic 13 Topic 14 Topic 15 Topic 16 Topic 17 Topic 18 Topic 19 Topic 20 Topic 21 Topic 22 Topic 23 Topic 24

"special" "new" "bill" "census" "sars" "time" "war" "microsoft" "day" "iraq" "bush" "oil"

Topic 25 Topic 26 Topic 27

"campaign" "bush" "ground"

Correlated topic model

As the name of the model suggests, this way of modeling collections helps us understand the correlation between the hidden topics in the collection of documents. This technique is very useful when we have to understand the relationship between each topic and build graphs about the topics, or build a topic of interest or document browser. This will help the user in navigating through the collection of documents by their topics of interest or preference, and ease their experience in finding the right content from a huge set of documents. An LDA model sets the basic principles for topic modeling and correlated topic modeling is an extension of this, building upon the LDA model. As explained in the previous section, an LDA model does not take into account the order in which the words occur or whether the order of the words is lost or is exchangeable. LDA is a high dimensional vector model - in LDA, the assumption is that the occurrence of one topic is not correlated to another topic; for example, an LDA fails to directly model correlation between topics, whereas a Correlated topic model (CTM) is a hierarchical model. A CTM provides better insights about the data, which helps in better visualization. A CTM models the words of each document from document-specific random variables, and can capture the diversity in grouped data that illustrates multiple hidden patterns. A CTM gives better predictive performance, but comes at the expense of extra computation cost.

For both models; LDA and CTM, the number of topics has to be fixed before modeling a corpus; they follow a generative process:

  • Determine term distribution for each topic
  • Determine proportions of the topic distribution for the document
  • For each word choose a topic, then choose a word conditioned on that topic

This is the list of steps followed in topic modeling:

  1. The provided data can be in various formats. Creating a corpus or vocabulary out of the given data is the first step.
  2. Process the created corpus to remove noisy data. This involves:
    • Tokenizing
    • Stemming
    • Stop word removal
    • Removing numbers
    • Removing punctuation
    • Removing terms below certain length
    • Converting to lower case
  3. Create the document term matrix of the processed corpus.
  4. Remove the sparse entries from the document term matrix.
  5. This matrix can be provided as the input to LDA and CTM.

Model selection

Selecting the number of topics is a tricky problem; for fitting a given document-term matrix using the LDA model or the CTM, the number of topics needs to be fixed before modeling. There are various approaches to select the number of topics:

  • Generally the number of topics is not known, but we can run the models on a different number of topics and find the best value in a data-driven way
  • Bayesian approach
  • Hierarchical Dirichlet process
  • Cross validation on likelihood approach

R Package for topic modeling

We use the topicmodels package for topic modeling. This package provides an interface to the C code for LDA models and CTM, and C++ code for fitting LDA models using Gibbs sampling.

The main functions in package topic models for fitting the LDA and CTM models are LDA() and CTM(), respectively. These two functions have the same arguments. The functions are as shown in the following code:

LDA(x, k, method = "VEM", control = NULL, model = NULL ...)
CTM(x, k, method = "VEM", control = NULL, model = NULL ...)

Fitting the LDA model with the VEM algorithm

The following arguments are possible for the control list:

control_LDA_VEM<-  list( estimate.alpha = TRUE, alpha = 50/k, estimate.beta = TRUE,verbose = 0, prefix = tempfile(), save = 0, keep = 0, seed = as.integer(Sys.time()), nstart = 1, best = TRUE,var = list(iter.max = 500, tol = 10^-6),em = list(iter.max = 1000, tol = 10^-4),initialize = "random")

For more information about function parameters, refer to the package documentation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset