Topic modeling

A final, very useful application of word counting is topic modeling. Given a set of texts, are we able to find clusters of topics? The method to do this is called Latent Dirichlet Allocation (LDA).

Note

Note: The code and data for this section can be found on Kaggle at https://www.kaggle.com/jannesklaas/topic-modeling-with-lda.

While the name is quite a mouth full, the algorithm is a very useful one, so we will look at it step by step. LDA makes the following assumption about how texts are written:

  1. First, a topic distribution is chosen, say 70% machine learning and 30% finance.
  2. Second, the distribution of words for each topic is chosen. For example, the topic "machine learning" might be made up of 20% the word "tensor," 10% the word "gradient," and so on. This means that our topic distribution is a distribution of distributions, also called a Dirichlet distribution.
  3. Once the text gets written, two probabilistic decisions are made for each word: first, a topic is chosen from the distribution of topics in the document. Then, a word is chosen for the distribution of words in that document.

Note that not all documents in a corpus have the same distribution of topics. We need to specify a fixed number of topics. In the learning process, we start out by assigning each word in the corpus randomly to one topic. For each document, we then calculate the following:

Topic modeling

The preceding formula is the probability of each topic, t, to be included in document d. For each word, we then calculate:

Topic modeling

That is the probability of a word, w, to belong to a topic, t. We then assign the word to a new topic, t, with the following probability:

Topic modeling

In other words, we assume that all of the words are already correctly assigned to a topic except for the word currently under consideration. We then try to assign words to topics to make documents more homogenous in their topic distribution. This way, words that actually belong to a topic cluster together.

Scikit-learn offers an easy-to-use LDA tool that will help us achieve this. To use this, we must first create a new LDA analyzer and specify the number of topics, called components that we expect.

This can be done by simply running the following:

from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=2)

We then create count vectors, just as we did for the bag-of-words analysis. For LDA, it is important to remove frequent words that don't mean anything, such as "an" or "the," so-called stop words. CountVectorizer comes with a built-in stopword dictionary that removes these words automatically. To use this, we'll need to run the following code:

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
vectorizer = CountVectorizer(stop_words='english')
tf = vectorizer.fit_transform(df['joint_lemmas'])

Next, we fit the LDA to the count vectors:

lda.fit(tf)

To inspect our results, we can print out the most frequent words for each topic. To this end, we first need to specify the number of words per topic we want to print, in this case 5. We also need to extract the mapping word count vector indices to words:

n_top_words = 5
tf_feature_names = vectorizer.get_feature_names()

Now we can loop over the topics of the LDA, in order to print the most frequent words:

for topic_idx, topic in enumerate(lda.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([tf_feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)
Topic #0: http news bomb kill disaster
Topic #1: pron http like just https

As you can see, the LDA seems to have discovered the grouping into serious tweets and non-serious ones by itself without being given the targets.

This method is very useful for classifying news articles, too. Back in the world of finance, investors might want to know if there is a news article mentioning a risk factor they are exposed to. The same goes for support requests for consumer-facing organizations, which can be clustered this way.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset