A final, very useful application of word counting is topic modeling. Given a set of texts, are we able to find clusters of topics? The method to do this is called Latent Dirichlet Allocation (LDA).
Note: The code and data for this section can be found on Kaggle at https://www.kaggle.com/jannesklaas/topic-modeling-with-lda.
While the name is quite a mouth full, the algorithm is a very useful one, so we will look at it step by step. LDA makes the following assumption about how texts are written:
Note that not all documents in a corpus have the same distribution of topics. We need to specify a fixed number of topics. In the learning process, we start out by assigning each word in the corpus randomly to one topic. For each document, we then calculate the following:
The preceding formula is the probability of each topic, t, to be included in document d. For each word, we then calculate:
That is the probability of a word, w, to belong to a topic, t. We then assign the word to a new topic, t, with the following probability:
In other words, we assume that all of the words are already correctly assigned to a topic except for the word currently under consideration. We then try to assign words to topics to make documents more homogenous in their topic distribution. This way, words that actually belong to a topic cluster together.
Scikit-learn offers an easy-to-use LDA tool that will help us achieve this. To use this, we must first create a new LDA analyzer and specify the number of topics, called components that we expect.
This can be done by simply running the following:
from sklearn.decomposition import LatentDirichletAllocation lda = LatentDirichletAllocation(n_components=2)
We then create count vectors, just as we did for the bag-of-words analysis. For LDA, it is important to remove frequent words that don't mean anything, such as "an" or "the," so-called stop words. CountVectorizer
comes with a built-in stopword dictionary that removes these words automatically. To use this, we'll need to run the following code:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer vectorizer = CountVectorizer(stop_words='english') tf = vectorizer.fit_transform(df['joint_lemmas'])
Next, we fit the LDA to the count vectors:
lda.fit(tf)
To inspect our results, we can print out the most frequent words for each topic. To this end, we first need to specify the number of words per topic we want to print, in this case 5. We also need to extract the mapping word count vector indices to words:
n_top_words = 5 tf_feature_names = vectorizer.get_feature_names()
Now we can loop over the topics of the LDA, in order to print the most frequent words:
for topic_idx, topic in enumerate(lda.components_): message = "Topic #%d: " % topic_idx message += " ".join([tf_feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]) print(message) Topic #0: http news bomb kill disaster Topic #1: pron http like just https
As you can see, the LDA seems to have discovered the grouping into serious tweets and non-serious ones by itself without being given the targets.
This method is very useful for classifying news articles, too. Back in the world of finance, investors might want to know if there is a news article mentioning a risk factor they are exposed to. The same goes for support requests for consumer-facing organizations, which can be clustered this way.