Let's explore another popular topic modeling algorithm, latent Dirichlet allocation (LDA). LDA is a generative probabilistic graphical model that explains each input document by means of a mixture of topics with certain probabilities. Again, topic in topic modeling means a collection of words with a certain connection. In other words, LDA basically deals with two probability values, P(term | topic) and P(topic | document). This can be difficult to understand at the beginning. So, let's start from the bottom, the end result of an LDA model.
Let's take a look at the following set of documents:
Document 1: This restaurant is famous for fish and chips.
Document 2: I had fish and rice for lunch.
Document 3: My sister bought me a cute kitten.
Document 4: Some research shows eating too much rice is bad.
Document 5: I always forget to feed fish to my cat.
Now, let's say we want two topics. The topics derived from these documents may appear as follows:
Topic 1: 30% fish, 20% chip, 30% rice, 10% lunch, 10% restaurant (which we can interpret Topic 1 to be food related)
Topic 2: 40% cute, 40% cat, 10% fish, 10% feed (which we can interpret Topic 1 to be about pet)
Therefore, we find how each document is represented by these two topics:
Documents 1: 85% Topic 1, 15% Topic 2
Documents 2: 88% Topic 1, 12% Topic 2
Documents 3: 100% Topic 2
Documents 4: 100% Topic 1
Documents 5: 33% Topic 1, 67% Topic 2
After seeing a dummy example, we come back to its learning procedure:
- Specify the number of topics, T. Now we have topic 1, 2, …, and T.
- For each document, randomly assign one of the topics to each term in the document.
- For each document, calculate P(topic=t | document), which is the proportion of terms in the document that are assigned to the topic t.
- For each topic, calculate P(term=w | topic), which is the proportion of term w among all terms that are assigned to the topic.
- For each term w, reassign its topic based on the latest probabilities P(topic=t | document) and P(term=w | topic=t).
- Repeat steps 3 to step 5 under the latest topic distributions for each iteration. The training stops if the model converges or reaches the maximum number of iterations.
LDA is trained in a generative manner, where it tries to abstract from the documents a set of hidden topics that are likely to generate a certain collection of words.
With all this in mind, let's see LDA in action. The LDA model is also included in scikit-learn:
>>> from sklearn.decomposition import LatentDirichletAllocation
>>> t = 20
>>> lda = LatentDirichletAllocation(n_components=t,
learning_method='batch',random_state=42)
Again, we specify 20 topics (n_components). The key parameters of the model are included in the following table:
For the input data to LDA, remember that LDA only takes in term counts as it is a probabilistic graphical model. This is unlike NMF, which can work with both the term count matrix and the tf-idf matrix as long as they are non-negative data. Again, we use the term matrix defined previously as input to the lda model:
>>> data = count_vector.fit_transform(data_cleaned)
Now, fit the LDA model on the term matrix, data:
>>> lda.fit(data)
We can obtain the resulting topic-term rank after the model is trained:
>>> lda.components_
[[0.05 2.05 2.05 ... 0.05 0.05 0.05 ]
[0.05 0.05 0.05 ... 0.05 0.05 0.05 ]
[0.05 0.05 0.05 ... 4.0336285 0.05 0.05 ]
...
[0.05 0.05 0.05 ... 0.05 0.05 0.05 ]
[0.05 0.05 0.05 ... 0.05 0.05 0.05 ]
[0.05 0.05 0.05 ... 0.05 0.05 3.05 ]]
Similarly, for each topic, we display the top 10 terms based on their ranks as follows:
>>> terms = count_vector.get_feature_names()
>>> for topic_idx, topic in enumerate(lda.components_):
... print("Topic {}:" .format(topic_idx))
... print(" ".join([terms[i] for i in
topic.argsort()[-10:]]))
Topic 0:
atheist doe ha believe say jesus people christian wa god
Topic 1:
moment just adobe want know ha wa hacker article radius
Topic 2:
center point ha wa available research computer data graphic hst
Topic 3:
objective argument just thing doe people wa think say article
Topic 4:
time like brian ha good life want know just wa
Topic 5:
computer graphic think know need university just article wa like
Topic 6:
free program color doe use version gif jpeg file image
Topic 7:
gamma ray did know university ha just like article wa
Topic 8:
tool ha processing using data software color program bit image
Topic 9:
apr men know ha think woman just university article wa
Topic 10:
jpl propulsion mission april mar jet command data spacecraft wa
Topic 11:
russian like ha university redesign point option article space station
Topic 12:
ha van book star material physicist universe physical theory wa
Topic 13:
bank doe book law wa article rushdie muslim islam islamic
Topic 14:
think gopher routine point polygon book university article know wa
Topic 15:
ha rocket new lunar mission satellite shuttle nasa launch space
Topic 16:
want right article ha make like just think people wa
Topic 17:
just light space henry wa like zoology sky article toronto
Topic 18:
comet venus solar moon orbit planet earth probe ha wa
Topic 19:
site format image mail program available ftp send file graphic
There are a number of interesting topics that we just mined, for instance, computer graphics-related topics, such as 2, 5, 6, 8, and 19, space-related ones, such as 10, 11, 12, and 15, and religion-related ones, such as 0 and 13. There are also topics involving noise, for example, 9 and 16, which may require some imagination to interpret. Again, this is not surprising at all, since LDA, or topic modeling in general, is a kind of free-form learning.