Topic modeling using LDA

Let's explore another popular topic modeling algorithm, latent Dirichlet allocation (LDA). LDA is a generative probabilistic graphical model that explains each input document by means of a mixture of topics with certain probabilities. Again, topic in topic modeling means a collection of words with a certain connection. In other words, LDA basically deals with two probability values, P(term | topic) and P(topic | document). This can be difficult to understand at the beginning. So, let's start from the bottom, the end result of an LDA model.

Let's take a look at the following set of documents:

Document 1: This restaurant is famous for fish and chips.
Document 2: I had fish and rice for lunch.
Document 3: My sister bought me a cute kitten.
Document 4: Some research shows eating too much rice is bad.
Document 5: I always forget to feed fish to my cat.

Now, let's say we want two topics. The topics derived from these documents may appear as follows:

Topic 1: 30% fish, 20% chip, 30% rice, 10% lunch, 10% restaurant (which we can interpret Topic 1 to be food related)
Topic 2: 40% cute, 40% cat, 10% fish, 10% feed (which we can interpret Topic 1 to be about pet)

Therefore, we find how each document is represented by these two topics:

Documents 1: 85% Topic 1, 15% Topic 2
Documents 2: 88% Topic 1, 12% Topic 2
Documents 3: 100% Topic 2
Documents 4: 100% Topic 1
Documents 5: 33% Topic 1, 67% Topic 2

After seeing a dummy example, we come back to its learning procedure:

Specify the number of topics, T. Now we have topic 1, 2, …, and T.
For each document, randomly assign one of the topics to each term in the document.
For each document, calculate P(topic=t | document), which is the proportion of terms in the document that are assigned to the topic t.
For each topic, calculate P(term=w | topic), which is the proportion of term w among all terms that are assigned to the topic.
For each term w, reassign its topic based on the latest probabilities P(topic=t | document) and P(term=w | topic=t).
Repeat steps 3 to step 5 under the latest topic distributions for each iteration. The training stops if the model converges or reaches the maximum number of iterations.

LDA is trained in a generative manner, where it tries to abstract from the documents a set of hidden topics that are likely to generate a certain collection of words.

With all this in mind, let's see LDA in action. The LDA model is also included in scikit-learn:

>>> from sklearn.decomposition import LatentDirichletAllocation
>>> t = 20
>>> lda = LatentDirichletAllocation(n_components=t,
                      learning_method='batch',random_state=42)

Again, we specify 20 topics (n_components). The key parameters of the model are included in the following table:

For the input data to LDA, remember that LDA only takes in term counts as it is a probabilistic graphical model. This is unlike NMF, which can work with both the term count matrix and the tf-idf matrix as long as they are non-negative data. Again, we use the term matrix defined previously as input to the lda model:

>>> data = count_vector.fit_transform(data_cleaned)

Now, fit the LDA model on the term matrix, data:

>>> lda.fit(data)

We can obtain the resulting topic-term rank after the model is trained:

>>> lda.components_
[[0.05     2.05    2.05    ...   0.05      0.05    0.05 ]
 [0.05     0.05    0.05    ...   0.05      0.05    0.05 ]
 [0.05     0.05    0.05    ...   4.0336285 0.05    0.05 ]
 ...
 [0.05     0.05    0.05    ...   0.05      0.05    0.05 ]
 [0.05     0.05    0.05    ...   0.05      0.05    0.05 ]
 [0.05     0.05    0.05    ...   0.05      0.05    3.05 ]]

Similarly, for each topic, we display the top 10 terms based on their ranks as follows:

>>> terms = count_vector.get_feature_names()
>>> for topic_idx, topic in enumerate(lda.components_):
...         print("Topic {}:" .format(topic_idx))
...         print(" ".join([terms[i] for i in
                                   topic.argsort()[-10:]]))
Topic 0:
atheist doe ha believe say jesus people christian wa god
Topic 1:
moment just adobe want know ha wa hacker article radius
Topic 2:
center point ha wa available research computer data graphic hst
Topic 3:
objective argument just thing doe people wa think say article
Topic 4:
time like brian ha good life want know just wa
Topic 5:
computer graphic think know need university just article wa like
Topic 6:
free program color doe use version gif jpeg file image
Topic 7:
gamma ray did know university ha just like article wa
Topic 8:
tool ha processing using data software color program bit image
Topic 9:
apr men know ha think woman just university article wa
Topic 10:
jpl propulsion mission april mar jet command data spacecraft wa
Topic 11:
russian like ha university redesign point option article space station
Topic 12:
ha van book star material physicist universe physical theory wa
Topic 13:
bank doe book law wa article rushdie muslim islam islamic
Topic 14:
think gopher routine point polygon book university article know wa
Topic 15:
ha rocket new lunar mission satellite shuttle nasa launch space
Topic 16:
want right article ha make like just think people wa
Topic 17:
just light space henry wa like zoology sky article toronto
Topic 18:
comet venus solar moon orbit planet earth probe ha wa
Topic 19:
site format image mail program available ftp send file graphic

There are a number of interesting topics that we just mined, for instance, computer graphics-related topics, such as 2, 5, 6, 8, and 19, space-related ones, such as 10, 11, 12, and 15, and religion-related ones, such as 0 and 13. There are also topics involving noise, for example, 9 and 16, which may require some imagination to interpret. Again, this is not surprising at all, since LDA, or topic modeling in general, is a kind of free-form learning.

Table of Contents for Topic modeling using LDA

Create new playlist

Sign In

Sign Up

Table of Contents for
Topic modeling using LDA