Latent Dirichlet allocation

Unfortunately, there are two methods in machine learning with the initials LDA: latent Dirichlet allocation, which is a topic modeling method, and linear discriminant analysis, which is a classification method. They are completely unrelated, except for the fact that the initials LDA can refer to either. In certain situations, this can be confusing. The scikit-learn tool has a submodule, sklearn.lda, which implements linear discriminant analysis. At the moment, scikit-learn does not implement latent Dirichlet allocation.

The first topic model we will look at is latent Dirichlet allocation. The mathematical ideas behind LDA are fairly complex, and we will not go into the details here.

For those who are interested, and adventurous enough, Wikipedia provides all the equations behind these algorithms: http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation.

However, we can understand the ideas behind LDA intuitively at a high level. LDA belongs to a class of models that are called generative models as they have a sort of fable explains how the data was generated. This generative story is a simplification of reality, of course, to make machine learning easier. In the LDA fable, we first create topics by assigning probability weights to words. Each topic will assign different weights to different words. For example, a Python topic will assign high probability to the word variable and a low probability to the word inebriated. When we wish to generate a new document, we first choose the topics it will use and then mix words from these topics.

For example, let's say we have only three topics that books discuss:

  • Machine learning
  • Python
  • Baking

For each topic, we have a list of words associated with it. This book will be a mixture of the first two topics, perhaps 50 percent each. The mixture does not need to be equal; it can also be a 70/30 split. When we are generating the actual text, we generate word by word; first we decide which topic this word will come from. This is a random decision based on the topic weights. Once a topic is chosen, we generate a word from that topic's list of words. To be precise, we choose a word in English with the probability given by the topic. The same word can be generated from multiple topics. For example, weight is a common word in both machine learning and baking (albeit with different meanings).

In this model, the order of words does not matter. This is a bag of words model, as we have already seen in the previous chapter. It is a crude simplification of language, but it often works well enough, because just knowing which words were used in a document and their frequencies is enough to make machine learning decisions.

In the real world, we do not know what the topics are. Our task is to take a collection of text and to reverse engineer this fable in order to discover what topics are out there and simultaneously figure out which topics each document uses.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset