Choosing the number of topics

So far in the chapter, we have used a fixed number of topics for our analysis, namely 100. This was a purely arbitrary number; we could have just as well used either 20 or 200 topics. Fortunately, for many uses, this number does not really matter. If you are going to only use the topics as an intermediate step, as we did previously when finding similar posts, the final behavior of the system is rarely very sensitive to the exact number of topics used in the model. This means that as long as you use enough topics, whether you use 100 topics or 200, the recommendations that result from the process will not be very different. However, 100 is often a good enough number (while 20 is too few for a general collection of text documents), but we could have used more if we had more documents.

The same is true of setting the alpha value. While playing around with it can change the topics, the final results are again robust in terms of this change. Naturally, this depends on the exact nature of your data, and should be tested empirically to make sure that the results are indeed stable.

Topic modeling is often an end toward a goal. In that case, it is not always very important which parameter values are used exactly. A different number of topics or values for parameters such as alpha will result in systems whose end results are almost identical in their final results.

On the other hand, if you are going to explore the topics directly, or build a visualization tool that exposes them, you should probably try a few values and see which gives you the most useful or most appealing results. There are also statistical concepts such as perplexity that can be used to determine which of a series of models best fits the data, enabling a more informed decision.

Alternatively, there are a few methods that will automatically determine the number of topics for you, depending on the dataset. One popular model is called the hierarchical Dirichlet process (HDP). Again, the full mathematical model behind it is complex and beyond the scope of this book. However, what we can tell you is that instead of having the topics fixed first, as in the LDA generative method, the topics themselves are generated along with the data, one at a time. Whenever the writer creates a new document, they have the option of using the topics that already exist or to create a completely new one. When more topics have been created, the probability of creating a new one, instead of reusing what exists, goes down, but the possibility always exists.

This means that the more documents we have, the more topics we will end up with. This is one of those statements that is unintuitive at first but makes perfect sense upon reflection. We are grouping documents and the more examples we have, the more we can break them up. If we only have a few examples of news articles, then sports will be a topic. However, as we have more, we start to break it up into the individual modalities: hockey, soccer, and so on. As we have even more data, we can start to tell nuances apart, articles about individual teams and even individual players. The same is true for people. In a group of many different backgrounds, with a few computer people, you might put them together; in a slightly larger group, you will have separate gatherings for programmers and systems administrators; and in the real-world, we even have different gatherings for Python and Ruby programmers.

HDP is available in gensim. Using it is trivial. To adapt the code we wrote for LDA, we just need to replace the call to gensim.models.ldamodel.LdaModel with a call to the HdpModel constructor as follows:

hdp = gensim.models.hdpmodel.HdpModel(mm, id2word)

That's it (except that it takes a bit longer to compute—there are no free lunches). Now, we can use this model in much the same way as we used the LDA model, except that we did not need to specify the number of topics.

Table of Contents for Choosing the number of topics

Create new playlist

Sign In

Sign Up

Table of Contents for
Choosing the number of topics