Modeling the whole of Wikipedia

While the initial LDA implementations can be slow, which limits their use to small document collections, modern algorithms work well with very large collections of data. Following the documentation of gensim, we are going to build a topic model for the whole of the English-language Wikipedia. This takes hours, but can be done with just a laptop! With a cluster of machines, we can make it go much faster, but we will look at that sort of processing environment in a later chapter.

First, we download the whole Wikipedia dump from http://dumps.wikimedia.org. This is a large file (currently over 14 GB), so it may take a while, unless your internet connection is very fast. Then, we will index it with a gensim tool:

python -m gensim.scripts.make_wiki enwiki-latest-pages-articles.xml.bz2 wiki_en_output

Run the previous line on the command shell, not on the Python shell! After several hours, the index will be saved in the same directory. At this point, we can build the final topic model. This process looks exactly like it did for the small AP dataset. We first import a few packages:

import logging, gensim 

Now, we set up logging using the standard Python logging module (which gensim uses to print out status messages). This step is not strictly necessary, but it is nice to have a little more output to know what is happening:

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Now we load the preprocessed data:

id2word = gensim.corpora.Dictionary.load_from_text( 
              'wiki_en_output_wordids.txt') 
mm = gensim.corpora.MmCorpus('wiki_en_output_tfidf.mm') 

Finally, we build the LDA model as we did earlier:

model = gensim.models.ldamodel.LdaModel( 
          corpus=mm, 
          id2word=id2word, 
          num_topics=100) 

This will again take a couple of hours. You will see the progress on your console, which can give you an indication of how long you still have to wait.

Once it is done, we can save the topic model to a file, so we don't have to redo it:

model.save('wiki_lda.pkl') 

If you exit your session and come back later, you can load the model again using the following command (after the appropriate imports, naturally):

model = gensim.models.ldamodel.LdaModel.load('wiki_lda.pkl') 

The model object can be used to explore the collection of documents and build the topics matrix as we did earlier.

We can see that this is still a sparse model even if we have many more documents than we had earlier (over 4 million as we are writing this):

lens = (topics > 0).sum(axis=1) 
print('Mean number of topics mentioned: {0:.4}'.format(np.mean(lens))) 
print('Percentage of articles mentioning <10 topics: {0:.1%}'.format( 
               np.mean(lens <= 10))) 
Mean number of topics mentioned: 6.244 
Percentage of articles mentioning <10 topics: 95.1%  

So, the average document mentions 6.244 topics and 95.1 percent of them mention 10 or fewer topics.

We can ask what the most talked about topic on Wikipedia is. We will first compute the total weight for each topic (by summing up the weights from all the documents) and then retrieve the words corresponding to the most highly weighted topic. This is performed using the following code:

weights = topics.sum(axis=0) 
words = model.show_topic(weights.argmax(), 64) 

Using the same tools as we did earlier to build up a visualization, we can see that the most talked about topic is related to music and is a very coherent topic. A full 18 percent of Wikipedia pages are partially related to this topic (5.5 percent of all the words in Wikipedia are assigned to this topic). Take a look at the following screenshot:

These plots and numbers were obtained when the book was being written. As Wikipedia keeps changing, your results will be different. We expect that the trends will be similar, but the details may vary.

Alternatively, we can look at the least talked about topic:

words = model.show_topic(weights.argmin(), 64) 

Refer to the following screenshot:

The least talked about topic is harder to interpret, but many of its top words refer to locations in Africa. Just 2.1 percent of documents touch upon it, and it represents just 0.1 percent of the words.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset