In machine learning and natural language processing, a topic model is a type of statistical model used to discover the abstract topics that occur in a collection of documents. A good example or use case to illustrate this concept is Twitter. Suppose we could analyze an individual's (or an organization's) tweets to discover any overriding trend. Let's look at a simple example.
If you have a Twitter account, you can perform this exercise pretty easily (you can then apply the same process to an archive of tweets you want to focus on and/or model). First, we need to create a tweet archive file.
Under Settings, you can submit a request to receive your tweets in an archive file. Once it's ready, you'll get an email with a link to download it:
And then save your file locally:
Now that we have a data source to work with, we can move the tweets into a list object (we'll call it x) and then convert that into an R data frame object (df1):
The tweets were first converted to a data frame before using the R tm
package to convert them to a corpus or Corpus collection (of text documents) object:
Next, we convert the Corpus to a Document-Term Matrix object with the following code. This creates a mathematical matrix that describes the frequency of terms that occur in a collection of documents, in this case, our collection of tweets:
After building a document-term matrix (shown earlier), we can more easily show the importance of the words found within our tweets with a word cloud (also known as a tag cloud). We can do this using the R package wordcloud
:
Finally, let's generate the word cloud visual:
Seems like there may be a theme involved here! The word cloud shows us that the words south and carolinas are the most important words.