Topic models

We will leave behind the 19th century and look at these recent times of trial and tribulation (1965 through 2016). In looking at this data, I found something interesting and troubling. Let's take a look at the 1970s:

> sotu_meta[185:191, 1:4]
# A tibble: 7 x 4
  president        year  years_active party 
  <chr>            <int> <chr>        <chr> 
1 Richard M. Nixon 1970  1969-1973    Republican
2 Richard M. Nixon 1971  1969-1973    Republican
3 Richard M. Nixon 1972  1969-1973    Republican
4 Richard M. Nixon 1972  1969-1973    Republican

5 Richard M. Nixon 1974  1973-1974    Republican
6 Richard M. Nixon 1974  1973-1974    Republican
7 Gerald R.   Ford 1975  1974-1977    Republican

We see there are two 1972 and two 1974 addresses, but none for 1973. What? I went to the Nixon Foundation website, spent about 10 minutes trying to deconflict this, and finally threw my hands in the air and decided on implementing a quick fix. Be advised that there are a number of these conflicts to put in order:

> sotu_meta[188, 2] <- "1972_2"

> sotu_meta[190, 2] <- "1974_2"

> sotu_meta[157, 2] <- "1945_2"

> sotu_meta[166, 2] <- "1953_2"

> sotu_meta[170, 2] <- "1956_2"

> sotu_meta[176, 2] <- "1961_2"

> sotu_meta[195, 2] <- "1978_2"

> sotu_meta[197, 2] <- "1979_2"

> sotu_meta[199, 2] <- "1980_2"

> sotu_meta[201, 2] <- "1981_2"

An email to the author of this package is in order. I won't bother with that, but feel free to solve the issue yourself.

With this tragedy behind us, we'll go through tokenizing and removing stop words again for our relevant time frame:

> sotu_meta_recent <- sotu_meta %>%
    dplyr::filter(year > 1964)

> sotu_meta_recent %>%
    tidytext::unnest_tokens(word, text) -> sotu_unnest_recent

> sotu_recent <- sotu_unnest_recent %>%
    dplyr::anti_join(stop_words, by = "word")

As discussed previously, we need to put the data into a DTM before building a model. This is done by creating a word count grouped by year, then passing that to the cast_dtm() function:

> sotu_recent %>%
    dplyr::group_by(year) %>%
    dplyr::count(word) -> lda_words

> sotu_dtm <- tidytext::cast_dtm(lda_words, year, word, n)

Let's get our model built. I'm going to create six different topics using the Gibbs method, and I specified verbose. It should run 2,000 iterations:

> sotu_lda <-
 topicmodels::LDA(
 sotu_dtm,
 k = 6,
 method = "Gibbs",
 control = list(seed = 1965, verbose = 1)
 )

> sotu_lda
A LDA_Gibbs topic model with 6 topics.

The algorithm gives each topic a number. We can see what year is mapped to what topic. I abbreviate the output since 2002:

> topicmodels::topics(sotu_lda)
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
   2    2    2    2    2    2    2    4    4    4    4    4    4    4    4

We see a clear transition between Bush and Obama from topic 2 to topic 4. Here is a table of the count of topics:

> table(topicmodels::topics(sotu_lda))

 1 2 3  4  5 6 
 8 7 5 18 14 5

Topic 4 is the most prevalent, which is associated with Clinton's term also. This output gives us the top five words associated with each topic:

> topicmodels::terms(sotu_lda, 5)
     Topic 1      Topic 2    Topic 3 
[1,] "future"     "america"  "administration"
[2,] "tax"        "security" "congress"
[3,] "spending"   "country"  "economic"
[4,] "government" "world"    "legislation"
[5,] "economic"   "iraq"     "energy"

      Topic 4     Topic 5    Topic 6 
[1,] "people"     "world"    "federal"
[2,] "american"   "people"   "programs"
[3,] "jobs"       "american" "government"
[4,] "america"    "congress" "program"
[5,] "children"   "peace"    "act"

This all makes good sense, and topic 2 is spot on for the time. If you drill down further to, say, 10, 15, or 20 words, it is even more revealing, but I won't bore you further. What about an application in the tidy ecosystem and a visualization? Certainly! We'll turn the model object into a data frame first and in the process capture the per-topic-per-word probabilities called beta:

> lda_topics <- tidytext::tidy(sotu_lda, matrix = "beta")

> ap_top_terms <- lda_topics %>%
    dplyr::group_by(topic) %>%
    dplyr::top_n(10, beta) %>%
    dplyr::ungroup() %>%
    dplyr::arrange(topic, -beta)

We can explore that data further or just plot it as follows:

> ap_top_terms %>%
    dplyr::mutate(term = reorder(term, beta)) %>%
    ggplot2::ggplot(ggplot2::aes(term, beta, fill = factor(topic))) +
    ggplot2::geom_col(show.legend = FALSE) +
    ggplot2::facet_wrap(~ topic, scales = "free") +
    ggplot2::coord_flip() +
    ggthemes::theme_economist_white()

The output of the preceding code is as follows:

This is the top 10 words per topic based on the beta probability. Another thing we can do is look at the probability an address is related to a topic. This is referred to as gamma in the model and we can pull those in just like the beta:

> ap_documents <- tidytext::tidy(sotu_lda, matrix = "gamma")

We now have the probabilities of an address per topic. Let's look at the 1981 Ronald Reagan values:

> dplyr::filter(ap_documents, document == "1981")
# A tibble: 6 x 3
  document topic gamma
  <chr>    <int> <dbl>
1 1981      1    0.286 
2 1981      2    0.0163
3 1981      3    0.0923
4 1981      4    0.118 
5 1981      5    0.0777
6 1981      6    0.411

Topic 1 is a close second in the topic race. If you think about it, this means that more than six topics would help to create better separation in the probabilities. However, I like just six topics for this chapter for the purpose of demonstration.

Our next endeavor will consist of turning the DTM into input features for a simple classification model on predicting the political party, because partying is what politicians do best.

Table of Contents for Topic models

Create new playlist

Sign In

Sign Up

Table of Contents for
Topic models