We will leave behind the 19th century and look at these recent times of trial and tribulation (1965 through 2016). In looking at this data, I found something interesting and troubling. Let's take a look at the 1970s:
> sotu_meta[185:191, 1:4]
# A tibble: 7 x 4
president year years_active party
<chr> <int> <chr> <chr>
1 Richard M. Nixon 1970 1969-1973 Republican
2 Richard M. Nixon 1971 1969-1973 Republican
3 Richard M. Nixon 1972 1969-1973 Republican
4 Richard M. Nixon 1972 1969-1973 Republican
5 Richard M. Nixon 1974 1973-1974 Republican
6 Richard M. Nixon 1974 1973-1974 Republican
7 Gerald R. Ford 1975 1974-1977 Republican
We see there are two 1972 and two 1974 addresses, but none for 1973. What? I went to the Nixon Foundation website, spent about 10 minutes trying to deconflict this, and finally threw my hands in the air and decided on implementing a quick fix. Be advised that there are a number of these conflicts to put in order:
> sotu_meta[188, 2] <- "1972_2"
> sotu_meta[190, 2] <- "1974_2"
> sotu_meta[157, 2] <- "1945_2"
> sotu_meta[166, 2] <- "1953_2"
> sotu_meta[170, 2] <- "1956_2"
> sotu_meta[176, 2] <- "1961_2"
> sotu_meta[195, 2] <- "1978_2"
> sotu_meta[197, 2] <- "1979_2"
> sotu_meta[199, 2] <- "1980_2"
> sotu_meta[201, 2] <- "1981_2"
An email to the author of this package is in order. I won't bother with that, but feel free to solve the issue yourself.
With this tragedy behind us, we'll go through tokenizing and removing stop words again for our relevant time frame:
> sotu_meta_recent <- sotu_meta %>%
dplyr::filter(year > 1964)
> sotu_meta_recent %>%
tidytext::unnest_tokens(word, text) -> sotu_unnest_recent
> sotu_recent <- sotu_unnest_recent %>%
dplyr::anti_join(stop_words, by = "word")
As discussed previously, we need to put the data into a DTM before building a model. This is done by creating a word count grouped by year, then passing that to the cast_dtm() function:
> sotu_recent %>%
dplyr::group_by(year) %>%
dplyr::count(word) -> lda_words
> sotu_dtm <- tidytext::cast_dtm(lda_words, year, word, n)
Let's get our model built. I'm going to create six different topics using the Gibbs method, and I specified verbose. It should run 2,000 iterations:
> sotu_lda <-
topicmodels::LDA(
sotu_dtm,
k = 6,
method = "Gibbs",
control = list(seed = 1965, verbose = 1)
)
> sotu_lda
A LDA_Gibbs topic model with 6 topics.
The algorithm gives each topic a number. We can see what year is mapped to what topic. I abbreviate the output since 2002:
> topicmodels::topics(sotu_lda)
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
2 2 2 2 2 2 2 4 4 4 4 4 4 4 4
We see a clear transition between Bush and Obama from topic 2 to topic 4. Here is a table of the count of topics:
> table(topicmodels::topics(sotu_lda))
1 2 3 4 5 6
8 7 5 18 14 5
Topic 4 is the most prevalent, which is associated with Clinton's term also. This output gives us the top five words associated with each topic:
> topicmodels::terms(sotu_lda, 5)
Topic 1 Topic 2 Topic 3
[1,] "future" "america" "administration"
[2,] "tax" "security" "congress"
[3,] "spending" "country" "economic"
[4,] "government" "world" "legislation"
[5,] "economic" "iraq" "energy"
Topic 4 Topic 5 Topic 6
[1,] "people" "world" "federal"
[2,] "american" "people" "programs"
[3,] "jobs" "american" "government"
[4,] "america" "congress" "program"
[5,] "children" "peace" "act"
This all makes good sense, and topic 2 is spot on for the time. If you drill down further to, say, 10, 15, or 20 words, it is even more revealing, but I won't bore you further. What about an application in the tidy ecosystem and a visualization? Certainly! We'll turn the model object into a data frame first and in the process capture the per-topic-per-word probabilities called beta:
> lda_topics <- tidytext::tidy(sotu_lda, matrix = "beta")
> ap_top_terms <- lda_topics %>%
dplyr::group_by(topic) %>%
dplyr::top_n(10, beta) %>%
dplyr::ungroup() %>%
dplyr::arrange(topic, -beta)
We can explore that data further or just plot it as follows:
> ap_top_terms %>%
dplyr::mutate(term = reorder(term, beta)) %>%
ggplot2::ggplot(ggplot2::aes(term, beta, fill = factor(topic))) +
ggplot2::geom_col(show.legend = FALSE) +
ggplot2::facet_wrap(~ topic, scales = "free") +
ggplot2::coord_flip() +
ggthemes::theme_economist_white()
The output of the preceding code is as follows:
This is the top 10 words per topic based on the beta probability. Another thing we can do is look at the probability an address is related to a topic. This is referred to as gamma in the model and we can pull those in just like the beta:
> ap_documents <- tidytext::tidy(sotu_lda, matrix = "gamma")
We now have the probabilities of an address per topic. Let's look at the 1981 Ronald Reagan values:
> dplyr::filter(ap_documents, document == "1981")
# A tibble: 6 x 3
document topic gamma
<chr> <int> <dbl>
1 1981 1 0.286
2 1981 2 0.0163
3 1981 3 0.0923
4 1981 4 0.118
5 1981 5 0.0777
6 1981 6 0.411
Topic 1 is a close second in the topic race. If you think about it, this means that more than six topics would help to create better separation in the probabilities. However, I like just six topics for this chapter for the purpose of demonstration.
Our next endeavor will consist of turning the DTM into input features for a simple classification model on predicting the political party, because partying is what politicians do best.