Modeling will be broken in two distinct parts. The first will focus on word frequency and correlation and culminate in the building of a topic model. In the next portion, we will examine many different quantitative techniques by utilizing the power of the qdap
package in order to compare two different speeches.
As we have everything set up in the document-term matrix, we can move on to exploring word frequencies by creating an object with the column sums, sorted in descending order. It is necessary to use as.matrix()
in the code to sum the columns. The default order is ascending, so putting -
in front of freq
will change it to descending:
> freq = colSums(as.matrix(dtm)) > ord = order(-freq)
We will examine head
and tail
of the object with the following code:
> freq[head(ord)] american year job work america new 243 241 212 195 187 177 > freq[tail(ord)] voic welcom worldclass yearold yemen 3 3 3 3 3 youll 3
The most frequent word is american
—as you might expect from the President— but notice how important its employment is with job
and work
. You can see how stemming changed voice to voic
and welcome/welcoming/welcomed to welcom
.
To look at the frequency of the word frequency, you can create tables, as follows:
> head(table(freq)) freq 3 4 5 6 7 8 127 118 112 75 65 50 > tail(table(freq)) freq 177 187 195 212 241 243 1 1 1 1 1 1
What these tables show is the number of words with that specific frequency, so 127
words occurred three times and one word, american
in our case, occurred 243
times.
Using findFreqTerms()
, we can see what words occurred at least 100
times. Looks like he talked quite a bit about business and it is clear that the government, including the IRS, is here to "help"
, perhaps even help "now"
. That is a relief!
> findFreqTerms(dtm, 100) [1] "america" "american" "busi" "countri" "everi" [6] "get" "help" "job" "let" "like" [11] "make" "need" "new" "now" "one" [16] "peopl" "right" "time" "work" "year"
You can find associations with words by correlation with the findAssocs()
function. Let's look at business and also job as two examples using 0.9
as the correlation cutoff:
> findAssocs(dtm, "busi", corlimit=0.9) $busi drop eager hear fund add main track 0.98 0.98 0.92 0.91 0.90 0.90 0.90 > findAssocs(dtm, "job", corlimit=0.9) $job hightech lay announc natur 0.94 0.94 0.93 0.93 aid alloftheabov burma cleaner 0.92 0.92 0.92 0.92 ford gather involv poor 0.92 0.92 0.92 0.92 redesign skill yemen sourc 0.92 0.92 0.92 0.91
Business needs further exploration, but jobs is interesting in the focus on high-tech jobs. It is curious that burma
and yemen
show up; I guess we still have a job to do on these countries, certainly in yemen
.
For visual portrayal, we can produce wordclouds
and a bar chart. We will do two wordclouds
to show the different ways to produce them: one with a minimum frequency and the other by specifying the maximum number of words to include. The first one with minimum frequency also includes code to specify the color. The scale syntax determines the minimum and maximum word size by frequency; in this case, the minimum frequency is 50
:
> wordcloud(names(freq), freq, min.freq=50, scale=c(3, .5), colors=brewer.pal(6, "Dark2"))
The output of the preceding command is as follows:
One can forgo all the fancy graphics as we will in the following image, capturing 30
most frequent words:
> wordcloud(names(freq), freq, max.words=30)
The output of the preceding command is as follows:
To produce a bar chart, the code can get a bit complicated, whether you use base R, ggplot2
, or lattice
. The following code will show you how to produce a bar chart for the 10
most frequent words in base R:
> freq = sort(colSums(as.matrix(dtm)), decreasing=TRUE) > wf = data.frame(word=names(freq), freq=freq) > wf = wf[1:10,] > barplot(wf$freq, names=wf$word, main="Word Frequency", xlab="Words", ylab="Counts", ylim=c(0,250))
The output of the preceding command is as follows:
We will now move on to the building of topic models using the topicmodels
package, which offers the LDA()
function. The question now is how many topics to create. It seems logical to solve for three or four, so we will try both, starting with three topics
(k=3
):
> library(topicmodels) > set.seed(123) > lda3 = LDA(dtm, k=3, method="Gibbs") > topics(lda3) 2010 2011 2012 2013 2014 2015 3 3 1 1 2 2
We can see that topics
are grouped every two years.
Now we will try for topics
(k=4
):
> set.seed(456) > lda4 = LDA(dtm, k=4, method="Gibbs") > topics(lda4) 2010 2011 2012 2013 2014 2015 4 4 3 2 1 1
Here, the topic groupings are similar to the preceding ones, except that the 2012
and 2013
speeches have their own topics
. For simplicity, let's have a look at three topics
for the speeches. Using the terms()
function produces a list of an ordered word frequency for each topic. The list of words is specified in the function, so let's look at the top 20
per topic:
> terms(lda3,20) Topic 1 Topic 2 Topic 3 [1,] "american" "new" "year" [2,] "job" "america" "peopl" [3,] "now" "work" "know" [4,] "right" "help" "nation" [5,] "get" "one" "last" [6,] "tax" "everi" "take" [7,] "busi" "need" "invest" [8,] "energi" "make" "govern" [9,] "home" "world" "school" [10,] "time" "countri" "also" [11,] "like" "let" "cut" [12,] "million" "congress" "two" [13,] "give" "state" "next" [14,] "well" "want" "come" [15,] "compani" "tonight" "deficit" [16,] "reform" "first" "chang" [17,] "back" "futur" "famili" [18,] "educ" "keep" "care" [19,] "put" "today" "economi" [20,] "unit" "worker" "work"
Topic 3
covers the first two speeches. Some key words stand out, such as "invest"
, "school"
, "economi"
, and "deficit"
. During this time, Congress passed and implemented the $787 billion American Recovery and Reinvestment Act with the goal of stimulating the economy.
Topic 1
covers the next two speeches. Here, the message transitions to "job"
, "tax"
, "busi"
, and what appears to be some comments on the "energi"
policy. A supposed comprehensive policy put forward under the rhetorical All of the above in the 2012 speech. Note the association with the rhetorical comment and jobs when we examined it with findAssocs()
.
Topic 2
brings us to the last two speeches. There doesn't appear to be a clear topic that rises to the surface like the others. It appears that these speeches were less about specific calls to action and more about what was done and the future vision of the country and the world. In the next section, we can dig into the exact speech content further, along with comparing and contrasting his first State of the Union speech with the most recent one.
This portion of the analysis will focus on the power of the qdap
package. It allows you to compare multiple documents over a wide array of measures. Our effort will be on comparing the 2010
and 2015
speeches. For starters, we will need to turn the text into data frames, perform sentence splitting, and then combine them to one data frame with a variable created that specifies the year of the speech. We will use this as our grouping variable in the analyses. You can include multiple variables in your groups. We will not need to do any of the other transformations such as stemming or lowering the case.
Before creating a data frame, we will need to get rid of that pesky (Applause.)
text with the gsub
function. We will also need to load the library:
> library(qdap) > state15 = gsub("(Applause.)", "", sou2015)
Now, put this in df
and split it into sentences, which will put one sentence per row. As proper punctuation is in the text, you can use the sentSplit
function. If punctuation was not there, other functions are available to detect the sentences:
> speech15 = data.frame(speech=state15) > sent15 = sentSplit(speech15, "speech")
The last thing is to create the year variable:
> sent15$year = "2015"
Repeat the steps for the 2010 speech:
> state10 = gsub("(Applause.)", "", sou2010) > speech10 = data.frame(speech=state10) > sent10 = sentSplit(speech10, "speech") > sent10$year = "2010"
Now, concatenate the two datasets:
> sentences = rbind(sent10, sent15)
To compare the polarity (sentiment scores), use the polarity()
function, specifying the text and grouping variables:
> pol = polarity(sentences$speech, sentences$year) > pol year total.sentences total.words ave.polarity sd.polarity stan.mean.polarity 1 2010 443 7233 0.040 0.319 0.124 2 2015 378 6712 0.098 0.274 0.356
The stan.mean.polarity
value represents the standardized mean polarity, which is the average polarity divided by the standard deviation. We see that 2015
was slightly higher (0.356
) than 2010
(0.124
). This is in line with what we expect. You can also plot the data. The plot produces two charts. The first shows the polarity by sentences over time and the second shows the distribution of the polarity:
> plot(pol)
The output of the preceding command is as follows:
This plot may be a challenge to read in this text, but let me do my best to interpret it. The 2010
speech starts out with a strong negative sentiment and is more negative than 2015
. We can identify this sentence by creating a data frame of the pol
object, find the sentence number, and call this sentence:
> pol.df = pol$all > which.min(pol.df$polarity) [1] 12 > pol.df$text.var[12] [1] "One year ago, I took office amid two wars, an economy rocked by a severe recession, a financial system on the verge of collapse, and a government deeply in debt.
Now that is negative sentiment! We will look at the readability index next:
> ari = automated_readability_index(sentences$speech, sentences$year) > ari$Readability year word.count sentence.count character.count 1 2010 7207 443 33623 2 2015 6671 378 30469 Automated_Readability_Index 1 8.677994 2 8.906440
I think it is no surprise that they are basically the same. Formality analysis is next. This takes a couple of minutes to run in R:
> form = formality(sentences$speech, sentences$year) > form year word.count formality 1 2015 6676 62.49 2 2010 7412 58.88
This looks to be very similar. We can examine the proportion of the parts of the speech and also produce a plot that confirms this, as follows:
> form$form.prop.by year word.count noun adj prep articles pronoun 1 2010 7412 24.22 11.39 14.64 6.46 10.75 2 2015 6676 24.94 12.46 16.37 6.34 10.23 verb adverb interj other 1 21.57 6.58 0.03 4.36 2 19.19 5.69 0.01 4.76 > plot(form)
The following is the output of the preceding command:
Now, the diversity measures have been produced. Again, they are nearly identical. A plot is also available, (plot(div)
), but being so similar, it adds no value. It is important to note that Obama's speech writer for 2010 was Jon Favreau, and in 2015, it was Cody Keenan:
> div = diversity(sentences$speech, sentences$year) > div year wc simpson shannon collision berger_parker brillouin 1 2010 7207 0.992 6.163 4.799 0.047 5.860 2 2015 6671 0.992 6.159 4.791 0.039 5.841
One of my favorite plots is the dispersion plot. This shows the dispersion of a word throughout the text. Let's examine the dispersion of "jobs"
, "families",
and "economy"
:
> dispersion_plot(sentences$speech, grouping.var=sentences$year, c("economy","jobs","families"), color="black", bg.color="white")
This is quite interesting as these topics were discussed early on in the 2010 speech but at the end in the 2015 speech.
Many of the tasks that we performed earlier with the tm
package can also be done in qdap
. So, the last thing that I want to do is show you how to execute the word frequency with qdap
and count the top ten words for each speech. This is easy with the freq_terms()
function. In addition to specifying the top ten words, we will also specify one of the stopwords
defaults available in qdap
. In this case, 200 versus the other option of 100:
> freq2010 = freq_terms(sent10$speech, top=10, stopwords=Top200Words) > freq2010 WORD FREQ 1 americans 28 2 that's 26 3 jobs 23 4 it's 20 5 years 19 6 american 18 7 businesses 18 8 those 18 9 families 17 10 last 16 > freq2015 = freq_terms(sent15$speech, top=10, stopwords=Top200Words) > freq2015 WORD FREQ 1 that's 28 2 years 25 3 every 24 4 american 19 5 country 19 6 economy 18 7 jobs 18 8 americans 17 9 lets 17 10 families 16
This completes our analysis of the two speeches. I must confess that I did not listen to any of these speeches. In fact, I haven't watched a State of the Union address since Reagan was President with the exception of the 2002 address. This provided some insight for me on how the topics and speech formats have changed over time to accommodate political necessity, while the overall style of formality and sentence structure has remained consistent. Keep in mind that this code can be adapted to text for dozens, if not hundreds, of documents and with multiple speakers, for example, screenplays, legal proceedings, interviews, social media, and on and on. Indeed, text mining can bring quantitative order to what has been qualitative chaos.