Working with tidy text

For this, we will use the tidytext package. This package is built on the philosophy of tidy data, introduced by Hadley Wickham in his 2014 paper (https://www.jstatsoft.org/article/view/v059i10). A dataset is tidy if the following three conditions are satisfied:

Each variable is a column
Each observation is a row
Each type of observational unit is a table

The tidytext package helps us turn our text into tidy form, by putting one token per row. Let's start by loading dplyr and tidytext. If you don't have tidytext, install it first using install.packages("tidytext").

Load the packages and let's transform our text into a data frame:

library(tidytext)
library(dplyr)
text_df <- data_frame(line = 1:4, text = text)

The unnest_tokens function is where the magic of tidytext begins:

text_df <- text_df %>%
 unnest_tokens(word, text)
head(text_df)
# A tibble: 6 x 2
   line word
  <int> <chr>
1 1 the
2 1 food
3 1 is
4 1 typical
5 1 czech
6 1 and

As you can see, our text was transformed as one token (the default is one word = one token) per row. First, let's get rid of stop words:

data(stop_words)
head(stop_words)
text_df <- text_df %>% anti_join(stop_words)

Our goal is to determine, at least visually by now, the sentiment of the preceding reviews. Let's begin with a quick summary of the word count:

library(ggplot2)
text_df %>% 
  count(word, sort=T) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()+
  theme_bw()

We get a nice bar chart like this:

Basic word count in our toy review dataset

The tidytext package includes three lexicons (collection of words) annotated by sentiment:

AFINN from Finn Årup Nielsen (http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010)
bing from Bing Liu and collaborators (https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html)
nrc from Saif Mohammad and Peter Turney (http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm)

The AFINN lexicon gives a numeric value between -5 and 5 to common words in English, with negative values being negative words. For instance:

get_sentiments("afinn") %>% 
  filter(score==-5) %>% 
  head

Gives (sensitive readers should skip the next snippet):

# A tibble: 6 x 2
        word score
       <chr> <int>
1 bastard -5
2 bastards -5

Whereas, the following:

get_sentiments("afinn") %>% 
 filter(score==0) %>% 
 head

Is simply:

# A tibble: 1 x 2
 word score
 <chr> <int>
1 some kind 0

And the following:

get_sentiments("afinn") %>% 
 filter(score==5) %>% 
 head

Returns:

# A tibble: 5 x 2
 word score
 <chr> <int>
1 breathtaking 5
2 hurrah 5
3 outstanding 5
4 superb 5
5 thrilled 5

The bing lexicon has only positive and negative words:

> get_sentiments("bing") %>% head
# A tibble: 6 x 2
 word sentiment
 <chr> <chr>
1 2-faced negative
2 2-faces negative
3 a+ positive
4 abnormal negative
5 abolish negative
6 abominable negative

Whereas, the nrc has different categories:

> get_sentiments("nrc") %>% head
# A tibble: 6 x 2
 word sentiment
 <chr> <chr>
1 abacus trust
2 abandon fear
3 abandon negative
4 abandon sadness
5 abandoned anger
6 abandoned fear

How can we use these word lists? Well, once our data is tidy, we can join them and create different aggregations to try to get a feeling of what is going on. Let's start by storing the bing lexicon somewhere:

bing <- get_sentiments("bing")

And joining it with our data:

> text_df %>% inner_join(bing) %>% count(line,sentiment)
Joining, by = "word"
# A tibble: 5 x 3
 line sentiment n
 <int> <chr> <int>
1 1 negative 1
2 2 negative 1
3 3 negative 3
4 3 positive 1
5 4 positive 2

Not bad, but we can always do better with a plot:

# Plot
text_df %>% 
 inner_join(bing) %>% 
 count(line,sentiment) %>%
 ggplot(aes(line, n, fill=sentiment))+
 geom_col()+
 coord_flip()+
 theme_bw()

The results are shown as follows. Not bad. We see that a simple join and a summary aggregation already gives a basic insight into how to classify the reviews by sentiment:

Aggregate statistics using the bing lexicon

This is already valuable, but in some cases we would like to know how positive or negative a review is; for instance, to redirect the issue to the proper customer service representative. In this case, it might be useful to use the AFINN lexicon instead:

afinn <- get_sentiments("afinn")

Now, we join the review data as before:

text_df %>% inner_join(afinn)

And look at the total score per review:

# Group
text_df %>% 
 inner_join(afinn) %>% 
 group_by(line) %>% 
 summarize(total_score = sum(score))

Not bad; but again, it's better to make a plot:

# Plot
text_df %>% 
 inner_join(afinn) %>% 
 group_by(line) %>% 
 summarize(total_score = sum(score)) %>%
 mutate(sentiment=ifelse(total_score>0,"positive","negative")) %>%
 ggplot(aes(line, total_score, fill=sentiment))+
 geom_col()+
 coord_flip()+
 theme_bw()

Which is shown here:

Aggregate statistics using the AFINN lexicon

Wow, what happened here? The situation looks a bit weird. First, note that one review has disappeared. This is because there are no common words with the AFINN lexicon for the first review (recall that we had an inner join). A bit more worrying is what has happened to the third review. The score is zero, which comes from summing the positive and negative scores of each word in the review, as per the AFINN lexicon. However, note that review 3 is mostly positive. What happened then?

Table of Contents for Working with tidy text

Create new playlist

Sign In

Sign Up

Table of Contents for
Working with tidy text