Exploratory data analysis

As explained earlier in this chapter, one thing you can do is to look for an annotated lexicon per sentiment and try to do some basic analysis there, thanks to the package tidytext.

First, we import a few libraries that would come handy and load our Twitter history:

library(plyr)
library(dplyr)
library(tidytext)
library(ggplot2)
df <- read.csv("./data/Tweets.csv", stringsAsFactors = F)
text_df <- data_frame(tweet_id=df$tweet_id, tweet=df$text)

Now, we use the unnest_tokens function to bring the data into tidy format:

text_df <- text_df %>%
 unnest_tokens(word, tweet)

And remove the stop words:

data(stop_words)
head(stop_words)
text_df <- text_df %>% anti_join(stop_words)

Once this is done, we join it with, for instance, the bing lexicon:

bing <- get_sentiments("bing")
text_df %>% inner_join(bing)

And we are ready!

# Plot
text_df %>% 
 inner_join(bing) %>% 
 count(sentiment) %>%
 ggplot(aes(sentiment, n, fill=sentiment))+
 geom_col()+
 theme_bw()

What do we learn from this? Well, that Pablo is a bit more of a negative person on Twitter, as the data shows. Unfortunately, it is our duty as data scientists to present the facts, even when they are not favorable to us.

Pablo is slightly more negative in Twitter than in person

Table of Contents for Exploratory data analysis

Create new playlist

Sign In

Sign Up

Table of Contents for
Exploratory data analysis