Exploratory data analysis

As explained earlier in this chapter, one thing you can do is to look for an annotated lexicon per sentiment and try to do some basic analysis there, thanks to the package tidytext

First, we import a few libraries that would come handy and load our Twitter history:

library(plyr)
library(dplyr)
library(tidytext)
library(ggplot2)
df <- read.csv("./data/Tweets.csv", stringsAsFactors = F)
text_df <- data_frame(tweet_id=df$tweet_id, tweet=df$text)

Now, we use the unnest_tokens function to bring the data into tidy format:

text_df <- text_df %>%
unnest_tokens(word, tweet)

And remove the stop words:

data(stop_words)
head(stop_words)
text_df <- text_df %>% anti_join(stop_words)

Once this is done, we join it with, for instance, the bing lexicon:

bing <- get_sentiments("bing")
text_df %>% inner_join(bing)

And we are ready!

# Plot
text_df %>%
inner_join(bing) %>%
count(sentiment) %>%
ggplot(aes(sentiment, n, fill=sentiment))+
geom_col()+
theme_bw()

What do we learn from this? Well, that Pablo is a bit more of a negative person on Twitter, as the data shows. Unfortunately, it is our duty as data scientists to present the facts, even when they are not favorable to us.

Pablo is slightly more negative in Twitter than in person
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset