Getting Twitter data

There are a number of ways to gather Twitter data. From web scraping to using custom libraries, each one has different advantages and disadvantages. For our implementation, as we also need sentiment labeling, we will utilize the Sentiment140 dataset (http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip). The reason that we do not collect our own data is mostly due to the time we would need to label it. In the last section of this chapter, we will see how we can collect our own data and analyze it in real time. The dataset consists of 1.6 million tweets, containing the following 6 fields:

The tweet's polarity
A numeric ID
The date it was tweeted
The query used to record the tweet
The user's name
The tweet's text content

For our models, we will only need the tweet's text and polarity. As can be seen in the following graph, there are 800,000 positive (with a polarity 4) and 800,000 negative (with a polarity 0) tweets:

Polarity distribution

Here, we can also verify the statement we made earlier about word frequencies. The following graph depicts the 30 most common words in the dataset. As is evident, none of them bears any sentiment. Thus, an IDF transform would be more beneficial to our models:

The 30 most common words in the dataset and the number of occurrences of each

Table of Contents for Getting Twitter data

Create new playlist

Sign In

Sign Up

Table of Contents for
Getting Twitter data