Getting Twitter data

There are a number of ways to gather Twitter data. From web scraping to using custom libraries, each one has different advantages and disadvantages. For our implementation, as we also need sentiment labeling, we will utilize the Sentiment140 dataset (http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip). The reason that we do not collect our own data is mostly due to the time we would need to label it. In the last section of this chapter, we will see how we can collect our own data and analyze it in real time. The dataset consists of 1.6 million tweets, containing the following 6 fields:

  • The tweet's polarity
  • A numeric ID
  • The date it was tweeted
  • The query used to record the tweet
  • The user's name
  • The tweet's text content

For our models, we will only need the tweet's text and polarity. As can be seen in the following graph, there are 800,000 positive (with a polarity 4) and 800,000 negative (with a polarity 0) tweets:

Polarity distribution

Here, we can also verify the statement we made earlier about word frequencies. The following graph depicts the 30 most common words in the dataset. As is evident, none of them bears any sentiment. Thus, an IDF transform would be more beneficial to our models:

The 30 most common words in the dataset and the number of occurrences of each
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset