Chapter 7. Naive Bayes Classification of Twitter Data

In the last chapter, we looked at regression analysis and created a simple linear regression based on baseball data in order to determine if a team was playing above expectations or below expectations. In some sense, we created a classifier. We took raw data and partitioned the data space into two classes, those performing above and those performing below expectation based on a linear interpretation of the data.

Up to this point in the book, all of the data that we have investigated has been numerical data. Looking at numerical data is great for an introduction to data analysis because so much of an analyst's job is applying simple formulas (such as average, normalization, or regression) to data and interpreting the results. Numerical data is only half the picture. Raw, unstructured data is an equally important that we haven't touched upon. We will look at unstructured data in this chapter.

In this chapter, we cover the following:

  • An introduction to Naive Bayes classification
  • Downloading tweets via the Twitter API
  • Creating a database to collect tweets
  • Analyzing text data based on word frequency
  • Cleaning our tweets
  • Creating a Naive Bayes classifier to detect the language of tweets
  • Testing our classifier

In this chapter, we look at tweets, which are the short messages posted to the enormously popular social networking website, Twitter. Tweets only have one limitation; they must fit into 140 characters. Because of this limitation, tweets are short bursts of ideas or quick messages to other members. The collection of tweets by a member is called the timeline. The social network on Twitter follows a directed graph. Members are encouraged to follow other members and members who gain a follower have the option to follow back. The default behavior of each Twitter account is that no special permissions are required for a member to be followed, to view a member's timeline, or to send public messages that are delivered to another member. This openness allows members to have conversations with others who may join and leave at their pleasure. Active conversations can be discovered through the use of hashtags, which is a way of tagging a tweet with a subject, thus encouraging discussion on a topic by allowing others to find tweets in an easier manner. Hashtags begin with a # and are followed by a subject term. Searching hashtags is a way for people to quickly find like-minded members and build communities around an interest. Twitter is a social network with broad international appeal and supports a wide variety of languages. We wish to use Twitter in this chapter to study language specifically. Language is a tool that humans use to convey ideas to other humans. As the case may be, not everyone communicates in the same language. Look at the following two sentences:

  • My house is your house
  • Mi casa su casa

Both of these sentences convey the same idea, just in a different natural language (the first, obviously, in English, and the second in Spanish). Even if you don't know Spanish (I must confess that I don't), you might be able to guess that the second sentence represents a sentence in the Spanish language based on the individual words. What we would like to accomplish in this chapter is the creation of a classifier that will take a sentence and produce what it thinks is the best guess of the language used. This problem builds upon data analysis by utilizing that analysis to make an informed decision. We will attempt to gently step into the field of machine learning, where we attempt to write software that makes decisions for us. In this case, we will be creating a language detector.

An introduction to Naive Bayes classification

The Bayes theorem is a simple yet efficient method of classifying data. In the context of our example, tweets will be analyzed based on their individual words. There are three factors that go into a Naive Bayes classifier: prior knowledge, likelihood, and evidence. Together, they attempt to create a proportional measurement of an unknown quality of an event based on something knowable.

Prior knowledge

Prior knowledge allows us to contemplate our problem of discovering the language represented by a sentence without thinking about the features of the sentence. Think about answering the question blindly; that is, a sentence is spoken and you aren't allowed to see or hear it. What language was used? Of all of the tens of thousands of languages used across time, how could you ever guess this one? You are forced to play the odds. The top five most widely spoken languages are Mandarin, Spanish, English, Hindi, and Arabic. By selecting one of these languages, you have improved your odds (albeit still in the realm of speculation). The same basic strategy is taken by the Naive Bayes classifier; when stumped between multiple classes and the data presented is equally split, it leans towards the more popular category.

Prior knowledge is denoted in the following way:

Prior knowledge

In other words, without any information to go on, the probability that a language will be selected will be based on the popularity of that language. This quantity is sometimes called the prior belief, since it can be difficult to measure. Care should be taken while estimating the prior belief since our results will be sensitive to this quantity.

Likelihood

Likelihood asks for the probability of our known features, given that the class of the features is already known. In essence, we ask how likely a tweet is represented by a particular language based on a single word in which we already know how likely we are to see that word in phrases representing this language. A word is a feature of a language. Some words are used often, such as the word the in English. If you were told that a sentence contains the word the, based on this word alone there's a pretty good chance that a sentence is written in English. Some words cross language boundaries. The word casa is house in Spanish, but casa is also house in Portuguese, making casa a term that might help you to narrow down a sentence to a subset of languages, but it won't determine the language for you automatically.

Likelihood is denoted by the following:

Likelihood

In other words, we are asking, What is the probability of B given that we already know A is true? We could rephrase this as, What is the probability that a sentence contains the word casa given that we know it's written in Spanish? and What is the probability that a sentence contains the word casa given that we know it's written in Portuguese?

Evidence

Evidence asks for the probability of our known features independent of any unknown features. We will be dividing our likelihood property by the evidence, thus creating a proportional measurement of how the probability of a feature given the known class relates to the probability of the feature as a whole.

Evidence is denoted in the following:

Evidence

Putting the parts of the Bayes theorem together

Putting it all together, we get the classical Bayes theorem:

Putting the parts of the Bayes theorem together

The Bayes theorem will tell us the probability of A (the unknown language class of a tweet) given that we know B (a single word) is in a sentence. This needs to be generalized further. We know that there are multiple classes of languages (English, Spanish, Portuguese, and so on.) and each tweet will have multiple words (also known as features).

Since the evidence portion of the Bayes theorem involves the same investigation as the likelihood portion, the evidence portion is often ignored when there are multiple features. When dealing with multiple features, we multiply the likelihood of each feature given a class times the prior probability of that class. This can be denoted by the following:

Putting the parts of the Bayes theorem together

This represents the probability that a feature vector (a list of words represented by B1 to Bn) represents a class (language A) based on our prior knowledge of multiple known features. We perform this calculation for each of our classes and select the class with the highest probability.

But before we can do any of this, we need data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset