Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Creating a Twitter application

We need to collect data from Twitter via Twitter's available APIs. In order to follow along with the book, you are going to need a Twitter account and a Twitter application project. Go to www.twitter.com and create an account if you don't have one already.

Sign in to your account and go to https://apps.twitter.com/ and create a new Twitter application. Here's a screenshot of the application that I used for this chapter:

Upon creating an application, Twitter will generate a Consumer key and a Consumer secret key for you. It's considered a best practice to keep this information a secret since Twitter will hold the holder of these keys responsible for the activity generated by them. In addition to these two keys, you will have to manually generate an Access token and an Access token secret in order to access the REST APIs. These keys can be generated from the Keys and Access Tokens tab within the Twitter Application Management page.

Communicating with Twitter

Now we need to craft some Haskell code that will communicate with the Twitter API and download tweets. The Twitter API uses OAuth in order to provide some security for their application. All responses to the Twitter API are returned as JSON objects. The code presented in this chapter for communicating with the Twitter API was adapted from a tutorial from the FP Complete website. You can find the full tutorial here: https://www.fpcomplete.com/school/starting-with-haskell/libraries-and-frameworks/text-manipulation/json.

Our goal is to download tweets and we aren't picky about which tweets are downloaded. We just need data and plenty of it. The closest API call within the documentation is the search API, which requires that we provide a term on which to search. For this, we selected the search term a, which has usage across multiple languages (which is our goal in this exercise). At the time of writing, Twitter allows us to download a maximum of 100 tweets per query and allows us to query the API up to 180 times every 15 minutes (for a total of 18,000 tweets). Complete information about the search API command that we will be using in this chapter can be found here: https://dev.twitter.com/rest/reference/get/search/tweets.

The JSON object returned by the Twitter API contains lots of information on each tweet, including the tweet itself, the member who wrote the tweet, the time, and the detected language. Twitter admits that Language detection is best-effort in their documentation. Understanding the potential that Twitter might be wrong, we will be using Twitter's assumption of the detected language as the training data for our classifier.

From the GHCi prompt, we import the following libraries:

> import Data.HashMap.Strict as HM
> import Data.List as L

We should display all of the libraries used in this chapter (and we depend on several). This is the beginning of this chapter's module (LearningDataAnalysis07):

{-# LANGUAGE OverloadedStrings, DeriveGeneric #-}
module LearningDataAnalysis07 where
import Data.List as L
import Data.Hashable
import Data.HashMap.Strict as HM
import Database.HDBC.Sqlite3
import Database.HDBC
import Control.Concurrent
import Data.Char
import Network.HTTP.Conduit
import Web.Authenticate.OAuth
import Data.Aeson
import GHC.Generics

As you can see in these import snippets, I've started to use the as keyword to differentiate the Data.List and the Data.HashMap.Strict libraries. Both of these libraries have a function called map, and this requires that the two libraries be named.

Let's begin by creating our credentials for the API. Replacing the fields that begin with YOUR... with your API credentials, myoauth and mycred, will allow us to be properly identified by Twitter:

myoauth :: OAuth
myoauth =
  newOAuth { oauthServerName     = "api.twitter.com"
           , oauthConsumerKey    = "YOUR CONSUMER KEY"
           , oauthConsumerSecret = "YOUR CONSUMER SECRET KEY"
           }
mycred :: Credential
mycred = newCredential "YOUR ACCESS TOKEN"
                       "YOUR ACCESS TOKEN SECRET"

Next, we have to create some data objects that will be pattern-matched with the JSON object returned by the API. Each JSON tweet object returned by the search command will contain various fields. We can select just the fields we desire here.

Within the User object, there is the screenName field. We collect it here. While we aren't using the screenName information in this chapter, we do use it in the next chapter:

data User =
    User { screenName :: !String } deriving (Show, Generic)

The Status object contains the tweet (called text), the detected language (called lang), and a User object as shown in the following function:

data Status =
  Status { text :: !String,
          lang :: !String,
          user :: !User } deriving (Show, Generic)

Each Status object is contained in a list of objects named statuses as shown in the following function:

data Search =
  Search { statuses :: ![Status] } deriving (Show, Generic)

Once our data objects have been defined, we make sure that Haskell recognizes that we will be performing JSON pattern matching on these objects using the following instance statements:

instance FromJSON User
instance ToJSON User
instance FromJSON Status
instance ToJSON Status
instance FromJSON Search
instance ToJSON Search

We then call the Twitter_Search function and download tweets. This function will call Twitter's search API using a signed OAuth request and parse what is returned using the Search object.

twitterSearch :: String -> IO (Either String Search)
twitterSearch term = do
  req <- parseUrl $ "https://api.twitter.com/1.1/search/tweets.json?count=100&q=" ++ term
  res <- withManager $ m -> do
          signedreq <- signOAuth myoauth mycred req
          httpLbs signedreq m
  return $ eitherDecode $ responseBody res

This function will return either an error string or a Search object.

Creating a database to collect tweets

Now that we have a function to search Twitter, we need to create a database to collect tweets. We will create a tweets.sql file in the same manner that we used in Chapter 2, Getting Our Feet Wet:

createTweetsDatabase :: IO()
createTweetsDatabase = do
    conn <- connectSqlite3 "tweets.sql"
    run conn createStatement []
    commit conn
    disconnect conn
    putStrLn "Successfully created database."
  where
    createStatement =
        "CREATE TABLE tweets (message TEXT, user TEXT, language TEXT)"

Next, we need a separate function for inserting tweets into this database. Again, we will be using a similar technique to insert records that we used in Chapter 2, Getting Our Feet Wet for CSV records:

insertTweetsInDatabase :: [Tweet] -> IO()
insertTweetsInDatabase tweets = do
    conn <- connectSqlite3 "tweets.sql"
    stmt <- prepare conn insertStatement
    executeMany stmt sqlRecords
    commit conn
    disconnect conn
    putStrLn "Successfully inserted Tweets to database."
  where
    insertStatement = "INSERT INTO tweets VALUES (?, ?, ?)"
    sqlRecords = L.map ((Tweet message language (User user)) ->
                 [toSql message, toSql user, toSql language]) tweets

Next, we need a function to call our twitterSearch function and insert the returned objects into the database via insertTweetsInDatabase. We are using threadDelay in order to have a delay of five seconds after each API call in order to allow a little breathing time between each call, as seen in the following function:

collectTweetsIntoDatabase :: IO()
collectTweetsIntoDatabase = do
    status <- twitterSearch "a"
    either
       putStrLn
       ((Search statuses) -> insertTweetsInDatabase statuses)
       status
    threadDelay 5000

Finally, we collect the tweets. Simply create the database and call collectTweetsIntoDatabase 180 times. If written correctly, this should print the success message 180 times on the screen and the database will be populated with 18,000 tweets. It's enough to get us started.

> :l LearningDataAnalysis02 LearningDataAnalysis04 LearningDataAnalysis07 
> :m LearningDataAnalysis02 LearningDataAnalysis04 LearningDataAnalysis07
> createTweetsDatabase
> mapM_ (x -> collectTweetsIntoDatabase) [1..180]

From the command line, we can pull our tweets from the database.

> sqlTweets <- queryDatabase "tweets.sql" "SELECT message, language FROM tweets"
> let tweets = zip (readStringColumn sqlTweets 0) (readStringColumn sqlTweets 1)

A frequency study of tweets

A frequency function is one that counts the number of times each element is seen in a list. We will be using our frequency function in order to create a unique set of tweets, words, and languages in our database. The function that we will be creating returns a HashMap structure and will be used extensively in this chapter. Make sure that you install the library using cabal:

$ cabal install hashmap

This recursive function indexes through each element in a list and creates a mapped value of 1 for the elements that do not currently exist in the HashMap and adds 1 to the elements that do exist.

frequency :: (Eq k, Data.Hashable.Hashable k, Integral v) => [k] -> HashMap k v
frequency [] = HM.empty
frequency (x:xs) = HM.insertWith (+) x 1 (frequency xs)

You can quickly test to see if the function is working with a little help from Dr. Seuss. The title of one fish two fish red fish blue fish has five unique words and the word fish is repeated four times.

> frequency $ words "one fish two fish red fish blue fish"
fromList [("blue",1),("one",1),("two",1),("red",1),("fish",4)]

We can pass our listing of tweets to the frequency function in order to create a unique list:

> let freqTable = frequency tweets
> let uniqueTweets = HM.keys freqTable
> HM.size freqTable
15656

It seems that almost 87 percent of the tweets from your author's downloaded dataset represented unique content. We will be using these unique tweets for the remaining phases.

Cleaning our tweets

Now that we have data, we need to scrub this data. We wish to focus on the individual words in a tweet. Our ideal tweet is one that is written in all lowercase letters without any punctuation, without any hashtags, without any links, and without any replies to other users. This mythical tweet is rare and we must adapt the existing tweets to this form.

To scrub our data, I'm going to borrow a function found in the Haskell Data Analysis Cookbook, Nishant Shukla, Packt Publishing (I was a technical reviewer for this book; it's an excellent book and I referred back to it regularly when preparing for this book):

-- Removes @ replies, hashtags, and links from strings.
clean :: String -> String
clean myString = unwords $ L.filter
    (myWord -> not (or
                [ isInfixOf "@" myWord
                , isInfixOf "#" myWord
                , isInfixOf "http://" myWord ]))
    (words myString)

Now what this function won't do is convert tweets to all lowercase letters and remove punctuation. For that, I created a second function called removePunctionation:

removePunctuation :: String -> String
removePunctuation myString =
                  [toLower c | c <- myString, or [isAlpha c, isSpace c]]

With these two functions in place, we can clean our data to our ideal working conditions:

> let cleanedTweets = zip (L.map (removePunctuation.clean.fst) uniqueTweets) (L.map snd uniqueTweets)

Creating our feature vectors

To begin, let's create the frequency table of our languages seen in the database:

> let languageFrequency = (frequency . L.map snd) cleanedTweets
> languageFrequency
fromList [("uk",3),("pt",1800),("th",1),("sl",13),("in",34),("ja",274),("tl",13),("ar",180),("hi",1),("en",8318),("lv",2),("sk",13),("fi",2),("el",4),("vi",1),("ht",16),("es",3339),("pl",50),("da",4),("hu",10),("zh",1),("nl",13),("ko",17),("tr",73),("und",86),("it",235),("sv",8),("fr",1078),("is",1),("de",20),("bg",2),("fa",3),("ru",38),("et",3)]

You can glance through the list and see the language codes. English (en) has just over 8,000 tweets. Next in size is Spanish (es) with over 3,000 tweets. Third is Portuguese (pt) with 1,800 tweets, and fourth is French (fr) with a little over 1,000 tweets. By dividing each of these language counts by the sum (15,656), we have our prior estimation required by the Bayes theorem. Since English is represented by 8,318 tweets, the probability that a tweet will be English will be about 53 percent without knowing the contents of that tweet. (My original search term of "a" causes the database to be heavily slanted towards English. For the purposes of this tutorial, that's okay.)

While we're looking at the languageFrequency table, let's grab the unique languages represented:

> let allLanguages = HM.keys languageFrequency
> length allLanguages
34

Next, we need to know the frequency table of each word across all languages. This will be similar to our last command, which computed the frequency table of languages.

> let wordFrequency = (frequency . concatMap words) (L.map fst cleanedTweets)
> size wordFrequency
34250

Our database contains 34,000 unique words across 15,000 unique tweets and over 30 unique languages, all from waiting 15 minutes to download tweets. Not bad at all.

Here's where it gets tricky. We now need a frequency table of each word, frequency with respect to each language. This requires a HashMap object of languages that maps to a HashMap object of words and frequencies:

> let wordFrequencyByLanguage = (HM.fromList . L.map (language -> (language, (frequency . concatMap words . L.map fst) (L.filter (	weet -> language == (snd tweet)) cleanedTweets)))) allLanguages

We've got HashMap objects embedded within a HashMap. On the first layer of our HashMap object is each two-letter language code. Each two-letter language code maps to another HashMap with the word frequency for words among this language.

At the end of this section, we should have four variables: allLanguages (a list of String objects representing each language code), languageFrequency (a HashMap of String objects that map to Integer objects), wordFrequency (a HashMap of String objects that map to Integer objects), and wordFrequencyByLanguage (a HashMap of String objects that map to a HashMap of String objects that map to Integer objects). We will be using these variables when writing our Naive Bayes classifier.

Writing the code for the Bayes theorem

To begin, we will be writing the code for the simple single-feature Bayes theorem:

probLanguageGivenWord ::
    String
    -> String
    -> HashMap String Integer
    -> HashMap String Integer
    -> HashMap String (HashMap String Integer)
    -> Double
probLanguageGivenWord
    language
    wordlanguageFrequency
    wordFrequency
    wordFrequencyByLanguage =
    pLanguage * pWordGivenLanguage / pWord
  where
      countTweets = fromIntegral . sum $ elems languageFrequency

      countAllWords = fromIntegral . sum $ elems wordFrequency

      countLanguage = fromIntegral $
                      lookupDefault 0 language languageFrequency

      countWordsUsedInLanguage = fromIntegral . sum . elems $
                                 wordFrequencyByLanguage ! 
                                 language

      countWord = fromIntegral $ lookupDefault 0 word 
                  wordFrequency

      countWordInLanguage = fromIntegral $ lookupDefault 0 word
                            (wordFrequencyByLanguage ! language)

      pLanguage = countLanguage / countTweets

      pWordGivenLanguage = countWordInLanguage / 
                              countWordsUsedInLanguage

      pWord = countWord / countAllWords

This code is piecing together each of the individual parts of the Bayes theorem. Under the function's where clause, I've created several variables that begin with count. These count variables tally various important values from the four variables collected in the previous section. The three variables essential to the Bayes theorem are probLanguage, probWordGivenLanguage, and probWord. You should be able to identify how these variables are being calculated in the preceding code.

Let's test this algorithm on the word house with the top four languages found in our database:

> probLanguageGivenWord "en" "house" languageFrequency wordFrequency wordFrequencyByLanguage
0.9637796833052451

> probLanguageGivenWord "es" "house" languageFrequency wordFrequency wordFrequencyByLanguage
0.0

> probLanguageGivenWord "pt" "house" languageFrequency wordFrequency wordFrequencyByLanguage
0.0

> probLanguageGivenWord "fr" "house" languageFrequency wordFrequency wordFrequencyByLanguage
0.0

As you can see, house is a very English word. None of the other three languages return a score greater than 0. Let's look at the word casa with the top four languages in our database:

> probLanguageGivenWord "en" "casa" languageFrequency wordFrequency wordFrequencyByLanguage
7.899833469715125e-3
> probLanguageGivenWord "es" "casa" languageFrequency wordFrequency wordFrequencyByLanguage
0.4144443354127799

> probLanguageGivenWord "pt" "casa" languageFrequency wordFrequency wordFrequencyByLanguage
0.5225466167002369

> probLanguageGivenWord "fr" "casa" languageFrequency wordFrequency wordFrequencyByLanguage
2.3998008047484393e-2

The term casa appears in all four languages to some degree, but especially in Spanish and Portuguese. There appears to be a slight edge to Portuguese over Spanish in terms of the dominant ownership of the word.

Creating a Naive Bayes classifier with multiple features

As we stated earlier in the chapter, of the three parts of the Bayes Theorem, the two more important ones are the prior probability and the likelihood probability. I've extracted out the probability of a language into its own function here. We compute the total number of tweets of this language divided by the total number of tweets:

probLanguage :: String
    -> HashMap String Integer
    -> Double
probLanguage language languageFrequency =
    countLanguage / countTweets
  where
    countTweets = fromIntegral . sum $ elems languageFrequency
    countLanguage = fromIntegral $
                    lookupDefault 0 language languageFrequency

Next, we find the probability of a word given a language, in which we divide the number of times a word is seen in a language by the count of all occurrences of all words within a language.

Note

The lookupDefault function uses a default value of 0. This means that if lookupDefault cannot find a word within a language's HashMap, it returns 0.

Recall that our formula for computing the probability of a classifier given a set of features requires that we multiply the probability of each feature. The product of any value multiplied by 0 is 0, thus, if a word cannot be identified within a language, our approach will automatically assume that the probability of a match is 0, when it could just mean that we don't have enough data.

This can be done using the probWordGivenLanguage function as follows:

probWordGivenLanguage :: String
    -> String
    ->HashMap String (HashMap String Integer)
    -> Double
probWordGivenLanguage word language wordFrequencyByLanguage =
    countWordInLanguage / countWordsUsedInLanguage
  where
    countWordInLanguage = fromIntegral .
                          lookupDefault 0 word $
                          wordFrequencyByLanguage ! language
    countWordsUsedInLanguage = fromIntegral . sum . elems $
                               wordFrequencyByLanguage ! 
                               language

Finally, we can craft our Naive Bayes calculation based on multiple features. This function will multiply the probabilities of each word assuming a particular language by the probability of the language itself:

probLanguageGivenMessage :: String
    -> String
    -> HashMap String Integer
    -> HashMap String (HashMap String Integer)
    -> Double
probLanguageGivenMessage language message languageFrequency wordFrequencyByLanguage =
    probLanguage language languageFrequency *
    product (L.map
            (word ->
            probWordGivenLanguage word language wordFrequencyByLanguage)
            (words message))

Great! We should test this function on various phrases and languages:

> probLanguageGivenMessage "en" "my house is your house" languageFrequency wordFrequencyByLanguage
1.4151926487738795e-14

> probLanguageGivenMessage "es" "my house is your house" languageFrequency wordFrequencyByLanguage
0.0

> probLanguageGivenMessage "en" "mi casa su casa" languageFrequency wordFrequencyByLanguage
2.087214738575832e-20

> probLanguageGivenMessage "es" "mi casa su casa" languageFrequency wordFrequencyByLanguage
6.3795321947397925e-12

Note that the results are tiny numbers. This can be expected when multiplying the probabilities of words. We are not concerned with the size of the probability but with how the probabilities relate to each other across languages. Here, we see that my house is your house returns 1.4e-14 in English, which is small, but returns 0.0 in Spanish. We would select the English class for this sentence. We also see mi casa su casa returns 6.4e-12 in Spanish and 2.1e-20 in English. Here, Spanish is the selected class.

We can aggregate this process by mapping the function across all known languages:

languageClassifierGivenMessage ::
    String
    -> (HashMap String Integer)
    -> (HashMap String (HashMap String Integer))
    -> [(String, Double)]
languageClassifierGivenMessage
    message languageFrequency wordFrequencyByLanguage =
    L.map (language->
          (language,
           probLanguageGivenMessage
              language message languageFrequency wordFrequencyByLanguage))
          (keys languageFrequency)

Here, we can take the results of our language classifier and return the maximum (which should be the most likely language):

maxClassifier :: [(String, Double)] -> (String, Double)
maxClassifier = L.maximumBy (comparing snd)

Testing our classifier

We are finally at a point where we can perform some testing on our approach. We will begin with simple phrases in English and Spanish, since these two languages make up most of the database.

First, we will test five phrases in English:

> maxClassifier $ languageClassifierGivenMessage "the quick brown fox jumps over the lazy dog" languageFrequency wordFrequencyByLanguage
("en",1.0384385163880495e-32)

> maxClassifier $ languageClassifierGivenMessage "hi how are you doing today" languageFrequency wordFrequencyByLanguage
("en",4.3296809098896647e-17)

> maxClassifier $ languageClassifierGivenMessage "it is a beautiful day outside" languageFrequency wordFrequencyByLanguage
("en",4.604145482001343e-16)

> maxClassifier $ languageClassifierGivenMessage "would you like to join me for lunch" languageFrequency wordFrequencyByLanguage
("en",6.91160501990044e-21)

> maxClassifier $ languageClassifierGivenMessage "my teacher gave me too much homework" languageFrequency wordFrequencyByLanguage
("en",6.532933008201886e-23)

Next, we evaluate five phrases in Spanish using the same training data. As I mentioned earlier in the chapter, I don't know Spanish. These phrases were pulled from a Spanish language education website:

> maxClassifier $ languageClassifierGivenMessage "estoy bien gracias" languageFrequency wordFrequencyByLanguage
("es",3.5494939242101163e-10)

> maxClassifier $ languageClassifierGivenMessage "vaya ud derecho" languageFrequency wordFrequencyByLanguage
("es",7.86551836549381e-13)

> maxClassifier $ languageClassifierGivenMessage "eres muy amable" languageFrequency wordFrequencyByLanguage
("es",2.725124039761997e-12)

> maxClassifier $ languageClassifierGivenMessage "le gusta a usted aquí" languageFrequency wordFrequencyByLanguage
("es",6.631704901901517e-15)

> maxClassifier $ languageClassifierGivenMessage "feliz cumpleaños" languageFrequency wordFrequencyByLanguage
("es",2.4923205860794728e-8)

Finally, I decided to throw some French phrases at the classifier again, these phrases were pulled from a French language education website:

> maxClassifier $ languageClassifierGivenMessage "cest une bonne idée" languageFrequency wordFrequencyByLanguage
("fr",2.5206114495244297e-13)

> maxClassifier $ languageClassifierGivenMessage "il est très beau" languageFrequency wordFrequencyByLanguage
("fr",8.027963170060149e-13)

Hopefully, you can see that our classifier, which we built on a small database of 18,000 tweets, was enough to detect the correct language of simple phrases.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Creating a Twitter application

Create new playlist

Sign In

Sign Up