Creating a model

The most important step in sentiment analysis (as is the case with most machine learning problems) is the preprocessing of our data. The following table contains 10 tweets, randomly sampled from the dataset:

id	text
44	@JonathanRKnight Awww I soo wish I was there to see...
143873	Shaking stomach flipping........god i hate thi...
466449	why do they refuse to put nice things in our v...
1035127	@KrisAllenmusic visit here
680337	Rafa out of Wimbledon Love Drunk by BLG out S...
31250	It's official, printers hate me Going to sul...
1078430	@_Enigma__ Good to hear
1436972	Dear Photoshop CS2. i love you. and i miss you!
401990	my boyfriend got in a car accident today !
1053169	Happy birthday, Wisconsin! 161 years ago, you ...

An outline of 10 random samples from the dataset

We can immediately make the following observations. First, there are references to other users, for example, @KrisAllenmusic. These references do not provide any information about the tweet's sentiment. Thus, during preprocessing, we will remove them. Second, there are numbers and punctuation. These also do not contribute to the tweet’s sentiment, so they must also be removed. Third, some letters are capitalized while others are not. As capitalization does not alter the word’s sentiment, we can choose to either convert all letters to lowercase or to convert them to uppercase. This ensures that words such as LOVE, love, and Love will be handled as the same unigram. If we sample more tweets, we can identify more problems. There are hashtags (such as #summer), which also do not contribute to the tweet’s sentiment. Furthermore, there are URL links (for example https://www.packtpub.com/eu/) and HTML attributes (such as &amp which corresponds to &). These will also be removed during our preprocessing.

In order to preprocess our data, first, we must import the required libraries. We will use pandas; Python's built-in regular expressions library, re; punctuation from string; and the Natural Language Toolkit (NLTK). The nltk library can be easily installed either through pip or conda as follows:

import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from string import punctuation

After loading the libraries, we load the data, change the polarity from [0, 4] to [0, 1], and discard all fields except for the text content and the polarity:

# Read the data and assign labels
labels = ['polarity', 'id', 'date', 'query', 'user', 'text']
data = pd.read_csv("sent140.csv", names=labels)
 
# Keep only text and polarity, change polarity to 0-1
data = data[['text', 'polarity']]
data.polarity.replace(4, 1, inplace=True)

As we saw earlier, many words do not contribute to a tweet's sentiment, although they frequently appear in text. Search engines handle this by removing such words, which are called stop words. NLTK has a list of the most common stop words that we are going to utilize. Furthermore, as there are a number of stop words that are contractions (such as "you're" and "don't") and tweets frequently omit single quotes in contractions, we expand the list in order to include contractions without single quotes (such as "dont"):

# Create a list of stopwords
stops = stopwords.words("english")
# Add stop variants without single quotes
no_quotes = []
for word in stops:
    if "'" in word:
        no_quotes.append(re.sub(r''', '', word))
stops.extend(no_quotes)

We then define two distinct functions. The first function, clean_string, cleans the tweet by removing all elements we discussed earlier (such as references, hashtags, and so on). The second function removes all punctuation or stop word and stems each word, by utilizing NLTK's PorterStemmer:

def clean_string(string):
    # Remove HTML entities
    tmp = re.sub(r'&w*;', '', string)
    # Remove @user
    tmp = re.sub(r'@(w+)', '', tmp)
    # Remove links
    tmp = re.sub(r'(http|https|ftp)://[a-zA-Z0-9\./]+', '', tmp)
    # Lowercase
    tmp = tmp.lower()
    # Remove Hashtags
    tmp = re.sub(r'#(w+)', '', tmp)
    # Remove repeating chars
    tmp = re.sub(r'(.)1{1,}', r'11', tmp)
    # Remove anything that is not letters
    tmp = re.sub("[^a-zA-Z]", " ", tmp)
    # Remove anything that is less than two characters
    tmp = re.sub(r'w{1,2}', '', tmp)
    # Remove multiple spaces
    tmp = re.sub(r'ss+', ' ', tmp)
    return tmp

def preprocess(string):
    stemmer = PorterStemmer()
    # Remove any punctuation character
    removed_punc = ''.join([char for char in string if char not in punctuation])
    cleaned = []
    # Remove any stopword
    for word in removed_punc.split(' '):
        if word not in stops:
            cleaned.append(stemmer.stem(word.lower()))
    return ' '.join(cleaned)

As we would like to compare the performance of the ensemble with the base learners themselves, we will define a function that will evaluate any given classifier. The two most important factors that will define our dataset are the n-grams we will use and the number of features. Scikit-learn has an implementation of an IDF feature extractor, the TfidfVectorizer class. This allows us to only utilize the top M most frequent features, as well as define the n-gram range we will use, through the max_features and ngram_range parameters. It creates sparse arrays of features, which saves a great deal of memory, but the results must be converted to normal arrays before they can be processed by scikit-learn's classifiers. This is achieved by calling the toarray() function. Our check_features_ngrams function accepts the number of features, a tuple of minimum and maximum n-grams, and a list of named classifiers (a name, classifier tuple). It extracts the required features from the dataset and passes them to the nested check_classifier. This function trains and evaluates each classifier, as well as exports the results to the specified file, outs.txt:

def check_features_ngrams(features, n_grams, classifiers):
    print(features, n_grams)

    # Create the IDF feature extractor
    tf = TfidfVectorizer(max_features=features, ngram_range=n_grams,
                         stop_words='english')

    # Create the IDF features
    tf.fit(data.text)
    transformed = tf.transform(data.text)
    np.random.seed(123456)

    def check_classifier(name, classifier):
        print('--'+name+'--')

        # Train the classifier
        x_data = transformed[:train_size].toarray()
        y_data = data.polarity[:train_size].values
        classifier.fit(x_data, y_data)
        i_s = metrics.accuracy_score(y_data, classifier.predict(x_data))

        # Evaluate on the test set
        x_data = transformed[test_start:test_end].toarray()
        y_data = data.polarity[test_start:test_end].values
        oos = metrics.accuracy_score(y_data, classifier.predict(x_data))

        # Export the results
        with open("outs.txt","a") as f:
            f.write(str(features)+',')
            f.write(str(n_grams[-1])+',')
            f.write(name+',')
            f.write('%.4f'%i_s+',')
            f.write('%.4f'%oos+'
')

    for name, classifier in classifiers:
        check_classifier(name, classifier)
Finally, we test for n-grams in the range of [1, 3] and for the top 500, 1000, 5000, 10000, 20000, and 30000 features.

# Create csv header
with open("outs.txt","a") as f:
    f.write('features,ngram_range,classifier,train_acc,test_acc')
# Test all features and n-grams combinations
for features in [500, 1000, 5000, 10000, 20000, 30000]:
    for n_grams in [(1, 1), (1, 2), (1, 3)]:
    # Create the ensemble
        voting = VotingClassifier([('LR', LogisticRegression()),
                                   ('NB', MultinomialNB()),
                                    ('Ridge', RidgeClassifier())])
    # Create the named classifiers
    classifiers = [('LR', LogisticRegression()),
                    ('NB', MultinomialNB()),
                    ('Ridge', RidgeClassifier()),
                    ('Voting', voting)]
     # Evaluate them
     check_features_ngrams(features, n_grams, classifiers)

The results are depicted in the following diagram. As is evident, as we increase the number of features, the accuracy increases for all classifiers. Furthermore, if the number of features is relatively small, unigrams outperform combinations of unigrams and bigrams/trigrams. This is due to the fact that the most frequent expressions are not sentimental. Finally, although voting exhibits a relatively satisfactory performance, it is not able to outperform logistic regression:

Results of voting and base learners

Table of Contents for Creating a model

Create new playlist

Sign In

Sign Up

Table of Contents for
Creating a model