Creating a model

The most important step in sentiment analysis (as is the case with most machine learning problems) is the preprocessing of our data. The following table contains 10 tweets, randomly sampled from the dataset:

id

text

44

@JonathanRKnight Awww I soo wish I was there to see...

143873

Shaking stomach flipping........god i hate thi...

466449

why do they refuse to put nice things in our v...

1035127

@KrisAllenmusic visit here

680337

Rafa out of Wimbledon Love Drunk by BLG out S...

31250

It's official, printers hate me Going to sul...

1078430

@_Enigma__ Good to hear

1436972

Dear Photoshop CS2. i love you. and i miss you!

401990

my boyfriend got in a car accident today !

1053169

Happy birthday, Wisconsin! 161 years ago, you ...

An outline of 10 random samples from the dataset

We can immediately make the following observations. First, there are references to other users, for example, @KrisAllenmusic. These references do not provide any information about the tweet's sentiment. Thus, during preprocessing, we will remove them. Second, there are numbers and punctuation. These also do not contribute to the tweet’s sentiment, so they must also be removed. Third, some letters are capitalized while others are not. As capitalization does not alter the word’s sentiment, we can choose to either convert all letters to lowercase or to convert them to uppercase. This ensures that words such as LOVE, love, and Love will be handled as the same unigram. If we sample more tweets, we can identify more problems. There are hashtags (such as #summer), which also do not contribute to the tweet’s sentiment. Furthermore, there are URL links (for example https://www.packtpub.com/eu/) and HTML attributes (such as &amp which corresponds to &). These will also be removed during our preprocessing.

In order to preprocess our data, first, we must import the required libraries. We will use pandas; Python's built-in regular expressions library, re; punctuation from string; and the Natural Language Toolkit (NLTK). The nltk library can be easily installed either through pip or conda as follows:

import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from string import punctuation

After loading the libraries, we load the data, change the polarity from [0, 4] to [0, 1], and discard all fields except for the text content and the polarity:

# Read the data and assign labels
labels = ['polarity', 'id', 'date', 'query', 'user', 'text']
data = pd.read_csv("sent140.csv", names=labels)

# Keep only text and polarity, change polarity to 0-1
data = data[['text', 'polarity']]
data.polarity.replace(4, 1, inplace=True)

As we saw earlier, many words do not contribute to a tweet's sentiment, although they frequently appear in text. Search engines handle this by removing such words, which are called stop words. NLTK has a list of the most common stop words that we are going to utilize. Furthermore, as there are a number of stop words that are contractions (such as "you're" and "don't") and tweets frequently omit single quotes in contractions, we expand the list in order to include contractions without single quotes (such as "dont"):

# Create a list of stopwords
stops = stopwords.words("english")
# Add stop variants without single quotes
no_quotes = []
for word in stops:
if "'" in word:
no_quotes.append(re.sub(r''', '', word))
stops.extend(no_quotes)

We then define two distinct functions. The first function, clean_string, cleans the tweet by removing all elements we discussed earlier (such as references, hashtags, and so on). The second function removes all punctuation or stop word and stems each word, by utilizing NLTK's PorterStemmer:

def clean_string(string):
# Remove HTML entities
tmp = re.sub(r'&w*;', '', string)
# Remove @user
tmp = re.sub(r'@(w+)', '', tmp)
# Remove links
tmp = re.sub(r'(http|https|ftp)://[a-zA-Z0-9\./]+', '', tmp)
# Lowercase
tmp = tmp.lower()
# Remove Hashtags
tmp = re.sub(r'#(w+)', '', tmp)
# Remove repeating chars
tmp = re.sub(r'(.)1{1,}', r'11', tmp)
# Remove anything that is not letters
tmp = re.sub("[^a-zA-Z]", " ", tmp)
# Remove anything that is less than two characters
tmp = re.sub(r'w{1,2}', '', tmp)
# Remove multiple spaces
tmp = re.sub(r'ss+', ' ', tmp)
return tmp

def preprocess(string):
stemmer = PorterStemmer()
# Remove any punctuation character
removed_punc = ''.join([char for char in string if char not in punctuation])
cleaned = []
# Remove any stopword
for word in removed_punc.split(' '):
if word not in stops:
cleaned.append(stemmer.stem(word.lower()))
return ' '.join(cleaned)

As we would like to compare the performance of the ensemble with the base learners themselves, we will define a function that will evaluate any given classifier. The two most important factors that will define our dataset are the n-grams we will use and the number of features. Scikit-learn has an implementation of an IDF feature extractor, the TfidfVectorizer class. This allows us to only utilize the top M most frequent features, as well as define the n-gram range we will use, through the max_features and ngram_range parameters. It creates sparse arrays of features, which saves a great deal of memory, but the results must be converted to normal arrays before they can be processed by scikit-learn's classifiers. This is achieved by calling the toarray() function. Our check_features_ngrams function accepts the number of features, a tuple of minimum and maximum n-grams, and a list of named classifiers (a name, classifier tuple). It extracts the required features from the dataset and passes them to the nested check_classifier. This function trains and evaluates each classifier, as well as exports the results to the specified file, outs.txt:

def check_features_ngrams(features, n_grams, classifiers):
print(features, n_grams)

# Create the IDF feature extractor
tf = TfidfVectorizer(max_features=features, ngram_range=n_grams,
stop_words='english')

# Create the IDF features
tf.fit(data.text)
transformed = tf.transform(data.text)
np.random.seed(123456)

def check_classifier(name, classifier):
print('--'+name+'--')

# Train the classifier
x_data = transformed[:train_size].toarray()
y_data = data.polarity[:train_size].values
classifier.fit(x_data, y_data)
i_s = metrics.accuracy_score(y_data, classifier.predict(x_data))

# Evaluate on the test set
x_data = transformed[test_start:test_end].toarray()
y_data = data.polarity[test_start:test_end].values
oos = metrics.accuracy_score(y_data, classifier.predict(x_data))

# Export the results
with open("outs.txt","a") as f:
f.write(str(features)+',')
f.write(str(n_grams[-1])+',')
f.write(name+',')
f.write('%.4f'%i_s+',')
f.write('%.4f'%oos+' ')

for name, classifier in classifiers:
check_classifier(name, classifier)
Finally, we test for n-grams in the range of [1, 3] and for the top 500, 1000, 5000, 10000, 20000, and 30000 features.

# Create csv header
with open("outs.txt","a") as f:
f.write('features,ngram_range,classifier,train_acc,test_acc')
# Test all features and n-grams combinations
for features in [500, 1000, 5000, 10000, 20000, 30000]:
for n_grams in [(1, 1), (1, 2), (1, 3)]:
# Create the ensemble
voting = VotingClassifier([('LR', LogisticRegression()),
('NB', MultinomialNB()),
('Ridge', RidgeClassifier())])
# Create the named classifiers
classifiers = [('LR', LogisticRegression()),
('NB', MultinomialNB()),
('Ridge', RidgeClassifier()),
('Voting', voting)]
# Evaluate them
check_features_ngrams(features, n_grams, classifiers)

The results are depicted in the following diagram. As is evident, as we increase the number of features, the accuracy increases for all classifiers. Furthermore, if the number of features is relatively small, unigrams outperform combinations of unigrams and bigrams/trigrams. This is due to the fact that the most frequent expressions are not sentimental. Finally, although voting exhibits a relatively satisfactory performance, it is not able to outperform logistic regression:

Results of voting and base learners
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset