Sentiment analysis tools 

Sentiment analysis can be implemented in a number of ways. The easiest to both implement and understand are lexicon-based approaches. These methods leverage the use of lists (lexicons) of polarized words and expressions. Given a sentence, these methods count the number of positive and negative words and expressions. If there are more positive words/expressions, the sentence is labeled as positive. If there are more negative than positive words/expressions, the sentence is labeled as negative. If the number of positive and negative words/expressions are equal, the sentence is labeled as neutral. Although this approach is relatively easy to code and does not require any training, it has two major disadvantages. First, it does not take into account interactions between words. For example, not bad, which is actually a positive expression, can be classified as negative, as it is composed of two negative words. Even if the expression is included in the lexicon under positive, the expression not that bad may not be included. The second disadvantage is that the whole process relies on good and complete lexicons. If the lexicon omits certain words, the results can be very poor.

Another approach is to train a machine learning model in order to classify sentences. In order to do so, a training dataset has to be created, where a number of sentences are labeled as positive or negative by human experts. This process indirectly uncovers a hidden problem in (and also indicates the difficulty of) sentiment analysis. Human analysts agree on 80% to 85% of the cases. This is partly due to the subjective nature of many expressions. For example, the sentence Today the weather is nice, yesterday it was bad, can be either positive, negative, or neutral. This depends on intonation. Assuming that the bold word is intonated, Today the weather is nice, yesterday it was bad is positive. Today the weather is nice, yesterday it was bad is negative, while Today the weather is nice, yesterday it was bad is actually neutral (a simple observation of a change in the weather).

You can read more about the problem of disagreement between human analysts in sentiment classification at: https://www.lexalytics.com/lexablog/sentiment-accuracy-quick-overview.

In order to create machine learning features from text data, usually, n-grams are created. N-grams are sequences of n words extracted from each sentence. For example, the sentence "Hello there, kids" contains the following:

1-grams: "Hello", "there,", "kids"
2-grams: "Hello there,”, "there, kids"
3-grams: "Hello there, kids"

In order to create numeric features for a dataset, a single feature is created for each unique N-gram. For each instance, the feature's value depends on the number of times it appears in the sentence. For example, consider the following toy dataset:

Sentence	Polarity
My head hurts	Positive
The food was good food	Negative
The sting hurts	Positive
That was a good time	Negative

A sentiment toy dataset

Assume that we will only use 1-grams (unigrams). The unique unigrams contained in the dataset are: "My", "head", "hurts", "The", "food", "was", "good", "sting", "That", "a", and "time". Thus, each instance has 11 features. Each feature corresponds to a single n-gram (in our case, a unigram). Each feature’s value equals the number of appearances of the corresponding n-gram in the instance. The final dataset is depicted in the following table:

My	Head	Hurts	The	Food	Was	Good	Sting	That	A	Time	Polarity
1	1	1	0	0	0	0	0	0	0	0	Positive
0	0	0	1	2	1	1	0	0	0	0	Negative
0	0	1	1	0	0	0	1	0	0	0	Positive
0	0	0	0	0	1	1	0	1	1	1	Negative

The extracted features dataset

Usually, each instance is normalized, so each feature represents the relative frequency, rather than the absolute frequency (count), of each n-gram. This method is called Term Frequency (TF). The TF dataset is depicted as follows:

My	Head	Hurts	The	Food	Was	Good	Sting	That	A	Time	Polarity
0.33	0.33	0.33	0	0	0	0	0	0	0	0	Positive
0	0	0	0.2	0.4	0.2	0.2	0	0	0	0	Negative
0	0	0.33	0.33	0	0	0	0.33	0	0	0	Positive
0	0	0	0	0	0.2	0.2	0	0.2	0.2	0.2	Negative

The TF dataset

In the English language, some terms exhibit a really high frequency, while contributing little towards the expression’s sentiment. In order to account for this fact, Inverse Document Frequency (IDF) is employed. IDF puts more emphasis on infrequent terms. For N instances with K unique unigrams, the IDF of unigram u, which is present in M instances, is calculated as follows:

The following table depicts the IDF-transformed dataset:

My	Head	Hurts	The	Food	Was	Good	Sting	That	A	Time	Polarity
0.6	0.6	0.3	0	0	0	0	0	0	0	0	Positive
0	0	0	0.3	0.6	0.3	0.3	0	0	0	0	Negative
0	0	0.3	0.3	0	0	0	0.6	0	0	0	Positive
0	0	0	0	0	0.3	0.3	0	0.6	0.6	0.6	Negative

The IDF dataset

Table of Contents for Sentiment analysis tools&#xA0;

Create new playlist

Sign In

Sign Up

Table of Contents for
Sentiment analysis tools