Sentiment analysis tools 

Sentiment analysis can be implemented in a number of ways. The easiest to both implement and understand are lexicon-based approaches. These methods leverage the use of lists (lexicons) of polarized words and expressions. Given a sentence, these methods count the number of positive and negative words and expressions. If there are more positive words/expressions, the sentence is labeled as positive. If there are more negative than positive words/expressions, the sentence is labeled as negative. If the number of positive and negative words/expressions are equal, the sentence is labeled as neutral. Although this approach is relatively easy to code and does not require any training, it has two major disadvantages. First, it does not take into account interactions between words. For example, not bad, which is actually a positive expression, can be classified as negative, as it is composed of two negative words. Even if the expression is included in the lexicon under positive, the expression not that bad may not be included. The second disadvantage is that the whole process relies on good and complete lexicons. If the lexicon omits certain words, the results can be very poor.

Another approach is to train a machine learning model in order to classify sentences. In order to do so, a training dataset has to be created, where a number of sentences are labeled as positive or negative by human experts. This process indirectly uncovers a hidden problem in (and also indicates the difficulty of) sentiment analysis. Human analysts agree on 80% to 85% of the cases. This is partly due to the subjective nature of many expressions. For example, the sentence Today the weather is nice, yesterday it was bad, can be either positive, negative, or neutral. This depends on intonation. Assuming that the bold word is intonated, Today the weather is nice, yesterday it was bad is positive. Today the weather is nice, yesterday it was bad is negative, while Today the weather is nice, yesterday it was bad is actually neutral (a simple observation of a change in the weather).

You can read more about the problem of disagreement between human analysts in sentiment classification at: https://www.lexalytics.com/lexablog/sentiment-accuracy-quick-overview.

In order to create machine learning features from text data, usually, n-grams are created. N-grams are sequences of n words extracted from each sentence. For example, the sentence "Hello there, kids" contains the following:

  • 1-grams: "Hello", "there,", "kids"
  • 2-grams: "Hello there,”, "there, kids"
  • 3-grams: "Hello there, kids"

In order to create numeric features for a dataset, a single feature is created for each unique N-gram. For each instance, the feature's value depends on the number of times it appears in the sentence. For example, consider the following toy dataset:

Sentence

Polarity

My head hurts

Positive

The food was good food

Negative

The sting hurts

Positive

That was a good time

Negative

A sentiment toy dataset

Assume that we will only use 1-grams (unigrams). The unique unigrams contained in the dataset are: "My", "head", "hurts", "The", "food", "was", "good", "sting", "That", "a", and "time". Thus, each instance has 11 features. Each feature corresponds to a single n-gram (in our case, a unigram). Each feature’s value equals the number of appearances of the corresponding n-gram in the instance. The final dataset is depicted in the following table:

My

Head

Hurts

The

Food

Was

Good

Sting

That

A

Time

Polarity

1

1

1

0

0

0

0

0

0

0

0

Positive

0

0

0

1

2

1

1

0

0

0

0

Negative

0

0

1

1

0

0

0

1

0

0

0

Positive

0

0

0

0

0

1

1

0

1

1

1

Negative

The extracted features dataset

Usually, each instance is normalized, so each feature represents the relative frequency, rather than the absolute frequency (count), of each n-gram. This method is called Term Frequency (TF). The TF dataset is depicted as follows:

My

Head

Hurts

The

Food

Was

Good

Sting

That

A

Time

Polarity

0.33

0.33

0.33

0

0

0

0

0

0

0

0

Positive

0

0

0

0.2

0.4

0.2

0.2

0

0

0

0

Negative

0

0

0.33

0.33

0

0

0

0.33

0

0

0

Positive

0

0

0

0

0

0.2

0.2

0

0.2

0.2

0.2

Negative

The TF dataset

In the English language, some terms exhibit a really high frequency, while contributing little towards the expression’s sentiment. In order to account for this fact, Inverse Document Frequency (IDF) is employed. IDF puts more emphasis on infrequent terms. For N instances with K unique unigrams, the IDF of unigram u, which is present in M instances, is calculated as follows:

The following table depicts the IDF-transformed dataset:

My

Head

Hurts

The

Food

Was

Good

Sting

That

A

Time

Polarity

0.6

0.6

0.3

0

0

0

0

0

0

0

0

Positive

0

0

0

0.3

0.6

0.3

0.3

0

0

0

0

Negative

0

0

0.3

0.3

0

0

0

0.6

0

0

0

Positive

0

0

0

0

0

0.3

0.3

0

0.6

0.6

0.6

Negative

The IDF dataset
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset