Bag-of-words

A simple yet effective way of classifying text is to see the text as a bag-of-words. This means that we do not care for the order in which words appear in the text, instead we only care about which words appear in the text.

One of the ways of doing a bag-of-words classification is by simply counting the occurrences of different words from within a text. This is done with a so-called count vector. Each word has an index, and for each text, the value of the count vector at that index is the number of occurrences of the word that belong to the index.

Picture this as an example: the count vector for the text "I see cats and dogs and elephants" could look like this:

i

see

cats

and

dogs

elephants

1

1

1

2

1

1

In reality, count vectors are pretty sparse. There are about 23,000 different words in our text corpus, so it makes sense to limit the number of words we want to include in our count vectors. This could mean excluding words that are often just gibberish or typos with no meaning. As a side note, if we kept all the rare words, this could be a source of overfitting.

We are using sklearn's built-in count vectorizer. By setting max_features, we can control how many words we want to consider in our count vector. In this case, we will only consider the 10,000 most frequent words:

from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer(max_features=10000)

Our count vectorizer can now transform texts into count vectors. Each count vector will have 10,000 dimensions:

X_train_counts = count_vectorizer.fit_transform(X_train)
X_test_counts = count_vectorizer.transform(X_test)

Once we have obtained our count vectors, we can then perform a simple logistic regression on them. While we could use Keras for logistic regression, as we did in the first chapter of this book, it is often easier to just use the logistic regression class from scikit-learn:

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()

clf.fit(X_train_counts, y_train)

y_predicted = clf.predict(X_test_counts)

Now that we have predictions from our logistic regressor, we can measure the accuracy of it with sklearn:

from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_predicted)
0.8011049723756906

As you can see, we've got 80% accuracy, which is pretty decent for such a simple method. A simple count vector-based classification is useful as a baseline for more advanced methods, which we will be discussing later.

TF-IDF

TF-IDF stands for Term Frequency, Inverse Document Frequency. It aims to address a problem of simple word counting, that being words that frequently appear in a text are important, while words that appear in all texts are not important.

The TF component is just like a count vector, except that TF divides the counts by the total number of words in a text. Meanwhile, the IDF component is the logarithm of the total number of texts in the entire corpus divided by the number of texts that include a specific word.

TF-IDF is the product of these two measurements. TF-IDF vectors are like count vectors, except they contain the TF-IDF scores instead of the counts. Rare words will gain a high score in the TF-IDF vector.

We create TF-IDF vectors just as we created count vectors with sklearn:

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

Once we have the TF-IDF vectors, we can train a logistic regressor on them just like we did for count vectors:

clf_tfidf = LogisticRegression()
clf_tfidf.fit(X_train_tfidf, y_train)

y_predicted = clf_tfidf.predict(X_test_tfidf)

In this case, TF-IDF does slightly worse than count vectors. However, because the performance difference is very small, this poorer performance might be attributable to chance in this case:

accuracy_score(y_pred=y_predicted, y_true=y_test)
0.79788213627992
63
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset