How it works...

As mentioned earlier, we start with some text. In our example, we use some extracts from Spark's documentation.

.RegexTokenizer(...) is the text tokenizer that uses regular expressions to split the sentence. In our example, we split the sentences on a minimum of one (or more) space—that's the s+ expression. However, our pattern also splits on either a comma, period, or the quotation marks—that's the [,."] part. The pipe, |, means split on either the spaces or the punctuation marks. The text, after passing through .RegexTokenizer(...), will look as follows:

Next, we use the .StopWordsRemover(...) method to remove the stop words, as the name suggests.

Check out NLTK's list of the most common stop words: https://gist.github.com/sebleier/554280.

.StopWordsRemover(...) simply scans the tokenized text and discards any stop word it encounters. After removing the stop words, our text will look as follows:

As you can see, what is left is an essential meaning of the sentence; a human can read these words and somewhat make sense of it.

A hashing trick (or feature hashing) is a method that transforms an arbitrary list of features into indices in a vector form. It is a space-efficient way of tokenizing text and, at the same time, turning text into a numerical representation. The hashing trick uses a hashing function to convert from one representation into another. A hashing function is essentially any mapping function that transforms one representation into another. Normally, it is a lossy and one-way mapping (or conversion); different input can be hashed into the same hash (a term called a collision) and, once hashed, it is almost always prohibitively difficult to reconstruct the input. The .HashingTF(...) method takes the input column from the sq_remover object and transforms (or encodes) the tokenized text into a vector of 20 features. Here's what our text will look like after it has been hashed:

Now that we have the features hashed, we could potentially use these features to train a machine learning model. However, simply counting the occurrences of words might lead to misleading conclusions. A better measure is the term frequency-inverse document frequency (TF-IDF). It is a metric that counts how many times a word occurs in the whole corpus and then calculates a proportion of the word's count in a sentence to its count in the whole corpus. This measure helps to evaluate how important a word is to a document in the whole collection of documents. In Spark, we use the .IDF(...) method, which does this for us.

Here's what our text would look like after passing the whole Pipeline:

So, effectively, we have encoded the passage from Spark's documentation into a vector of 20 elements that we could now use to train a machine learning model.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...