Featurization - feature hashing

Now, it is time to transform string representation into a numeric one. We adopt a bag-of-words approach; however, we use a trick called feature hashing. Let's look in more detail at how Spark employs this powerful technique to help us construct and access our tokenized dataset efficiently. We use feature hashing as a time-efficient implementation of a bag-of-words, as explained earlier.

At its core, feature hashing is a fast and space-efficient method to deal with high-dimensional data-typical in working with text-by converting arbitrary features into indices within a vector or matrix. This is best described with an example text. Suppose we have the following two movie reviews:

  1. The movie Goodfellas was well worth the money spent. Brilliant acting!
  2. Goodfellas is a riveting movie with a great cast and a brilliant plot-a must see for all movie lovers!

For each token in these reviews, we can apply a "hashing trick," whereby we assign the distinct tokens a number. So, the set of unique tokens (after lowercasing + text processing) in the preceding two reviews would be in alphabetical order:

{"acting": 1, "all": 2, "brilliant": 3, "cast": 4, "goodfellas": 5, "great": 6, "lover": 7, "money": 8, "movie": 9, "must": 10, "plot": 11, "riveting": 12, "see": 13, "spent": 14, "well": 15, "with": 16, "worth": 17}

We will then apply the hashes to create the following matrix:

[[1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1]
[0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0]]

The matrix from the feature hashing is constructed as follows:

  • Rows represent the movie review numbers.
  • Columns represent the features (not the actual words!). The feature space is represented by a range of used hash functions. Note that for each row, there is the same number of columns and not just one ever-growing, wide matrix.
  • Thus, every entry in the matrix (i, j) = k means that in row i, feature j, appears k times. So, for example, the token "movie" which is hashed on feature 9, appears twice in the second review; therefore, matrix (2, 9) = 2.
  • The used hashing function makes gaps. If the hashing function hashes a small set of words into large numeric space, the resulting matrix will have high sparsity.
  • One important consideration to think about is the notion of hashing collisions, which is where two different features (tokens, in this case) are hashed into the same index number in our feature matrix. A way to guard against this is to choose a large number of features to hash, which is a parameter we can control in Spark (the default setting for this in Spark is 2^20 ~ 1 million features).

Now, we can employ Spark's hashing function, which will map each token to a hash index that will make up our feature vector/matrix. As always, we will start with our imports of the necessary classes we will need and then change the default value for the number of features to create hashes against to roughly 4096 (2^12).

In the code, we will use the HashingTF transformer from the Spark ML package (you will learn more about transformations later in this chapter). It requires the names of input and output columns. For our dataset movieReviews, the input column is reviewTokens, which holds the tokens created in the previous steps. The result of the transformation is stored in a new column called tf:

val hashingTF= new HashingTF hashingTF.setInputCol("reviewTokens")
.setOutputCol("tf")
.setNumFeatures(1 <<12) // 2^12
.setBinary(false)
val tfTokens= hashingTF.transform(movieReviews)
println("Vectorized movie reviews:")
tfTokens.show(5)

The output is as follows:

After invoking the transformation, the resulting tfTokens dataset contains alongside original data a new column called tf, which holds an instance of org.apache.spark.ml.linalg. Vector for each input row. The vector in our case is a sparse vector (because the hash space is much larger than the number of unique tokens).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset