Term Frequency - Inverse Document Frequency (TF-IDF) weighting scheme

We will now use Spark ML to apply a very common weighting scheme called a TF-IDF to convert our tokenized reviews into vectors, which will be inputs to our machine learning models. The math behind this transformation is relatively straightforward:

For each token:

  1. Find the term frequency within a given document (in our case, a movie review).
  2. Multiply this count by the log of the inverse document frequency that looks at how common the token occurs among all of the documents (commonly referred to as the corpus).
  3. Taking the inverse is useful, in that it will penalize tokens that occur too frequently in the document (for example, "movie") and boost those tokens that do not appear as frequently.

Now, we can scale terms based on the inverse term document frequency formula explained earlier. First, we need to compute a model-a prescription about how to scale term frequencies. In this case, we use the Spark IDF estimator to create a model based on the input data produced by the previous step hashingTF:

import org.apache.spark.ml.feature.IDF
val idf= new IDF idf.setInputCol(hashingTF.getOutputCol)
.setOutputCol("tf-idf")
val idfModel= idf.fit(tfTokens)

Now, we will build a Spark estimator that we trained (fitted) on the input data (= output of transformation in the previous step). The IDF estimator computes weights of individual tokens. Having the model, it is possible to apply it on any data that contains a column defined during fitting:

val tfIdfTokens= idfModel.transform(tfTokens)
println("Vectorized and scaled movie reviews:")
tfIdfTokens.show(5)

Let's look in more detail at a single row and the difference between hashingTF and IDF outputs. Both operations produced a sparse vector the same length. We can look at non-zero elements and verify that both rows contain non-zero values at the same locations:

import org.apache.spark.ml.linalg.Vector
val vecTf= tfTokens.take(1)(0).getAs[Vector]("tf").toSparse
val vecTfIdf= tfIdfTokens.take(1)(0).getAs[Vector]("tf-idf").toSparse
println(s"Both vectors contains the same layout of non-zeros: ${java.util.Arrays.equals(vecTf.indices, vecTfIdf.indices)}")

We can also print a few non-zero values:

println(s"${vecTf.values.zip(vecTfIdf.values).take(5).mkString("
")}")

You can directly see that tokens with the same frequency in the sentence can have different resulting scores based on their frequencies over all the sentences.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset