Applying word2vec and exploring our data with vectors

Now that you have a good understanding of word2vec, doc2vec, and the incredible power of word vectors, it's time we turned our focus to our original IMDB dataset, whereby we will perform the following preprocessing:

Split words in each movie review by a space
Remove punctuation
Remove stopwords and all alphanumeric words
Using our tokenization function from the previous chapter, we will end with an array of comma-separated words

Because we have already covered the preceding steps in Chapter 4, Predicting Movie Reviews Using NLP and Spark Streaming, we'll quickly reproduce them in this section.

As usual, we begin with starting the Spark shell, which is our working environment:

export SPARKLING_WATER_VERSION="2.1.12" 
export SPARK_PACKAGES= 
"ai.h2o:sparkling-water-core_2.11:${SPARKLING_WATER_VERSION}, 
ai.h2o:sparkling-water-repl_2.11:${SPARKLING_WATER_VERSION}, 
ai.h2o:sparkling-water-ml_2.11:${SPARKLING_WATER_VERSION}, 
com.packtpub:mastering-ml-w-spark-utils:1.0.0" 
 
$SPARK_HOME/bin/spark-shell  
        --master 'local[*]'  
        --driver-memory 8g  
        --executor-memory 8g  
        --conf spark.executor.extraJavaOptions=-XX:MaxPermSize=384M  
        --conf spark.driver.extraJavaOptions=-XX:MaxPermSize=384M  
        --packages "$SPARK_PACKAGES" "$@"

In the prepared environment, we can directly load the data:

val DATASET_DIR = s"${sys.env.get("DATADIR").getOrElse("data")}/aclImdb/train"
 val FILE_SELECTOR = "*.txt" 
 
case class Review(label: Int, reviewText: String) 

 val positiveReviews = spark.read.textFile(s"$DATASET_DIR/pos/$FILE_SELECTOR")
     .map(line => Review(1, line)).toDF
 val negativeReviews = spark.read.textFile(s"$DATASET_DIR/neg/$FILE_SELECTOR")
   .map(line => Review(0, line)).toDF
 var movieReviews = positiveReviews.union(negativeReviews)

We can also define the tokenization function to split the reviews into tokens, removing all the common words:

import org.apache.spark.ml.feature.StopWordsRemover
 val stopWords = StopWordsRemover.loadDefaultStopWords("english") ++ Array("ax", "arent", "re")

 val MIN_TOKEN_LENGTH = 3
 val toTokens = (minTokenLen: Int, stopWords: Array[String], review: String) =>
   review.split("""W+""")
     .map(_.toLowerCase.replaceAll("[^\p{IsAlphabetic}]", ""))
     .filter(w => w.length > minTokenLen)
     .filter(w => !stopWords.contains(w))

With all the building blocks ready, we just apply them to the loaded input data, augmenting them by a new column, reviewTokens, which holds a list of words extracted from the review:


 val toTokensUDF = udf(toTokens.curried(MIN_TOKEN_LENGTH)(stopWords))
 movieReviews = movieReviews.withColumn("reviewTokens", toTokensUDF('reviewText))

The reviewTokens column is a perfect input for the word2vec model. We can build it using the Spark ML library:

val word2vec = new Word2Vec()
   .setInputCol("reviewTokens")
   .setOutputCol("reviewVector")
   .setMinCount(1)
val w2vModel = word2vec.fit(movieReviews)

The Spark implementation has several additional hyperparameters:

setMinCount: This is the minimum frequency with which we can create a word. It is another processing step so that the model is not running on super rare terms with low counts.
setNumIterations: Typically, we see that a higher number of iterations leads to more accurate word vectors (think of these as the number of epochs in a traditional feed-forward neural network). The default value is set to 1.
setVectorSize: This is where we declare the size of our vectors. It can be any integer with a default size of 100. Many of the public word vectors that come pretrained tend to favor larger vector sizes; however, this is purely application-dependent.
setLearningRate: Just like a regular neural network, which we learned about in Chapter 2, Detecting Dark Matter- The Higgs-Boson Particle, discretion is needed in part by the data scientist--too little a learning rate and the model will take forever-and-a-day to converge. However, if the learning rate is too large, one risks a non-optimal set of learned weights in the network. The default value is 0.

Now that our model has finished, it's time to inspect some of our word vectors! Recall that whenever you are unsure of what values your model can produce, always hit the tab button, as follows:

w2vModel.findSynonyms("funny", 5).show()

The output is as follows:

Let's take a step back and consider what we just did here. First, we condensed the word, funny, to a vector composed of an array of 100 floating point numbers (recall that this is the default value for the Spark implementation of the word2vec algorithm). Because we have reduced all the words in our corpus of reviews to the same distributed representation of 100 numbers, we can make comparisons using the cosine similarity, which is what the second number in our result set reflects (the highest cosine similarity in this case is for the word nutty).

Note that we can also access the vector for funny or any other word in our dictionary using the getVectors function, as follows:

w2vModel.getVectors.where("word = 'funny'").show(truncate = false)

The output is as follows:

A lot of interesting research has been done on clustering similar words together based on these representations as well. We will revisit clustering later in this chapter when we'll try to cluster similar movie reviews after we perform a hacked version of doc2vec in the next section.

Table of Contents for Applying word2vec and exploring our data with vectors

Create new playlist

Sign In

Sign Up

Table of Contents for
Applying word2vec and exploring our data with vectors