Creating document vectors

So, now that we can create vectors that encode the meaning of words, and we know that any given movie review post tokenization is an array of N words, we can begin creating a poor man's doc2vec by taking the average of all the words that make up the review. That is, for each review, by averaging the individual word vectors, we lose the specific sequencing of words, which, depending on the sensitivity of your application, can make a difference:

v(word_1) + v(word_2) + v(word_3) ... v(word_Z) / count(words in review)

Ideally, one would use a flavor of doc2vec to create document vectors; however, doc2vec has yet to be implemented in MLlib at the time of writing this book, so for now, we are going to use this simple version, which, as you will see, has surprising results. Fortunately, the Spark ML implementation of the word2vec model already averages word vectors if the model contains a list of tokens. For example, we can show that the phrase, funny movie, has a vector that is equal to the average of the vectors of the funny and movie tokens:

val testDf = Seq(Seq("funny"), Seq("movie"), Seq("funny", "movie")).toDF("reviewTokens")
 w2vModel.transform(testDf).show(truncate=false)

The output is as follows:

Hence, we can prepare our simple version of doc2vec by a simple model transformation:

val inputData = w2vModel.transform(movieReviews)

As practitioners in this field, we have had the unique opportunity to work with various flavors of document vectors, including word averaging, doc2vec, LSTM auto-encoders, and skip-thought vectors. What we have found is that for small word snippets, where the sequencing of words isn't crucial, the simple word averaging does a surprisingly good job as supervised learning tasks. That is, not to say that it could be improved with doc2vec and other variants but is rather an observation based on the many use cases we have seen across various customer applications.

Table of Contents for Creating document vectors

Create new playlist

Sign In

Sign Up

Table of Contents for
Creating document vectors