The distributed bag-of-words model

The last algorithm in doc2vec is modeled after the word2vec skip-gram model, with one exception--instead of using the focus word as the input, we will now take the document ID as the input and try to predict randomly sampled words from the document. That is, we will completely ignore the context words in our output altogether:

Like word2vec, we can take similarities of documents of N words using these paragraph vectors, which have proven hugely successful in both supervised and unsupervised tasks. Here are some of the experiments that Mikolov et. al ran using, notably, the supervised task that leverages the same dataset we used in the last two chapters!

Information-retrieval tasks (three paragraphs, the first should sound closer to the second than the third paragraph):

In the subsequent sections, we are going to create a poor man's document vector by taking the average of individual word vectors to form our document vector, which will encode entire movie reviews of n-length into vectors of dimension, 300.

At the time of writing this book, Spark's MLlib does not have an implementation of doc2vec; however, there are many projects that are leveraging this technology, which are in the incubation phase and which you can test out.

Table of Contents for The distributed bag-of-words model

Create new playlist

Sign In

Sign Up

Table of Contents for
The distributed bag-of-words model