Doc2vec explained

As we mentioned in the chapter's introduction, there is an extension of word2vec that encodes entire documents as opposed to individual words. In this case, a document is what you make of it, be it a sentence, a paragraph, an article, an essay, and so on. Not surprisingly, this paper came out after the original word2vec paper but was also, not surprisingly, coauthored by Tomas Mikolov and Quoc Le. Even though MLlib has yet to introduce doc2vec into their stable of algorithms, we feel it is necessary for a data science practitioner to know about this extension of word2vec, given its promise of and results with supervised learning and information retrieval tasks.

Like word2vec, doc2vec (sometimes referred to as paragraph vectors) relies on a supervised learning task to learn distributed representations of documents based on contextual words. Doc2vec is also a family of algorithms, whereby the architecture will look extremely similar to the CBOW and skip-gram models of word2vec that you learned in the previous sections. As you will see next, implementing doc2vec will require a parallel training of both individual word vectors and document vectors that represent what we deem as a document.

Table of Contents for Doc2vec explained

Create new playlist

Sign In

Sign Up

Table of Contents for
Doc2vec explained