Doc2vec explained

As we mentioned in the chapter's introduction, there is an extension of word2vec that encodes entire documents as opposed to individual words. In this case, a document is what you make of it, be it a sentence, a paragraph, an article, an essay, and so on. Not surprisingly, this paper came out after the original word2vec paper but was also, not surprisingly, coauthored by Tomas Mikolov and Quoc Le. Even though MLlib has yet to introduce doc2vec into their stable of algorithms, we feel it is necessary for a data science practitioner to know about this extension of word2vec, given its promise of and results with supervised learning and information retrieval tasks.

Like word2vec, doc2vec (sometimes referred to as paragraph vectors) relies on a supervised learning task to learn distributed representations of documents based on contextual words. Doc2vec is also a family of algorithms, whereby the architecture will look extremely similar to the CBOW and skip-gram models of word2vec that you learned in the previous sections. As you will see next, implementing doc2vec will require a parallel training of both individual word vectors and document vectors that represent what we deem as a document.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset