Introduction to word embedding

In the preceding section, we studied how we can perform NLP by using BoW as the abstraction for the input text data. One of the major advancements in NLP is our ability to create a meaningful numeric representation of words in the form of dense vectors. This technique is called word embedding. Yoshua Bengio first introduced the term in his paper A Neural Probabilistic Language Model. Each word in an NLP problem can be thought of as a categorical object. Mapping each of the words to a list of numbers represented as a vector is called word embedding. In other words, the methodologies that are used to convert words into real numbers are called word embedding. A differentiating feature of embedding is that it uses a dense vector, instead of using traditional approaches that use sparse matrix vectors.

There are basically two problems with using BoW for NLP:

  • Loss of semantic context: When we tokenize the data, its context is lost. A word may have different meanings based on where it is used in the sentence; this becomes even more important when interpreting complex human expressions, such as humor or satire.
  • Sparse input: When we tokenize, each word becomes a feature. As we saw in the preceding example, each word is a feature. It results in sparse data structures.  
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset