Stemming and lemmatization

One extremely popular step in NLP is to stem words back to their root form. For example, "accounts" and "accounting" would both be stemmed to "account," which at first blush seems very reasonable. However, stemming falls prey to the following two areas, which you should be aware of:

1. Over-stemming: This is when stemming fails to keep two words with distinct meanings separate. For example, stem ("general," "genetic") = "gene".

2. Under-stemming: This is the inability to reduce words with the same meaning to their root forms. For example, stem ("jumping," "jumpiness") = jumpi but stem ("jumped," "jumps") = "jump." In this example, we know that each of the preceding terms are simply an inflection of the root word "jump;" however, depending on the stemmer you choose to employ (the two most common stemmers are Porter [oldest and most common] and Lancaster), you may fall into this error.

Given the possibilities of over and under-stemming words in your corpus, NLP practitioners cooked up the notion of lemmatization to help combat these known issues. The word "Lemming," is taking the canonical (dictionary) form of a set of related words based on the context of the word. For example, lemma ("paying," "pays, "paid") = "pay." Like stemming, lemmatization tries to group related words, but goes one step further by trying to group words by their word sense because, after all, the same two words can have entirely different meanings depending on the context! Given the depth and complexity of this chapter already, we will refrain from performing any lemmatization techniques, but interested parties can read further about this topic at http://stanfordnlp.github.io/CoreNLP/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset