Stemming and lemmatization

One extremely popular step in NLP is to stem words back to their root form. For example, "accounts" and "accounting" would both be stemmed to "account," which at first blush seems very reasonable. However, stemming falls prey to the following two areas, which you should be aware of:

1. Over-stemming: This is when stemming fails to keep two words with distinct meanings separate. For example, stem ("general," "genetic") = "gene".

2. Under-stemming: This is the inability to reduce words with the same meaning to their root forms. For example, stem ("jumping," "jumpiness") = jumpi but stem ("jumped," "jumps") = "jump." In this example, we know that each of the preceding terms are simply an inflection of the root word "jump;" however, depending on the stemmer you choose to employ (the two most common stemmers are Porter [oldest and most common] and Lancaster), you may fall into this error.

Given the possibilities of over and under-stemming words in your corpus, NLP practitioners cooked up the notion of lemmatization to help combat these known issues. The word "Lemming," is taking the canonical (dictionary) form of a set of related words based on the context of the word. For example, lemma ("paying," "pays, "paid") = "pay." Like stemming, lemmatization tries to group related words, but goes one step further by trying to group words by their word sense because, after all, the same two words can have entirely different meanings depending on the context! Given the depth and complexity of this chapter already, we will refrain from performing any lemmatization techniques, but interested parties can read further about this topic at http://stanfordnlp.github.io/CoreNLP/.

Table of Contents for Stemming and lemmatization

Create new playlist

Sign In

Sign Up

Table of Contents for
Stemming and lemmatization