Using lemmatization in OpenNLP

OpenNLP also supports lemmatization using the JWNLDictionary class. This class' constructor uses a string that contains the path of the dictionary files used to identify roots. We will use a WordNet dictionary that has been developed at Princeton University (wordnet.princeton.edu). The actual dictionary is a series of files stored in a directory. These files contain a list of words and their root. For the examples used in this section, we will use the dictionary found at https://code.google.com/p/xssm/downloads/detail?name=SimilarityUtils.zip&can=2&q=.

The JWNLDictionary class' getLemmas method is passed the word we want to process and a second parameter that specifies the POS for the word. It is important that the POS matches the actual word type if we want accurate results.

In the following code sequence, we create an instance of the JWNLDictionary class using a path ending with dict. This is the location of the dictionary. We also define our sample text. The constructor can throw IOException and JWNLException, which we deal with in a try...catch block sequence:

try { 
    dictionary = new JWNLDictionary("...dict"); 
    paragraph = "Eat, drink, and be merry, for life is but a dream"; 
    ... 
} catch (IOException | JWNLException ex) 
    // 
}

Following the text initialization, add the following statements. First, we tokenize the string using the WhitespaceTokenizer class, as explained in the Using the WhitespaceTokenizer class section. Then, each token is passed to the getLemmas method with an empty string as the POS type. The original token and its lemmas are then displayed:

String tokens[] = 
     WhitespaceTokenizer.INSTANCE.tokenize(paragraph); 
for (String token : tokens) { 
    String[] lemmas = dictionary.getLemmas(token, ""); 
    for (String lemma : lemmas) { 
        System.out.println("Token: " + token + "  Lemma: " 
             + lemma); 
    } 
}

The output is as follows:

Token: Eat,  Lemma: at
Token: drink,  Lemma: drink
Token: be  Lemma: be
Token: life  Lemma: life
Token: is  Lemma: is
Token: is  Lemma: i
Token: a  Lemma: a
Token: dream  Lemma: dream

The lemmatization process works well, except for the is token, which returns two lemmas. The second one is not valid. This illustrates the importance of using the proper POS for a token. We could have used one or more of the POS tags as the argument to the getLemmas method. However, this begs the question: how do we determine the correct POS? This topic is discussed in detail in Chapter 5, Detecting Parts of Speech.

A short list of POS tags is found in the following table. This list is adapted from https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html. The complete list of The University of Pennsylvania (Penn) Treebank tagset can be found at http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html:

Tag	Description
JJ	Adjective
NN	Noun, singular, or mass
NNS	Noun, plural
NNP	Proper noun, singular
NNPS	Proper noun, plural
POS	Possessive ending
PRP	Personal pronoun
RB	Adverb
RP	Particle
VB	Verb, base form
VBD	Verb, past tense
VBG	Verb, gerund, or present participle

Table of Contents for Using lemmatization in OpenNLP

Create new playlist

Sign In

Sign Up

Table of Contents for
Using lemmatization in OpenNLP