Using lemmatization in OpenNLP

OpenNLP also supports lemmatization using the JWNLDictionary class. This class' constructor uses a string that contains the path of the dictionary files used to identify roots. We will use a WordNet dictionary that has been developed at Princeton University (wordnet.princeton.edu). The actual dictionary is a series of files stored in a directory. These files contain a list of words and their root. For the examples used in this section, we will use the dictionary found at https://code.google.com/p/xssm/downloads/detail?name=SimilarityUtils.zip&can=2&q=.

The JWNLDictionary class' getLemmas method is passed the word we want to process and a second parameter that specifies the POS for the word. It is important that the POS matches the actual word type if we want accurate results.

In the following code sequence, we create an instance of the JWNLDictionary class using a path ending with dict. This is the location of the dictionary. We also define our sample text. The constructor can throw IOException and JWNLException, which we deal with in a try...catch block sequence:

try { 
    dictionary = new JWNLDictionary("...dict"); 
    paragraph = "Eat, drink, and be merry, for life is but a dream"; 
    ... 
} catch (IOException | JWNLException ex) 
    // 
}

Following the text initialization, add the following statements. First, we tokenize the string using the WhitespaceTokenizer class, as explained in the Using the WhitespaceTokenizer class section. Then, each token is passed to the getLemmas method with an empty string as the POS type. The original token and its lemmas are then displayed:

String tokens[] = 
WhitespaceTokenizer.INSTANCE.tokenize(paragraph); for (String token : tokens) { String[] lemmas = dictionary.getLemmas(token, ""); for (String lemma : lemmas) { System.out.println("Token: " + token + " Lemma: "
+ lemma); } }

The output is as follows:

Token: Eat,  Lemma: at
Token: drink,  Lemma: drink
Token: be  Lemma: be
Token: life  Lemma: life
Token: is  Lemma: is
Token: is  Lemma: i
Token: a  Lemma: a
Token: dream  Lemma: dream  

The lemmatization process works well, except for the is token, which returns two lemmas. The second one is not valid. This illustrates the importance of using the proper POS for a token. We could have used one or more of the POS tags as the argument to the getLemmas method. However, this begs the question: how do we determine the correct POS? This topic is discussed in detail in Chapter 5, Detecting Parts of Speech.

A short list of POS tags is found in the following table. This list is adapted from https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html. The complete list of The University of Pennsylvania (Penn) Treebank tagset can be found at http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html:

Tag

Description

JJ

Adjective

NN

Noun, singular, or mass

NNS

Noun, plural

NNP

Proper noun, singular

NNPS

Proper noun, plural

POS

Possessive ending

PRP

Personal pronoun

RB

Adverb

RP

Particle

VB

Verb, base form

VBD

Verb, past tense

VBG

Verb, gerund, or present participle

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset