OpenNLP also supports lemmatization using the JWNLDictionary class. This class' constructor uses a string that contains the path of the dictionary files used to identify roots. We will use a WordNet dictionary that has been developed at Princeton University (wordnet.princeton.edu). The actual dictionary is a series of files stored in a directory. These files contain a list of words and their root. For the examples used in this section, we will use the dictionary found at https://code.google.com/p/xssm/downloads/detail?name=SimilarityUtils.zip&can=2&q=.
The JWNLDictionary class' getLemmas method is passed the word we want to process and a second parameter that specifies the POS for the word. It is important that the POS matches the actual word type if we want accurate results.
In the following code sequence, we create an instance of the JWNLDictionary class using a path ending with dict. This is the location of the dictionary. We also define our sample text. The constructor can throw IOException and JWNLException, which we deal with in a try...catch block sequence:
try { dictionary = new JWNLDictionary("...dict"); paragraph = "Eat, drink, and be merry, for life is but a dream"; ... } catch (IOException | JWNLException ex) // }
Following the text initialization, add the following statements. First, we tokenize the string using the WhitespaceTokenizer class, as explained in the Using the WhitespaceTokenizer class section. Then, each token is passed to the getLemmas method with an empty string as the POS type. The original token and its lemmas are then displayed:
String tokens[] =
WhitespaceTokenizer.INSTANCE.tokenize(paragraph); for (String token : tokens) { String[] lemmas = dictionary.getLemmas(token, ""); for (String lemma : lemmas) { System.out.println("Token: " + token + " Lemma: "
+ lemma); } }
The output is as follows:
Token: Eat, Lemma: at Token: drink, Lemma: drink Token: be Lemma: be Token: life Lemma: life Token: is Lemma: is Token: is Lemma: i Token: a Lemma: a Token: dream Lemma: dream
The lemmatization process works well, except for the is token, which returns two lemmas. The second one is not valid. This illustrates the importance of using the proper POS for a token. We could have used one or more of the POS tags as the argument to the getLemmas method. However, this begs the question: how do we determine the correct POS? This topic is discussed in detail in Chapter 5, Detecting Parts of Speech.
A short list of POS tags is found in the following table. This list is adapted from https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html. The complete list of The University of Pennsylvania (Penn) Treebank tagset can be found at http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html:
Tag |
Description |
JJ |
Adjective |
NN |
Noun, singular, or mass |
NNS |
Noun, plural |
NNP |
Proper noun, singular |
NNPS |
Proper noun, plural |
POS |
Possessive ending |
PRP |
Personal pronoun |
RB |
Adverb |
RP |
Particle |
VB |
Verb, base form |
VBD |
Verb, past tense |
VBG |
Verb, gerund, or present participle |