The Porter Stemmer is a commonly used stemmer for English. Its home page can be found at http://tartarus.org/martin/PorterStemmer/. It uses five steps to stem a word. The steps are :
- Change the plurals, simple present, past and past participle and converts y to i for example agreed will be change to agree, sleepy will be changed to sleepi
- Change double suffixes to single suffixes for example specialization will be changed to specialize
- Change remaining words as in step 2 by changing special in to special
- Change remaining single suffixes by changing special to speci
- It removes e or remove double letter at end for example attribute will be changed to attrib or will changed to wil
Although Apache OpenNLP 1.5.3 does not contain the PorterStemmer class, its source code can be downloaded from https://svn.apache.org/repos/asf/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/stemmer/PorterStemmer.java. It can then be added to your project.
In the following example, we demonstrate the PorterStemmer class against an array of words. The input could easily have originated from some other text source. An instance of the PorterStemmer class is created and then its stem method is applied to each word of the array:
String words[] = {"bank", "banking", "banks", "banker", "banked",
"bankart"}; PorterStemmer ps = new PorterStemmer(); for(String word : words) { String stem = ps.stem(word); System.out.println("Word: " + word + " Stem: " + stem); }
When executed, you will get the following output:
Word: bank Stem: bank Word: banking Stem: bank Word: banks Stem: bank Word: banker Stem: banker Word: banked Stem: bank Word: bankart Stem: bankart
The last word is used in combination with the word lesion as in Bankart lesion. This is an injury of the shoulder and doesn't have much to do with the previous words. It does show that only common affixes are used when finding the stem.
Other potentially useful PorterStemmer class methods can be found in the following table:
Method |
Meaning |
add |
This will add a char to the end of the current stem word |
stem |
The method used without an argument will return true if a different stem occurs |
reset |
Reset the stemmer so a different word can be used |