Using the Porter Stemmer

The Porter Stemmer is a commonly used stemmer for English. Its home page can be found at http://tartarus.org/martin/PorterStemmer/. It uses five steps to stem a word. The steps are :

  1. Change the plurals, simple present, past and past participle and converts y to i for example agreed will be change to agree, sleepy will be changed to sleepi
  2. Change double suffixes to single suffixes for example specialization will be changed to specialize
  3. Change remaining words as in step 2 by changing special in to special
  4. Change remaining single suffixes by changing special to speci
  5. It removes e or remove double letter at end for example attribute will be changed to attrib or will changed to wil

Although Apache OpenNLP 1.5.3 does not contain the PorterStemmer class, its source code can be downloaded from https://svn.apache.org/repos/asf/opennlp/trunk/opennlp-tools/src/main/java/opennlp/tools/stemmer/PorterStemmer.java. It can then be added to your project.

In the following example, we demonstrate the PorterStemmer class against an array of words. The input could easily have originated from some other text source. An instance of the PorterStemmer class is created and then its stem method is applied to each word of the array:

String words[] = {"bank", "banking", "banks", "banker", "banked", 
"bankart"}; PorterStemmer ps = new PorterStemmer(); for(String word : words) { String stem = ps.stem(word); System.out.println("Word: " + word + " Stem: " + stem); }

When executed, you will get the following output:

Word: bank  Stem: bank
Word: banking  Stem: bank
Word: banks  Stem: bank
Word: banker  Stem: banker
Word: banked  Stem: bank
Word: bankart  Stem: bankart  

The last word is used in combination with the word lesion as in Bankart lesion. This is an injury of the shoulder and doesn't have much to do with the previous words. It does show that only common affixes are used when finding the stem.

Other potentially useful PorterStemmer class methods can be found in the following table:

Method

Meaning

add

This will add a char to the end of the current stem word

stem

The method used without an argument will return true if a different stem occurs

reset

Reset the stemmer so a different word can be used

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset