Creating a StopWords class

The process of removing stopwords involves examining a stream of tokens, comparing them to a list of stopwords, and then removing the stopwords from the stream. To illustrate this approach, we will create a simple class that supports basic operations, as defined in the following table:

Constructor/method

Usage

Default constructor

Uses a default set of stopwords

Single argument constructor

Uses stopwords stored in a file

addStopWord

Adds a new stopword to the internal list

removeStopWords

Accepts an array of words and returns a new array with the stopwords removed

 

Create a class, called StopWords, that declares two instance variables, as shown in the following code block. The defaultStopWords variable is an array that holds the default stopword list. The HashSet variable's stopWords list is used to hold the stopwords for processing purposes:

public class StopWords { 
 
    private String[] defaultStopWords = {"i", "a", "about", "an", 
"are", "as", "at", "be", "by", "com", "for", "from", "how",
"in", "is", "it", "of", "on", "or", "that", "the", "this",
"to", "was", "what", "when", where", "who", "will", "with"}; private static HashSet stopWords = new HashSet(); ... }

Two constructors of the class follow, which populate HashSet:

public StopWords() { 
    stopWords.addAll(Arrays.asList(defaultStopWords)); 
} 
 
public StopWords(String fileName) { 
    try { 
        BufferedReader bufferedreader =  
                new BufferedReader(new FileReader(fileName)); 
        while (bufferedreader.ready()) { 
            stopWords.add(bufferedreader.readLine()); 
        } 
    } catch (IOException ex) { 
        ex.printStackTrace(); 
    } 
}

The addStopWord convenience method allows additional words to be added:

public void addStopWord(String word) { 
    stopWords.add(word); 
}

The removeStopWords method is used to remove the stopwords. It creates ArrayList to hold the original words passed to the method. The for loop is used to remove stopwords from this list. The contains method will determine whether the word submitted is a stopword, and if so, remove it. ArrayList is converted into an array of strings and then returned. This is shown as follows:

public String[] removeStopWords(String[] words) { 
    ArrayList<String> tokens =  
        new ArrayList<String>(Arrays.asList(words)); 
    for (int i = 0; i < tokens.size(); i++) { 
        if (stopWords.contains(tokens.get(i))) { 
            tokens.remove(i); 
        } 
    } 
    return (String[]) tokens.toArray(
new String[tokens.size()]); }

The following sequence illustrates how stopwords can be used. First, we declare an instance of the StopWords class using the default constructor. The OpenNLP SimpleTokenizer class is declared and the sample text is defined, as shown here:

StopWords stopWords = new StopWords(); 
SimpleTokenizer simpleTokenizer = SimpleTokenizer.INSTANCE; 
paragraph = "A simple approach is to create a class " 
    + "to hold and remove stopwords."; 

The sample text is tokenized and then passed to the removeStopWords method.
The new list is then displayed:

String tokens[] = simpleTokenizer.tokenize(paragraph); 
String list[] = stopWords.removeStopWords(tokens); 
for (String word : list) { 
    System.out.println(word); 
}

When executed, we get the following output. A is not removed because it is uppercase and the class does not perform case-conversion:

A
simple
approach
create
class
hold
remove
stopwords
.  
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset