Using the DocumentPreprocessor class

The DocumentPreprocessor class tokenizes input from an input stream. In addition, it implements the Iterable interface, making it easy to traverse the tokenized sequence. The tokenizer supports the tokenization of simple text and XML data.

To illustrate this process, we will use an instance of the StringReader class, which uses the paragraph string, as defined here:

Reader reader = new StringReader(paragraph);

An instance of the DocumentPreprocessor class is then instantiated:

DocumentPreprocessor documentPreprocessor = 
      new DocumentPreprocessor(reader); 

The DocumentPreprocessor class implements the Iterable<java.util.List<HasWord>> interface. The HasWord interface contains two methods that deal with words: setWord and word. The latter method returns a word as a string. In the following code sequence, the DocumentPreprocessor class splits the input text into sentences that are stored as List<HasWord>. An Iterator object is used to extract a sentence and then a for-each statement will display the tokens:

Iterator<List<HasWord>> it = documentPreprocessor.iterator(); 
while (it.hasNext()) { 
    List<HasWord> sentence = it.next(); 
    for (HasWord token : sentence) { 
        System.out.println(token); 
    } 
} 

When executed, we get the following output:

Let
's
pause
,
and
then
reflect
.  
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset