Using the DocumentPreprocessor class

When an instance of the DocumentPreprocessor class is created, it uses its Reader parameter to produce a list of sentences. It also implements the Iterable interface, which makes it easy to traverse the list.

In the following example, the paragraph is used to create a StringReader object, and this object is used to instantiate the DocumentPreprocessor instance:

Reader reader = new StringReader(paragraph); 
DocumentPreprocessor dp = new DocumentPreprocessor(reader); 
for (List sentence : dp) { 
    System.out.println(sentence); 
}

On execution, we get the following output:

    [When, determining, the, end, of, sentences, we, need, to, consider, several, factors, .]
    [Sentences, may, end, with, exclamation, marks, !]
    [Or, possibly, questions, marks, ?]
    [Within, sentences, we, may, find, numbers, like, 3.14159, ,, abbreviations, such, as, found, in, Mr., Smith, ,, and, possibly, ellipses, either, within, a, sentence, ..., ,, or, at, the, end, of, a, sentence, ...]

By default, PTBTokenizer is used to tokenize the input. The setTokenizerFactory method can be used to specify a different tokenizer. There are several other methods that can be useful, as detailed in the following table:

Method	Purpose
`setElementDelimiter`	Its argument specifies an XML element. Only the text inside of those elements will be processed.
`setSentenceDelimiter`	The processor will assume that the string argument is a sentence delimiter.
`setSentenceFinalPuncWords`	Its string array argument specifies the end of sentences delimiters.
`setKeepEmptySentences`	When used with whitespace models, if its argument is `true`, empty sentences will be retained.

The class can process either plain text or XML documents.

To demonstrate how an XML file can be processed, we will create a simple XML file called XMLText.xml, containing the following data:

<?xml version="1.0" encoding="UTF-8"?> 
<?xml-stylesheet type="text/xsl"?> 
<document> 
    <sentences> 
        <sentence id="1"> 
            <word>When</word> 
            <word>the</word> 
            <word>day</word> 
            <word>is</word> 
            <word>done</word> 
            <word>we</word> 
            <word>can</word> 
            <word>sleep</word> 
            <word>.</word> 
        </sentence> 
        <sentence id="2"> 
            <word>When</word> 
            <word>the</word> 
            <word>morning</word> 
            <word>comes</word> 
            <word>we</word> 
            <word>can</word> 
            <word>wake</word> 
            <word>.</word> 
        </sentence> 
        <sentence id="3"> 
            <word>After</word> 
            <word>that</word> 
            <word>who</word> 
            <word>knows</word> 
            <word>.</word> 
        </sentence> 
    </sentences> 
</document>

We will reuse the code from the previous example. However, we will open the XMLText.xml file instead, and use DocumentPreprocessor.DocType.XML as the second argument of the constructor of the DocumentPreprocessor class, as shown in the following code. This will specify that the processor should treat the text as XML text. In addition, we will specify that only those XML elements that are within the <sentence> tag should be processed:

try { 
    Reader reader = new FileReader("XMLText.xml"); 
    DocumentPreprocessor dp = new DocumentPreprocessor( 
        reader, DocumentPreprocessor.DocType.XML); 
    dp.setElementDelimiter("sentence"); 
    for (List sentence : dp) { 
        System.out.println(sentence); 
    } 
} catch (FileNotFoundException ex) { 
    // Handle exception 
}

The output of this example is as follows:

    [When, the, day, is, done, we, can, sleep, .] 
    [When, the, morning, comes, we, can, wake, .]
    [After, that, who, knows, .]

A cleaner output is possible using ListIterator, as shown here:

for (List sentence : dp) { 
    ListIterator list = sentence.listIterator(); 
     while (list.hasNext()) { 
        System.out.print(list.next() + " "); 
    } 
    System.out.println(); 
}

Its output is the following:

    When the day is done we can sleep . 
    When the morning comes we can wake . 
    After that who knows .

If we had not specified an element delimiter, each word would have been displayed like this:

    [When]
    [the]
    [day]
    [is]
    [done]
    ...
    [who]
    [knows]
    [.]

Table of Contents for Using the DocumentPreprocessor class

Create new playlist

Sign In

Sign Up

Table of Contents for
Using the DocumentPreprocessor class