Using the PTBTokenizer class

The PTBTokenizer class uses rules to perform SBD and has a variety of tokenization options. The constructor for this class possesses three parameters:

A Reader class that encapsulates the text to be processed
An object that implements the LexedTokenFactory interface
A string holding the tokenization options

These options allow us to specify the text, the tokenizer to be used, and any options that we may need to use for a specific text stream.

In the following code sequence, an instance of the StringReader class is created to encapsulate the text. The CoreLabelTokenFactory class is used with the options left as null for this example:

PTBTokenizer ptb = new PTBTokenizer(new StringReader(paragraph), 
     new CoreLabelTokenFactory(), null);

We will use the WordToSentenceProcessor class to create a List instance of the List class to hold the sentences and their tokens. Its process method takes the tokens produced by the PTBTokenizer instance to create the list of the List class, as shown here:

WordToSentenceProcessor wtsp = new WordToSentenceProcessor(); 
List<List<CoreLabel>> sents = wtsp.process(ptb.tokenize());

This List instance of the List class can be displayed in several ways. In the following sequence, the toString method of the List class displays the list enclosed in brackets, with its elements separated by commas:

for (List<CoreLabel> sent : sents) { 
    System.out.println(sent); 
}

The output of this sequence produces the following:

    [When, determining, the, end, of, sentences, we, need, to, consider, several, factors, .]
    [Sentences, may, end, with, exclamation, marks, !]
    [Or, possibly, questions, marks, ?]
    [Within, sentences, we, may, find, numbers, like, 3.14159, ,, abbreviations, such, as, found, in, Mr., Smith, ,, and, possibly, ellipses, either, within, a, sentence, ..., ,, or, at, the, end, of, a, sentence, ...]

An alternate approach, shown here, displays each sentence on a separate line:

for (List<CoreLabel> sent : sents) { 
    for (CoreLabel element : sent) { 
        System.out.print(element + " "); 
     } 
    System.out.println(); 
}

The output is as follows:

    When determining the end of sentences we need to consider several factors . 
    Sentences may end with exclamation marks ! 
    Or possibly questions marks ? 
    Within sentences we may find numbers like 3.14159 , abbreviations such as found in Mr. Smith , and possibly ellipses either within a sentence ... , or at the end of a sentence ...

If we are only interested in the positions of the words and sentences, we can use the endPosition method, as illustrated here:

for (List<CoreLabel> sent : sents) { 
    for (CoreLabel element : sent) { 
        System.out.print(element.endPosition() + " "); 
     } 
    System.out.println(); 
}

When this is executed, we get the following output. The last number on each line is the index of the sentence boundary:

    4 16 20 24 27 37 40 45 48 57 65 73 74 
    84 88 92 97 109 115 116 
    119 128 138 144 145 
    152 162 165 169 174 182 187 195 196 210 215 218 224 227 231 237 238 242 251 260 267 274 276 285 287 288 291 294 298 302 305 307 316 317

The first elements of each sentence are displayed in the following sequence along with its index:

for (List<CoreLabel> sent : sents) { 
    System.out.println(sent.get(0) + " "  
        + sent.get(0).beginPosition()); 
}

The output is as follows:

    When 0
    Sentences 75
    Or 117
    Within 146

If we are interested in the last elements of a sentence, we can use the following sequence. The number of elements of a list is used to display the terminating character and its ending position:

for (List<CoreLabel> sent : sents) { 
    int size = sent.size(); 
    System.out.println(sent.get(size-1) + " "  
        + sent.get(size-1).endPosition()); 
}

This will produce the following output:

There are a number of options available when the constructor of the PTBTokenizer class is invoked. These options are enclosed as the constructor's third parameter. The option string consists of the options separated by commas, as shown here:

"americanize=true,normalizeFractions=true,asciiQuotes=true".

Several of these options are listed in this table:

Option	Meaning
`invertible`	Used to indicate that the tokens and whitespace must be preserved so that the original string can be reconstructed
`tokenizeNLs`	Indicates that the ends of lines must be treated as tokens
`americanize`	If true, this will rewrite British spellings as American spellings
`normalizeAmpersandEntity`	Will convert the XML &amp character to an ampersand
`normalizeFractions`	Converts common fraction characters, such as ½, to the long form (1/2)
`asciiQuotes`	Will convert quote characters to the simpler ' and " characters
`unicodeQuotes`	Will convert quote characters to characters that range from U+2018 to U+201D

The following sequence illustrates the use of this option string:

paragraph = "The colour of money is green. Common fraction " 
    + "characters such as ½  are converted to the long form 1/2. " 
    + "Quotes such as "cat" are converted to their simpler form."; 
ptb = new PTBTokenizer( 
    new StringReader(paragraph), new CoreLabelTokenFactory(), 
    "americanize=true,normalizeFractions=true,asciiQuotes=true"); 
wtsp = new WordToSentenceProcessor(); 
sents = wtsp.process(ptb.tokenize()); 
for (List<CoreLabel> sent : sents) { 
    for (CoreLabel element : sent) { 
        System.out.print(element + " "); 
    } 
    System.out.println(); 
}

The output is as follows:

    The color of money is green . 
    Common fraction characters such as 1/2 are converted to the long form 1/2 . 
    Quotes such as " cat " are converted to their simpler form .

The British spelling of the word "colour" was converted to its American equivalent. The fraction ½ was expanded to three characters: 1/2. In the last sentence, the smart quotes were converted to their simpler form.

Table of Contents for Using the PTBTokenizer class

Create new playlist

Sign In

Sign Up

Table of Contents for
Using the PTBTokenizer class