Using the PTBTokenizer class

This tokenizer mimics the Penn Treebank 3 (PTB) tokenizer (http://www.cis.upenn.edu/~treebank/). It differs from PTB in terms of its options and its support for Unicode. The PTBTokenizer class supports several older constructors; however, it is suggested that the three-argument constructor be used. This constructor uses a Reader object, a LexedTokenFactory<T> argument, and a string to specify which of the several options to use.

The LexedTokenFactory interface is implemented by the CoreLabelTokenFactory and WordTokenFactory classes. The former class supports the retention of the beginning and ending character positions of a token, whereas the latter class simply returns a token as a string without any positional information. The WordTokenFactory class is used by default. We will demonstrate the use of both classes.

The CoreLabelTokenFactory class is used in the following example. A StringReader instance is created using paragraph. The last argument is used for the options, which is null for this example. The Iterator interface is implemented by the PTBTokenizer class, allowing us to use the hasNext and next method to display the tokens:

PTBTokenizer ptb = new PTBTokenizer( 
    new StringReader(paragraph), new 
 CoreLabelTokenFactory(),null); 
while (ptb.hasNext()) { 
    System.out.println(ptb.next()); 
}

The output is as follows:

Let
's
pause
,
and
then
reflect
.

The same output can be obtained using the WordTokenFactory class, as shown here:

PTBTokenizerptb = new PTBTokenizer( 
    new StringReader(paragraph), new WordTokenFactory(), null);

The power of the CoreLabelTokenFactory class is realized with the options parameter of the PTBTokenizer constructor. These options provide a means to control the behavior of the tokenizer. Options include such controls as how to handle quotes, how to map ellipses, and whether it should treat British English spellings or American English spellings. A list of options can be found at http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/process/PTBTokenizer.html.

In the following code sequence, the PTBTokenizer object is created using the CoreLabelTokenFactory variable, ctf, along with an option of "invertible=true". This option allows us to obtain and use a CoreLabel object, which will give us the beginning and ending position of each token:

CoreLabelTokenFactory ctf = new CoreLabelTokenFactory(); 
PTBTokenizer ptb = new PTBTokenizer( 
    new StringReader(paragraph),ctf,"invertible=true"); 
while (ptb.hasNext()) { 
    CoreLabel cl = (CoreLabel)ptb.next(); 
    System.out.println(cl.originalText() + " (" +  
        cl.beginPosition() + "-" + cl.endPosition() + ")"); 
}

The output of this sequence is as follows. The numbers within the parentheses indicate the tokens' beginning and ending positions:

Let (0-3)
's (3-5)
pause (6-11)
, (11-12)
and (14-17)
then (18-22)
reflect (23-30)
. (30-31)

Table of Contents for Using the PTBTokenizer class

Create new playlist

Sign In

Sign Up

Table of Contents for
Using the PTBTokenizer class