Using the BreakIterator class

Another approach for tokenization involves the use of the BreakIterator class. This class supports the location of integer boundaries for different units of text. In this section, we will illustrate how it can be used to find words.

The class has a single default constructor which is protected. We will use the static getWordInstance method to get an instance of the class. This method is overloaded with one version using a Locale object. The class possesses several methods to access boundaries, as listed in the following table. It has one field, DONE, that is used to indicate that the last boundary has been found:

Method

Usage

first

Returns the first boundary of the text

next

Returns the next boundary following the current one

previous

Returns the boundary preceding the current one

setText

Associates a string with the BreakIterator instance

 

To demonstrate this class, we declare an instance of the BreakIterator class and a string to use with it:

BreakIterator wordIterator = BreakIterator.getWordInstance(); 
String text = "Let's pause, and then reflect."; 

The text is then assigned to the instance and the first boundary is determined:

wordIterator.setText(text); 
int boundary = wordIterator.first();

The loop that follows will store the beginning and ending boundary indexes for word breaks, using the begin and end variables. The boundary values are integers. Each boundary pair and its associated text are displayed.

When the last boundary is found, the loop terminates:

while (boundary != BreakIterator.DONE) { 
    int begin = boundary; 
    System.out.print(boundary + "-"); 
    boundary = wordIterator.next(); 
    int end = boundary; 
    if(end == BreakIterator.DONE) break; 
    System.out.println(boundary + " [" 
    + text.substring(begin, end) + "]"); 
} 

The output follows where the brackets are used to clearly delineate the text:

0-5 [Let's]
5-6 [ ]
6-11 [pause]
11-12 [,]
12-13 [ ]
13-16 [and]
16-17 [ ]
17-21 [then]
21-22 [ ]
22-29 [reflect]
29-30 [.]  

This technique does a fairly good job of identifying the basic tokens.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset