Another approach for tokenization involves the use of the BreakIterator class. This class supports the location of integer boundaries for different units of text. In this section, we will illustrate how it can be used to find words.
The class has a single default constructor which is protected. We will use the static getWordInstance method to get an instance of the class. This method is overloaded with one version using a Locale object. The class possesses several methods to access boundaries, as listed in the following table. It has one field, DONE, that is used to indicate that the last boundary has been found:
Method |
Usage |
first |
Returns the first boundary of the text |
next |
Returns the next boundary following the current one |
previous |
Returns the boundary preceding the current one |
setText |
Associates a string with the BreakIterator instance |
To demonstrate this class, we declare an instance of the BreakIterator class and a string to use with it:
BreakIterator wordIterator = BreakIterator.getWordInstance(); String text = "Let's pause, and then reflect.";
The text is then assigned to the instance and the first boundary is determined:
wordIterator.setText(text); int boundary = wordIterator.first();
The loop that follows will store the beginning and ending boundary indexes for word breaks, using the begin and end variables. The boundary values are integers. Each boundary pair and its associated text are displayed.
When the last boundary is found, the loop terminates:
while (boundary != BreakIterator.DONE) { int begin = boundary; System.out.print(boundary + "-"); boundary = wordIterator.next(); int end = boundary; if(end == BreakIterator.DONE) break; System.out.println(boundary + " [" + text.substring(begin, end) + "]"); }
The output follows where the brackets are used to clearly delineate the text:
0-5 [Let's] 5-6 [ ] 6-11 [pause] 11-12 [,] 12-13 [ ] 13-16 [and] 16-17 [ ] 17-21 [then] 21-22 [ ] 22-29 [reflect] 29-30 [.]
This technique does a fairly good job of identifying the basic tokens.