Using the BreakIterator class

The BreakIterator class can be used to detect various text boundaries, such as those between characters, words, sentences, and lines. Different methods are used to create different instances of the BreakIterator class as follows:

  • For characters, the getCharacterInstance method is used
  • For words, the getWordInstance method is used
  • For sentences, the getSentenceInstance method is used
  • For lines, the getLineInstance method is used

Detecting breaks between characters is important at times, for example, when we need to process characters that are composed of multiple Unicode characters, such as ü. This character is sometimes formed by combining the u0075 (u) and u00a8 (¨) Unicode characters. The class will identify these types of characters. This capability is further detailed at https://docs.oracle.com/javase/tutorial/i18n/text/char.html.

The BreakIterator class can be used to detect the end of a sentence. It uses a cursor that references the current boundary. It supports a next and a previous method that moves the cursor forward and backwards in the text, respectively. BreakIterator has a single, protected default constructor. To obtain an instance of the BreakIterator class to detect the end of a sentence, use the static getSentenceInstance method, as shown here:

BreakIterator sentenceIterator = 
BreakIterator.getSentenceInstance();

There is also an overloaded version of the method. It takes a Locale instance as an argument:

Locale currentLocale = new Locale("en", "US"); 
BreakIterator sentenceIterator =  
    BreakIterator.getSentenceInstance(currentLocale); 

Once an instance has been created, the setText method will associate the text to
be processed with the iterator:

sentenceIterator.setText(paragraph); 

BreakIterator identifies the boundaries found in text using a series of methods and fields. All of these return integer values, and they are detailed in the following table:

Method

Usage

first

Returns the first boundary of the text

next

Returns the boundary following the current boundary

previous

Returns the boundary preceding the current boundary

DONE

The final integer, which is assigned a value of -1 (indicating that there are no more boundaries to be found)

To use the iterator in a sequential fashion, the first boundary is identified using the first method, and then the next method is called repeatedly to find the subsequent boundaries. The process is terminated when DONE is returned. This technique is illustrated in the following code sequence, which uses the previously declared sentenceIterator instance:

int boundary = sentenceIterator.first(); 
while (boundary != BreakIterator.DONE) { 
    int begin = boundary; 
    System.out.print(boundary + "-"); 
    boundary = sentenceIterator.next(); 
    int end = boundary; 
    if (end == BreakIterator.DONE) { 
        break; 
    } 
    System.out.println(boundary + " [" 
        + paragraph.substring(begin, end) + "]"); 
} 

On execution, we get the following output:

    0-75 [When determining the end of sentences we need to consider several factors. ]
    75-117 [Sentences may end with exclamation marks! ]
    117-146 [Or possibly questions marks? ]
    146-233 [Within sentences we may find numbers like 3.14159 , abbreviations such as found in Mr. ]
    233-319 [Smith, and possibly ellipses either within a sentence ... , or at the end of a sentence...]
    319-
  

This output works for simple sentences but is not successful with more complex sentences.

The uses of both regular expressions and the BreakIterator class have limitations. They are useful for text consisting of relatively simple sentences. However, when the text becomes more complex, it is better to use the NLP APIs instead, as discussed in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset