Using the SentenceChunker class

An alternative approach is to use the SentenceChunker class to perform SBD. The constructor of this class requires a TokenizerFactory object and a SentenceModel object, as shown here:

TokenizerFactory tokenizerfactory = 
IndoEuropeanTokenizerFactory.INSTANCE; SentenceModel sentenceModel = new IndoEuropeanSentenceModel();

The SentenceChunker instance is created using the tokenizerfactory and
sentence instances:

SentenceChunker sentenceChunker =  
    new SentenceChunker(tokenizerfactory, sentenceModel); 

The SentenceChunker class implements the Chunker interface, which uses a chunk method. This method returns an object that implements the Chunking interface. This object specifies "chunks" of text with a character sequence (CharSequence).

The chunk method uses a character array and indexes within the array to specify which portions of the text need to be processed. A Chunking object is returned like this:

Chunking chunking = sentenceChunker.chunk( 
    paragraph.toCharArray(),0, paragraph.length()); 

We will use the Chunking object for two purposes. First, we will use its chunkSet method to return a set of Chunk objects. Then, we will obtain a string holding all the sentences:

Set<Chunk> sentences = chunking.chunkSet(); 
String slice = chunking.charSequence().toString();

A Chunk object stores character offsets of the sentence boundaries. We will use its start and end methods in conjunction with the slice to display the sentences, as shown in the following code. Each element and sentence holds the sentence's boundary. We use this information to display each sentence in the slice:

for (Chunk sentence : sentences) { 
    System.out.println("[" + slice.substring(sentence.start(), 
sentence.end()) + "]"); }

The following is the output. However, it still has problems with sentences ending with an ellipsis, so a period has been added to the end of the last sentence before the text is processed.

    [When determining the end of sentences we need to consider several factors.]
    [Sentences may end with exclamation marks!]
    [Or possibly questions marks?]
    [Within sentences we may find numbers like 3.14159, abbreviations such as found in Mr. Smith, and possibly ellipses either within a sentence ..., or at the end of a sentence....]
  

Although the IndoEuropeanSentenceModel class works reasonably well for English text, it may not always work well for specialized text. In the next section, we will examine the use of the MedlineSentenceModel class, which has been trained to work with medical text.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset