Using regular expressions

Regular expressions can be difficult to understand. While simple expressions are not usually a problem, as they become more complex, their readability worsens. This is one of the limitations of regular expressions when trying to use them for SBD.

We will present two different regular expressions. The first expression is simple, but does not do a very good job. It illustrates a solution that may be too simple for some problem domains. The second is more sophisticated and does a better job.

In this example, we create a regular expression class that matches periods, question marks, and exclamation marks. The String class' split method is used to split the text into sentences:

String simple = "[.?!]"; 
String[] splitString = (paragraph.split(simple)); 
for (String string : splitString) { 
    System.out.println(string); 
}

The output is as follows:

    When determining the end of sentences we need to consider several factors
     Sentences may end with exclamation marks
     Or possibly questions marks
     Within sentences we may find numbers like 3
    14159, abbreviations such as found in Mr
     Smith, and possibly ellipses either within a sentence ..., or at the end of a sentence...
  

As expected, the method splits the paragraph into characters, regardless of whether they are part of a number or abbreviation.

A second approach follows, which produces better results. This example has been adapted from an example found at http://stackoverflow.com/questions/5553410/regular-expression-match-a-sentence. The Pattern class, which compiles the following regular expression, is used:

    [^.!?s][^.!?]*(?:[.!?](?!['"]?s|$)[^.!?]*)*[.!?]?['"]?(?=s|$)
  

The comment in the following code sequence provides an explanation of what each part represents:

Pattern sentencePattern = Pattern.compile( 
    "# Match a sentence ending in punctuation or EOS.
" 
    + "[^.!?\s]    # First char is non-punct, non-ws
" 
    + "[^.!?]*      # Greedily consume up to punctuation.
" 
    + "(?:          # Group for unrolling the loop.
" 
    + "  [.!?]      # (special) inner punctuation ok if
" 
    + "  (?!['"]?\s|$)  # not followed by ws or EOS.
" 
    + "  [^.!?]*    # Greedily consume up to punctuation.
" 
    + ")*           # Zero or more (special normal*)
" 
    + "[.!?]?       # Optional ending punctuation.
" 
    + "['"]?       # Optional closing quote.
" 
    + "(?=\s|$)", 
    Pattern.MULTILINE | Pattern.COMMENTS); 

Another representation of this expression can be generated using the display tool found at http://regexper.com/. As shown in the following diagram, it graphically depicts the expression and can clarify how it works:

The matcher method is executed against the sample paragraph and then the results are displayed:

Matcher matcher = sentencePattern.matcher(paragraph); 
while (matcher.find()) { 
    System.out.println(matcher.group()); 
} 

The output follows. The sentence terminators are retained, but there are still problems with abbreviations:

    When determining the end of sentences we need to consider several factors.
    Sentences may end with exclamation marks!
    Or possibly questions marks?
    Within sentences we may find numbers like 3.14159, abbreviations such as found in Mr.
    Smith, and possibly ellipses either within a sentence ..., or at the end of a sentence...
  
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset