Sentence boundary detection

Sentence boundary detection is an important step in NLP and an essential problem to be solved before analyzing the text for further use in information extraction, word tokenization, part of speech tagging, and so on. A sentence is a basic unit of text. Tough SBD has been solved to a good extent, extracting sentences from a text document is not a simple process. Sentence boundary detection is language dependent since the sentence termination character in each language may be different. This can be done using a machine learning approach by training a model rule-based approach. If we consider the English language then the simple set of rules which can give us a fairly accurate results are:

  • Text is terminated by a period ( . )
  • Text is terminated by an exclamation mark ( ! )
  • Text is terminated by a question mark ( ? )

Consider the following example:

NLP is a vast topic. Lots of research has been done in this field.

When we apply the preceding set of rules, we can extract all the sentences easily. We can see there are two sentences. Sometimes, due to the way these characters are used in English, sentence boundary detection becomes a tricky task. Consider the following example:

Mr. James is expert in NLP. James lives in the U.S.A.

When we apply the preceding set of rules, we get four sentences since there are four periods. The human brain can understand that Mr. is a prefix so the period doesn't mean the end of a sentence and U.S.A is an abbreviation and the period here does not mean the end of a sentence because it has some contextual information. Some of the common challenges while detecting sentences are abbreviations, use of punctuation, quotes within a sentence, and special characters in text such as tweets, but how can we make machine understand this? We will need a complex set of rules to understand abbreviations, prefixes, and quotes so, based on this, we can add two more rules such as:

  • Abbreviations do not come before period.
  • Numbers do not occur after period.

But how do you detect abbreviations? Are the words with all caps abbreviations? We can use a list of domain-specific abbreviations; these rules keep on growing as the complexity and the context of the text changes. Some of these subtle differences in the use of punctuation can make SBD tricky.

The examples shows the abbreviations causing ambiguity in SBD: in the first sentence, there are no period characters between the letters in the abbreviation NASA, and in the second sentence, NASA has periods in between:

"Mark is an astronaut in NASA."

"Mark is an astronaut in N.A.S.A."

Quotes inside a sentence or sentence within a sentence:

Alien said, "Welcome to MARS"

An exclamation mark is a part of a word not the end of sentence:

"Yahoo!"

We can use double exclamation marks but that does not mean there are two sentences:

"Congratulations!!"

In the following sentence, an ellipsis is used as a termination character:

"The story continued …"

Here it's not the end of the sentence:

"I wasn't really ...well"

Character encoding can also complicate the problems since some punctuation is treated as one character in certain encodings and as multiple characters in a different encoding.

Let's do a simple SBD in R using the OpenNLP package.

Load the required libraries; these libraries are already been installed:

library(rJava)
library(NLP)
library(openNLP)

We will consider a simple text and extract the sentences out of it:

simpleText <- "Text Mining is a interesting field. Mr. Paul Is a good programmer. He loves watching football. He lives in U.S.A."

Let's convert a character array into string:

simpleText_str <- as.String(simpleText)

Let's use the Maxent_Sent_Token_Annotator() method; this generates an annotator which calculates sentence annotations using the Apache OpenNLP Maxent sentence detector. This method is just an interface to the actual API exposed by Apache OpenNLP, This method can also take the following parameters as input:

  • language
  • probs
  • model

If you don't provide the parameters, then the defaults are:

  • language = "en"
  • probs = FALSE
  • model = NULL

It will use the en-sent.bin when we initialize this with defaults:

sent_token_annotator <- Maxent_Sent_Token_Annotator()

We use the annotate function from NLP R package; it calculates annotations by iterating over the given annotators and applying them to the input text, the output is merged values of the newly computed annotations with the current ones:

annotated_sentence <- annotate(simpleText_str,sent_token_annotator)

Let's inspect the value inside the annotated sentence:

annotated_sentence

The output is as follows:

Sentence boundary detection

From the preceding output snapshot, we can see there are four sentences; it also provides the information on the start and end of each sentence.

Let's use the MaxEnt Sentence Annotator using the parameter "probs = TRUE" and see the difference in output:

annotated_sentence_prob <- annotate(simpleText_str, Maxent_Sent_Token_Annotator(probs = TRUE))
Sentence boundary detection

The output has a features column which shows the confidence of the sentences detected. In order to get all the sentences from the string, we can use the following code:

simpleText_str[annotated_sentence]
Sentence boundary detection

We just saw how to invoke the MaxEnt sentence boundary detector from the R package, now let's understand what is happening under the hood, what Java code actually gets called, and what it actually does.

Apache OpenNLP can detect that a punctuation marks at the end of a sentence. The model used in OpenNLP is trained on data where white space follows sentence termination punctuation. That is, a sentence is defined as the longest white space trimmed character sequence between two punctuation marks. The first non-whitespace character is assumed to be the beginning of a sentence, and the last non-whitespace character is assumed to be a sentence end; this is a good standard in English orthography. English sentence detectors are trained to differentiate between sentence termination punctuation and punctuation used in between the sentence in abbreviations and so on.

Let's look a sample of Java code, used for sentence detection using Apache OpenNLP

We will assume all the required libraries are on your java class path. import statement to get the relevant class files and their methods:

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.util.InvalidFormatException;


public class SentenceDetector {

public static void DetectSentence() throws InvalidFormatException,
    IOException {
    
String text = "Text Mining is a interesting field. Mr. Paul Is a good programmer. He loves watching football. He lives in U.S.A.";
 

Now let's load a model:

  1. This model is created by using training data; we will read the file en-sent.bin which is the OpenNPL pre-trained model for the English language:
    InputStream is = new FileInputStream("en-sent.bin");
  2. We need to create a new object of the SentenceModel class using the bin file loaded into the input stream in the previous line:
    SentenceModel model = new SentenceModel(is);
  3. We will instantiate a SentenceDetectirME instance; it is a sentence detector class for splitting up raw text into sentences. It uses a maximum entropy model to evaluate end-of-sentence characters in a string to determine if they are the actual end of a sentence:
    SentenceDetectorME sdetector = new SentenceDetectorME(model);
  4. Detect sentences in a string. The input for this is the actual text:
    String sentences[] = sdetector.sentDetect(text);
  5. Print all the detected sentences to the console:
    for(String sentence : sentences){
    System.out.println(sentences);
    
    }
  6. Clean up the resource:
    is.close();
    }
    
    public static void main(String[] args) throws InvalidFormatException, IOException {
  7. Invoke the method to perform sentence detection:
         DetectSentence();
      }
    }
  8. The output after executing the preceding program is:
    Sentence boundary detection

Some of the useful methods exposed by SentenceDetectorME class are:

  • getSentenceProbabilities(): Returns the probabilities associated with the most recent calls to sentDetect()
  • sentDetect (String s): Detects sentences in a string
  • sentPosDetect (String s): Detects the position of the first words of sentences in a String
  • train (String, ObjectStream, SentenceDetectorFactory, TrainingParameters): Trains a new annotated text

So when we invoke an R code for SBD, it invokes the Java code as shown preceding under the hood and finally passes the output to the Simple_Sent_Token_Annotator() method for the NLP package which creates annotator objects.

We can obtain various pre-trained sentence boundary detection models for different languages from http://opennlp.sourceforge.net/models-1.5/:

  • en-sent.bin
  • nl-sent.bin
  • se-sent.bin

We can train or create our own sentence boundary detector model by using the train() API exposed by SentenceDetectorME class. In order to achieve this, the training data must meet certain prerequisites. The data must be in OpenNLP Sentence Detector training format or it has to be converted into the OpenNLP Sentence Detector training format which is:

  • One sentence per line
  • Empty line indicates a document boundary
  • Recommended to have an empty line every 10 sentences if the document boundary is unknown

Word token annotator

In order to tokenize the words in a document, we can use Maxent_Word_Token_Annotator() from the OpenNLP package; this method invokes the Apache OpenNLP Maxent tokenizer, which tokens the input text into tokens. Tokens are nothing but words, numbers, and punctuation. Apache OpenNLP has three different types of tokenizer:

  • Whitespace tokenizer
  • Simple tokenizer
  • Maximum entropy tokenizer

Word tokenization is an important step in language processing; the tokenized output may be used in parsers, POS taggers, and entity extractor. When we use word "tokenizer" in OpenNLP, we have to first identify the sentence boundaries by using Maxent_Sent_Token_Annotator() and then the sentences are further tokenized into words.

Let's see how to execute Maxent_Sent_Token_Annotator() in R:

  1. Load the required libraries; these libraries are already been installed:
    library(rJava)
    library(NLP)
    library(openNLP)
  2. We will consider a simple text and extract the sentences out of it:
    simpleText <- "Text Mining is a interesting field. Mr. Paul Is a good programmer. He loves watching football. He lives in U.S.A."
  3. Let's convert a character array into string and run a sentence annotation method; this is mandatory because Apache OpenNLP first tokenizes the sentences and then tokenizes the word in each sentence:
    simpleText_str <- as.String(simpleText)
    sent_token_annotator <- Maxent_Sent_Token_Annotator(probs=TRUE)
    annotated_sentence <- annotate(simpleText_str,sent_token_annotator)

Let's use the Maxent_Word_Token_Annotator() method; this generates an annotator which calculates sentence annotations using the Apache OpenNLP Maxent Word tokenizer. This method is just an interface to the actual API exposed by Apache OpenNLP. This method takes the following parameters:

  • language
  • probs
  • model

If you don't provide the parameters, then the defaults are:

  • language = "en"
  • probs = FALSE
  • model = NULL

It will use en-token.bin when we initialize this with defaults:

word_token_annotator <- Maxent_Word_Token_Annotator(probs=TRUE)

We use the annotate function from the NLP R package; it calculates annotations by iterating over the given annotators and applying them to the input text. The output is the merged values of the newly computed annotations with the current ones:

annotated_word <- annotate (simpleText_str, word_token_annotator, annotated_sentence)

Let's inspect the value inside the annotated sentence, annotated_word:

Word token annotator

The output provides lot of information, such as number of sentences, start and end of each sentence, probability of each sentence, number of words in the text, start and end of each word, and the probability of each word detected.

Let's inspect the output:

simpleText_str[annotated_word]
Word token annotator

First it lists all the sentences and then the words.

Let us look into what the Java APIs are that are exposed by Apache OpenNLP to perform the preceding process through a simple Java program:

  1. Load the required libraries:
    import java.io.FileInputStream;
    import java.io.IOException;
    import java.io.InputStream;
    import opennlp.tools.tokenize.Tokenizer;
    import opennlp.tools.tokenize.TokenizerME;
    import opennlp.tools.tokenize.TokenizerModel;
    import opennlp.tools.util.InvalidFormatException;
    public class WordTokenizer {
      public static void main(String[] args) throws InvalidFormatException, IOException {
        
        Tokenize();
      }
      public static void Tokenize() throws InvalidFormatException, IOException {
    Input String
        String text = "Text Mining is a interesting field. Mr. Paul Is a good programmer. He loves watching football. He lives in U.S.A.";    
  2. We will load the pre-trained model provided by Apache OpenNLP:
        InputStream is = new FileInputStream("C:\Users\avia.ORADEV\Documents\R\win-library\3.1\openNLPdata\models\en-token.bin"); 
  3. We need to initialize the TokenizerModel class using the input stream:
        TokenizerModel model = new TokenizerModel(is);   
  4. We create a tokenizer. We instantiate a maximum entropy tokenizer. This tokenizer converts raw text into separated tokens. It uses Maximum Entropy to make its decisions:
        Tokenizer tokenizer = new TokenizerME(model);
  5. Invoke tokenize method on the input text:
        String tokens[] = tokenizer.tokenize(text);
       
        for (String token : tokens){
          System.out.println(token);
        }
         
        is.close();
      }
    
    }
  6. The output of the programme is as follows. Now you know under the hood what APIs of OpenNLP are being called to do word tokenization:
    Word token annotator
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset