Using OpenNLP for NER

We will demonstrate the use of the TokenNameFinderModel class to perform NLP using the OpenNLP API. Additionally, we will demonstrate how to determine the probability that the entity identified is correct.

The general approach is to convert the text into a series of tokenized sentences, create an instance of the TokenNameFinderModel class using an appropriate model, and then use the find method to identify the entities in the text.

The following example demonstrates the use of the TokenNameFinderModel class. We will use a simple sentence initially, and then use multiple sentences. The sentence is defined here:

String sentence = "He was the last person to see Fred."; 

We will use the models found in the en-token.bin and en-ner-person.bin files for the tokenizer and name finder models, respectively. The InputStream object for these files is opened using a try-with-resources block, as shown here:

try (InputStream tokenStream = new FileInputStream( 
        new File(getModelDir(), "en-token.bin")); 
        InputStream modelStream = new FileInputStream( 
            new File(getModelDir(), "en-ner-person.bin"));) { 
    ... 
 
} catch (Exception ex) { 
    // Handle exceptions 
} 

Within the try block, the TokenizerModel and Tokenizer objects are created:

    TokenizerModel tokenModel = new TokenizerModel(tokenStream); 
    Tokenizer tokenizer = new TokenizerME(tokenModel); 

Next, an instance of the NameFinderME class is created using the person model:

TokenNameFinderModel entityModel =  
    new TokenNameFinderModel(modelStream); 
NameFinderME nameFinder = new NameFinderME(entityModel); 

We can now use the tokenize method to tokenize the text and the find method to identify the person in the text. The find method will use the tokenized String array as input and return an array of Span objects, as shown here:

String tokens[] = tokenizer.tokenize(sentence); 
Span nameSpans[] = nameFinder.find(tokens);

We discussed the Span class in Chapter 3, Finding Sentences. As you may remember, this class holds positional information about the entities found. The actual string entities are still in the tokens array:

The following for statement displays the person found in the sentence. Its positional information and the person are displayed on separate lines:

for (int i = 0; i < nameSpans.length; i++) { 
    System.out.println("Span: " + nameSpans[i].toString()); 
    System.out.println("Entity: " 
        + tokens[nameSpans[i].getStart()]); 
} 

The output is as follows:

    Span: [7..9) person
    Entity: Fred

We will often work with multiple sentences. To demonstrate this, we will use the previously defined sentences string array. The previous for statement is replaced with the following sequence. The tokenize method is invoked against each sentence and then the entity information is displayed, like it was earlier:

for (String sentence : sentences) { 
    String tokens[] = tokenizer.tokenize(sentence); 
    Span nameSpans[] = nameFinder.find(tokens); 
    for (int i = 0; i < nameSpans.length; i++) { 
        System.out.println("Span: " + nameSpans[i].toString()); 
        System.out.println("Entity: "  
            + tokens[nameSpans[i].getStart()]); 
    } 
    System.out.println(); 
} 

The output is as follows. There is an extra blank line between the two people detected because the second sentence did not contain a person:

    Span: [0..1) person
    Entity: Joe
    Span: [7..9) person
    Entity: Fred
    
    
    Span: [0..1) person
    Entity: Joe
    Span: [19..20) person
    Entity: Sally
    Span: [26..27) person
    Entity: Fred

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset