Using OpenNLP chunking

The process of chunking involves breaking a sentence into parts or chunks. These chunks can then be annotated with tags. We will use the ChunkerME class to illustrate how this is accomplished. This class uses a model loaded into a ChunkerModel instance. The ChunkerME class's chunk method performs the actual chunking process. We will also examine the use of the chunkAsSpans method to return information about the span of these chunks. This allows us to see how long a chunk is and what elements make up the chunk.

We will use the en-pos-maxent.bin file to create a model for the POSTaggerME instance. We need to use this instance to tag the text as we did in the Using OpenNLP POSTaggerME class for POS taggers section earlier in this chapter. We will also use the en-chunker.bin file to create a ChunkerModel instance to be used with the ChunkerME instance.

These models are created using input streams, as shown in the following example. We use a try-with-resources block to open and close files and to deal with any exceptions that may be thrown:

try ( 
        InputStream posModelStream = new FileInputStream( 
            getModelDir() + "\en-pos-maxent.bin"); 
        InputStream chunkerStream = new FileInputStream( 
            getModelDir() + "\en-chunker.bin");) { 
    ... 
} catch (IOException ex) { 
    // Handle exceptions 
}

The following code sequence creates and uses a tagger to find the POS of the sentence. The sentence and its tags are then displayed:

POSModel model = new POSModel(posModelStream); 
POSTaggerME tagger = new POSTaggerME(model); 
 
String tags[] = tagger.tag(sentence); 
for(int i=0; i<tags.length; i++) { 
    System.out.print(sentence[i] + "/" + tags[i] + " "); 
} 
System.out.println();

The output is as follows. We have shown this output so that it will be clear how the chunker works:

The/DT voyage/NN of/IN the/DT Abraham/NNP Lincoln/NNP was/VBD for/IN a/DT long/JJ time/NN marked/VBN by/IN no/DT special/JJ incident./NN

A ChunkerModel instance is created using the input stream. From this, the ChunkerME instance is created, followed by the use of the chunk method, as shown here. The chunk method will use the sentence's token and its tags to create an array of strings. Each string will hold information about the token and its chunk:

ChunkerModel chunkerModel = new 
ChunkerModel(chunkerStream); ChunkerME chunkerME = new ChunkerME(chunkerModel); String result[] = chunkerME.chunk(sentence, tags);

Each token in the results array and its chunk tag are displayed, as shown here:

for (int i = 0; i < result.length; i++) { 
    System.out.println("[" + sentence[i] + "] " + result[i]); 
}

The output is as follows. The token is enclosed in brackets, followed by the chunk tag. These tags are explained in the following table:

First part

B

Beginning of tag

I

Continuation of tag

E

End of tag (will not appear if tag is one word long)

Second part

NP

Noun chunk

VB

Verb chunk

Multiple words are grouped together, such as "The voyage" and "the Abraham Lincoln":

    [The] B-NP
    [voyage] I-NP
    [of] B-PP
    [the] B-NP
    [Abraham] I-NP
    [Lincoln] I-NP
    [was] B-VP
    [for] B-PP
    [a] B-NP
    [long] I-NP
    [time] I-NP
    [marked] B-VP
    [by] B-PP
    [no] B-NP
    [special] I-NP
    [incident.] I-NP

If we are interested in getting more detailed information about the chunks, we can use the ChunkerME class's chunkAsSpans method. This method returns an array of Span objects. Each object represents one span found in the text.

There are several other ChunkerME class methods available. Here, we will illustrate the use of the getType, getStart, and getEnd methods. The getType method returns the second part of the chunk tag, and the getStart and getEnd methods return the beginning and ending index of the tokens in the original sentence array, respectively. The length method returns the length of the span in a number of tokens.

In the following sequence, the chunkAsSpans method is executed using the sentence and tags arrays. The spans array is then displayed. The outer for loop processes one Span object at a time, displaying the basic span information.
The inner for loop displays the spanned text enclosed within brackets:

Span[] spans = chunkerME.chunkAsSpans(sentence, tags); 
for (Span span : spans) { 
    System.out.print("Type: " + span.getType() + " - "  
        + " Begin: " + span.getStart()  
        + " End:" + span.getEnd() 
        + " Length: " + span.length() + "  ["); 
    for (int j = span.getStart(); j < span.getEnd(); j++) { 
        System.out.print(sentence[j] + " "); 
    } 
    System.out.println("]"); 
}

The following output clearly shows the span type, its position in the sentence array, its Length, and then the actual spanned text:

    Type: NP -  Begin: 0 End:2 Length: 2  [The voyage ]
    Type: PP -  Begin: 2 End:3 Length: 1  [of ]
    Type: NP -  Begin: 3 End:6 Length: 3  [the Abraham Lincoln ]
    Type: VP -  Begin: 6 End:7 Length: 1  [was ]
    Type: PP -  Begin: 7 End:8 Length: 1  [for ]
    Type: NP -  Begin: 8 End:11 Length: 3  [a long time ]
    Type: VP -  Begin: 11 End:12 Length: 1  [marked ]
    Type: PP -  Begin: 12 End:13 Length: 1  [by ]
    Type: NP -  Begin: 13 End:16 Length: 3  [no special incident. ]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset