Using the OpenNLP POSTaggerME class for POS taggers

The OpenNLP POSTaggerME class uses maximum entropy to process the tags.
The tagger determines the type of tag based on the word itself and the word's context. Any given word may have multiple tags associated with it. The tagger uses a probability model to determine the specific tag to be assigned.

POS models are loaded from a file. The en-pos-maxent.bin model is used frequently and is based on the Penn TreeBank tag set. Various pretrained POS models for OpenNLP can be found at http://opennlp.sourceforge.net/models-1.5/.

We start with a try-catch block to handle any IOException that might be generated when loading a model, as shown here.

We use the en-pos-maxent.bin file for the model:

try (InputStream modelIn = new FileInputStream( 
    new File(getModelDir(), "en-pos-maxent.bin"));) { 
    ... 
} 
catch (IOException e) { 
    // Handle exceptions 
} 

Next, create the POSModel and POSTaggerME instances, as shown here:

POSModel model = new POSModel(modelIn); 
POSTaggerME tagger = new POSTaggerME(model); 

The tag method can now be applied to the tagger using the text to be processed as its argument:

String tags[] = tagger.tag(sentence); 

The words and their tags are then displayed, as shown here:

for (int i = 0; i<sentence.length; i++) { 
    System.out.print(sentence[i] + "/" + tags[i] + " "); 
} 

The output is as follows. Each word is followed by its type:

The/DT voyage/NN of/IN the/DT Abraham/NNP Lincoln/NNP was/VBD for/IN a/DT long/JJ time/NN marked/VBN by/IN no/DT special/JJ incident./NN

With any sentence, there may be more than one possible assignment of tags to words. The topKSequences method will return a set of sequences based on their probability of being correct. In the following code sequence, the topKSequences method is executed using the sentence variable and then displayed:

Sequence topSequences[] = tagger.topKSequences(sentence); 
for (inti = 0; i<topSequences.length; i++) { 
    System.out.println(topSequences[i]); 
}

Its output follows, in which the first number represents a weighted score and the tags within the brackets are the sequence of tags scored:

    -0.5563571615737618 [DT, NN, IN, DT, NNP, NNP, VBD, IN, DT, JJ, NN, VBN, IN, DT, JJ, NN]
    -2.9886144610050907 [DT, NN, IN, DT, NNP, NNP, VBD, IN, DT, JJ, NN, VBN, IN, DT, JJ, .]
    -3.771930515521527 [DT, NN, IN, DT, NNP, NNP, VBD, IN, DT, JJ, NN, VBN, IN, DT, NN, NN]
Ensure that you include the correct Sequence class. For this example, use import opennlp.tools.util.Sequence;.

The Sequence class has several methods, as detailed in the following table:

Method

Meaning

getOutcomes

Returns a list of strings representing the tags for the sentence

getProbs

Returns an array of double variables representing the probability for each tag in the sequence

getScore

Returns a weighted value for the sequence

 

In the following sequence, we use several of these methods to demonstrate what they do. For each sequence, the tags and their probabilities are displayed, separated by a forward slash:

for (int i = 0; i<topSequences.length; i++) { 
    List<String> outcomes = topSequences[i].getOutcomes(); 
    double probabilities[] = topSequences[i].getProbs(); 
    for (int j = 0; j <outcomes.size(); j++) {  
        System.out.printf("%s/%5.3f ",outcomes.get(j), 
        probabilities[j]); 
    } 
    System.out.println(); 
} 
System.out.println();

The output is as follows. Each pair of lines represents one sequence where the output has been wrapped:

    DT/0.992 NN/0.990 IN/0.989 DT/0.990 NNP/0.996 NNP/0.991 VBD/0.994 IN/0.996 DT/0.996 JJ/0.991 NN/0.994 VBN/0.860 IN/0.985 DT/0.960 JJ/0.919 NN/0.832 
    DT/0.992 NN/0.990 IN/0.989 DT/0.990 NNP/0.996 NNP/0.991 VBD/0.994 IN/0.996 DT/0.996 JJ/0.991 NN/0.994 VBN/0.860 IN/0.985 DT/0.960 JJ/0.919 ./0.073 
    DT/0.992 NN/0.990 IN/0.989 DT/0.990 NNP/0.996 NNP/0.991 VBD/0.994 IN/0.996 DT/0.996 JJ/0.991 NN/0.994 VBN/0.860 IN/0.985 DT/0.960 NN/0.073 NN/0.419
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset