Using the Stanford pipeline to perform tagging

We have used the Stanford pipeline in several previous examples. In this example, we will use the Stanford pipeline to extract POS tags. As with our previous Stanford examples, we create a pipeline based on a set of annotators: tokenize, ssplit,
and pos.

These will tokenize, split the text into sentences, and then find the POS tags:

Properties props = new Properties(); 
props.put("annotators", "tokenize, ssplit, pos"); 
StanfordCoreNLP pipeline = new StanfordCoreNLP(props); 

To process the text, we will use the theSentence variable as input to Annotator. The pipeline's annotate method is then invoked, as shown here:

Annotation document = new Annotation(theSentence); 
pipeline.annotate(document);

Since the pipeline can perform different types of processing, a list of CoreMap objects is used to access the words and tags. The Annotation class's get method returns the list of sentences, as shown here:

List<CoreMap> sentences = 
document.get(SentencesAnnotation.class);

The contents of the CoreMap objects can be accessed using its get method. The method's argument is the class for the information needed. As shown in the following code example, tokens are accessed using the TextAnnotation class, and the POS tags can be retrieved using the PartOfSpeechAnnotation class. Each word of each sentence and its tags are displayed:

for (CoreMap sentence : sentences) { 
    for (CoreLabel token : sentence.get(TokensAnnotation.class)) { 
        String word = token.get(TextAnnotation.class); 
        String pos = token.get(PartOfSpeechAnnotation.class); 
        System.out.print(word + "/" + pos + " "); 
    } 
    System.out.println(); 
}

The output will be as follows:

The/DT voyage/NN of/IN the/DT Abraham/NNP Lincoln/NNP was/VBD for/IN a/DT long/JJ time/NN marked/VBN by/IN no/DT special/JJ incident/NN ./.

The pipeline can use additional options to control how the tagger works. For example, by default, the english-left3words-distsim.tagger tagger model is used. We can specify a different model using the pos.model property, as shown here. There is also a pos.maxlen property to control the maximum sentence size:

props.put("pos.model", 
"C:/.../Models/english-caseless-left3words-distsim.tagger"); 

Sometimes, it is useful to have a tagged document that is XML formatted. The StanfordCoreNLP class's xmlPrint method will write out such a document. The method's first argument is the annotator to be displayed. Its second argument is the OutputStream object to write to. In the following code sequence, the previous tagging results are written to standard output. It is enclosed in a try...catch block to handle IO exceptions:

try { 
    pipeline.xmlPrint(document, System.out); 
} catch (IOException ex) { 
    // Handle exceptions 
}

A partial listing of the results is as follows. Only the first two words and the last word are displayed. Each token tag contains the word, its position, and its POS tag:

    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet href="CoreNLP-to-HTML.xsl" type="text/xsl"?>
    <root>
    <document>
    <sentences>
    <sentence id="1">
    <tokens>
    <token id="1">
    <word>The</word>
    <CharacterOffsetBegin>0</CharacterOffsetBegin>
    <CharacterOffsetEnd>3</CharacterOffsetEnd>
    <POS>DT</POS>
    </token>
    <token id="2">
    <word>voyage</word>
    <CharacterOffsetBegin>4</CharacterOffsetBegin>
    <CharacterOffsetEnd>10</CharacterOffsetEnd>
    <POS>NN</POS>
    </token>
             ...
    <token id="17">
    <word>.</word>
    <CharacterOffsetBegin>83</CharacterOffsetBegin>
    <CharacterOffsetEnd>84</CharacterOffsetEnd>
    <POS>.</POS>
    </token>
    </tokens>
    </sentence>
    </sentences>
    </document>
    </root>

The prettyPrint method works in a similar manner:

pipeline.prettyPrint(document, System.out); 

However, the output is not really that pretty, as shown here. The original sentence is displayed, followed by each word, its position, and its tag. The output has been formatted to make it more readable:

    The voyage of the Abraham Lincoln was for a long time marked by no special incident.
    [Text=The CharacterOffsetBegin=0 CharacterOffsetEnd=3 PartOfSpeech=DT] 
    [Text=voyage CharacterOffsetBegin=4 CharacterOffsetEnd=10 PartOfSpeech=NN] 
    [Text=of CharacterOffsetBegin=11 CharacterOffsetEnd=13 PartOfSpeech=IN] 
    [Text=the CharacterOffsetBegin=14 CharacterOffsetEnd=17 PartOfSpeech=DT] 
    [Text=Abraham CharacterOffsetBegin=18 CharacterOffsetEnd=25 PartOfSpeech=NNP]
     [Text=Lincoln CharacterOffsetBegin=26 CharacterOffsetEnd=33 PartOfSpeech=NNP]
     [Text=was CharacterOffsetBegin=34 CharacterOffsetEnd=37 PartOfSpeech=VBD]
     [Text=for CharacterOffsetBegin=38 CharacterOffsetEnd=41 PartOfSpeech=IN]
     [Text=a CharacterOffsetBegin=42 CharacterOffsetEnd=43 PartOfSpeech=DT]
     [Text=long CharacterOffsetBegin=44 CharacterOffsetEnd=48 PartOfSpeech=JJ]
     [Text=time CharacterOffsetBegin=49 CharacterOffsetEnd=53 PartOfSpeech=NN]
     [Text=marked CharacterOffsetBegin=54 CharacterOffsetEnd=60 PartOfSpeech=VBN]
     [Text=by CharacterOffsetBegin=61 CharacterOffsetEnd=63 PartOfSpeech=IN] 
    [Text=no CharacterOffsetBegin=64 CharacterOffsetEnd=66 PartOfSpeech=DT]
     [Text=special CharacterOffsetBegin=67 CharacterOffsetEnd=74 PartOfSpeech=JJ]
     [Text=incident CharacterOffsetBegin=75 CharacterOffsetEnd=83 PartOfSpeech=NN]
     [Text=. CharacterOffsetBegin=83 CharacterOffsetEnd=84 PartOfSpeech=.]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset