Detecting parts of speech

Another way of classifying the parts of text is at the sentence level. A sentence can be decomposed into individual words or combinations of words according to categories, such as nouns, verbs, adverbs, and prepositions. Most of us learned how to do this in school. We also learned not to end a sentence with a preposition, contrary to what we did in the second sentence of this paragraph.

Detecting the POS is useful in other tasks, such as extracting relationships and determining the meaning of text. Determining these relationships is called parsing. POS processing is useful for enhancing the quality of data sent to other elements of a pipeline.

The internals of a POS process can be complex. Fortunately, most of the complexity is hidden from us and encapsulated in classes and methods. We will use a couple of OpenNLP classes to illustrate this process. We will need a model to detect the POS. The POSModel class will be used and instanced using the model found in the en-pos-maxent.bin file, as shown here:

POSModel model = new POSModelLoader().load( 
    new File("../OpenNLP Models/" "en-pos-maxent.bin")); 

The POSTaggerME class is used to perform the actual tagging. Create an instance of this class based on the previous model, as shown here:

POSTaggerME tagger = new POSTaggerME(model); 

Next, declare a string containing the text to be processed:

String sentence = "POS processing is useful for enhancing the "  
   + "quality of data sent to other elements of a pipeline."; 

Here, we will use WhitespaceTokenizer to tokenize the text:

String tokens[] = WhitespaceTokenizer.INSTANCE.tokenize(sentence); 

The tag method is then used to find those parts of speech that stored the results
in an array of strings:

String[] tags = tagger.tag(tokens); 

The tokens and their corresponding tags are then displayed:

for(int i=0; i<tokens.length; i++) { 
    System.out.print(tokens[i] + "[" + tags[i] + "] "); 
} 

When executed, the following output will be produced:

    POS[NNP] processing[NN] is[VBZ] useful[JJ] for[IN] enhancing[VBG] the[DT] quality[NN] of[IN] data[NNS] sent[VBN] to[TO] other[JJ] elements[NNS] of[IN] a[DT] pipeline.[NN]  

Each token is followed by an abbreviation, contained within brackets, for its POS. For example, NNP means that it is a proper noun. These abbreviations will be covered in Chapter 5, Detecting Parts-of-Speech, which is devoted to exploring this topic in depth.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset