Finding word dependencies using the GrammaticalStructure class

Another approach to parsing text is to use the LexicalizedParser object that we created in the previous section in conjunction with the TreebankLanguagePack interface. A Treebank is a text corpus that has been annotated with syntactic or semantic information, providing information about a sentence's structure. The first major Treebank was the Penn TreeBank (http://www.cis.upenn.edu/~treebank/). Treebanks can be created manually or semi-automatically.

The following example illustrates how a simple string can be formatted using the parser. A TokenizerFactory creates a tokenizer.

The CoreLabel class that we discussed in the Using the LexicalizedParser class section is used here:

String sentence = "The cow jumped over the moon."; 
TokenizerFactory<CoreLabel> tokenizerFactory =  
    PTBTokenizer.factory(new CoreLabelTokenFactory(), ""); 
Tokenizer<CoreLabel> tokenizer =  
    tokenizerFactory.getTokenizer(new StringReader(sentence)); 
List<CoreLabel> wordList = tokenizer.tokenize(); 
parseTree = lexicalizedParser.apply(wordList); 

The TreebankLanguagePack interface specifies methods for working with a Treebank. In the following code, a series of objects are created that culminate with the creation of a TypedDependency instance, which is used to obtain dependency information about elements of a sentence. An instance of a GrammaticalStructureFactory object is created and used to create an instance of a GrammaticalStructure class.

As this class' name implies, it stores grammatical information between elements in the tree:

TreebankLanguagePack tlp =  
    lexicalizedParser.treebankLanguagePack; 
GrammaticalStructureFactory gsf =  
    tlp.grammaticalStructureFactory(); 
GrammaticalStructure gs =  
    gsf.newGrammaticalStructure(parseTree); 
List<TypedDependency> tdl = gs.typedDependenciesCCprocessed(); 

We can simply display the list, as shown here:

System.out.println(tdl);

The output is as follows:

    [det(cow-2, The-1), nsubj(jumped-3, cow-2), root(ROOT-0, jumped-3), det(moon-6, the-5), prep_over(jumped-3, moon-6)]  

This information can also be extracted using the gov, reln, and dep methods,
which return the governor word, the relationship, and the dependent element, respectively, as illustrated here:

for(TypedDependency dependency : tdl) { 
    System.out.println("Governor Word: [" + dependency.gov()  
        + "] Relation: [" + dependency.reln().getLongName() 
        + "] Dependent Word: [" + dependency.dep() + "]"); 
} 

The output is as follows:

    Governor Word: [cow/NN] Relation: [determiner] Dependent Word: [The/DT]
    Governor Word: [jumped/VBD] Relation: [nominal subject] Dependent Word: [cow/NN]
    Governor Word: [ROOT] Relation: [root] Dependent Word: [jumped/VBD]
    Governor Word: [moon/NN] Relation: [determiner] Dependent Word: [the/DT]
    Governor Word: [jumped/VBD] Relation: [prep_collapsed] Dependent Word: [moon/NN]  

From this, we can gleam the relationships within a sentence and the elements of the relationship.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset