Another approach to parsing text is to use the LexicalizedParser object that we created in the previous section in conjunction with the TreebankLanguagePack interface. A Treebank is a text corpus that has been annotated with syntactic or semantic information, providing information about a sentence's structure. The first major Treebank was the Penn TreeBank (http://www.cis.upenn.edu/~treebank/). Treebanks can be created manually or semi-automatically.
The following example illustrates how a simple string can be formatted using the parser. A TokenizerFactory creates a tokenizer.
The CoreLabel class that we discussed in the Using the LexicalizedParser class section is used here:
String sentence = "The cow jumped over the moon."; TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), ""); Tokenizer<CoreLabel> tokenizer = tokenizerFactory.getTokenizer(new StringReader(sentence)); List<CoreLabel> wordList = tokenizer.tokenize(); parseTree = lexicalizedParser.apply(wordList);
The TreebankLanguagePack interface specifies methods for working with a Treebank. In the following code, a series of objects are created that culminate with the creation of a TypedDependency instance, which is used to obtain dependency information about elements of a sentence. An instance of a GrammaticalStructureFactory object is created and used to create an instance of a GrammaticalStructure class.
As this class' name implies, it stores grammatical information between elements in the tree:
TreebankLanguagePack tlp = lexicalizedParser.treebankLanguagePack; GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory(); GrammaticalStructure gs = gsf.newGrammaticalStructure(parseTree); List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
We can simply display the list, as shown here:
System.out.println(tdl);
The output is as follows:
[det(cow-2, The-1), nsubj(jumped-3, cow-2), root(ROOT-0, jumped-3), det(moon-6, the-5), prep_over(jumped-3, moon-6)]
This information can also be extracted using the gov, reln, and dep methods,
which return the governor word, the relationship, and the dependent element, respectively, as illustrated here:
for(TypedDependency dependency : tdl) { System.out.println("Governor Word: [" + dependency.gov() + "] Relation: [" + dependency.reln().getLongName() + "] Dependent Word: [" + dependency.dep() + "]"); }
The output is as follows:
Governor Word: [cow/NN] Relation: [determiner] Dependent Word: [The/DT] Governor Word: [jumped/VBD] Relation: [nominal subject] Dependent Word: [cow/NN] Governor Word: [ROOT] Relation: [root] Dependent Word: [jumped/VBD] Governor Word: [moon/NN] Relation: [determiner] Dependent Word: [the/DT] Governor Word: [jumped/VBD] Relation: [prep_collapsed] Dependent Word: [moon/NN]
From this, we can gleam the relationships within a sentence and the elements of the relationship.