Using the ExactDictionaryChunker class

The ExactDictionaryChunker class provides an easy way to create a dictionary of entities and their types, which can be used to find them later in text. It uses a MapDictionary object to store entries, and then the ExactDictionaryChunker class is used to extract chunks based on the dictionary.

The AbstractDictionary interface supports basic operations for entities, categories, and scores. The score is used in the matching process. The MapDictionary and TrieDictionary classes implement the AbstractDictionary interface. The TrieDictionary class stores information using a character trie structure. This approach uses less memory so when the memory is limited this approach works well. We will use the MapDictionary class for our example.

To illustrate this approach, we will start with a declaration of the MapDictionary class:

private MapDictionary<String> dictionary;

The dictionary will contain the entities that we are interested in finding. We need to initialize the model, as performed in the following initializeDictionary method. The DictionaryEntry constructor used here accepts three arguments:

  • String: The name of the entity
  • String: The category of the entity
  • Double: Represents a score for the entity

The score is used when determining matches. A few entities are declared and added to the dictionary:

private static void initializeDictionary() { 
    dictionary = new MapDictionary<String>(); 
    dictionary.addEntry( 
        new DictionaryEntry<String>("Joe","PERSON",1.0)); 
    dictionary.addEntry( 
        new DictionaryEntry<String>("Fred","PERSON",1.0)); 
    dictionary.addEntry( 
        new DictionaryEntry<String>("Boston","PLACE",1.0)); 
    dictionary.addEntry( 
        new DictionaryEntry<String>("pub","PLACE",1.0)); 
    dictionary.addEntry( 
        new DictionaryEntry<String>("Vermont","PLACE",1.0)); 
    dictionary.addEntry( 
        new DictionaryEntry<String>("IBM","ORGANIZATION",1.0)); 
    dictionary.addEntry( 
        new DictionaryEntry<String>("Sally","PERSON",1.0)); 
} 

An ExactDictionaryChunker instance will use this dictionary. The arguments of the ExactDictionaryChunker class are detailed here:

  • Dictionary<String>: It is a dictionary containing the entities
  • TokenizerFactory: It is a tokenizer used by the chunker
  • boolean: If it is true, the chunker should return all matches
  • boolean: If it is true, matches are case sensitive

Matches can be overlapping. For example, in the phrase The First National Bank, the entity Bank could be used by itself or in conjunction with the rest of the phrase. The third parameter that is, boolean determines whether all of the matches are returned.

In the following sequence, the dictionary is initialized. We then create an instance of the ExactDictionaryChunker class using the Indo-European tokenizer, where we return all matches and ignore the case of the tokens:

initializeDictionary(); 
ExactDictionaryChunker dictionaryChunker 
    = new ExactDictionaryChunker(dictionary, 
        IndoEuropeanTokenizerFactory.INSTANCE, true, false); 

The dictionaryChunker object is used with each sentence, as shown in the following code sequence. We will use the displayChunkSet method, as developed in the Using the RegExChunker class of LingPipe section earlier in this chapter:

for (String sentence : sentences) { 
    System.out.println("
TEXT=" + sentence); 
    displayChunkSet(dictionaryChunker, sentence); 
} 

On execution, we get the following output:

TEXT=Joe was the last person to see Fred. 
Type: PERSON Entity: [Joe] Score: 1.0
Type: PERSON Entity: [Fred] Score: 1.0
    
TEXT=He saw him in Boston at McKenzie's pub at 3:00 where he paid $2.45 for an ale. 
Type: PLACE Entity: [Boston] Score: 1.0
Type: PLACE Entity: [pub] Score: 1.0
    
TEXT=Joe wanted to go to Vermont for the day to visit a cousin who works at IBM, but Sally and he had to look for Fred
Type: PERSON Entity: [Joe] Score: 1.0
Type: PLACE Entity: [Vermont] Score: 1.0
Type: ORGANIZATION Entity: [IBM] Score: 1.0
Type: PERSON Entity: [Sally] Score: 1.0
Type: PERSON Entity: [Fred] Score: 1.0  

This does a pretty good job, but it requires a lot of effort to create the dictionary for a large vocabulary.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset