Training a model

We will use OpenNLP to demonstrate how a model is trained. The training file used must:

  • Contain marks to demarcate the entities
  • Have one sentence per line

We will use the following model file, named en-ner-person.train:

<START:person> Joe <END> was the last person to see <START:person> Fred <END>.  
He saw him in Boston at McKenzie's pub at 3:00 where he paid $2.45 for an ale.  
<START:person> Joe <END> wanted to go to Vermont for the day to visit a cousin who works at IBM, but <START:person> Sally <END> and he had to look for <START:person> Fred <END>. 

Several methods in this example are capable of throwing exceptions. These statements will be placed in a try-with-resource block, as shown here, where the model's output stream is created:

try (OutputStream modelOutputStream = new BufferedOutputStream( 
        new FileOutputStream(new File("modelFile")));) { 
    ... 
} catch (IOException ex) { 
    // Handle exception 
} 

Within the block, we create an OutputStream<String> object using the PlainTextByLineStream class. This class's constructor takes a FileInputStream instance and returns each line as a String object. The en-ner-person.train file is used as the input file, as shown here. The UTF-8 string refers to the encoding sequence used:

ObjectStream<String> lineStream = new PlainTextByLineStream( 
    new FileInputStream("en-ner-person.train"), "UTF-8"); 

The lineStream object contains streams that are annotated with tags delineating the entities in the text. These need to be converted to NameSample objects so that the model can be trained. This conversion is performed by the NameSampleDataStream class, as shown here. A NameSample object holds the names of the entities found in the text:

ObjectStream<NameSample> sampleStream =  
    new NameSampleDataStream(lineStream); 

The train method can now be executed as follows:

TokenNameFinderModel model = NameFinderME.train( 
    "en", "person",  sampleStream,  
    Collections.<String, Object>emptyMap(), 100, 5);

The arguments of the method are as detailed in the following table:

Parameter

Meaning

"en"

Language code

"person"

Entity type

sampleStream

Sample data

null

Resources

100

Number of iterations

5

Cutoff

The model is then serialized to an output file:

model.serialize(modelOutputStream); 

The output of this sequence is as follows. It has been shortened to conserve space. Basic information about the creation of the model is provided:

    Indexing events using cutoff of 5
    
      Computing event counts...  done. 53 events
      Indexing...  done.
    Sorting and merging events... done. Reduced 53 events to 46.
    Done indexing.
    Incorporating indexed data for training...  
    done.
      Number of Event Tokens: 46
          Number of Outcomes: 2
        Number of Predicates: 34
    ...done.
    Computing model parameters ...
    Performing 100 iterations.
      1:  ... loglikelihood=-36.73680056967707  0.05660377358490566
      2:  ... loglikelihood=-17.499660626361216  0.9433962264150944
      3:  ... loglikelihood=-13.216835449617108  0.9433962264150944
      4:  ... loglikelihood=-11.461783667999262  0.9433962264150944
      5:  ... loglikelihood=-10.380239416084963  0.9433962264150944
      6:  ... loglikelihood=-9.570622475692486  0.9433962264150944
      7:  ... loglikelihood=-8.919945779143012  0.9433962264150944
    ...
     99:  ... loglikelihood=-3.513810438211968  0.9622641509433962
    100:  ... loglikelihood=-3.507213816708068  0.9622641509433962
  

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset