Training the OpenNLP POSModel

Training an OpenNLP POSModel is similar to the previous training examples. A training file is needed and should be large enough to provide a good sample set. Each sentence of the training file must be on a line by itself. Each line consists of a token, followed by the underscore character and then the tag.

The following training data was created using the first five sentences of Chapter 5, At A Venture, of Twenty Thousands Leagues Under the Sea. Although this is not a large sample set, it is easy to create and adequate for illustration purposes. It is saved in a file named sample.train:

    The_DT voyage_NN of_IN the_DT Abraham_NNP Lincoln_NNP was_VBD for_IN a_DT long_JJ time_NN marked_VBN by_IN no_DT special_JJ incident._NN
    But_CC one_CD circumstance_NN happened_VBD which_WDT showed_VBD the_DT wonderful_JJ dexterity_NN of_IN Ned_NNP Land,_NNP and_CC proved_VBD what_WP confidence_NN we_PRP might_MD place_VB in_IN him._PRP$ 
    The_DT 30th_JJ of_IN June,_NNP the_DT frigate_NN spoke_VBD some_DT American_NNP whalers,_, from_IN whom_WP we_PRP learned_VBD that_IN they_PRP knew_VBD nothing_NN about_IN the_DT narwhal._NN 
    But_CC one_CD of_IN them,_PRP$ the_DT captain_NN of_IN the_DT Monroe,_NNP knowing_VBG that_IN Ned_NNP Land_NNP had_VBD shipped_VBN on_IN board_NN the_DT Abraham_NNP Lincoln,_NNP begged_VBD for_IN his_PRP$ help_NN in_IN chasing_VBG a_DT whale_NN they_PRP had_VBD in_IN sight._NN

We will demonstrate the creation of the model using the POSModel class's train method and how the model can be saved to a file. We start with the declaration of the POSModel instance variable:

POSModel model = null;

A try-with-resources block opens the sample file:

try (InputStream dataIn = new FileInputStream("sample.train");) { 
    ... 
} catch (IOException e) { 
    // Handle exceptions 
}

An instance of the PlainTextByLineStream class is created and used with the WordTagSampleStream class to create an ObjectStream<POSSample> instance. This puts the sample data into the format required by the train method:

ObjectStream<String> lineStream =  
    new PlainTextByLineStream(dataIn, "UTF-8"); 
ObjectStream<POSSample> sampleStream =  
    new WordTagSampleStream(lineStream); 

The train method uses its parameters to specify the language, the sample stream, the training parameters, and any dictionaries (none, in this case) needed, as shown here:

model = POSTaggerME.train("en", sampleStream, 
    TrainingParameters.defaultParams(), null, null); 

The output of this process is lengthy. The following output has been shortened to conserve space:

    Indexing events using cutoff of 5
    
      Computing event counts...  done. 90 events
      Indexing...  done.
    Sorting and merging events... done. Reduced 90 events to 82.
    Done indexing.
    Incorporating indexed data for training...  
    done.
      Number of Event Tokens: 82
          Number of Outcomes: 17
        Number of Predicates: 45
    ...done.
    Computing model parameters ...
    Performing 100 iterations.
      1:  ... loglikelihood=-254.98920096505964  0.14444444444444443
      2:  ... loglikelihood=-201.19283975630537  0.6
      3:  ... loglikelihood=-174.8849213436524  0.6111111111111112
      4:  ... loglikelihood=-157.58164262220754  0.6333333333333333
      5:  ... loglikelihood=-144.69272379986646  0.6555555555555556
    ...
     99:  ... loglikelihood=-33.461128002846024  0.9333333333333333
    100:  ... loglikelihood=-33.29073273669207  0.9333333333333333

To save the model to a file, we use the following code. The output stream is created and the POSModel class's serialize method saves the model to the en_pos_verne.bin file:

try (OutputStream modelOut = new BufferedOutputStream( 
        new FileOutputStream(new File("en_pos_verne.bin")));) { 
    model.serialize(modelOut); 
} catch (IOException e) { 
    // Handle exceptions 
}
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset