Training an OpenNLP POSModel is similar to the previous training examples. A training file is needed and should be large enough to provide a good sample set. Each sentence of the training file must be on a line by itself. Each line consists of a token, followed by the underscore character and then the tag.
The following training data was created using the first five sentences of Chapter 5, At A Venture, of Twenty Thousands Leagues Under the Sea. Although this is not a large sample set, it is easy to create and adequate for illustration purposes. It is saved in a file named sample.train:
The_DT voyage_NN of_IN the_DT Abraham_NNP Lincoln_NNP was_VBD for_IN a_DT long_JJ time_NN marked_VBN by_IN no_DT special_JJ incident._NN But_CC one_CD circumstance_NN happened_VBD which_WDT showed_VBD the_DT wonderful_JJ dexterity_NN of_IN Ned_NNP Land,_NNP and_CC proved_VBD what_WP confidence_NN we_PRP might_MD place_VB in_IN him._PRP$ The_DT 30th_JJ of_IN June,_NNP the_DT frigate_NN spoke_VBD some_DT American_NNP whalers,_, from_IN whom_WP we_PRP learned_VBD that_IN they_PRP knew_VBD nothing_NN about_IN the_DT narwhal._NN But_CC one_CD of_IN them,_PRP$ the_DT captain_NN of_IN the_DT Monroe,_NNP knowing_VBG that_IN Ned_NNP Land_NNP had_VBD shipped_VBN on_IN board_NN the_DT Abraham_NNP Lincoln,_NNP begged_VBD for_IN his_PRP$ help_NN in_IN chasing_VBG a_DT whale_NN they_PRP had_VBD in_IN sight._NN
We will demonstrate the creation of the model using the POSModel class's train method and how the model can be saved to a file. We start with the declaration of the POSModel instance variable:
POSModel model = null;
A try-with-resources block opens the sample file:
try (InputStream dataIn = new FileInputStream("sample.train");) { ... } catch (IOException e) { // Handle exceptions }
An instance of the PlainTextByLineStream class is created and used with the WordTagSampleStream class to create an ObjectStream<POSSample> instance. This puts the sample data into the format required by the train method:
ObjectStream<String> lineStream = new PlainTextByLineStream(dataIn, "UTF-8"); ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);
The train method uses its parameters to specify the language, the sample stream, the training parameters, and any dictionaries (none, in this case) needed, as shown here:
model = POSTaggerME.train("en", sampleStream, TrainingParameters.defaultParams(), null, null);
The output of this process is lengthy. The following output has been shortened to conserve space:
Indexing events using cutoff of 5 Computing event counts... done. 90 events Indexing... done. Sorting and merging events... done. Reduced 90 events to 82. Done indexing. Incorporating indexed data for training... done. Number of Event Tokens: 82 Number of Outcomes: 17 Number of Predicates: 45 ...done. Computing model parameters ... Performing 100 iterations. 1: ... loglikelihood=-254.98920096505964 0.14444444444444443 2: ... loglikelihood=-201.19283975630537 0.6 3: ... loglikelihood=-174.8849213436524 0.6111111111111112 4: ... loglikelihood=-157.58164262220754 0.6333333333333333 5: ... loglikelihood=-144.69272379986646 0.6555555555555556 ... 99: ... loglikelihood=-33.461128002846024 0.9333333333333333 100: ... loglikelihood=-33.29073273669207 0.9333333333333333
To save the model to a file, we use the following code. The output stream is created and the POSModel class's serialize method saves the model to the en_pos_verne.bin file:
try (OutputStream modelOut = new BufferedOutputStream( new FileOutputStream(new File("en_pos_verne.bin")));) { model.serialize(modelOut); } catch (IOException e) { // Handle exceptions }