Training an OpenNLP classification model

First, we have to train our model because OpenNLP does not have prebuilt models. This process consists of creating a file of training data and then using the DocumentCategorizerME model to perform the actual training. The model that is created
is typically saved in a file for later use.

The training file format consists of a series of lines where each line represents a document. The first word of the line is the category. The category is followed by text separated by whitespace. Here is an example of the dog category:

dog The most interesting feature of a dog is its ...

To demonstrate the training process, we created the en-animals.train file, where we created two categories: cats and dogs. For the training text, we used sections of Wikipedia. For dogs (http://en.wikipedia.org/wiki/Dog), we used the As Pets section. For cats (http://en.wikipedia.org/wiki/Cats_and_humans), we used the Pet section plus the first paragraph of the Domesticated varieties section. We also removed the numeric references from the sections.

The first part of each line is shown in the following code:

dog The most widespread form of interspecies bonding occurs ... 
dog There have been two major trends in the changing status of  ... 
dog There are a vast range of commodity forms available to  ... 
dog An Australian Cattle Dog in reindeer antlers sits on Santa's lap ... 
dog A pet dog taking part in Christmas traditions ... 
dog The majority of contemporary people with dogs describe their  ... 
dog Another study of dogs' roles in families showed many dogs have  ... 
dog According to statistics published by the American Pet Products  ... 
dog The latest study using Magnetic resonance imaging (MRI) ... 
cat Cats are common pets in Europe and North America, and their  ... 
cat Although cat ownership has commonly been associated  ... 
cat The concept of a cat breed appeared in Britain during ... 
cat Cats come in a variety of colors and patterns. These are physical  ... 
cat A natural behavior in cats is to hook their front claws periodically  ... 
cat Although scratching can serve cats to keep their claws from growing  ...

When creating training data, it is important to use a large enough sample size. The data we used is not sufficient for some kinds of analysis. However, as we will see, it does a pretty good job of identifying the categories correctly.

The DoccatModel class supports the categorization and classification of text. A model is trained using the train method based on annotated text. The train method uses a string denoting the language and an ObjectStream<DocumentSample> instance that's holding the training data. The DocumentSample instance holds the annotated text and its category.

In the following example, the en-animal.train file is used to train the model. Its input stream is used to create a PlainTextByLineStream instance, which is then converted to an ObjectStream<DocumentSample> instance. The train method is then applied. The code is enclosed in a try-with-resources block to handle exceptions. We also created an output stream that we will use to persist the model:

DoccatModel model = null; 
try (InputStream dataIn =  
            new FileInputStream("en-animal.train"); 
        OutputStream dataOut =  
            new FileOutputStream("en-animal.model");) { 
    ObjectStream<String> lineStream 
        = new PlainTextByLineStream(dataIn, "UTF-8"); 
    ObjectStream<DocumentSample> sampleStream =  
        new DocumentSampleStream(lineStream);             
    model = DocumentCategorizerME.train("en", sampleStream); 
    ... 
} catch (IOException e) { 
// Handle exceptions   
}

The output is as follows, and has been shortened for the sake of brevity:

    Indexing events using cutoff of 5
    
      Computing event counts...  done. 12 events
      Indexing...  done.
    Sorting and merging events... done. Reduced 12 events to 12.
    Done indexing.
    Incorporating indexed data for training...  
    done.
      Number of Event Tokens: 12
          Number of Outcomes: 2
        Number of Predicates: 30
    ...done.
    Computing model parameters ...
    Performing 100 iterations.
      1:  ... loglikelihood=-8.317766166719343  0.75
      2:  ... loglikelihood=-7.1439957443937265  0.75
      3:  ... loglikelihood=-6.560690872956419  0.75
      4:  ... loglikelihood=-6.106743124066829  0.75
      5:  ... loglikelihood=-5.721805583104927  0.8333333333333334
      6:  ... loglikelihood=-5.3891508904777785  0.8333333333333334
      7:  ... loglikelihood=-5.098768040466029  0.8333333333333334
    ...
     98:  ... loglikelihood=-1.4117372921765519  1.0
     99:  ... loglikelihood=-1.4052738190352423  1.0
    100:  ... loglikelihood=-1.398916120150312  1.0

The model is saved using the serialize method, as shown in the following code. The model is saved to the en-animal.model file, as opened in the previous try-with-resources block:

OutputStream modelOut = null; 
modelOut = new BufferedOutputStream(dataOut); 
model.serialize(modelOut);

Table of Contents for Training an OpenNLP classification model

Create new playlist

Sign In

Sign Up

Table of Contents for
Training an OpenNLP classification model