Training a sentence-detector model

We will use OpenNLP's SentenceDetectorME class to illustrate the training process. This class has a static train method that uses sample sentences found in a file. The method returns a model that is usually serialized to a file for later use.

Models use special annotated data to clearly specify where a sentence ends. Frequently, a large file is used to provide a good sample for training purposes. Part of the file is used for training purposes, and the rest is used to verify the model after it has been trained.

The training file used by OpenNLP consists of one sentence per line. Usually, at least 10 to 20 sample sentences are needed to avoid processing errors. To demonstrate this process, we will use a file called sentence.train. It consists of Chapter 5, Twenty Thousand Leagues Under the Sea, by Jules Verne. The text of the book can be found at http://www.gutenberg.org/files/164/164-h/164-h.htm#chap05. The file can be downloaded from https://github.com/PacktPublishing/Natural-Language-Processing-with-Java-Second-Edition or from this book's GitHub repository.

A FileReader object is used to open the file. This object is used as the argument of the PlainTextByLineStream constructor. The stream that results consists of a string for each line of the file. This is used as the argument of the SentenceSampleStream constructor, which converts the sentence strings to SentenceSample objects. These objects hold the beginning index of each sentence. This process is shown as follows, where the statements are enclosed in a try block to handle exceptions that may be thrown by these statements:

try { 
    ObjectStream<String> lineStream = new PlainTextByLineStream( 
        new FileReader("sentence.train")); 
    ObjectStream<SentenceSample> sampleStream 
        = new SentenceSampleStream(lineStream); 
    ... 
    } catch (FileNotFoundException ex) { 
ex.printStackTrace(); // Handle exception } catch (IOException ex) {
ex.printStackTrace(); // Handle exception }

Now, the train method can be used like this:

SentenceModel model = SentenceDetectorME.train("en", 
sampleStream, true, null, TrainingParameters.defaultParams());

The output of the method is a trained model. The parameters of this method are detailed in the following table:

Parameter

Meaning

"en"

Specifies that the language of the
text is English

sampleStream

The training text stream

true

Specifies whether end tokens shown should be used

null

A dictionary for abbreviations

TrainingParameters.defaultParams()

Specifies that the default training parameters should be used

 

In the following sequence, OutputStream is created and used to save the model in the modelFile file. This allows the model to be reused for other applications:

OutputStream modelStream = new BufferedOutputStream( 
    new FileOutputStream("modelFile")); 
model.serialize(modelStream); 

The output of this process is as follows. All the iterations have not been shown here to save space. The default cuts off indexing events to 5 and iterations to 100:

    Indexing events using cutoff of 5
    
        Computing event counts...  done. 93 events
        Indexing...  done.
    Sorting and merging events... done. Reduced 93 events to 63.
    Done indexing.
    Incorporating indexed data for training...  
    done.
        Number of Event Tokens: 63
            Number of Outcomes: 2
          Number of Predicates: 21
    ...done.
    Computing model parameters ...
    Performing 100 iterations.
      1:  ... loglikelihood=-64.4626877920749    0.9032258064516129
      2:  ... loglikelihood=-31.11084296202819    0.9032258064516129
      3:  ... loglikelihood=-26.418795734248626    0.9032258064516129
      4:  ... loglikelihood=-24.327956749903198    0.9032258064516129
      5:  ... loglikelihood=-22.766489585258565    0.9032258064516129
      6:  ... loglikelihood=-21.46379347841989    0.9139784946236559
      7:  ... loglikelihood=-20.356036369911394    0.9139784946236559
      8:  ... loglikelihood=-19.406935608514992    0.9139784946236559
      9:  ... loglikelihood=-18.58725539754483    0.9139784946236559
     10:  ... loglikelihood=-17.873030559849326    0.9139784946236559
     ...
     99:  ... loglikelihood=-7.214933901940582    0.978494623655914
    100:  ... loglikelihood=-7.183774954664058    0.978494623655914
  
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset