Using the TokenizerME class

The TokenizerME class uses models created with Maximum Entropy (MaxEnt) and a statistical model to perform tokenization. The MaxEnt model is used to determine the relationship between data  in our case, text. Some text sources, such as various social media, are not well-formatted and use a lot of slang and special symbols, such as emoticons. A statistical tokenizer, such as the MaxEnt model, improves the quality of the tokenization process.

A detailed discussion of this model is not possible here due to its complexity. A good starting point for an interested reader can be found at http://en.wikipedia.org/w/index.php?title=Multinomial_logistic_regression&redirect=no.

A TokenizerModel class hides the model and is used to instantiate the tokenizer. The model must have been previously trained. In the following example, the tokenizer is instantiated using the model found in the en-token.bin file. This model has been trained to work with common English text.

The location of the model file is returned by the getModelDir method, which you will need to implement. The returned value is dependent on where the models are stored on your system. Many of these models can be found at http://opennlp.sourceforge.net/models-1.5/.

After the instance of a FileInputStream class is created, the input stream is used as the argument of the TokenizerModel constructor. The tokenize method will generate an array of strings. This is followed by code to display the tokens:

try { 
    InputStream modelInputStream = new FileInputStream( 
        new File(getModelDir(), "en-token.bin")); 
    TokenizerModel model = new 
TokenizerModel(modelInputStream); Tokenizer tokenizer = new TokenizerME(model); String tokens[] = tokenizer.tokenize(paragraph); for (String token : tokens) { System.out.println(token); } } catch (IOException ex) { // Handle the exception }

The output is as follows:

Let
's
pause
,
and
then
reflect
.  
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset