OpenNLP possesses a Tokenizer interface that is implemented by three classes: SimpleTokenizer, TokenizerME, and WhitespaceTokenizer. This interface supports two methods:
- tokenize: This is passed a string to tokenize and returns an array of
tokens as strings. - tokenizePos: This is passed a string and returns an array of Span
objects. The Span class is used to specify the beginning and ending
offsets of the tokens.
Each of these classes is demonstrated in the following sections.