Comparing tokenizers

A brief comparison of the NLP API tokenizers is shown in the following table. The tokens generated are listed under the tokenizer's name. They are based on the same text: "Let's pause, and then reflect." Keep in mind that the output is based on a simple use of the classes. There may be options not included in the examples that will influence how the tokens are generated. The intent is to simply show the type of output that can be expected based on the sample code and data:

SimpleTokenizer

WhitespaceTokenizer

TokenizerME

PTBTokenizer

DocumentPreprocessor

IndoEuropeanTokenizerFactory

Let

Let's

Let

Let

Let

Let

'

pause,

's

's

's

'

s

and

pause

pause

pause

s

pause

then

,

,

,

pause

,

reflect.

and

and

and

,

and

then

then

then

and

then

reflect

reflect

reflect

then

reflect

.

.

.

reflect

.

.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset