Uses of tokenizers

The output of tokenization can be used for simple tasks, such as spellcheckers and processing simple searches. It is also useful for various downstream NLP tasks, such as identifying POS, sentence-detection, and classification. Most of the chapters that follow will involve tasks that require tokenization.

Frequently, the tokenization process is just one step in a larger sequence of tasks. These steps involve the use of pipelines, as we will illustrate in the Using a pipeline section. This highlights the need for tokenizers that produce quality results for the downstream task. If the tokenizer does a poor job, the downstream task will be adversely affected.

There are many different tokenizers and tokenization techniques available in Java. There are several core Java classes that were designed to support tokenization. Some of these are now outdated. There are also a number of NLP APIs designed to address both simple and complex tokenization problems. The next two sections will examine these approaches. First, we will see what the Java core classes have to offer, and then we will demonstrate a number of the NLP API tokenization libraries.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset