Summary

In this chapter, we illustrated various approaches to tokenizing text and performing normalization on text. We started with simple tokenization techniques based on core Java classes, such as the String class' split method and the StringTokenizer class. These approaches can be useful when we decide to forgo the use of the NLP API classes.

We demonstrated how tokenization can be performed using the OpenNLP, Stanford, and LingPipe APIs. We found variations in how tokenization can be performed and options that can be applied in these APIs. A brief comparison of their output was provided.

Normalization was discussed, which can involve converting characters to lowercase, expanding abbreviations, removing stopwords, stemming, and lemmatization. We illustrated how these techniques can be applied using both core Java classes and the NLP APIs.

In the next chapter, Chapter 3Finding Sentences, we will investigate the issues involved in determining the end of a sentence using various NLP APIs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset