Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Understanding the parts of text

There are a number of ways to categorize parts of text. For example, we may be concerned with character-level issues, such as punctuation, with a possible need to ignore or expand contractions. At the word level, we may need to perform different operations, such as the following:

Identifying morphemes using stemming and/or lemmatization
Expanding abbreviations and acronyms
Isolating number units

We cannot always split words with punctuation, because the punctuation is sometimes considered to be part of the word, such as the word can't. We may also be concerned with grouping multiple words to form meaningful phrases. Sentence-detection can also be a factor. We do not necessarily want to group words that cross sentence boundaries.

In this chapter, we are primarily concerned with the tokenization process and a few specialized techniques, such as stemming. We will not attempt to show how they are used in other NLP tasks. Those efforts are reserved for later chapters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Understanding the parts of text

Create new playlist

Sign In

Sign Up

Table of Contents for
Understanding the parts of text