Finding parts of text

Text can be decomposed into a number of different types of elements, such as words, sentences, and paragraphs. There are several ways of classifying these elements. When we refer to parts of text in this book, we are referring to words, sometimes called tokens. Morphology is the study of the structure of words. We will use a number of morphology terms in our exploration of NLP. However, there are many ways to classify words, including the following:

Simple words: These are the common connotations of what a word means, including the 17 words in this sentence.
Morphemes: This are the smallest unit of a word that is meaningful. For example, in the word bounded, bound is considered to be a morpheme. Morphemes also include parts such as the suffix, ed.
Prefix/suffix: This precedes or follows the root of a word. For example, in the word graduation, the ation is a suffix based on the word graduate.
Synonyms: This is a word that has the same meaning as another word. Words such as small and tiny can be recognized as synonyms. Addressing this issue requires word-sense disambiguation.
Abbreviations: These shorten the use of a word. Instead of using Mister Smith, we use Mr. Smith.
Acronyms: These are used extensively in many fields, including computer science. They use a combination of letters for phrases such as FORmula TRANslation for FORTRAN. They can be recursive, such as GNU. Of course, the one we will continue to use is NLP.
Contractions: We'll find these useful for commonly used combinations of words, such as the first word of this sentence.
Numbers: A specialized word that normally uses only digits. However, more complex versions can include a period and a special character to reflect scientific notation or numbers of a specific base.

Identifying these parts is useful for other NLP tasks. For example, to determine the boundaries of a sentence, it is necessary to break it apart and determine which elements terminate a sentence.

The process of breaking text apart is called tokenization. The result is a stream of tokens. The elements of the text that determine where elements should be split are called delimiters. For most English text, whitespace is used as a delimiter. This type of a delimiter typically includes blanks, tabs, and new line characters.

Tokenization can be simple or complex. Here, we will demonstrate a simple tokenization using the String class' split method. First, declare a string to hold the text that is to be tokenized:

String text = "Mr. Smith went to 123 Washington avenue.";

The split method uses a regular expression argument to specify how the text should be split. In the following code sequence, its argument is the \s+ string. This specifies that one or more whitespaces will be used as the delimiter:

String tokens[] = text.split("\s+");

A for-each statement is used to display the resulting tokens:

for(String token : tokens) { 
  System.out.println(token); 
}

When executed, the output will appear as shown here:

Mr.
Smith
went
to
123
Washington
avenue.

In Chapter 2, Finding Parts of Text, we will explore the tokenization process in depth.

Table of Contents for Finding parts of text

Create new playlist

Sign In

Sign Up

Table of Contents for
Finding parts of text