What is tokenization?

Tokenization is the process of breaking text down into simpler units. For most text, we are concerned with isolating words. Tokens are split based on a set of delimiters. These delimiters are frequently whitespace characters. Whitespace in Java is defined by the Character class' isWhitespace method. These characters are listed in the following table. However, there may be a need, at times, to use a different set of delimiters. For example, different delimiters can be useful when whitespace delimiters obscure text breaks, such as paragraph boundaries, and detecting these text breaks is important:

Character

Meaning

Unicode space character

(space_separator, line_separator, or paragraph_separator)

U+0009 horizontal tabulation

U+000A line feed

u000B

U+000B vertical tabulation

f

U+000C form feed

U+000D carriage return

u001C

U+001C file separator

u001D

U+001D group separator

u001E

U+001E record separator

u001F

U+001F unit separator

 

The tokenization process is complicated by a large number of factors, such as the following:

  • Language: Different languages present unique challenges. Whitespace is a commonly-used delimiter, but it will not be sufficient if we need to work with Chinese, where it is not used.
  • Text format: Text is often stored or presented using different formats. How simple text is processed versus HTML or other markup techniques will complicate the tokenization process.
  • Stopwords: Commonly-used words might not be important for some NLP tasks, such as general searches. These common words are called stopwords. Stopwords are sometimes removed when they do not contribute to the NLP task at hand. These can include words such as a, and, and she.
  • Text-expansion: For acronyms and abbreviations, it is sometimes desirable
    to expand them so that postprocesses can produce better-quality results.
    For example, if a search is interested in the word machine, knowing that IBM stands for International Business Machines can be useful.
  • Case: The case of a word (upper or lower) may be significant in some situations. For example, the case of a word can help identify proper nouns. When identifying the parts of text, conversion to the same case can be useful in simplifying searches.
  • Stemming and lemmatization: These processes will alter the words to get to their roots.

Removing stopwords can save space in an index and make the indexing process faster. However, some engines do not remove stopwords because they can be useful for certain queries. For example, when performing an exact match, removing stopwords will result in misses. Also, the NER task often depends on stopword inclusion. Recognizing that Romeo and Juliet is a play is dependent on the inclusion of the word and.

There are many lists that define stopwords. Sometimes, what constitutes a stopword is dependent on the problem domain. A list of stopwords can be found at http://www.ranks.nl/stopwords. It lists a few categories of English stopwords and stopwords for languages other than English. At http://www.textfixer.com/resources/common-english-words.txt, you will find a comma-separated formatted list of English stopwords.

The top-10 stopwords adapted from Stanford (http://library.stanford.edu/blogs/digital-library-blog/2011/12/stopwords-searchworks-be-or-not-be) can be found in the following table:

Stopword

Occurrences

the

7,578

of

6,582

and

4,106

in

2,298

a

1,137

to

1,033

for

695

on

685

an

289

with

231

 

We will focus on the techniques used to tokenize English text. This usually involves using whitespace or other delimiters to return a list of tokens.

Parsing is closely related to tokenization. They are both concerned with identifying parts of text, but parsing is also concerned with identifying the parts of speech and their relationship to each other.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset