What is tokenization?

Tokenization is the process of breaking text down into simpler units. For most text, we are concerned with isolating words. Tokens are split based on a set of delimiters. These delimiters are frequently whitespace characters. Whitespace in Java is defined by the Character class' isWhitespace method. These characters are listed in the following table. However, there may be a need, at times, to use a different set of delimiters. For example, different delimiters can be useful when whitespace delimiters obscure text breaks, such as paragraph boundaries, and detecting these text breaks is important:

Character	Meaning
Unicode space character	(space_separator, line_separator, or paragraph_separator)
	U+0009 horizontal tabulation
	U+000A line feed
`u000B`	U+000B vertical tabulation
`f`	U+000C form feed
	U+000D carriage return
`u001C`	U+001C file separator
`u001D`	U+001D group separator
`u001E`	U+001E record separator
`u001F`	U+001F unit separator

The tokenization process is complicated by a large number of factors, such as the following:

Language: Different languages present unique challenges. Whitespace is a commonly-used delimiter, but it will not be sufficient if we need to work with Chinese, where it is not used.
Text format: Text is often stored or presented using different formats. How simple text is processed versus HTML or other markup techniques will complicate the tokenization process.
Stopwords: Commonly-used words might not be important for some NLP tasks, such as general searches. These common words are called stopwords. Stopwords are sometimes removed when they do not contribute to the NLP task at hand. These can include words such as a, and, and she.
Text-expansion: For acronyms and abbreviations, it is sometimes desirable
to expand them so that postprocesses can produce better-quality results.
For example, if a search is interested in the word machine, knowing that IBM stands for International Business Machines can be useful.
Case: The case of a word (upper or lower) may be significant in some situations. For example, the case of a word can help identify proper nouns. When identifying the parts of text, conversion to the same case can be useful in simplifying searches.
Stemming and lemmatization: These processes will alter the words to get to their roots.

Removing stopwords can save space in an index and make the indexing process faster. However, some engines do not remove stopwords because they can be useful for certain queries. For example, when performing an exact match, removing stopwords will result in misses. Also, the NER task often depends on stopword inclusion. Recognizing that Romeo and Juliet is a play is dependent on the inclusion of the word and.

There are many lists that define stopwords. Sometimes, what constitutes a stopword is dependent on the problem domain. A list of stopwords can be found at http://www.ranks.nl/stopwords. It lists a few categories of English stopwords and stopwords for languages other than English. At http://www.textfixer.com/resources/common-english-words.txt, you will find a comma-separated formatted list of English stopwords.

The top-10 stopwords adapted from Stanford (http://library.stanford.edu/blogs/digital-library-blog/2011/12/stopwords-searchworks-be-or-not-be) can be found in the following table:

Stopword	Occurrences
the	7,578
of	6,582
and	4,106
in	2,298
a	1,137
to	1,033
for	695
on	685
an	289
with	231

We will focus on the techniques used to tokenize English text. This usually involves using whitespace or other delimiters to return a list of tokens.

Parsing is closely related to tokenization. They are both concerned with identifying parts of text, but parsing is also concerned with identifying the parts of speech and their relationship to each other.

Table of Contents for What is tokenization?

Create new playlist

Sign In

Sign Up

Table of Contents for
What is tokenization?