Tokenization is the process of breaking text down into simpler units. For most text, we are concerned with isolating words. Tokens are split based on a set of delimiters. These delimiters are frequently whitespace characters. Whitespace in Java is defined by the Character class' isWhitespace method. These characters are listed in the following table. However, there may be a need, at times, to use a different set of delimiters. For example, different delimiters can be useful when whitespace delimiters obscure text breaks, such as paragraph boundaries, and detecting these text breaks is important:
Character |
Meaning |
Unicode space character |
(space_separator, line_separator, or paragraph_separator) |
|
U+0009 horizontal tabulation |
|
U+000A line feed |
u000B |
U+000B vertical tabulation |
f |
U+000C form feed |
|
U+000D carriage return |
u001C |
U+001C file separator |
u001D |
U+001D group separator |
u001E |
U+001E record separator |
u001F |
U+001F unit separator |
The tokenization process is complicated by a large number of factors, such as the following:
- Language: Different languages present unique challenges. Whitespace is a commonly-used delimiter, but it will not be sufficient if we need to work with Chinese, where it is not used.
- Text format: Text is often stored or presented using different formats. How simple text is processed versus HTML or other markup techniques will complicate the tokenization process.
- Stopwords: Commonly-used words might not be important for some NLP tasks, such as general searches. These common words are called stopwords. Stopwords are sometimes removed when they do not contribute to the NLP task at hand. These can include words such as a, and, and she.
- Text-expansion: For acronyms and abbreviations, it is sometimes desirable
to expand them so that postprocesses can produce better-quality results.
For example, if a search is interested in the word machine, knowing that IBM stands for International Business Machines can be useful. - Case: The case of a word (upper or lower) may be significant in some situations. For example, the case of a word can help identify proper nouns. When identifying the parts of text, conversion to the same case can be useful in simplifying searches.
- Stemming and lemmatization: These processes will alter the words to get to their roots.
Removing stopwords can save space in an index and make the indexing process faster. However, some engines do not remove stopwords because they can be useful for certain queries. For example, when performing an exact match, removing stopwords will result in misses. Also, the NER task often depends on stopword inclusion. Recognizing that Romeo and Juliet is a play is dependent on the inclusion of the word and.
The top-10 stopwords adapted from Stanford (http://library.stanford.edu/blogs/digital-library-blog/2011/12/stopwords-searchworks-be-or-not-be) can be found in the following table:
Stopword |
Occurrences |
the |
7,578 |
of |
6,582 |
and |
4,106 |
in |
2,298 |
a |
1,137 |
to |
1,033 |
for |
695 |
on |
685 |
an |
289 |
with |
231 |
We will focus on the techniques used to tokenize English text. This usually involves using whitespace or other delimiters to return a list of tokens.