Tokenization

Given a text sequence, tokenization is the task of breaking it into fragments, which can be words, characters, or sentences. Sometimes, certain characters are usually removed, such as punctuation marks, digits, and emoticons. These fragments are the so-called tokens used for further processing. Moreover, tokens composed of one word are also called unigrams in computational linguistics; bigrams are composed of two consecutive words; trigrams of three consecutive words; and n-grams of n consecutive words. Here is an example of tokenization:

We can implement word-based tokenization using the word_tokenize function in NLTK. We will use the input text '''I am reading a book., and in the next line, It is Python Machine Learning By Example,, then 2nd edition.''', as an example as shown in the following commands:

>>> from nltk.tokenize import word_tokenize
>>> sent = '''I am reading a book.
... It is Python Machine Learning By Example,
... 2nd edition.'''
>>> print(word_tokenize(sent))
['I', 'am', 'reading', 'a', 'book', '.', 'It', 'is', 'Python', 'Machine', 'Learning', 'By', 'Example', ',', '2nd', 'edition', '.']

Word tokens are obtained.

The word_tokenize function keeps punctuation marks and digits, and only discards whitespaces and newlines.

You might think word tokenization is simply splitting a sentence by space and punctuation. Here's an interesting example showing that tokenization is more complex than you think:

>>> sent2 = 'I have been to U.K. and U.S.A.'
>>> print(word_tokenize(sent2))
['I', 'have', 'been', 'to', 'U.K.', 'and', 'U.S.A', '.']

The tokenizer accurately recognizes the words 'U.K.' and 'U.S.A' as tokens instead of 'U' and '.' followed by 'K', for example.

SpaCy also has an outstanding tokenization feature. It uses an accurately trained model that is constantly updated. To install it, we can run the following command:

python -m spacy download en_core_web_sm

Then, we'll load the en_core_web_sm model and parse the sentence using this model:

>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> tokens2 = nlp(sent2)
>>> print([token.text for token in tokens2])
['I', 'have', 'been', 'to', 'U.K.', 'and', 'U.S.A.']

We can also segment text based on sentence. For example, on the same input text, using the sent_tokenize function from NLTK, we have the following commands:

>>> from nltk.tokenize import sent_tokenize
>>> print(sent_tokenize(sent))
['I am reading a book.', '...', 'It's Python Machine Learning By Example, ... 2nd edition.']

Two sentence-based tokens are returned, as there are two sentences in the input text regardless of a newline following a comma.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset