The corpora that are part of the NLTK distribution are already tokenized, so we can easily get lists of words and sentences. For our own corpora, we should apply tokenization too. This recipe demonstrates how to implement tokenization with NLTK. The text file we will use is in this book's code bundle. This particular text is in English, but NLTK supports other languages too.
The program is in the tokenizing.py
file in this book's code bundle:
from nltk.tokenize import sent_tokenize from nltk.tokenize import word_tokenize import dautil as dl
fname = '46_bbc_world.txt' printer = dl.log_api.Printer(nelems=3) with open(fname, "r", encoding="utf-8") as txt_file: txt = txt_file.read() printer.print('Sentences', sent_tokenize(txt)) printer.print('Words', word_tokenize(txt))
Refer to the following screenshot for the end result: