Tokenizing news articles in sentences and words

The corpora that are part of the NLTK distribution are already tokenized, so we can easily get lists of words and sentences. For our own corpora, we should apply tokenization too. This recipe demonstrates how to implement tokenization with NLTK. The text file we will use is in this book's code bundle. This particular text is in English, but NLTK supports other languages too.

Getting ready

Install NLTK, following the instructions in the Introduction section of this chapter.

How to do it...

The program is in the tokenizing.py file in this book's code bundle:

  1. The imports are as follows:
    from nltk.tokenize import sent_tokenize
    from nltk.tokenize import word_tokenize
    import dautil as dl
  2. The following code demonstrates tokenization:
    fname = '46_bbc_world.txt'
    printer = dl.log_api.Printer(nelems=3)
    
    with open(fname, "r", encoding="utf-8") as txt_file:
        txt = txt_file.read()
        printer.print('Sentences', sent_tokenize(txt))
        printer.print('Words', word_tokenize(txt))

Refer to the following screenshot for the end result:

How to do it...

See also

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset