B.1. Working with strings

Natural language processing is all about processing strings. And strings have a lot of quirks in Python 3 that may take you by surprise, especially if you have a lot of Python 2 experience. So you’ll want to play around with strings and all the ways you can interact with them so you are comfortable interacting with natural language strings.

B.1.1. String types (str and bytes)

Strings (str) are sequences of Unicode characters. If you use a non-ASCII character in a str, it may contain multiple bytes for some of the characters. Non-ASCII characters pop up a lot if you are copying and pasting from the internet into your Python console or program. And some of them are hard to spot, like those curly asymmetrical quote characters and apostrophes.

When you open a file with the Python open command, it’ll be read as a str by default. If you open a binary file, like a pretrained Word2vec model '.txt' file, without specifying mode='b' it won’t load correctly. Even though the gensim.Keyed-Vectors model type may be text, not binary, the file must be opened in binary mode so that Unicode characters aren’t garbled as gensim loads the model; likewise for a CSV file or any other text file saved with Python 2.

Bytes (bytes) are arrays of 8-bit values, usually used to hold ASCII characters or Extended ASCII characters (with integer ord values greater than 128).[1] Bytes are also sometimes used to store RAW images, WAV audio files, or other binary data blobs.

1

There’s no single official Extended ASCII character set, so don’t ever use them for NLP unless you want to confuse your machine trying to learn a general language model.

B.1.2. Templates in Python (.format())

Python comes with a versatile string templating system that allows you to populate a string with the values of variables. This allows you to create dynamic responses with knowledge from a database or the context of a running python program (locals()).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset