Chapter 10. File and Text Operations

This chapter covers most of the issues related to dealing with files and filesystems in Python. A file is a stream of text or bytes that a program can read and/or write; a filesystem is a hierarchical repository of files on a computer system.

Other Chapters That Also Deal with Files

Because files are such a crucial concept in programming, even though this chapter is the largest one in the book, several other chapters also contain material that is relevant when you’re handling specific kinds of files. In particular, Chapter 11 deals with many kinds of files related to persistence and database functionality (JSON files in “The json Module”, pickle files in “The pickle and cPickle Modules”, shelve files in “The shelve Module”, DBM and DBM-like files in “The v3 dbm Package”, and SQLite database files in “SQLite”), Chapter 22 deals with files and other streams in HTML format, and Chapter 23 deals with files and other streams in XML format.

Organization of This Chapter

Files and streams come in many flavors: their contents can be arbitrary bytes or text (with various encodings, if the underlying storage or channel deals only with bytes, as most do); they may be suitable for reading, writing, or both; they may or may not be buffered; they may or may not allow “random access,” going back and forth in the file (a stream whose underlying channel is an Internet socket, for example, only allows sequential, “going forward” access—there’s no “going back and forth”).

Traditionally, old-style Python coalesced most of this diverse functionality into the built-in file object, working in different ways depending on how the built-in function open created it. These built-ins are still in v2, for backward compatibility.

In both v2 and v3, however, input/output (I/O) is more logically structured, within the standard library’s io module. In v3, the built-in function open is, in fact, simply an alias for the function io.open. In v2, the built-in open still works the old-fashioned way, creating and returning an old-fashioned built-in file object (a type that does not exist anymore in v3). However, you can from io import open to use, instead, the new and better structured io.open: the file-like objects it returns operate quite similarly to the old-fashioned built-in file objects in most simple cases, and you can call the new function in a similar way to old-fashioned built-in function open, too. (Alternatively, of course, in both v2 and v3, you can practice the excellent principle “explicit is better than implicit” by doing import io and then explicitly using io.open.)

If you have to maintain old code using built-in file objects in complicated ways, and don’t want to port that code to the newer approach, use the online docs as a reference to the details of old-fashioned built-in file objects. This book (and, specifically, the start of this chapter) does not cover old-fashioned built-in file objects, just io.open and the various classes from the io module.

Immediately after that, this chapter covers the polymorphic concept of file-like objects (objects that are not actually files but behave to some extent like files) in “File-Like Objects and Polymorphism”.

The chapter next covers modules that deal with temporary files and file-like objects (tempfile in “The tempfile Module”, and io.StringIO and io.BytesIO in “In-Memory “Files”: io.StringIO and io.BytesIO”).

Next comes the coverage of modules that help you access the contents of text and binary files (fileinput in “The fileinput Module”, linecache in “The linecache Module”, and struct in “The struct Module”) and support compressed files and other data archives (gzip in “The gzip Module”, bz2 in “The bz2 Module”, tarfile in “The tarfile Module”, zipfile in “The zipfile Module”, and zlib in “The zlib Module”). v3 also supports LZMA compression, as used, for example, by the xz program: we don’t cover that issue in this book, but see the online docs and PyPI for a backport to v2.

In Python, the os module supplies many of the functions that operate on the filesystem, so this chapter continues by introducing the os module in “The os Module”. The chapter then covers, in “Filesystem Operations”, operations on the filesystem (comparing, copying, and deleting directories and files; working with file paths; and accessing low-level file descriptors) offered by os (in “File and Directory Functions of the os Module”), os.path (in “The os.path Module”), and other modules (dircache under listdir in Table 10-3, stat in “The stat Module”, filecmp in “The filecmp Module”, fnmatch in “The fnmatch Module”, glob in “The glob Module”, and shutil in “The shutil Module”). We do not cover the module pathlib, supplying an object-oriented approach to filesystem paths, since, as of this writing, it has been included in the standard library only on a provisional basis, meaning it can undergo backward-incompatible changes, up to and including removal of the module; if you nevertheless want to try it out, see the online docs, and PyPI for a v2 backport.

While most modern programs rely on a graphical user interface (GUI), often via a browser or a smartphone app, text-based, nongraphical “command-line” user interfaces are still useful, since they’re simple, fast to program, and lightweight. This chapter concludes with material about text input and output in Python in “Text Input and Output”, richer text I/O in “Richer-Text I/O”, interactive command-line sessions in “Interactive Command Sessions”, and, finally, a subject generally known as internationalization (often abbreviated i18n). Building software that processes text understandable to different users, across languages and cultures, is described in “Internationalization”.

The io Module

As mentioned in “Organization of This Chapter”, io is a standard library module in Python and provides the most common ways for your Python programs to read or write files. Use io.open to make a Python “file” object—which, depending on what parameters you pass to io.open, can in fact be an instance of io.TextIOWrapper if textual, or, if binary, io.BufferedReader, io.BufferedWriter, or io.BufferedRandom, depending on whether it’s read-only, write-only, or read-write—to read and/or write data to a file as seen by the underlying operating system. We refer to these as “file” objects, in quotes, to distinguish them from the old-fashioned built-in file type still present in v2.

In v3, the built-in function open is a synonym for io.open. In v2, use from io import open to get the same effect. We use io.open explicitly (assuming a previous import io has executed, of course), for clarity and to avoid ambiguity.

This section covers such “file” objects, as well as the important issue of making and using temporary files (on disk, or even in memory).

Python reacts to any I/O error related to a “file” object by raising an instance of built-in exception class IOError (in v3, that’s a synonym for OSError, but many useful subclasses exist, and are covered in “OSError and subclasses (v3 only)”). Errors that cause this exception include open failing to open a file, calls to a method on a “file” to which that method doesn’t apply (e.g., calling write on a read-only “file,” or calling seek on a nonseekable file)—which could also cause ValueError or AttributeError—and I/O errors diagnosed by a “file” object’s methods.

The io module also provides the underlying web of classes, both abstract and concrete, that, by inheritance and by composition (also known as wrapping), make up the “file” objects (instances of classes mentioned in the first paragraph of this section) that your program generally uses. We do not cover these advanced topics in this book. If you have access to unusual channels for data, or nonfilesystem data storage, and want to provide a “file” interface to those channels or storage, you can ease your task, by appropriate subclassing and wrapping, using other classes in the module io. For such advanced tasks, consult the online docs.

Creating a “file” Object with io.open

To create a Python “file” object, call io.open with the following syntax:

open(file, mode='r', buffering=-1, encoding=None, errors='strict',
     newline=None, closefd=True, opener=os.open)

file can be a string, in which case it’s any path to a file as seen by the underlying OS, or it can be an integer, in which case it’s an OS-level file descriptor as returned by os.open (or, in v3 only, whatever function you pass as the opener argument—opener is not supported in v2). When file is a string, open opens the file thus named (possibly creating it, depending on mode—despite its name, open is not just for opening existing files: it can also create new ones); when file is an integer, the underlying OS file must already be open (via os.open or whatever).

Opening a file Pythonically

open is a context manager: use with io.open(...) as f:, not f = io.open(...), to ensure the “file” f gets closed as soon as the with statement’s body is done.

open creates and returns an instance f of the appropriate class of the module io, depending on mode and buffering—we refer to all such instances as “file” objects; they all are reasonably polymorphic with respect to each other.

mode

mode is a string indicating how the file is to be opened (or created). mode can be:

'r'

The file must already exist, and it is opened in read-only mode.

'w'

The file is opened in write-only mode. The file is truncated to zero length and overwritten if it already exists, or created if it does not exist.

'a'

The file is opened in write-only mode. The file is kept intact if it already exists, and the data you write is appended to what’s already in the file. The file is created if it does not exist. Calling f.seek on the file changes the result of the method f.tell, but does not change the write position in the file.

'r+'

The file must already exist and is opened for both reading and writing, so all methods of f can be called.

'w+'

The file is opened for both reading and writing, so all methods of f can be called. The file is truncated and overwritten if it already exists, or created if it does not exist.

'a+'

The file is opened for both reading and writing, so all methods of f can be called. The file is kept intact if it already exists, and the data you write is appended to what’s already in the file. The file is created if it does not exist. Calling f.seek on the file, depending on the underlying operating system, may have no effect when the next I/O operation on f writes data, but does work normally when the next I/O operation on f reads data.

Binary and text modes

The mode string may have any of the values just explained, followed by a b or t. b means a binary file, while t means a text one. When mode has neither b nor t, the default is text (i.e., 'r' is like 'rt', 'w' is like 'wt', and so on).

Binary files let you read and/or write strings of type bytes; text ones let you read and/or write Unicode text strings (str in v3, unicode in v2). For text files, when the underlying channel or storage system deals in bytes (as most do), encoding (the name of an encoding known to Python) and errors (an error-handler name such as 'strict', 'replace', and so on, as covered under decode in Table 8-6) matter, as they specify how to translate between text and bytes, and what to do on encoding and decoding errors.

Buffering

buffering is an integer that denotes the buffering you’re requesting for the file. When buffering is less than 0, a default is used. Normally, this default is line buffering for files that correspond to interactive consoles, and a buffer of io.DEFAULT_BUFFER_SIZE bytes for other files. When buffering is 0, the file is unbuffered; the effect is as if the file’s buffer were flushed every time you write anything to the file. When buffering equals 1, the file (which must be text mode) is line-buffered, which means the file’s buffer is flushed every time you write to the file. When buffering is greater than 1, the file uses a buffer of about buffering bytes, rounded up to some reasonable amount.

Sequential and nonsequential (“random”) access

A “file” object f is inherently sequential (a stream of bytes or text). When you read, you get bytes or text in the sequential order in which they’re present. When you write, the bytes or text you write are added in the order in which you write them.

To allow nonsequential access (also known as “random access”), a “file” object whose underlying storage allows this keeps track of its current position (the position in the underlying file where the next read or write operation starts transferring data). f.seekable() returns True when f supports nonsequential access.

When you open a file, the initial position is at the start of the file. Any call to f.write on a “file” object f opened with a mode of 'a' or 'a+' always sets f’s position to the end of the file before writing data to f. When you write or read n bytes to/from “file” object f, f’s position advances by n. You can query the current position by calling f.tell and change the position by calling f.seek, both covered in the next section.

f.tell and f.seek also work on a text-mode f, but in this case the offset you pass to f.seek must be 0 (to position f at the start or end, depending on f.seek’s second parameter), or the opaque result previously returned by a call to f.tell, to position f back to a position you had thus “bookmarked” before.

Attributes and Methods of “file” Objects

A “file” object f supplies the attributes and methods documented in this section.

close

f.close()

Closes the file. You can call no other method on f after f.close. Multiple calls to f.close are allowed and innocuous.

closed

closed

f.closed is a read-only attribute that is True when f.close() has been called; otherwise, False.

encoding

encoding

f.encoding is a read-only attribute, a string naming the encoding (as covered in “Unicode”). The attribute does not exist on binary “files.”

flush

f.flush()

Requests that f’s buffer be written out to the operating system, so that the file as seen by the system has the exact contents that Python’s code has written. Depending on the platform and the nature of f’s underlying file, f.flush may not be able to ensure the desired effect.

isatty

f.isatty()

True when f’s underlying file is an interactive terminal; otherwise, False.

fileno

f.fileno()

Returns an integer, the file descriptor of f’s file at operating-system level. File descriptors are covered in “File and Directory Functions of the os Module”.

mode

mode

f.mode is a read-only attribute that is the value of the mode string used in the io.open call that created f.

name

name

f.name is a read-only attribute that is the value of the file string or int used in the io.open call that created f.

read

f.read(size=-1)

In v2, or in v3 when f is open in binary mode, read reads up to size bytes from f’s file and returns them as a bytestring. read reads and returns less than size bytes if the file ends before size bytes are read. When size is less than 0, read reads and returns all bytes up to the end of the file. read returns an empty string when the file’s current position is at the end of the file or when size equals 0. In v3, when f is open in text mode, size is a number of characters, not bytes, and read returns a text string.

readline

f.readline(size=-1)

Reads and returns one line from f’s file, up to the end of line ( ), included. When size is greater than or equal to 0, readline reads no more than size bytes. In that case, the returned string might not end with . might also be absent when readline reads up to the end of the file without finding . readline returns an empty string when the file’s current position is at the end of the file or when size equals 0.

readlines

f.readlines(size=-1)

Reads and returns a list of all lines in f’s file, each a string ending in . If size>0, readlines stops and returns the list after collecting data for a total of about size bytes rather than reading all the way to the end of the file; in that case, the last string in the list might not end in .

seek

f.seek(pos, how=io.SEEK_SET)

Sets f’s current position to the signed integer byte offset pos away from a reference point. how indicates the reference point. The io module has attributes named SEEK_SET, SEEK_CUR, and SEEK_END, to specify that the reference point is, respectively, the file’s beginning, current position, or end.

When f is opened in text mode, f.seek must have a pos of 0, or, for io.SEEK_SET only, a pos that is the result of a previous call to f.tell.

When f is opened in mode 'a' or 'a+', on some but not all platforms, data written to f is appended to the data that is already in f, regardless of calls to f.seek.

tell

f.tell()

Returns f’s current position: for a binary file, that’s an integer offset in bytes from the start of the file; for a text file, an opaque value usable in future calls to f.seek to position f back to the position that is now current.

truncate

f.truncate([size])

Truncates f’s file, which must be open for writing. When size is present, truncates the file to be at most size bytes. When size is absent, uses f.tell() as the file’s new size.

write

f.write(s)

Writes the bytes of string s (binary or text, depending on f’s mode) to the file.

writelines

f.writelines(lst)

Like:

for line in lst: f.write(line)

It does not matter whether the strings in iterable lst are lines: despite its name, the method writelines just writes each of the strings to the file, one after the other. In particular, writelines does not add line-ending markers: such markers must, if required, already be present in the items of lst.

Iteration on “File” Objects

A “file” object f, open for text-mode reading, is also an iterator whose items are the file’s lines. Thus, the loop:

for line in f:

iterates on each line of the file. Due to buffering issues, interrupting such a loop prematurely (e.g., with break), or calling next(f) instead of f.readline(), leaves the file’s position set to an arbitrary value. If you want to switch from using f as an iterator to calling other reading methods on f, be sure to set the file’s position to a known value by appropriately calling f.seek. On the plus side, a loop directly on f has very good performance, since these specifications allow the loop to use internal buffering to minimize I/O without taking up excessive amounts of memory even for huge files.

File-Like Objects and Polymorphism

An object x is file-like when it behaves polymorphically to a “file” object as returned by io.open, meaning that we can use x “as if” x were a “file.” Code using such an object (known as client code of the object) usually gets the object as an argument, or by calling a factory function that returns the object as the result. For example, if the only method that client code calls on x is x.read(), without arguments, then all x needs to supply in order to be file-like for that code is a method read that is callable without arguments and returns a string. Other client code may need x to implement a larger subset of file methods. File-like objects and polymorphism are not absolute concepts: they are relative to demands placed on an object by some specific client code.

Polymorphism is a powerful aspect of object-oriented programming, and file-like objects are a good example of polymorphism. A client-code module that writes to or reads from files can automatically be reused for data residing elsewhere, as long as the module does not break polymorphism by the dubious practice of type checking. When we discussed built-ins type and isinstance in Table 7-1, we mentioned that type checking is often best avoided, as it blocks the normal polymorphism that Python otherwise supplies. Most often, to support polymorphism in your client code, all you have to do is avoid type checking.

You can implement a file-like object by coding your own class (as covered in Chapter 4) and defining the specific methods needed by client code, such as read. A file-like object fl need not implement all the attributes and methods of a true “file” object f. If you can determine which methods the client code calls on fl, you can choose to implement only that subset. For example, when fl is only going to be written, fl doesn’t need “reading” methods, such as read, readline, and readlines.

If the main reason you want a file-like object instead of a real file object is to keep the data in memory, use the io module’s classes StringIO and BytesIO, covered in “In-Memory “Files”: io.StringIO and io.BytesIO”. These classes supply “file” objects that hold data in memory and largely behave polymorphically to other “file” objects.

The tempfile Module

The tempfile module lets you create temporary files and directories in the most secure manner afforded by your platform. Temporary files are often a good solution when you’re dealing with an amount of data that might not comfortably fit in memory, or when your program must write data that another process later uses.

The order of the parameters for the functions in this module is a bit confusing: to make your code more readable, always call these functions with named-argument syntax. The tempfile module exposes the functions and classes outlined in Table 10-1.

Table 10-1.  

mkdtemp

mkdtemp(suffix=None, prefix=None, dir=None)

Securely creates a new temporary directory that is readable, writable, and searchable only by the current user, and returns the absolute path to the temporary directory. The optional arguments suffix, prefix, and dir are like for function mkstemp. Ensuring that the temporary directory is removed when you’re done using it is your program’s responsibility. Here is a typical usage example that creates a temporary directory, passes its path to another function, and finally ensures the directory is removed together with all of its contents:

import tempfile, shutil
path = tempfile.mkdtemp()
try:
    use_dirpath(path)
finally:
    shutil.rmtree(path)

mkstemp

mkstemp(suffix=None, prefix=None, dir=None, text=False)

Securely creates a new temporary file, readable and writable only by the current user, not executable, not inherited by subprocesses; returns a pair (fd, path), where fd is the file descriptor of the temporary file (as returned by os.open, covered in Table 10-5) and string path is the absolute path to the temporary file. You can optionally pass arguments to specify strings to use as the start (prefix) and end (suffix) of the temporary file’s filename, and the path to the directory in which the temporary file is created (dir); if you want the temporary file to be a text file, explicitly pass the argument text=True.

Ensuring that the temporary file is removed when you’re done using it is up to you: mkstemp is not a context manager, so you can’t use a with statement—you’ll generally use try/finally instead. Here is a typical usage example that creates a temporary text file, closes it, passes its path to another function, and finally ensures the file is removed:

import tempfile, os
fd, path = 
  tempfile.mkstemp(suffix='.txt', text=True)
try:
    os.close(fd)
    use_filepath(path)
finally:
    os.unlink(path)

SpooledTemporaryFile

SpooledTemporaryFile(mode='w+b',bufsize=-1,
suffix=None,prefix=None,dir=None)

Just like TemporaryFile, covered next, except that the “file” object SpooledTemporaryFile returns can stay in memory, if space permits, until you call its fileno method (or the rollover method, which ensures the file gets materialized on disk); as a result, performance can be better with SpooledTemporaryFile, as long as you have enough memory that’s not otherwise in use.

TemporaryFile

TemporaryFile(mode='w+b',bufsize=-1,suffix=None,
prefix=None,dir=None)

Creates a temporary file with mkstemp (passing to mkstemp the optional arguments suffix, prefix, and dir), makes a “file” object from it with os.fdopen as covered in Table 10-5 (passing to fdopen the optional arguments mode and bufsize), and returns the “file” object. The temporary file is removed as soon as the file object is closed (implicitly or explicitly). For greater security, the temporary file has no name on the filesystem, if your platform allows that (Unix-like platforms do; Windows doesn’t). The returned “file” object from TemporaryFile is a context manager, so you can use a with statement to ensure it’s closed when you’re done with it.

NamedTemporaryFile

NamedTemporaryFile(mode='w+b', bufsize=-1,suffix=None,
prefix=None, dir=None)

Like TemporaryFile, except that the temporary file does have a name on the filesystem. Use the name attribute of the “file” object to access that name. Some platforms, mainly Windows, do not allow the file to be opened again; therefore, the usefulness of the name is limited if you want to ensure that your program works cross-platform. If you need to pass the temporary file’s name to another program that opens the file, you can use function mkstemp, instead of NamedTemporaryFile, to guarantee correct cross-platform behavior. Of course, when you choose to use mkstemp, you do have to take care to ensure the file is removed when you’re done with it. The returned “file” object from NamedTemporaryFile is a context manager, so you can use a with statement.

Auxiliary Modules for File I/O

“File” objects supply the minimal functionality needed for file I/O. Some auxiliary Python library modules, however, offer convenient supplementary functionality, making I/O even easier and handier in several important cases.

The fileinput Module

The fileinput module lets you loop over all the lines in a list of text files. Performance is good, comparable to the performance of direct iteration on each file, since buffering is used to minimize I/O. You can therefore use module fileinput for line-oriented file input whenever you find the module’s rich functionality convenient, with no worry about performance. The input function is the key function of module fileinput; the module also supplies a FileInput class whose methods support the same functionality as the module’s functions. The module contents are listed here:

close

close()

Closes the whole sequence so that iteration stops and no file remains open.

FileInput

class FileInput(files=None, inplace=False, backup='', bufsize=0,
openhook=None)

Creates and returns an instance f of class FileInput. Arguments are the same as for fileinput.input, and methods of f have the same names, arguments, and semantics, as functions of module fileinput. f also supplies a method readline, which reads and returns the next line. You can use class FileInput explicitly when you want to nest or mix loops that read lines from more than one sequence of files.

filelineno

filelineno()

Returns the number of lines read so far from the file now being read. For example, returns 1 if the first line has just been read from the current file.

filename

filename()

Returns the name of the file being read, or None if no line has been read yet.

input

input(files=None, inplace=False, backup='', bufsize=0,
openhook=None)

Returns the sequence of lines in the files, suitable for use in a for loop. files is a sequence of filenames to open and read one after the other, in order. Filename '-' means standard input (sys.stdin). When files is a string, it’s a single filename to open and read. When files is None, input uses sys.argv[1:] as the list of filenames. When the sequence of filenames is empty, input reads sys.stdin.

The sequence object that input returns is an instance of the class FileInput; that instance is also the global state of the module input, so all other functions of the fileinput module operate on the same shared state. Each function of the fileinput module corresponds directly to a method of the class FileInput.

When inplace is False (the default), input just reads the files. When inplace is True, input moves each file being read (except standard input) to a backup file and redirects standard output (sys.stdout) to write to a new file with the same path as the original one of the file being read. This way, you can simulate overwriting files in-place. If backup is a string that starts with a dot, input uses backup as the extension of the backup files and does not remove the backup files. If backup is an empty string (the default), input uses .bak and deletes each backup file as the input files are closed.

bufsize is the size of the internal buffer that input uses to read lines from the input files. When bufsize is 0, input uses a buffer of 8,192 bytes.

You can optionally pass an openhook function to use as an alternative to io.open. A popular one: openhook=fileinput.hook_compressed, which transparently decompresses any input file with extension .gz or .bzip (not compatible with inplace=True).

isfirstline

isfirstline()

Returns True or False, just like filelineno()==1.

isstdin

isstdin()

Returns True when the current file being read is sys.stdin; otherwise, False.

lineno

lineno()

Returns the total number of lines read since the call to input.

nextfile

nextfile()

Closes the file being read: the next line to read is the first one of the next file.

Here’s a typical example of using fileinput for a “multifile search and replace,” changing one string into another throughout the text files whose name were passed as command-line arguments to the script:

import fileinput
for line in fileinput.input(inplace=True):
    print(line.replace('foo', 'bar'), end='')

In such cases it’s important to have the end='' argument to print, since each line has its line-end character at the end, and you need to ensure that print doesn’t add another (or else each file would end up “double-spaced”).

The linecache Module

The linecache module lets you read a given line (specified by number) from a file with a given name, keeping an internal cache so that, when you read several lines from a file, it’s faster than opening and examining the file each time. The linecache module exposes the following functions:

checkcache

checkcache()

Ensures that the module’s cache holds no stale data and reflects what’s on the filesystem. Call checkcache when the files you’re reading may have changed on the filesystem, to ensure that further calls to getline return updated information.

clearcache

clearcache()

Drops the module’s cache so that the memory can be reused for other purposes. Call clearcache when you know you don’t need to perform any reading for a while.

getline

getline(filename, lineno)

Reads and returns the lineno line (the first line is 1, not 0 as is usual in Python) from the text file named filename, including the trailing . For any error, getline does not raise exceptions but rather returns the empty string ''. If filename is not found, getline looks for the file in the directories listed in sys.path (ignoring ZIP files, if any, in sys.path).

getlines

getlines(filename)

Reads and returns all lines from the text file named filename as a list of strings, each including the trailing . For any error, getlines does not raise exceptions but rather returns the empty list []. If filename is not found, getlines looks for the file in the directories listed in sys.path.

The struct Module

The struct module lets you pack binary data into a bytestring, and unpack the bytes of such a bytestring back into the data they represent. Such operations are useful for many kinds of low-level programming. Most often, you use struct to interpret data records from binary files that have some specified format, or to prepare records to write to such binary files. The module’s name comes from C’s keyword struct, which is usable for related purposes. On any error, functions of the module struct raise exceptions that are instances of the exception class struct.error, the only class the module supplies.

The struct module relies on struct format strings following a specific syntax. The first character of a format string gives byte order, size, and alignment of packed data:

@

Native byte order, native data sizes, and native alignment for the current platform; this is the default if the first character is none of the characters listed here (note that format P in Table 10-2 is available only for this kind of struct format string). Look at string sys.byteorder when you need to check your system’s byte order ('little' or 'big').

=

Native byte order for the current platform, but standard size and alignment.

<

Little-endian byte order (like Intel platforms); standard size and alignment.

>, !

Big-endian byte order (network standard); standard size and alignment.

Table 10-2. Format characters for struct
Character C type Python type Standard size

B

unsigned char

int

1 byte

b

signed char

int

1 byte

c

char

bytes (length 1)

1 byte

d

double

float

8 bytes

f

float

float

4 bytes

H

unsigned short

int

2 bytes

h

signed short

int

2 bytes

I

unsigned int

long

4 bytes

i

signed int

int

4 bytes

L

unsigned long

long

4 bytes

l

signed long

int

4 bytes

P

void*

int

N/A

p

char[]

bytes

N/A

s

char[]

bytes

N/A

x

padding byte

no value

1 byte

Standard sizes are indicated in Table 10-2. Standard alignment means no forced alignment, with explicit padding bytes used if needed. Native sizes and alignment are whatever the platform’s C compiler uses. Native byte order can put the most significant byte at either the lowest (big-endian) or highest (little-endian) address, depending on the platform.

After the optional first character, a format string is made up of one or more format characters, each optionally preceded by a count (an integer represented by decimal digits). (The format characters are shown in Table 10-2.) For most format characters, the count means repetition (e.g., '3h' is exactly the same as 'hhh'). When the format character is s or p—that is, a bytestring—the count is not a repetition: it’s the total number of bytes in the string. Whitespace can be freely used between formats, but not between a count and its format character.

Format s means a fixed-length bytestring as long as its count (the Python string is truncated, or padded with copies of the null byte b'', if needed). The format p means a “Pascal-like” bytestring: the first byte is the number of significant bytes that follow, and the actual contents start from the second byte. The count is the total number of bytes, including the length byte.

The struct module supplies the following functions:

calcsize

calcsize(fmt)

Returns the size in bytes corresponding to format string fmt.

pack

pack(fmt, *values)

Packs the values per format string fmt, returns the resulting bytestring. values must match in number and type the values required by fmt.

pack_into

pack_into(fmt,buffer,offset,*values)

Packs the values per format string fmt into writeable buffer buffer (usually an instance of bytearray) starting at index offset in it. values must match in number and type the values required by fmt. len(buffer[offset:]) must be >= struct.calcsize(fmt).

unpack

unpack(fmt, s)

Unpacks bytestring s per format string fmt, returns a tuple of values (if just one value, a one-item tuple). len(s) must equal struct.calcsize(fmt).

unpack_from

unpack_from(fmt,s,offset)

Unpacks bytestring (or other readable buffer) s, starting from offset offset, per format string fmt, returning a tuple of values (if just one value, a 1-item tuple). len(s[offset:]) must be >= struct.calcsize(fmt).

In-Memory “Files”: io.StringIO and io.BytesIO

You can implement file-like objects by writing Python classes that supply the methods you need. If all you want is for data to reside in memory, rather than on a file as seen by the operating system, use the class StringIO or BytesIO of the io module. The difference between them is that instances of StringIO are text-mode “files,” so reads and writes consume or produce Unicode strings, while instances of BytesIO are binary “files,” so reads and writes consume or produce bytestrings.

When you instantiate either class you can optionally pass a string argument, respectively Unicode or bytes, to use as the initial content of the “file.” An instance f of either class, in addition to “file” methods, supplies one extra method:

getvalue

f.getvalue()

Returns the current data contents of f as a string (text or bytes). You cannot call f.getvalue after you call f.close: close frees the buffer that f internally keeps, and getvalue needs to return the buffer as its result.

Compressed Files

Storage space and transmission bandwidth are increasingly cheap and abundant, but in many cases you can save such resources, at the expense of some extra computational effort, by using compression. Computational power grows cheaper and more abundant even faster than some other resources, such as bandwidth, so compression’s popularity keeps growing. Python makes it easy for your programs to support compression, since the Python standard library contains several modules dedicated to compression.

Since Python offers so many ways to deal with compression, some guidance may be helpful. Files containing data compressed with the zlib module are not automatically interchangeable with other programs, except for those files built with the zipfile module, which respects the standard format of ZIP file archives. You can write custom programs, with any language able to use InfoZip’s free zlib compression library, to read files produced by Python programs using the zlib module. However, if you need to interchange compressed data with programs coded in other languages but have a choice of compression methods, we suggest you use the modules bzip2 (best), gzip, or zipfile instead. The zlib module, however, may be useful when you want to compress some parts of datafiles that are in some proprietary format of your own and need not be interchanged with any other program except those that make up your application.

In v3 only, you can also use the newer module lzma for even (marginally) better compression and compatibility with the newer xz utility. We do not cover lzma in this book; see the online docs, and, for use in v2, the v2 backport.

The gzip Module

The gzip module lets you read and write files compatible with those handled by the powerful GNU compression programs gzip and gunzip. The GNU programs support many compression formats, but the module gzip supports only the gzip format, often denoted by appending the extension .gz to a filename. The gzip module supplies the GzipFile class and an open factory function:

GzipFile

class GzipFile(filename=None, mode=None, compresslevel=9,
fileobj=None)

Creates and returns a file-like object f wrapping the “file” or file-like object fileobj. When fileobj is None, filename must be a string that names a file; GzipFile opens that file with the mode (by default, 'rb'), and f wraps the resulting file object.

mode should be 'ab', 'rb', 'wb', or None. When mode is None, f uses the mode of fileobj if it can discover that mode; otherwise, it uses 'rb'. When filename is None, f uses the filename of fileobj if it can discover that name; otherwise, it uses ''. compresslevel is an integer between 1 and 9: 1 requests modest compression but fast operation; 9 requests the best compression at the cost of more computation.

The file-like object f delegates most methods to the underlying file-like object fileobj, transparently accounting for compression as needed. However, f does not allow nonsequential access, so f does not supply methods seek and tell. Calling f.close does not close fileobj if f was created with a not-None fileobj. This matters especially when fileobj is an instance of io.BytesIO: you can call fileobj.getvalue after f.close to get the compressed data string. So, you always have to call fileobj.close (explicitly, or implicitly by using a with statement) after f.close.

open

open(filename, mode='rb', compresslevel=9)

Like GzipFile(filename, mode, compresslevel), but filename is mandatory and there is no provision for passing an already opened fileobj. In v3 only, filename can be an already-opened “file” object; also, in v3, mode can be a text one (for example 'rt'), in which case open wraps an io.TextIOWrapper around the “file” it returns.

A gzip example

Say that you have some function f(x) that writes Unicode text to a text file object x passed in as an argument, by calling x.write and/or x.writelines. To make f write text to a gzip-compressed file instead:

import gzip, io
with io.open('x.txt.gz', 'wb') as underlying:
    with gzip.GzipFile(fileobj=underlying, mode='wb') as wrapper:
        f(io.TextIOWrapper(wrapper, 'utf8'))

This example opens the underlying binary file x.txt.gz and explicitly wraps it with gzip.GzipFile; thus, we need two nested with statements. This separation is not strictly necessary: we could pass the filename directly to gzip.GzipFile (or gzip.open); in v3 only, with gzip.open, we could even ask for mode='wt' and have the TextIOWrapper transparently provided for us. However, the example is coded to be maximally explicit, and portable between v2 and v3.

Reading back a compressed text file—for example, to display it on standard output—uses a pretty similar structure of code:

import gzip, io
with io.open('x.txt.gz', 'rb') as underlying:
    with gzip.GzipFile(fileobj=underlying, mode='rb') as wrapper:
        for line in wrapper:
            print(line.decode('utf8'), end='')

Here, we can’t just use an io.TextIOWrapper, since, in v2, it would not be iterable by line, given the characteristics of the underlying (decompressing) wrapper. However, the explicit decode of each line works fine in both v2 and v3.

The bz2 Module

The bz2 module lets you read and write files compatible with those handled by the compression programs bzip2 and bunzip2, which often achieve even better compression than gzip and gunzip. Module bz2 supplies the BZ2File class, for transparent file compression and decompression, and functions compress and decompress to compress and decompress data strings in memory. It also provides objects to compress and decompress data incrementally, enabling you to work with data streams that are too large to comfortably fit in memory at once. For the latter, advanced functionality, consult the Python standard library’s online docs.

For richer functionality in v2, consider the third-party module bz2file, with more complete features, matching v3’s standard library module’s ones as listed here:

BZ2File

class BZ2File(filename=None, mode='r', buffering=0,
compresslevel=9)

Creates and returns a file-like object f, corresponding to the bzip2-compressed file named by filename, which must be a string denoting a file’s path (in v3 only, filename can also be an open file object). mode can be 'r', for reading, or 'w', for writing (in v3 only, 'a' for appending, and 'x' for exclusive-open—like 'w', but raising an exception when the file already exists—are also OK; and so is appending a 'b' to the mode string, since the resulting file-like object f is binary).

buffering is deprecated; don’t pass it. compresslevel is an integer between 1 and 9: 1 requests modest compression but fast operation; 9 requests the best compression at the cost of more computation.

f supplies all methods of “file” objects, including seek and tell. Thus, f is seekable; however, the seek operation is emulated, and, while guaranteed to be semantically correct, may in some cases be very slow.

compress

compress(s, level=9)

Compresses string s and returns the string of compressed data. level is an integer between 1 and 9: 1 requests modest compression but fast operation; 9 requests the best compression at the cost of more computation.

decompress

decompress(s)

Decompresses the compressed data string s and returns the string of uncompressed data.

The tarfile Module

The tarfile module lets you read and write TAR files (archive files compatible with those handled by popular archiving programs such as tar), optionally with either gzip or bzip2 compression (and, in v3 only, lzma too). In v3 only, python -m tarfile offers a useful command-line interface to the module’s functionality: run it without further arguments to get a brief help message.

When handling invalid TAR files, functions of tarfile raise instances of tarfile.TarError. The tarfile module supplies the following classes and functions:

is_tarfile

is_tarfile(filename)

Returns True when the file named by string filename appears to be a valid TAR file (possibly with compression), judging by the first few bytes; otherwise, returns False.

TarInfo

class TarInfo(name='')

The methods getmember and getmembers of TarFile instances return instances of TarInfo, supplying information about members of the archive. You can also build a TarInfo instance with a TarFile instance’s method gettarinfo. The most useful attributes supplied by a TarInfo instance t are:

linkname

A string, the target file’s name, when t.type is LNKTYPE or SYMTYPE

mode

Permission and other mode bits of the file identified by t

mtime

Time of last modification of the file identified by t

name

Name in the archive of the file identified by t

size

Size in bytes (uncompressed) of the file identified by t

type

File type, one of many constants that are attributes of module tarfile (SYMTYPE for symbolic links, REGTYPE for regular files, DIRTYPE for directories, and so on)

To check the type of t, rather than testing t.type, you can call t’s methods. The most frequently used methods of t are:

t.isdir()

Returns True if the file is a directory

t.isfile()

Returns True if the file is a regular file

t.issym()

Returns True if the file is a symbolic link

open

open(filename, mode='r', fileobj=None, bufsize=10240, **kwargs)

Creates and returns a TarFile instance f to read or create a TAR file through file-like object fileobj. When fileobj is None, filename must be a string naming a file; open opens the file with the given mode (by default, 'r'), and f wraps the resulting file object. Calling f.close does not close fileobj if f was opened with a fileobj that is not None. This behavior of f.close is important when fileobj is an instance of io.BytesIO: you can call fileobj.getvalue after f.close to get the archived and possibly compressed data as a string. This behavior also means that you have to call fileobj.close explicitly after calling f.close.

mode can be 'r', to read an existing TAR file, with whatever compression it has (if any); 'w', to write a new TAR file, or truncate and rewrite an existing one, without compression; or 'a', to append to an existing TAR file, without compression. Appending to compressed TAR files is not supported. To write a new TAR file with compression, mode can be 'w:gz' for gzip compression, or 'w:bz2' for bzip2 compression. Special mode strings 'r:' or 'w:' can be used to read or write uncompressed, nonseekable TAR files using a buffer of bufsize bytes, and 'r:gz' and 'r:bz2'can be used to read such files with compression. In v3 only, you can also use 'r:xz' and 'w:xz' to specify LZMA compression.

In the mode strings specifying compression, you can use a vertical bar | instead of a colon : in order to force sequential processing and fixed-size blocks (precious if you’re ever handling a tape device).

A TarFile instance f supplies the following methods.

add

f.add(filepath, arcname=None, recursive=True)

Adds to archive f the file named by filepath (can be a regular file, a directory, or a symbolic link). When arcname is not None, it’s used as the archive member name in lieu of filepath. When filepath is a directory, add recursively adds the whole filesystem subtree rooted in that directory, unless you pass recursive as False.

addfile

f.addfile(tarinfo, fileobj=None)

Adds to archive f a member identified by tarinfo, a TarInfo instance ( if fileobj is not None, the data is the first tarinfo.size bytes of file-like object fileobj).

close

f.close()

Closes archive f. You must call close, or else an incomplete, unusable TAR file might be left on disk. Such mandatory finalization is best performed with a try/finally, as covered in “try/finally”, or, even better, a with statement, covered in “The with Statement and Context Managers”.

extract

f.extract(member, path='.')

Extracts the archive member identified by member (a name or a TarInfo instance) into a corresponding file in the directory named by path (the current directory by default).

extractfile

f.extractfile(member)

Extracts the archive member identified by member (a name or a TarInfo instance) and returns a read-only file-like object with the methods read, readline, readlines, seek, and tell.

getmember

f.getmember(name)

Returns a TarInfo instance with information about the archive member named by the string name.

getmembers

f.getmembers()

Returns a list of TarInfo instances, one for each member in archive f, in the same order as the entries in the archive itself.

getnames

f.getnames()

Returns a list of strings, the names of each member in archive f, in the same order as the entries in the archive itself.

gettarinfo

f.gettarinfo(name=None, arcname=None,fileobj=None)

Returns a TarInfo instance with information about the open “file” object fileobj, when not None, or else the existing file whose path is string name. When arcname is not None, it’s used as the name attribute of the resulting TarInfo instance.

list

f.list(verbose=True)

Outputs a directory of the archive f to sys.stdout. If the optional argument verbose is False, outputs only the names of the archive’s members.

The zipfile Module

The zipfile module can read and write ZIP files (i.e., archive files compatible with those handled by popular compression programs such as zip and unzip, pkzip and pkunzip, WinZip, and so on). python -m zipfile offers a useful command-line interface to the module’s functionality: run it without further arguments to get a brief help message.

Detailed information about ZIP files is at the pkware and info-zip web pages. You need to study that detailed information to perform advanced ZIP file handling with zipfile. If you do not specifically need to interoperate with other programs using the ZIP file standard, the modules gzip and bz2 are often better ways to deal with file compression.

The zipfile module can’t handle multidisk ZIP files, and cannot create encrypted archives (however, it can decrypt them, albeit rather slowly). The module also cannot handle archive members using compression types besides the usual ones, known as stored (a file copied to the archive without compression) and deflated (a file compressed using the ZIP format’s default algorithm). (In v3 only, zipfile also handles compression types bzip2 and lzma, but beware: not all tools, including v2’s zipfile, can handle those, so if you use them you’re sacrificing some portability to get better compression.) For errors related to invalid .zip files, functions of zipfile raise exceptions that are instances of the exception class zipfile.error.

The zipfile module supplies the following classes and functions:

is_zipfile

is_zipfile(file)

Returns True when the file named by string file (or, the file-like object file) seems to be a valid ZIP file, judging by the first few and last bytes of the file; otherwise, returns False.

ZipInfo

class ZipInfo(file='NoName', date_time=(1980, 1, 1))

The methods getinfo and infolist of ZipFile instances return instances of ZipInfo to supply information about members of the archive. The most useful attributes supplied by a ZipInfo instance z are:

comment

A string that is a comment on the archive member

compress_size

Size in bytes of the compressed data for the archive member

compress_type

An integer code recording the type of compression of the archive member

date_time

A tuple of six integers with the time of the last modification to the file: the items are year, month, day (1 and up), hour, minute, second (0 and up)

file_size

Size in bytes of the uncompressed data for the archive member

filename

Name of the file in the archive

ZipFile

class ZipFile(file, mode='r', compression=zipfile.ZIP_STORED,
allowZip64=True)

Opens a ZIP file named by string file (or, the file-like object file). mode can be 'r', to read an existing ZIP file; 'w', to write a new ZIP file or truncate and rewrite an existing one; or 'a', to append to an existing file. (In v3 only, it can also be 'x', which is like 'w' but raises an exception if the ZIP file already existed—here, 'x' stands for “exclusive.”)

When mode is 'a', filename can name either an existing ZIP file (in which case new members are added to the existing archive) or an existing non-ZIP file. In the latter case, a new ZIP file–like archive is created and appended to the existing file. The main purpose of this latter case is to let you build a self-unpacking executable file that unpacks itself when run. The existing file must then be a pristine copy of a self-unpacking executable prefix, as supplied by www.info-zip.org and by other purveyors of ZIP file compression tools.

compression is an integer code that can be either of two attributes of the module zipfile. zipfile.ZIP_STORED requests that the archive use no compression; zipfile.ZIP_DEFLATED requests that the archive use the deflation mode of compression (i.e., the most usual and effective compression approach used in .zip files). In v3 only, it can also be zipfile.ZIP_BZIP2 or zipfile.ZIP_LZMA (sacrificing portability for more compression).

When allowZip64 is true, the ZipFile instance is allowed to use the ZIP64 extensions to produce an archive larger than 4 GB; otherwise, any attempt to produce such a large archive raises exception LargeZipFile. The default is True in v3, but False in v2.

A ZipFile instance z has the attributes compression and mode, corresponding to the arguments with which z was instantiated; fp and filename, the file-like object z works on and its filename if known; comment, the possibly empty string that is the archive’s comment; and filelist, the list of ZipInfo instances in the archive.

In addition, z has a writable attribute called debug, an int from 0 to 3 that you can assign to control how much debugging output to emit to sys.stdout—from nothing, when z.debug is 0, to the maximum amount of information available, when z.debug is 3.

ZipFile is a context manager; thus, you can use it in a with statement to ensure the underlying file gets closed when you’re done with it. For example:

with ZipFile('archive.zip') as z:
    data = z.read('data.txt')

A ZipFile instance z supplies the following methods:

close

z.close()

Closes archive file z. Make sure the close method gets called, or else an incomplete and unusable ZIP file might be left on disk. Such mandatory finalization is generally best performed with a try/finally statement, as covered in “try/finally”, or—even better—a with statement.

extract

z.extract(member, path=None, pwd=None)

Extract an archive member to disk, to the directory path, or, by default, to the current working directory; member is the member’s full name, or an instance of ZipInfo identifying the member. extract normalizes path info within member, turning absolute paths into relative ones, removing any component that’s '..', and, on Windows, turning characters that are illegal in filenames into underscores (_). pwd, if present, is the password to use to decrypt an encrypted member.

Returns the path to the file it has created (or overwritten if it already existed), or to the directory it has created (or left alone if it already existed).

extractall

z.extractall(path=None, members=None, pwd=None)

Extract archive members to disk (by default, all of them), to directory path, or, by default, to the current working directory; members optionally limits which members to extract, and must be a subset of the list of strings returned by z.namelist(). extractall normalizes path info within members it extracts, turning absolute paths into relative ones, removing any component that’s '..', and, on Windows, turning characters that are illegal in filenames into underscores (_). pwd, if present, is the password to use to decrypt an encrypted member.

getinfo

z.getinfo(name)

Returns a ZipInfo instance that supplies information about the archive member named by the string name.

infolist

z.infolist()

Returns a list of ZipInfo instances, one for each member in archive z, in the same order as the entries in the archive.

namelist

z.namelist()

Returns a list of strings, the name of each member in archive z, in the same order as the entries in the archive.

open

z.open(name, mode='r', pwd=None)

Extracts and returns the archive member identified by name (a member name string or ZipInfo instance) as a read-only file-like object. mode should be 'r': in v2 and early v3 versions it could also be 'U' or 'rU' to get “universal newlines” mode, but that’s now deprecated. pwd, if present, is the password to use to decrypt an encrypted member.

printdir

z.printdir()

Outputs a textual directory of the archive z to file sys.stdout.

read

z.read(name, pwd)

Extracts the archive member identified by name (a member name string or ZipInfo instance) and returns the bytestring of its contents. pwd, if present, is the password to use to decrypt an encrypted member.

setpassword

z.setpassword(pwd)

Sets string pwd as the default password to use to decrypt encrypted files.

testzip

z.testzip()

Reads and checks the files in archive z. Returns a string with the name of the first archive member that is damaged, or None if the archive is intact.

write

z.write(filename, arcname=None, compress_type=None)

Writes the file named by string filename to archive z, with archive member name arcname. When arcname is None, write uses filename as the archive member name. When compress_type is None, write uses z’s compression type; otherwise, compress_type specifies how to compress the file. z must be opened for 'w' or 'a'.

writestr

z.writestr(zinfo, bytes)

zinfo must be a ZipInfo instance specifying at least filename and date_time, or a string (in which case it’s used as the archive member name, and the date and time are set to the current moment). bytes is a string of bytes. writestr adds a member to archive z using the metadata specified by zinfo and the data in bytes. z must be opened for 'w' or 'a'. When you have data in memory and need to write the data to the ZIP file archive z, it’s simpler and faster to use z.writestr rather than z.write: the latter would require you to write the data to disk first and later remove the useless disk file. The following example shows both approaches, encapsulated into functions polymorphic to each other:

import zipfile

def data_to_zip_direct(z, data, name):
    import time
    zinfo = zipfile.ZipInfo(name, time.localtime()[:6])
    zinfo.compress_type = zipfile.ZIP_DEFLATED
    z.writestr(zinfo, data)

def data_to_zip_indirect(z, data, name):
    import os
    flob = open(name, 'wb')
    flob.write(data)
    flob.close()
    z.write(name)
    os.unlink(name)

with zipfile.ZipFile('z.zip', 'w', 
zipfile.ZIP_DEFLATED) as zz:
    data = 'four score
and seven
years ago
'
    data_to_zip_direct(zz, data, 'direct.txt')
    data_to_zip_indirect(zz, data, 'indirect.txt')

Besides being faster and more concise, data_to_zip_direct is handier, since it works in memory and doesn’t require the current working directory to be writable, as data_to_zip_indirect does. Of course, write has its uses, when the data is in a file on disk and you just want to add the file to the archive.

Here’s how you can print a list of all files contained in the ZIP file archive created by the previous example, followed by each file’s name and contents:

import zipfile
zz = zipfile.ZipFile('z.zip')
zz.printdir()
for name in zz.namelist():
    print('{}: {!r}'.format(name, zz.read(name)))
zz.close()

The zlib Module

The zlib module lets Python programs use the free InfoZip zlib compression library version 1.1.4 or later. zlib is used by the modules gzip and zipfile, but is also available directly for any special compression needs. The most commonly used functions supplied by zlib are:

compress

compress(s, level=6)

Compresses string s and returns the string of compressed data. level is an integer between 1 and 9; 1 requests modest compression but fast operation, and 9 requests best compression, requiring more computation.

decompress

decompress(s)

Decompresses the compressed bytestring s and returns the bytestring of uncompressed data (also accepts optional arguments for pretty advanced uses—see the online docs).

The zlib module also supplies functions to compute Cyclic-Redundancy Check (CRC), which allows detection of damage in compressed data, as well as objects to compress and decompress data incrementally to allow working with data streams too large to fit in memory at once. For such advanced functionality, consult the Python library’s online docs.

The os Module

os is an umbrella module presenting a reasonably uniform cross-platform view of the capabilities of various operating systems. It supplies low-level ways to create and handle files and directories, and to create, manage, and destroy processes. This section covers filesystem-related functions of os; “Running Other Programs with the os Module” covers process-related functions.

The os module supplies a name attribute, a string that identifies the kind of platform on which Python is being run. Common values for name are 'posix' (all kinds of Unix-like platforms, including Linux and macOS), 'nt' (all kinds of 32-bit Windows platforms), and 'java' (Jython). You can exploit some unique capabilities of a platform through functions supplied by os. However, this book deals with cross-platform programming, not with platform-specific functionality, so we do not cover parts of os that exist only on one platform, nor platform-specific modules. Functionality covered in this book is available at least on 'posix' and 'nt' platforms. We do, though, cover some of the differences among the ways in which a given functionality is provided on various platforms.

OSError Exceptions

When a request to the operating system fails, os raises an exception, an instance of OSError. os also exposes built-in exception class OSError with the synonym os.error. Instances of OSError expose three useful attributes:

errno

The numeric error code of the operating system error

strerror

A string that summarily describes the error

filename

The name of the file on which the operation failed (file-related functions only)

In v3 only, OSError has many subclasses that specify more precisely what problem was encountered, as covered in “OSError and subclasses (v3 only)”.

os functions can also raise other standard exceptions, such as TypeError or ValueError, when the cause of the error is that you have called them with invalid argument types or values, so that the underlying operating system functionality has not even been attempted.

The errno Module

The errno module supplies dozens of symbolic names for error code numbers. To handle possible system errors selectively, based on error codes, use errno to enhance your program’s portability and readability. For example, here’s how you might handle “file not found” errors, while propagating all other kinds of errors (when you want your code to work as well in v2 as in v3):

try: os.some_os_function_or_other()
except OSError as err:
    import errno
    # check for "file not found" errors, re-raise other cases
    if err.errno != errno.ENOENT: raise
    # proceed with the specific case you can handle
    print('Warning: file', err.filename, 'not found—continuing')

If you’re coding for v3 only, however, you can make an equivalent snippet much simpler and clearer, by catching just the applicable OSError subclass:

try: os.some_os_function_or_other()
except FileNotFoundError as err:
    print('Warning: file', err.filename, 'not found—continuing')

errno also supplies a dictionary named errorcode: the keys are error code numbers, and the corresponding names are the error names, which are strings such as 'ENOENT'. Displaying errno.errorcode[err.errno], as part of your diagnosis of some OSError instance err, can often make the diagnosis clearer and more understandable to readers who specialize in the specific platform.

Filesystem Operations

Using the os module, you can manipulate the filesystem in a variety of ways: creating, copying, and deleting files and directories; comparing files; and examining filesystem information about files and directories. This section documents the attributes and methods of the os module that you use for these purposes, and covers some related modules that operate on the filesystem.

Path-String Attributes of the os Module

A file or directory is identified by a string, known as its path, whose syntax depends on the platform. On both Unix-like and Windows platforms, Python accepts Unix syntax for paths, with a slash (/) as the directory separator. On non-Unix-like platforms, Python also accepts platform-specific path syntax. On Windows, in particular, you may use a backslash () as the separator. However, you then need to double-up each backslash as \ in string literals, or use raw-string syntax as covered in “Literals”; you needlessly lose portability. Unix path syntax is handier and usable everywhere, so we strongly recommend that you always use it. In the rest of this chapter, we assume Unix path syntax in both explanations and examples.

The os module supplies attributes that provide details about path strings on the current platform. You should typically use the higher-level path manipulation operations covered in “The os.path Module” rather than lower-level string operations based on these attributes. However, the attributes may be useful at times.

curdir

The string that denotes the current directory ('.' on Unix and Windows)

defpath

The default search path for programs, used if the environment lacks a PATH environment variable

linesep

The string that terminates text lines (' ' on Unix; ' ' on Windows)

extsep

The string that separates the extension part of a file’s name from the rest of the name ('.' on Unix and Windows)

pardir

The string that denotes the parent directory ('..' on Unix and Windows)

pathsep

The separator between paths in lists of paths, such as those used for the environment variable PATH (':' on Unix; ';' on Windows)

sep

The separator of path components ('/' on Unix; '' on Windows)

Permissions

Unix-like platforms associate nine bits with each file or directory: three each for the file’s owner, its group, and anybody else (AKA “the world”), indicating whether the file or directory can be read, written, and executed by the given subject. These nine bits are known as the file’s permission bits, and are part of the file’s mode (a bit string that includes other bits that describe the file). These bits are often displayed in octal notation, since that groups three bits per digit. For example, mode 0o664 indicates a file that can be read and written by its owner and group, and read—but not written—by anybody else. When any process on a Unix-like system creates a file or directory, the operating system applies to the specified mode a bit mask known as the process’s umask, which can remove some of the permission bits.

Non-Unix-like platforms handle file and directory permissions in very different ways. However, the os functions that deal with file permissions accept a mode argument according to the Unix-like approach described in the previous paragraph. Each platform maps the nine permission bits in a way appropriate for it. For example, on versions of Windows that distinguish only between read-only and read/write files and do not distinguish file ownership, a file’s permission bits show up as either 0o666 (read/write) or 0o444 (read-only). On such a platform, when creating a file, the implementation looks only at bit 0o200, making the file read/write when that bit is 1, read-only when 0.

File and Directory Functions of the os Module

The os module supplies several functions to query and set file and directory status. In all versions and platforms, the argument path to any of these functions can be a string giving the path of the file or directory involved. In v3 only, on some Unix platforms, some of the functions also support as argument path a file descriptor (AKA fd), an int denoting a file (as returned, for example, by os.open). In this case, the module attribute os.supports_fd is the set of functions of the os module that do support a file descriptor as argument path (the module attribute is missing in v2, and in v3 on platforms lacking such support).

In v3 only, on some Unix platforms, some functions support the optional, keyword-only argument follow_symlinks, defaulting to True. When true, and always in v2, if path indicates a symbolic link, the function follows it to reach an actual file or directory; when false, the function operates on the symbolic link itself. The module attribute os.supports_follow_symlinks, if present, is the set of functions of the os module that do support this argument.

In v3 only, on some Unix platforms, some functions support the optional, keyword-only argument dir_fd, defaulting to None. When present, path (if relative) is taken as being relative to the directory open at that file descriptor; when missing, and always in v2, path (if relative) is taken as relative to the current working directory. The module attribute os.supports_dir_fd, if present, is the set of functions of the os module that do support this argument.

Table 10-3. os module functions

access

access(path, mode)

Returns True when the file path has all of the permissions encoded in integer mode; otherwise, False. mode can be os.F_OK to test for file existence, or one or more of the constant integers named os.R_OK, os.W_OK, and os.X_OK (with the bitwise-OR operator | joining them, if more than one) to test permissions to read, write, and execute the file.

access does not use the standard interpretation for its mode argument, covered in “Permissions”. Rather, access tests only if this specific process’s real user and group identifiers have the requested permissions on the file. If you need to study a file’s permission bits in more detail, see the function stat, covered in Table 10-4.

In v3 only, access supports an optional, keyword-only argument effective_ids, defaulting to False. When this argument is passed as true, access uses effective rather than real user and group identifiers.

chdir

chdir(path)

Sets the current working directory of the process to path.

chmod

chmod(path, mode)

Changes the permissions of the file path, as encoded in integer mode. mode can be zero or more of os.R_OK, os.W_OK, and os.X_OK (with the bitwise-OR operator | joining them, if more than one) for read, write, and execute (respectively). On Unix-like platforms, mode can be a richer bit pattern (as covered in “Permissions”) to specify different permissions for user, group, and other, as well as other special bits defined in the module stat and listed in the online docs.

getcwd

getcwd()

Returns a str, the path of the current working directory. In v3, getcwdb returns the same value as bytes; in v2, getcwdu returns the same value as unicode.

link

link(src,dst)

Create a hard link named dst, pointing to src. In v2, this is only available on Unix platforms; in v3, it’s also available on Windows.

listdir

listdir(path)

Returns a list whose items are the names of all files and subdirectories in the directory path. The list is in arbitrary order and does not include the special directory names '.' (current directory) and '..' (parent directory). See also the v3-only alternative function scandir, covered later in this table, which can offer performance improvements in some cases.

The v2-only dircache module also supplies a function named listdir, which works like os.listdir, with two enhancements. dircache.listdir returns a sorted list; and dircache caches the list, so that repeated requests for the same directory are faster if the directory’s contents don’t change. dircache automatically detects changes: when you call dircache.listdir, you get a list of the directory’s contents at that time.

makedirs, mkdir

makedirs(path, mode=0777) mkdir(path, mode=0777)

makedirs creates all directories that are part of path and do not yet exist. mkdir creates only the rightmost directory of path and raises OSError if any of the previous directories in path do not exist. Both functions use mode as permission bits of directories they create. Both raise OSError when creation fails, or when a file or directory named path already exists.

remove, unlink

remove(path) unlink(path)

Removes the file named path (see rmdir in this table to remove a directory). unlink is a synonym of remove.

removedirs

removedirs(path)

Loops from right to left over the directories that are part of path, removing each one. The loop ends when a removal attempt raises an exception, generally because a directory is not empty. removedirs does not propagate the exception, as long as it has removed at least one directory.

rename

rename(source, dest)

Renames (i.e., moves) the file or directory named source to dest. If dest already exists, rename may either replace dest, or raise an exception; in v3 only, to guarantee replacement rather than exception, call, instead, the function os.replace.

renames

renames(source, dest)

Like rename, except that renames tries to create all intermediate directories needed for dest. Also, after renaming, renames tries to remove empty directories from the path source using removedirs. It does not propagate any resulting exception; it’s not an error if the starting directory of source does not become empty after the renaming.

rmdir

rmdir(path)

Removes the empty directory named path (raises OSError if the removal fails, and, in particular, if the directory is not empty).

scandir

scandir(path) v3 only

Returns an iterator over os.DirEntry instances representing each item in the path; using scandir, and calling each resulting instance’s methods to determine its characteristics, can offer performance improvements compared to using listdir and stat, depending on the underlying platform.

 

class DirEntry

An instance d of class DirEntry supplies string attributes name and path, holding the item’s base name and full path, respectively; and several methods, of which the most frequently used are the no-arguments, bool-returning methods is_dir, is_file, and is_symlinkis_dir and is_file by default follow symbolic links: pass named-only argument follow_symlinks=False to avoid this behavior. For more complete information, see the online docs. d avoids system calls as much as feasible, and, when it needs one, it caches the results; if you need information that’s guaranteed to be up to date, call os.stat(d.path) and use the stat_result instance it returns (however, this sacrifices scandir’s potential performance improvements).

stat

stat(path)

Returns a value x of type stat_result, which provides 10 items of information about the file or subdirectory path. Accessing those items by their numeric indices is possible but generally not advisable, because the resulting code is not very readable; use the corresponding attribute names instead. Table 10-4 lists the attributes of a stat_result instance and the meaning of corresponding items.

Table 10-4. Items (attributes) of a stat_result instance
Item index Attribute name Meaning

0

st_mode

Protection and other mode bits

1

st_ino

Inode number

2

st_dev

Device ID

3

st_nlink

Number of hard links

4

st_uid

User ID of owner

5

st_gid

Group ID of owner

6

st_size

Size in bytes

7

st_atime

Time of last access

8

st_mtime

Time of last modification

9

st_ctime

Time of last status change

For example, to print the size in bytes of file path, you can use any of:

import os
print(os.path.getsize(path))
print(os.stat(path)[6])
print(os.stat(path).st_size)

Time values are in seconds since the epoch, as covered in Chapter 12 (int on most platforms). Platforms unable to give a meaningful value for an item use a dummy value.

tempnam, tmpnam

tempnam(dir=None, prefix=None) tmpnam( )

Returns an absolute path usable as the name of a new temporary file.

Note: tempnam and tmpnam are weaknesses in your program’s security. Avoid these functions and use instead the standard library module tempfile, covered in “The tempfile Module”.

utime

utime(path, times=None)

Sets the accessed and modified times of file or directory path. If times is None, utime uses the current time. Otherwise, times must be a pair of numbers (in seconds since the epoch, as covered in Chapter 12) in the order (accessed, modified).

walk

walk(top, topdown=True, onerror=None, followlinks=False)

A generator yielding an item for each directory in the tree whose root is directory top. When topdown is True, the default, walk visits directories from the tree’s root downward; when topdown is False, walk visits directories from the tree’s leaves upward. When onerror is None, walk catches and ignores any OSError exception raised during the tree-walk. Otherwise, onerror must be a function; walk catches any OSError exception raised during the tree-walk and passes it as the only argument in a call to onerror, which may process it, ignore it, or raise it to terminate the tree-walk and propagate the exception.

Each item walk yields is a tuple of three subitems: dirpath, a string that is the directory’s path; dirnames, a list of names of subdirectories that are immediate children of the directory (special directories ‘.’ and ‘..’ are not included); and filenames, a list of names of files that are directly in the directory. If topdown is True, you can alter list dirnames in-place, removing some items and/or reordering others, to affect the tree-walk of the subtree rooted at dirpath; walk iterates only on subdirectories left in dirnames, in the order in which they’re left. Such alterations have no effect if topdown is False (in this case, walk has already visited all subdirectories by the time it visits the current directory and yields its item).

By default, walk does not walk down symbolic links that resolve to directories. To get such extra walking, pass followlinks as true, but beware: this can cause infinite looping if a symbolic link resolves to a directory that is its ancestor—walk doesn’t take precautions against this anomaly.

The os.path Module

The os.path module supplies functions to analyze and transform path strings. To use this module, you can import os.path; however, even if you just import os, you can also access the os.path module and all of its attributes. The most commonly useful functions from the module are listed here:

abspath

abspath(path)

Returns a normalized absolute path string equivalent to path, just like:

os.path.normpath(os.path.join(os.getcwd(), path))

For example, os.path.abspath(os.curdir) is the same as os.getcwd().

basename

basename(path)

Returns the base name part of path, just like os.path.split(path)[1]. For example, os.path.basename('b/c/d.e') returns 'd.e'.

commonprefix

commonprefix(list)

Accepts a list of strings and returns the longest string that is a prefix of all items in the list. Unlike all other functions in os.path, commonprefix works on arbitrary strings, not just on paths. For example, os.path.commonprefix('foobar', 'foolish') returns 'foo'.

In v3 only, function os.path.commonpath works similarly, but returns, specifically, only common prefix paths, not arbitrary string prefixes.

dirname

dirname(path)

Returns the directory part of path, just like os.path.split(path)[0]. For example, os.path.dirname('b/c/d.e') returns 'b/c'.

exists,
lexists

exists(path) lexists(path)

Returns True when path names an existing file or directory; otherwise, False. In other words, os.path.exists(x) is the same as os.access(x, os.F_OK). lexists is the same, but also returns True when path names a symbolic link that indicates a nonexisting file or directory (sometimes known as a broken symlink), while exists returns False in such cases.

expandvars,
expanduser

expandvars(path) expanduser(path)

Returns a copy of string path, where each substring of the form $name or ${name} is replaced with the value of environment variable name. For example, if environment variable HOME is set to /u/alex, the following code:

import os
print(os.path.expandvars('$HOME/foo/'))

emits /u/alex/foo/.

os.path.expanduser expands a leading ~ to the path of the home directory of the current user.

getatime, getmtime, getctime, getsize

getatime(path) getmtime(path) getctime(path) getsize(path)

Each of these functions returns an attribute from the result of os.stat(path): respectively, st_atime, st_mtime, st_ctime, and st_size. See Table 10-4 for more details about these attributes.

isabs

isabs(path)

Returns True when path is absolute. (A path is absolute when it starts with a slash (/), or, on some non-Unix-like platforms, with a drive designator followed by os.sep.) When path is not absolute, isabs returns False.

isfile

isfile(path)

Returns True when path names an existing regular file (however, isfile also follows symbolic links); otherwise, False.

isdir

isdir(path)

Returns True when path names an existing directory (however, isdir also follows symbolic links); otherwise, False.

islink

islink(path)

Returns True when path names a symbolic link; otherwise, False.

ismount

ismount(path)

Returns True when path names a mount point; otherwise, False.

join

join(path, *paths)

Returns a string that joins the argument strings with the appropriate path separator for the current platform. For example, on Unix, exactly one slash character / separates adjacent path components. If any argument is an absolute path, join ignores previous components. For example:

print(os.path.join('a/b', 'c/d', 'e/f'))
# on Unix prints: a/b/c/d/e/f
print(os.path.join('a/b', '/c/d', 'e/f'))
# on Unix prints: /c/d/e/f

The second call to os.path.join ignores its first argument 'a/b', since its second argument '/c/d' is an absolute path.

normcase

normcase(path)

Returns a copy of path with case normalized for the current platform. On case-sensitive filesystems (as is typical in Unix-like systems), path is returned unchanged. On case-insensitive filesystems (as is typical in Windows), all letters in the returned string are lowercase. On Windows, normcase also converts each / to a .

normpath

normpath(path)

Returns a normalized pathname equivalent to path, removing redundant separators and path-navigation aspects. For example, on Unix, normpath returns 'a/b' when path is any of 'a//b', 'a/./b', or 'a/c/../b'. normpath makes path separators appropriate for the current platform. For example, on Windows, separators become .

realpath

realpath(path)

Returns the actual path of the specified file or directory, resolving symlinks along the way.

relpath

relpath(path, start=os.curdir)

Returns a relative path to the specified file or directory, relative to directory start (by default, the process’s current working directory).

samefile

samefile(path1, path2)

Returns True if both arguments refer to the same file or directory.

sameopenfile

sameopenfile(fd1, fd2)

Returns True if both file descriptor arguments refer to the same open file or directory.

samestat

samestat(stat1, stat2)

Returns True if both arguments, instances of os.stat_result (typically, results of os.stat calls), refer to the same file or directory.

split

split(path)

Returns a pair of strings (dir, base) such that join(dir, base) equals path. base is the last component and never contains a path separator. If path ends in a separator, base is ''. dir is the leading part of path, up to the last separator, shorn of trailing separators. For example, os.path.split('a/b/c/d') returns ('a/b/c', 'd').

splitdrive

splitdrive(path)

Returns a pair of strings (drv,pth) such that drv+pth equals path. drv is either a drive specification, or ''—always '' on platforms not supporting drive specifications, such as Unix-like systems. On Windows, os.path.splitdrive('c:d/e') returns ('c:', 'd/e').

splitext

splitext(path)

Returns a pair (root, ext) such that root+ext equals path. ext is either '' or starts with a '.' and has no other '.' or path separator. For example, os.path.splitext('a.a/b.c.d') returns the pair ('a.a/b.c', '.d').

walk

walk(path, func, arg)

(v2 only) Calls func(arg, dirpath, namelist) for each directory in the tree whose root is the directory path, starting with path itself. This function is hard to use and obsolete; use, instead, generator os.walk, covered in Table 10-4, on both v2 and v3.

The stat Module

The function os.stat (covered in Table 10-4) returns instances of stat_result, whose item indices, attribute names, and meaning are also covered there. The stat module supplies attributes with names like those of stat_result’s attributes, turned into uppercase, and corresponding values that are the corresponding item indices.

The more interesting contents of the stat module are functions to examine the st_mode attribute of a stat_result instance and determine the kind of file. os.path also supplies functions for such tasks, which operate directly on the file’s path. The functions supplied by stat shown in the following list are faster when you perform several tests on the same file: they require only one os.stat call at the start of a series of tests, while the functions in os.path implicitly ask the operating system for the same information at each test. Each function returns True when mode denotes a file of the given kind; otherwise, False.

S_ISDIR(mode)

Is the file a directory?

S_ISCHR(mode)

Is the file a special device-file of the character kind?

S_ISBLK(mode)

Is the file a special device-file of the block kind?

S_ISREG(mode)

Is the file a normal file (not a directory, special device-file, and so on)?

S_ISFIFO(mode)

Is the file a FIFO (also known as a “named pipe”)?

S_ISLNK(mode)

Is the file a symbolic link?

S_ISSOCK(mode)

Is the file a Unix-domain socket?

Several of these functions are meaningful only on Unix-like systems, since other platforms do not keep special files such as devices and sockets in the namespace for regular files, as Unix-like systems do.

The stat module supplies two more functions that extract relevant parts of a file’s mode (x.st_mode, for some result x of function os.stat):

S_IFMT

S_IFMT(mode)

Returns those bits of mode that describe the kind of file (i.e., those bits that are examined by functions S_ISDIR, S_ISREG, etc.).

S_IMODE

S_IMODE(mode)

Returns those bits of mode that can be set by the function os.chmod (i.e., the permission bits and, on Unix-like platforms, a few other special bits such as the set-user-id flag).

The filecmp Module

The filecmp module supplies the following functions to compare files and directories:

cmp

cmp(f1, f2, shallow=True)

Compares the files named by path strings f1 and f2. If the files are deemed to be equal, cmp returns True; otherwise, False. If shallow is true, files are deemed to be equal if their stat tuples are. When shallow is False, cmp reads and compares the contents of files whose stat tuples are equal.

cmpfiles

cmpfiles(dir1, dir2, common, shallow=True)

Loops on the sequence common. Each item of common is a string that names a file present in both directories dir1 and dir2. cmpfiles returns a tuple whose items are three lists of strings: (equal, diff, errs). equal is the list of names of files that are equal in both directories, diff is the list of names of files that differ between directories, and errs is the list of names of files that could not be compared (because they do not exist in both directories, or there is no permission to read them). The argument shallow is the same as for cmp.

dircmp

class dircmp(dir1, dir2, ignore=('RCS', 'CVS', 'tags'), hide=('.', '..'))

Creates a new directory-comparison instance object, comparing directories named dir1 and dir2, ignoring names listed in ignore, and hiding names listed in hide. (In v3, the default value for ignore lists more files, and is supplied by attribute DEFAULT_IGNORE of module filecmp; at the time of this writing, ['RCS', 'CVS', 'tags', '.git', '.hg',
'.bzr', '_darcs', '__pycache__']
.)

A dircmp instance d supplies three methods:

d.report()

Outputs to sys.stdout a comparison between dir1 and dir2

d.report_partial_closure()

Outputs to sys.stdout a comparison between dir1 and dir2 and their common immediate subdirectories

d.report_full_closure()

Outputs to sys.stdout a comparison between dir1 and dir2 and all their common subdirectories, recursively

In addition, d supplies several attributes, covered in the next section.

dircmp instance attributes

A dircmp instance d supplies several attributes, computed “just in time” (i.e., only if and when needed, thanks to a __getattr__ special method) so that using a dircmp instance suffers no unnecessary overhead:

d.common

Files and subdirectories that are in both dir1 and dir2

d.common_dirs

Subdirectories that are in both dir1 and dir2

d.common_files

Files that are in both dir1 and dir2

d.common_funny

Names that are in both dir1 and dir2 for which os.stat reports an error or returns different kinds for the versions in the two directories

d.diff_files

Files that are in both dir1 and dir2 but with different contents

d.funny_files

Files that are in both dir1 and dir2 but could not be compared

d.left_list

Files and subdirectories that are in dir1

d.left_only

Files and subdirectories that are in dir1 and not in dir2

d.right_list

Files and subdirectories that are in dir2

d.right_only

Files and subdirectories that are in dir2 and not in dir1

d.same_files

Files that are in both dir1 and dir2 with the same contents

d.subdirs

A dictionary whose keys are the strings in common_dirs; the corresponding values are instances of dircmp for each subdirectory

The fnmatch Module

The fnmatch module (an abbreviation for filename match) matches filename strings with patterns that resemble the ones used by Unix shells:

*

Matches any sequence of characters

?

Matches any single character

[chars]

Matches any one of the characters in chars

[!chars]

Matches any one character not among those in chars

fnmatch does not follow other conventions of Unix shells’ pattern matching, such as treating a slash / or a leading dot . specially. It also does not allow escaping special characters: rather, to literally match a special character, enclose it in brackets. For example, to match a filename that’s a single closed bracket, use the pattern '[]]'.

The fnmatch module supplies the following functions:

filter

filter(names,pattern)

Returns the list of items of names (a sequence of strings) that match pattern.

fnmatch

fnmatch(filename,pattern)

Returns True when string filename matches pattern; otherwise, False. The match is case-sensitive when the platform is, for example, any Unix-like system, and otherwise (for example, on Windows) case-insensitive; beware of that, if you’re dealing with a filesystem whose case-sensitivity doesn’t match your platform (for example, macOS is Unix-like, however, its typical filesystems are case-insensitive).

fnmatchcase

fnmatchcase(filename,pattern)

Returns True when string filename matches pattern; otherwise, False. The match is always case-sensitive on any platform.

translate

translate(pattern)

Returns the regular expression pattern (as covered in “Pattern-String Syntax”) equivalent to the fnmatch pattern pattern. This can be quite useful, for example, to perform matches that are always case-insensitive on any platform—a functionality fnmatch doesn’t supply:

import fnmatch, re

def fnmatchnocase(filename, pattern):
    re_pat = fnmatch.translate(pattern)
    return re.match(re_pat, filename, re.IGNORECASE)

The glob Module

The glob module lists (in arbitrary order) the path names of files that match a path pattern using the same rules as fnmatch; in addition, it does treat a leading dot . specially, like Unix shells do.

glob

glob(pathname)

Returns the list of path names of files that match pattern pathname. In v3 only, you can also optionally pass named argument recursive=True to have path component ** recursively match zero or more levels of subdirectories.

iglob

iglob(pathname)

Like glob, but returns an iterator yielding one relevant path name at a time.

The shutil Module

The shutil module (an abbreviation for shell utilities) supplies the following functions to copy and move files, and to remove an entire directory tree. In v3 only, on some Unix platforms, most of the functions support optional, keyword-only argument follow_symlinks, defaulting to True. When true, and always in v2, if a path indicates a symbolic link, the function follows it to reach an actual file or directory; when false, the function operates on the symbolic link itself.

copy

copy(src, dst)

Copies the contents of the file src, creating or overwriting the file dst. If dst is a directory, the target is a file with the same base name as src, but located in dst. copy also copies permission bits, but not last-access and modification times. In v3 only, returns the path to the destination file it has copied to (in v2, less usefully, copy returns None).

copy2

copy2(src, dst)

Like copy, but also copies times of last access and modification.

copyfile

copyfile(src, dst)

Copies just the contents (not permission bits, nor last-access and modification times) of the file src, creating or overwriting the file dst.

copyfileobj

copyfileobj(fsrc, fdst, bufsize=16384)

Copies all bytes from the “file” object fsrc, which must be open for reading, to “file” object fdst, which must be open for writing. Copies no more than bufsize bytes at a time if bufsize is greater than 0. “File” objects are covered in “The io Module”.

copymode

copymode(src, dst)

Copies permission bits of the file or directory src to file or directory dst. Both src and dst must exist. Does not change dst’s contents, nor its status as being a file or a directory.

copystat

copystat(src, dst)

Copies permission bits and times of last access and modification of the file or directory src to the file or directory dst. Both src and dst must exist. Does not change dst’s contents, nor its status as being a file or a directory.

copytree

copytree(src, dst, symlinks=False, ignore=None)

Copies the directory tree rooted at src into the destination directory named by dst. dst must not already exist: copytree creates it (as well as creating any missing parent directory). copytree copies each file by using function copy2 (in v3 only, you can optionally pass a different file-copy function as named argument copy_function).

When symlinks is true, copytree creates symbolic links in the new tree when it finds symbolic links in the source tree. When symlinks is false, copytree follows each symbolic link it finds and copies the linked-to file with the link’s name. On platforms that do not have the concept of a symbolic link, copytree ignores argument symlinks.

When ignore is not None, it must be a callable accepting two arguments (a directory path and a list of the directory’s immediate children) and returning a list of such children to be ignored in the copy process. If present, ignore is usually the result of a call to shutil.ignore_patterns; for example:

import shutil
ignore = shutil.ignore_patterns('.*', '*.bak')
shutil.copytree('src', 'dst', ignore=ignore)

copies the tree rooted at directory src into a new tree rooted at directory dst, ignoring any file or subdirectory whose name starts with a dot, and any file or subdirectory whose name ends with '.bak'.

ignore_patterns

ignore_patterns(*patterns)

Returns a callable picking out files and subdirectories matching patterns, like those used in the fnmatch module (see “The fnmatch Module”). The result is suitable for passing as the ignore argument to function copytree.

move

move(src, dst)

Moves the file or directory src to dst. First tries os.rename. Then, if that fails (because src and dst are on separate filesystems, or because dst already exists), copies src to dst (copy2 for a file, copytree for a directory; in v3 only, you can optionally pass a file-copy function other than copy2 as named argument copy_function), then removes src (os.unlink for a file, rmtree for a directory).

rmtree

rmtree(path, ignore_errors=False, onerror=None)

Removes directory tree rooted at path. When ignore_errors is True, rmtree ignores errors. When ignore_errors is False and onerror is None, errors raise exceptions. When onerror is not None, it must be callable with three parameters: func, path, and ex. func is the function raising the exception (os.remove or os.rmdir), path is the path passed to func, and ex the tuple of information sys.exc_info() returns. When onerror raises an exception, rmtree terminates, and the exception propagates.

Beyond offering functions that are directly useful, the source file shutil.py in the standard Python library is an excellent example of how to use many os functions.

File Descriptor Operations

The os module supplies, among many others, many functions to handle file descriptors, integers the operating system uses as opaque handles to refer to open files. Python “file” objects (covered in “The io Module”) are usually better for I/O tasks, but sometimes working at file-descriptor level lets you perform some operation faster, or (sacrificing portability) in ways not directly available with io.open. “File” objects and file descriptors are not interchangeable.

To get the file descriptor n of a Python “file” object f, call n=f.fileno(). To wrap a new Python “file” object f around an open file descriptor fd, call f=os.fdopen(fd), or pass fd as the first argument of io.open. On Unix-like and Windows platforms, some file descriptors are pre-allocated when a process starts: 0 is the file descriptor for the process’s standard input, 1 for the process’s standard output, and 2 for the process’s standard error.

os provides many functions dealing with file descriptors; the most often used ones are listed in Table 10-5.

Table 10-5.  

close

close(fd)

Closes file descriptor fd.

closerange

closerange(fd_low, fd_high)

Closes all file descriptors from fd_low, included, to fd_high, excluded, ignoring any errors that may occur.

dup

dup(fd)

Returns a file descriptor that duplicates file descriptor fd.

dup2

dup2(fd, fd2)

Duplicates file descriptor fd to file descriptor fd2. When file descriptor fd2 is already open, dup2 first closes fd2.

fdopen

fdopen(fd, *a, **k)

Like io.open, except that fd must be an int that’s an open file descriptor.

fstat

fstat(fd)

Returns a stat_result instance x, with info about the file open on file descriptor fd. Table 10-4 covers x’s contents.

lseek

lseek(fd, pos, how)

Sets the current position of file descriptor fd to the signed integer byte offset pos and returns the resulting byte offset from the start of the file. how indicates the reference (point 0). When how is os.SEEK_SET, the reference is the start of the file; when os.SEEK_CUR, the current position; when os.SEEK_END, the end of the file. In particular, lseek(fd, 0, os.SEEK_CUR) returns the current position’s byte offset from the start of the file without affecting the current position. Normal disk files support seeking; calling lseek on a file that does not support seeking (e.g., a file open for output to a terminal) raises an exception.

open

open(file, flags, mode=0o777)

Returns a file descriptor, opening or creating a file named by string file. If open creates the file, it uses mode as the file’s permission bits. flags is an int, normally the bitwise OR (with operator |) of one or more of the following attributes of os:

O_RDONLY O_WRONLY O_RDWR

Opens file for read-only, write-only, or read/write, respectively (mutually exclusive: exactly one of these attributes must be in flags)

O_NDELAY O_NONBLOCK

Opens file in nonblocking (no-delay) mode if the platform supports this

O_APPEND

Appends any new data to file’s previous contents

O_DSYNC O_RSYNC O_SYNC O_NOCTTY

Sets synchronization mode accordingly if the platform supports this

O_CREAT

Creates file if file does not already exist

O_EXCL

Raises an exception if file already exists

O_TRUNC

Throws away previous contents of file (incompatible with O_RDONLY)

O_BINARY

Opens file in binary rather than text mode on non-Unix platforms (innocuous and without effect on Unix-like platforms)

pipe

pipe()

Creates a pipe and returns a pair of file descriptors (r, w), respectively open for reading and writing.

read

read(fd, n)

Reads up to n bytes from file descriptor fd and returns them as a bytestring. Reads and returns m<n bytes when only m more bytes are currently available for reading from the file. In particular, returns the empty string when no more bytes are currently available from the file, typically because the file is finished.

write

write(fd, s)

Writes all bytes from bytestring s to file descriptor fd and returns the number of bytes written (i.e., len(s)).

Text Input and Output

Python presents non-GUI text input and output channels to your programs as “file” objects, so you can use the methods of “file” objects (covered in “Attributes and Methods of “file” Objects”) to operate on these channels.

Standard Output and Standard Error

The sys module (covered in “The sys Module”) has the attributes stdout and stderr, writeable “file” objects. Unless you are using shell redirection or pipes, these streams connect to the “terminal” running your script. Nowadays, actual terminals are very rare: a so-called “terminal” is generally a screen window that supports text I/O (e.g., a command prompt console on Windows or an xterm window on Unix).

The distinction between sys.stdout and sys.stderr is a matter of convention. sys.stdout, known as standard output, is where your program emits results. sys.stderr, known as standard error, is where error messages go. Separating results from error messages helps you use shell redirection effectively. Python respects this convention, using sys.stderr for errors and warnings.

Standard Input

The sys module provides the stdin attribute, which is a readable “file” object. When you need a line of text from the user, you can call built-in function input (covered in Table 7-2; in v2, it’s named raw_input), optionally with a string argument to use as a prompt.

When the input you need is not a string (for example, when you need a number), use input to obtain a string from the user, then other built-ins, such as int, float, or ast.literal_eval (covered below), to turn the string into the number you need.

You could, in theory, also use eval (normally preceded by compile, for better control of error diagnostics), so as to let the user input any expression, as long as you totally trust the user. A nasty user can exploit eval to breach security and cause damage (a well-meaning but careless user can also unfortunately cause just about as much damage). There is no effective defense—just avoid eval (and exec!) on any input from sources you do not fully trust.

One advanced alternative we do recommend is to use the function literal_eval from the standard library module ast (as covered in the online docs). ast.literal_eval(astring) returns a valid Python value for the given literal astring when it can, or else raises a SyntaxError or ValueError; it never has any side effect. However, to ensure complete safety, astring in this case cannot use any operator, nor any nonkeyword identifier. For example:

import ast
print(ast.literal_eval('23'))     # prints 23
print(ast.literal_eval('[2,3]'))  # prints [2, 3]
print(ast.literal_eval('2+3'))    # raises ValueError
print(ast.literal_eval('2+'))     # raises SyntaxError

The getpass Module

Very occasionally, you may want the user to input a line of text in such a way that somebody looking at the screen cannot see what the user is typing. This may occur when you’re asking the user for a password. The getpass module provides the following functions:

getpass

getpass(prompt='Password: ')

Like input, except that the text the user inputs is not echoed to the screen as the user is typing. getpass’s default prompt is different from input’s.

getuser

getuser()

Returns the current user’s username. getuser tries to get the username as the value of one of the environment variables LOGNAME, USER, LNAME, and USERNAME, in order. If none of these variables are in os.environ, getuser asks the operating system.

Richer-Text I/O

The tools covered so far supply the minimal subset of text I/O functionality on all platforms. Most platforms offer richer-text I/O, such as responding to single keypresses (not just entire lines) and showing text in any spot on the terminal.

Python extensions and core Python modules let you access platform-specific functionality. Unfortunately, various platforms expose this functionality in very different ways. To develop cross-platform Python programs with rich-text I/O functionality, you may need to wrap different modules uniformly, importing platform-specific modules conditionally (usually with the try/except idiom covered in “try/except”).

The readline Module

The readline module wraps the GNU Readline Library. Readline lets the user edit text lines during interactive input, and recall previous lines for editing and reentry. Readline comes pre-installed on many Unix-like platforms, and it’s available online. On Windows, you can install and use the third-party module pyreadline.

When readline is available, Python uses it for all line-oriented input, such as input. The interactive Python interpreter always tries to load readline to enable line editing and recall for interactive sessions. Some readline functions control advanced functionality, particularly history, for recalling lines entered in previous sessions, and completion, for context-sensitive completion of the word being entered. (See the GNU Readline docs for details on configuration commands.) You can access the module’s functionality using the following functions:

add_history

add_history(s)

Adds string s as a line at the end of the history buffer.

clear_history

clear_history(s)

Clears the history buffer.

get_completer

get_completer()

Returns the current completer function (as last set by set_completer), or None if no completer function is set.

get_history_length

get_history_length()

Returns the number of lines of history to be saved to the history file. When the result is less than 0, all lines in the history are to be saved.

parse_and_bind

parse_and_bind(readline_cmd)

Gives Readline a configuration command. To let the user hit Tab to request completion, call parse_and_bind('tab: complete'). See the Readline documentation for other useful values of string readline_cmd.

A good completion function is in the module rlcompleter. In the interactive interpreter (or in the startup file executed at the start of interactive sessions, covered in “Environment Variables”), enter:

import readline, rlcompleter
readline.parse_and_bind('tab: complete')

For the rest of this interactive session, you can hit Tab during line editing and get completion for global names and object attributes.

read_history_file

read_history_file(filename='~/.history')

Loads history lines from the text file at path filename.

read_init_file

read_init_file(filename=None)

Makes Readline load a text file: each line is a configuration command. When filename is None, loads the same file as last time.

set_completer

set_completer(f=None)

Sets the completion function. When f is None, Readline disables completion. Otherwise, when the user types a partial word start, then Tab, Readline calls f(start, i), with i initially 0. f returns the ith possible word starting with start, or None when there are no more. Readline loops calling f, with i set to 0, 1, 2, etc., until f returns None.

set_history_length

set_history_length(x)

Sets the number of lines of history that are to be saved to the history file. When x is less than 0, all lines in the history are to be saved.

write_history_file

write_history_file(filename='~/.history')

Saves history lines to the text file whose name or path is filename.

Console I/O

“Terminals” today are usually text windows on a graphical screen. You may also, in theory, use a true terminal, or (perhaps a tad less theoretical, but, these days, not by much) the console (main screen) of a personal computer in text mode. All such “terminals” in use today offer advanced text I/O functionality, accessed in platform-dependent ways. The curses package works on Unix-like platforms; for a cross-platform (Windows, Unix, Mac) solution, you may use third-party package UniCurses. The msvcrt module, on the contrary, exists only on Windows.

The curses package

The classic Unix approach to advanced terminal I/O is named curses, for obscure historical reasons.1 The Python package curses affords reasonably simple use, but still lets you exert detailed control if required. We cover a small subset of curses, just enough to let you write programs with rich-text I/O functionality. (See Eric Raymond’s tutorial Curses Programming with Python for more). Whenever we mention “the screen” in this section, we mean the screen of the terminal (usually, these days, that’s the text window of a terminal-emulator program).

The simplest and most effective way to use curses is through the curses.wrapper function:

wrapper

wrapper(func, *args)

When you call wrapper you must pass as its first argument a function, func. wrapper initializes the curses system, creating a top-level curses.window object w, and then calls func with w as its first argument and the remaining arguments to wrapper as further arguments to func. When func returns, the wrapper function returns the terminal to normal functionality and itself terminates by returning the return value of func. Performs curses initialization, calls func(w, *args), performs curses finalization (setting the terminal back to normal behavior), and finally returns func’s result (or re-raises whatever exception func may have propagated). The key factor is that wrapper sets the terminal back to normal behavior, whether func terminates normally or propagates an exception.

func should be a function that performs all tasks in your program that may need curses functionality. In other words, func normally contains (or calls, directly or indirectly, functions containing) all of your program’s functionality, save perhaps for some noninteractive initialization and/or finalization tasks.

curses models text and background colors as character attributes. Colors available on the terminal are numbered from 0 to curses.COLORS. The function color_content takes a color number n as its argument and returns a tuple (r, g, b) of integers between 0 and 1000 giving the amount of each primary color in n. The function color_pair takes a color number n as its argument and returns an attribute code that you can pass to various methods of a curses.Window object in order to display text in that color.

curses lets you create multiple instances of type curses.Window, each corresponding to a rectangle on the screen. You can also create exotic variants, such as instances of Panel, polymorphic with Window but not tied to a fixed screen rectangle. You do not need such advanced functionality in simple curses programs: just use the Window object stdscr that curses.wrapper gives you. Call w.refresh() to ensure that changes made to any Window instance w, including stdscr, show up on screen. curses can buffer the changes until you call refresh.

An instance w of Window supplies, among many others, the following frequently used methods:

addstr

w.addstr([y, x, ]s[, attr])

Puts the characters in the string s, with the attribute attr, on w at the given coordinates (x, y), overwriting any previous contents. All curses functions and methods accept coordinate arguments in reverse order, with y (the row number) before x (the column number). If you omit y and x, addstr uses w’s current cursor coordinates. If you omit attr, addstr uses w’s current default attribute. In any case, addstr, when done adding the string, sets w’s current cursor coordinates to the end of the string it has added.

clrtobot, clrtoeol

w.clrtobot() w.clrtoeol()

clrtoeol writes blanks from w’s current cursor coordinates to the end of the line. clrtobot, in addition, also blanks all lines lower down on the screen.

delch

w.delch([y, x])

Deletes one character from w at coordinates (x, y). If you omit the y and x arguments, delch uses w’s current cursor coordinates. In any case, delch does not change w’s current cursor coordinates. All the following characters in line y, if any, shift left by one.

deleteln

w.deleteln()

Deletes from w the entire line at w’s current cursor coordinates, and scrolls up by one line all lines lower down on the screen.

erase

w.erase()

Writes spaces to the entire terminal screen.

getch

w.getch()

Returns an integer c corresponding to a user keystroke. A value between 0 and 255 represents an ordinary character, while a value greater than 255 represents a special key. curses supplies names for special keys, so you can test c for equality with readable constants such as curses.KEY_HOME (the Home special key), curses.KEY_LEFT (the left-arrow special key), and so on. (The list of all of curses many special-key names is in Python’s online docs.)

If you have set window w to no-delay mode by calling w.nodelay(True), w.getch raises an exception when no keystroke is ready. By default, w.getch waits until the user hits a key.

getyx

w.getyx()

Returns w’s current cursor coordinates as a tuple (y, x).

insstr

w.insstr([y, x, ]s[, attr])

Inserts the characters in string s, with attribute attr, on w at coordinates (x, y), shifting the rest of line rightward. Any characters shifted beyond line end are lost. If you omit y and x, insstr uses w’s current cursor coordinates. If you omit attr, insstr uses w’s current default attribute. In any case, when done inserting the string, insstr sets w’s current cursor coordinates to the first character of the string it has inserted.

move

w.move(y, x)

Moves w’s cursor to the given coordinates (x, y).

nodelay

w.nodelay(flag)

Sets w to no-delay mode when flag is true; resets w back to normal mode when flag is false. No-delay mode affects the method w.getch.

refresh

w.refresh()

Updates window w on screen with all changes the program has effected on w.

The curses.textpad module supplies the Textpad class, which lets you support advanced input and text editing.

Textpad

class Textpad(window)

Creates and returns an instance t of class Textpad that wraps the curses window instance window. Instance t has one frequently used method:

t.edit()

Lets the user perform interactive editing on the contents of the window instance that t wraps. The editing session supports simple Emacs-like key bindings: normal characters overwrite the window’s previous contents, arrow keys move the cursor, and Ctrl-H deletes the character to the cursor’s left. When the user hits Ctrl-G, the editing session ends, and edit returns the window’s contents as a single string, with newlines as line separators.

The msvcrt Module

The msvcrt module, available only on Windows, supplies functions that let Python programs access a few proprietary extras supplied by the Microsoft Visual C++’s runtime library msvcrt.dll. Some msvcrt functions let you read user input character by character rather than reading a full line at a time, as listed here:

getch, getche

getch() getche()

Reads and returns one character from keyboard input, and waits if no character is yet available for reading. getche also echoes the character to screen (if printable), while getch doesn’t. When the user presses a special key (arrows, function keys, etc.), it’s seen as two characters: first a chr(0) or chr(224), then a second character that, together with the first one, defines the special key the user pressed. To find out what getch returns for any key, run the following small script on a Windows machine:

import msvcrt
print("press z to exit, or any other key 
       to see the key's code:")
while True:
    c = msvcrt.getch()
    if c == 'z': break
    print('{} ({!r})'.format(ord(c), c))

kbhit

kbhit()

Returns True when a character is available for reading (getch, when called, returns immediately); otherwise, False (getch, when called, waits).

ungetch

ungetch(c)

“Ungets” character c; the next call to getch or getche returns c. It’s an error to call ungetch twice without intervening calls to getch or getche.

Interactive Command Sessions

The cmd module offers a simple way to handle interactive sessions of commands. Each command is a line of text. The first word of each command is a verb defining the requested action. The rest of the line is passed as an argument to the method that implements the verb’s action.

The cmd module supplies the class Cmd to use as a base class, and you define your own subclass of cmd.Cmd. Your subclass supplies methods with names starting with do_ and help_, and may optionally override some of Cmd’s methods. When the user enters a command line such as verb and the rest, as long as your subclass defines a method named do_verb, Cmd.onecmd calls:

self.do_verb('and the rest')

Similarly, as long as your subclass defines a method named help_verb, Cmd.do_help calls the method when the command line starts with 'help verb' or '?verb'. Cmd shows suitable error messages if the user tries to use, or asks for help about, a verb for which the needed method is not defined.

Initializing a Cmd Instance

Your subclass of cmd.Cmd, if it defines its own __init__ special method, must call the base class’s __init__, whose signature is as follows:

__init__

Cmd.__init__(self, completekey='Tab', stdin=sys.stdin,
stdout=sys.stdout)

Initializes the instance self with specified or default values for completekey (name of the key to use for command completion with the readline module; pass None to disable command completion), stdin (file object to get input from), and stdout (file object to emit output to).

If your subclass does not define __init__, then it just inherits the one from the base class cmd.Cmd that we just covered. In this case, to instantiate your subclass, call it, with the optional parameters completekey, stdin, and stdout, as documented in the previous paragraph.

Methods of Cmd Instances

An instance c of a subclass of the class Cmd supplies the following methods (many of these methods are “hooks” meant to be optionally overridden by the subclass):

cmdloop

c.cmdloop(intro=None)

Performs an interactive session of line-oriented commands. cmdloop starts by calling c.preloop(), and then emits string intro (c.intro if intro is None). Then c.cmdloop enters a loop. In each iteration of the loop, cmdloop reads line s with s=input(c.prompt). When standard input reaches end-of-file, cmdloop sets s='EOF'. If s is not 'EOF', cmdloop preprocesses string s with s=c.precmd(s), and then calls flag=c.onecmd(s). When onecmd returns a true value, this is a tentative request to terminate the command loop. Whatever the value of flag, cmdloop calls flag=c.postcmd(flag, s) to check whether the loop should terminate. If flag is now true, the loop terminates; otherwise, the loop repeats again. When the loop terminates, cmdloop calls c.postloop(), and then terminates. This structure of cmdloop that we just described is easiest to understand by looking at equivalent Python code:

def cmdloop(self, intro=None):
    self.preloop()
    if intro is None: intro = self.intro
    print(intro)
    finis_flag = False
    while not finis_flag:
        try: s = raw_input(self.prompt)
        except EOFError: s = 'EOF'
        else: s = self.precmd(s)
        finis_flag = self.onecmd(s)
        finis_flag = self.postcmd(finis_flag, s)
    self.postloop()

cmdloop is a good example of the classic design pattern known as Template Method. Such a method performs little substantial work itself; rather, it structures and organizes calls to other methods, known as hook methods. Subclasses may override some or all of the hook methods to define the details of class behavior within the overall framework thus established. When you inherit from Cmd, you almost never override the method cmdloop, since cmdloop’s structure is the main thing you get by subclassing Cmd.

default

c.default(s)

c.onecmd calls c.default(s) when there is no method c.do_verb for the first word verb of line s. Subclasses often override default. The base-class method Cmd.default prints an error message.

do_help

c.do_help(verb)

c.onecmd calls c.do_help(verb) when the command-line s starts with 'help verb' or '?verb'. Subclasses rarely override do_help. The Cmd.do_help method calls the method help_verb if the subclass supplies it; otherwise, it displays the docstring of the method do_verb if the subclass supplies that method with a nonempty docstring. If the subclass does not supply either source of help, Cmd.do_help outputs a message to inform the user that no help is available on verb.

emptyline

c.emptyline()

c.onecmd calls c.emptyline( ) when command-line s is empty or blank. Unless a subclass overrides this method, the base-class method Cmd.emptyline reexecutes the last nonblank command line seen, stored in the attribute c.lastcmd of c.

onecmd

c.onecmd(s)

c.cmdloop calls c.onecmd(s) for each command-line s that the user inputs. You can also call onecmd directly if you have otherwise obtained a line s to process as a command. Normally, subclasses do not override onecmd. Cmd.onecmd sets c.lastcmd=s. Then onecmd calls do_verb when s starts with the word verb and the subclass supplies such a method; otherwise, it calls emptyline or default, as explained earlier. In any case, Cmd.onecmd returns the result of whatever other method it calls to be interpreted by postcmd as a termination-request flag.

postcmd

c.postcmd(flag, s)

c.cmdloop calls c.postcmd(flag, s) for each command-line s after c.onecmd(s) has returned value flag. If flag is true, the command just executed is posing a tentative request to terminate the command loop. If postcmd returns a true value, cmdloop’s loop terminates. Unless your subclass overrides this method, the base-class method Cmd.postcmd is called and returns flag itself as the method’s result.

postloop

c.postloop()

c.cmdloop calls c.postloop( ) when cmdloop’s loop terminates. Your subclass may override this method; the base-class method Cmd.postloop does nothing.

precmd

c.precmd(s)

c.cmdloop calls s=c.precmd(s) to preprocess each command-line s. The current leg of the loop bases all further processing on the string that precmd returns. Your subclass may override this method; the base-class method Cmd.precmd just returns s itself as the method’s result.

preloop

c.preloop()

c.cmdloop calls c.preloop() before cmdloop’s loop begins. Your subclass may override this method; the base-class method Cmd.preloop does nothing.

Attributes of Cmd Instances

An instance c of a subclass of the class Cmd supplies the following attributes:

identchars

A string whose characters are all those that can be part of a verb; by default, c.identchars contains letters, digits, and an underscore (_).

intro

The message that cmdloop outputs first, when called with no argument.

lastcmd

The last nonblank command line seen by onecmd.

prompt

The string that cmdloop uses to prompt the user for interactive input. You almost always bind c.prompt explicitly, or override prompt as a class attribute of your subclass; the default Cmd.prompt is just '(Cmd) '.

use_rawinput

When false (default is true), cmdloop prompts and inputs via calls to methods of sys.stdout and sys.stdin, rather than via input.

Other attributes of Cmd instances, which we do not cover here, let you exert fine-grained control on many formatting details of help messages.

A Cmd Example

The following example shows how to use cmd.Cmd to supply the verbs print (to output the rest of the line) and stop (to end the loop):

import cmd

class X(cmd.Cmd):
    def do_print(self, rest):
        print(rest)
    def help_print(self):
        print('print (any string): outputs (any string)')
    def do_stop(self, rest):
        return True
    def help_stop(self):
        print('stop: terminates the command loop')

if __name__ == '__main__':
    X().cmdloop()

A session using this example might proceed as follows:

C:>python examples/chapter10/cmdex.py
(Cmd) help
Documented commands (type help <topic>):
= == == == == == == == == == == == == == == == == == == == =
print           stop
Undocumented commands:
= == == == == == == == == == == =
help
(Cmd) help print
print (any string): outputs (any string)
(Cmd) print hi there
hi there
(Cmd) stop

Internationalization

Most programs present some information to users as text. Such text should be understandable and acceptable to the user. For example, in some countries and cultures, the date “March 7” can be concisely expressed as “3/7.” Elsewhere, “3/7” indicates “July 3,” and the string that means “March 7” is “7/3.” In Python, such cultural conventions are handled with the help of the standard module locale.

Similarly, a greeting can be expressed in one natural language by the string “Benvenuti,” while in another language the string to use is “Welcome.” In Python, such translations are handled with the help of standard module gettext.

Both kinds of issues are commonly called internationalization (often abbreviated i18n, as there are 18 letters between i and n in the full spelling)—a misnomer, since the same issues apply to different languages or cultures within a single nation.

The locale Module

Python’s support for cultural conventions imitates that of C, slightly simplified. A program operates in an environment of cultural conventions known as a locale. The locale setting permeates the program and is typically set at program startup. The locale is not thread-specific, and the locale module is not thread-safe. In a multithreaded program, set the program’s locale before starting secondary threads.

locale is only useful for process-wide settings

If your application needs to handle multiple locales at the same time in a single process—whether that’s in threads or asynchronously—locale is not the answer, due to its process-wide nature. Consider alternatives such as PyICU, mentioned in “More Internationalization Resources”.

If a program does not call locale.setlocale, the locale is a neutral one known as the C locale. The C locale is named from this architecture’s origins in the C language; it’s similar, but not identical, to the U.S. English locale. Alternatively, a program can find out and accept the user’s default locale. In this case, the locale module interacts with the operating system (via the environment or in other system-dependent ways) to find the user’s preferred locale. Finally, a program can set a specific locale, presumably determining which locale to set on the basis of user interaction or via persistent configuration settings.

Locale setting is normally performed across the board for all relevant categories of cultural conventions. This wide-spectrum setting is denoted by the constant attribute LC_ALL of module locale. However, the cultural conventions handled by locale are grouped into categories, and, in some cases, a program can choose to mix and match categories to build up a synthetic composite locale. The categories are identified by the following constant attributes of the locale module:

LC_COLLATE

String sorting; affects functions strcoll and strxfrm in locale

LC_CTYPE

Character types; affects aspects of the module string (and string methods) that have to do with lowercase and uppercase letters

LC_MESSAGES

Messages; may affect messages displayed by the operating system—for example, the function os.strerror and module gettext

LC_MONETARY

Formatting of currency values; affects function locale.localeconv

LC_NUMERIC

Formatting of numbers; affects the functions atoi, atof, format, localeconv, and str in locale

LC_TIME

Formatting of times and dates; affects the function time.strftime

The settings of some categories (denoted by LC_CTYPE, LC_TIME, and LC_MESSAGES) affect behavior in other modules (string, time, os, and gettext, as indicated). Other categories (denoted by LC_COLLATE, LC_MONETARY, and LC_NUMERIC) affect only some functions of locale itself.

The locale module supplies the following functions to query, change, and manipulate locales, as well as functions that implement the cultural conventions of locale categories LC_COLLATE, LC_MONETARY, and LC_NUMERIC:

atof

atof(s)

Converts string s to a floating-point number using the current LC_NUMERIC setting.

atoi

atoi(s)

Converts string s to an integer number using the LC_NUMERIC setting.

format

format(fmt, num, grouping=False)

Returns the string obtained by formatting number num according to the format string fmt and the LC_NUMERIC setting. Except for cultural convention issues, the result is like fmt%num. If grouping is true, format also groups digits in the result string according to the LC_NUMERIC setting. For example:

>>> locale.setlocale(locale.LC_NUMERIC, 'en')
'English_United States.1252'
>>> locale.format('%s', 1000*1000)
'1000000'
>>> locale.format('%s', 1000*1000, grouping=True)
'1,000,000'

When the numeric locale is U.S. English and argument grouping is true, format groups digits by threes with commas.

getdefaultlocale

getdefaultlocale(envvars=('LANGUAGE', 'LC_ALL', 'LC_TYPE',
'LANG'))

Checks the environment variables whose names are specified by envvars, in order. The first one found in the environment determines the default locale. getdefaultlocale returns a pair of strings (lang, encoding) compliant with RFC 1766 (except for the 'C' locale), such as ('en_US', 'ISO8859-1'). Each item of the pair may be None if gedefaultlocale is unable to discover what value the item should have.

getlocale

getlocale(category=LC_CTYPE)

Returns a pair of strings (lang, encoding) with the current setting for the given category. The category cannot be LC_ALL.

localeconv

localeconv()

Returns a dict d with the cultural conventions specified by categories LC_NUMERIC and LC_MONETARY of the current locale. While LC_NUMERIC is best used indirectly, via other functions of module locale, the details of LC_MONETARY are accessible only through d. Currency formatting is different for local and international use. The U.S. currency symbol, for example, is '$' for local use only. '$' is ambiguous in international use, since the same symbol is also used for other currencies called “dollars” (Canadian, Australian, Hong Kong, etc.). In international use, therefore, the U.S. currency symbol is the unambiguous string 'USD'. The keys into d to use for currency formatting are the following strings:

'currency_symbol'

Currency symbol to use locally.

'frac_digits'

Number of fractional digits to use locally.

'int_curr_symbol'

Currency symbol to use internationally.

'int_frac_digits'

Number of fractional digits to use internationally.

'mon_decimal_point'

String to use as the “decimal point” (AKA radix) for monetary values.

'mon_grouping'

List of digit-grouping numbers for monetary values.

'mon_thousands_sep'

String to use as digit-groups separator for monetary values.

'negative_sign' 'positive_sign'

Strings to use as the sign symbol for negative (positive) monetary values.

'n_cs_precedes' 'p_cs_precedes'

True if the currency symbol comes before negative (positive) monetary values.

'n_sep_by_space' 'p_sep_by_space'

True if a space goes between sign and negative (positive) monetary values.

'n_sign_posn' 'p_sign_posn'

Numeric codes to use to format negative (positive) monetary values:

0

The value and the currency symbol are placed inside parentheses.

1

The sign is placed before the value and the currency symbol.

2

The sign is placed after the value and the currency symbol.

3

The sign is placed immediately before the value.

4

The sign is placed immediately after the value.

CHAR_MAX

The current locale does not specify any convention for this formatting.

d['mon_grouping'] is a list of numbers of digits to group when formatting a monetary value. When d['mon_grouping'][-1] is 0, there is no further grouping beyond the indicated numbers of digits. When d['mon_grouping'][-1] is locale.CHAR_MAX, grouping continues indefinitely, as if d['mon_grouping'][-2] were endlessly repeated. locale.CHAR_MAX is a constant used as the value for all entries in d for which the current locale does not specify any convention.

normalize

normalize(localename)

Returns a string, suitable as an argument to setlocale, that is the normalized equivalent to localename. If normalize cannot normalize string localename, then normalize returns localename unchanged.

resetlocale

resetlocale(category=LC_ALL)

Sets the locale for category to the default given by getdefaultlocale.

setlocale

setlocale(category, locale=None)

Sets the locale for category to locale, if not None; returns the setting (the existing one when locale is None; otherwise, the new one). locale can be a string, or a pair (lang, encoding). lang is normally a language code based on ISO 639 two-letter codes ('en' is English, 'nl' is Dutch, and so on). When locale is the empty string '', setlocale sets the user’s default locale.

str

str(num)

Like locale.format('%f', num).

strcoll

strcoll(str1, str2)

Like cmp(str1, str2), but per the LC_COLLATE setting.

strxfrm

strxfrm(s)

Returns a string sx such that Python’s built-in comparisons of strings so transformed is equivalent to calling locale.strcoll on the original strings. strxfrm lets you use the key= argument for sorts that involve locale-conformant string comparisons. For example:

def locale_sort(list_of_strings):
    list_of_strings.sort(key=locale.strxfrm)

The gettext Module

A key issue in internationalization is the ability to use text in different natural languages, a task known as localization. Python supports localization via the module gettext, inspired by GNU gettext. The gettext module is optionally able to use the latter’s infrastructure and APIs, but also offers a simpler, higher-level approach, so you don’t need to install or study GNU gettext to use Python’s gettext effectively.

For full coverage of gettext from a different perspective, see the online docs.

Using gettext for localization

gettext does not deal with automatic translation between natural languages. Rather, it helps you extract, organize, and access the text messages that your program uses. Pass each string literal subject to translation, also known as a message, to a function named _ (underscore) rather than using it directly. gettext normally installs a function named _ in the __builtin__ module. To ensure that your program runs with or without gettext, conditionally define a do-nothing function, named _, that just returns its argument unchanged. Then you can safely use _('message') wherever you would normally use a literal 'message' that should be translated if feasible. The following example shows how to start a module for conditional use of gettext:

try:
    _
except NameError:
    def _(s): return s

def greet():
    print(_('Hello world'))

If some other module has installed gettext before you run this example code, the function greet outputs a properly localized greeting. Otherwise, greet outputs the string 'Hello world' unchanged.

Edit your source, decorating message literals with function _. Then use any of various tools to extract messages into a text file (normally named messages.pot) and distribute the file to the people who translate messages into the various natural languages your application must support. Python supplies a script pygettext.py (in directory Tools/i18n in the Python source distribution) to perform message extraction on your Python sources.

Each translator edits messages.pot to produce a text file of translated messages with extension .po. Compile the .po files into binary files with extension .mo, suitable for fast searching, using any of various tools. Python supplies script Tools/i18n/msgfmt.py for this purpose. Finally, install each .mo file with a suitable name in a suitable directory.

Conventions about which directories and names are suitable differ among platforms and applications. gettext’s default is subdirectory share/locale/<lang>/LC_MESSAGES/ of directory sys.prefix, where <lang> is the language’s code (two letters). Each file is named <name>.mo, where <name> is the name of your application or package.

Once you have prepared and installed your .mo files, you normally execute, at the time your application starts up, some code such as the following:

import os, gettext
os.environ.setdefault('LANG', 'en')   # application-default language
gettext.install('your_application_name')

This ensures that calls such as _('message') return the appropriate translated strings. You can choose different ways to access gettext functionality in your program—for example, if you also need to localize C-coded extensions, or to switch between languages during a run. Another important consideration is whether you’re localizing a whole application, or just a package that is distributed separately.

Essential gettext functions

gettext supplies many functions; the most often used ones are:

install

install(domain, localedir=None, unicode=False)

Installs in Python’s built-in namespace a function named _ to perform translations given in the file <lang>/LC_MESSAGES/<domain>.mo in the directory localedir, with language code <lang> as per getdefaultlocale. When localedir is None, install uses the directory os.path.join(sys.prefix, 'share', 'locale'). When unicode is true, function _ accepts and returns Unicode strings, not bytestrings.

translation

translation(domain, localedir=None, languages=None)

Searches for a .mo file similarly to function install. When languages is None, translation looks in the environment for the lang to use, like install. It examines, in order, environment variables LANGUAGE, LC_ALL, LC_MESSAGES, LANG; the first nonempty one is split on ':' to give a list of language names (for example, 'de:en' is split into ['de', 'en']). When not None, languages must be a list of one or more language names (for example, ['de', 'en']). Translation uses the first language name in the list for which it finds a .mo file. The function translation returns an instance object that supplies methods gettext (to translate a bytestring), ugettext (to translate a Unicode string), and install (to install gettext or ugettext under name _ in Python’s built-in namespace).

translation offers more detailed control than install, which is like translation(domain,localedir).install(unicode). With translation, you can localize a single package without affecting the built-in namespace, by binding name _ on a per-module basis—for example, with:

_ = translation(domain).ugettext

translation also lets you switch globally between several languages, since you can pass an explicit languages argument, keep the resulting instance, and call the install method of the appropriate language as needed:

import gettext
trans = {}
def switch_to_language(lang, domain='my_app', 
use_unicode=True):
    if lang not in translators:
        trans[lang] = gettext.translation(domain, 
languages=[lang])
    trans[lang].install(use_unicode)

More Internationalization Resources

Internationalization is a very large topic. For a general introduction, see Wikipedia. One of the best packages of code and information for internationalization is ICU, which also embeds the Unicode Consortium’s excellent Common Locale Data Repository (CLDR) database of locale conventions, and code to access the CLDR. To use ICU in Python, install the third-party package PyICU.

1 “Curses” does describe well the typical utterances of programmers faced with this rich, complicated approach.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset