Chapter 8. Strings and Things

v3 supplies Unicode text strings as type str, with operators, built-in functions, methods, and dedicated modules. It also supplies the somewhat similar bytes type, representing arbitrary binary data as a sequence of bytes, also known as a bytestring or byte string. This is a major difference from v2, where type str is a sequence of bytes, while Unicode text strings are of type unicode. Many textual operations, in both versions, are possible on objects of either type.

This chapter covers the methods of string objects, in “Methods of String and Bytes Objects”; string formatting, in “String Formatting”; and the modules string (in “The string Module”) and pprint (in “The pprint Module”). Issues related specifically to Unicode are also covered, in “Unicode”. The new (v3.6) formatted string literals are covered in “New in 3.6: Formatted String Literals”.

Methods of String and Bytes Objects

Unicode str and bytes objects are immutable sequences, as covered in “Strings”. All immutable-sequence operations (repetition, concatenation, indexing, and slicing) apply to them, returning an object of the same type. A string or bytes object s also supplies several nonmutating methods, as documented in Table 8-1.

Unless otherwise noted, methods are present on objects of either type. In v3, str methods return a Unicode string, while methods of bytes objects return a bytestring (in v2, type unicode stands for a textual string—i.e., Unicode—and type str for a bytestring). Terms such as “letters,” “whitespace,” and so on, refer to the corresponding attributes of the string module, covered in “The string Module”.

Note

In Table 8-1, for conciseness, we use sys.maxsize for integer default values meaning, in practice, “any number, no matter how large.”

Table 8-1.  

capitalize

s.capitalize()

Returns a copy of s where the first character, if a letter, is uppercase, and all other letters, if any, are lowercase.

casefold

s.casefold()

str only, v3 only. Returns a string processed by the algorithm described in section 3.13 of the Unicode standard. This is similar to s.lower (described later in this list) but also takes into account broader equivalences, such as that between the German lowercase 'ß' and 'ss', and is thus better suited to case-insensitive matching.

center

s.center(n,fillchar=' ')

Returns a string of length max(len(s), n), with a copy of s in the central part, surrounded by equal numbers of copies of character fillchar on both sides (e.g., 'ciao'.center(2) is 'ciao' and 'x'.center(4,'_') is '_x__').

count

s.count(sub,start=0,end=sys.maxsize)

Returns the number of nonoverlapping occurrences of substring sub in s[start:end].

decode

s.decode(encoding='utf-8',errors='strict')

bytes only. Returns a str object decoded from the bytes s according to the given encoding. errors determines how decoding errors are handled. 'strict' cause errors to raise UnicodeError exceptions, 'ignore' ignores the malformed data, and 'replace' replaces them with question marks; see “Unicode” for details. Other values can be registered via codec.register_error(), covered in Table 8-7.

encode

s.encode(encoding=None,errors='strict')

str only. Returns a bytes object obtained from s with the given encoding and error handling. See “Unicode” for more details.

endswith

s.endswith(suffix,start=0,end=sys.maxsize)

Returns True when s[start:end] ends with string suffix; otherwise, False. suffix can be a tuple of strings, in which case endswith returns True when s[start:end] ends with any one of them.

expandtabs

s.expandtabs(tabsize=8)

Returns a copy of s where each tab character is changed into one or more spaces, with tab stops every tabsize characters.

find

s.find(sub,start=0,end=sys.maxsize)

Returns the lowest index in s where substring sub is found, such that sub is entirely contained in s[start:end]. For example, 'banana'.find('na') is 2, as is 'banana'.find('na',1), while 'banana'.find('na',3) is 4, as is 'banana'.find('na',-2). find returns -1 when sub is not found.

format

s.format(*args,**kwargs)

str only. Formats the positional and named arguments according to formatting instructions contained in the string s. See “String Formatting” for further details.

format_map

s.format_map(mapping)

str only, v3 only. Formats the mapping argument according to formatting instructions contained in the string s. Equivalent to s.format(**mapping) but uses the mapping directly.

index

s.index(sub,start=0,end=sys.maxsize)

Like find, but raises ValueError when sub is not found.

isalnum

s.isalnum()

Returns True when len(s) is greater than 0 and all characters in s are letters or digits. When s is empty, or when at least one character of s is neither a letter nor a digit, isalnum returns False.

isalpha

s.isalpha()

Returns True when len(s) is greater than 0 and all characters in s are letters. When s is empty, or when at least one character of s is not a letter, isalpha returns False.

isdecimal

s.isdecimal()

str only, v3 only. Returns True when len(s) is greater than 0 and all characters in s can be used to form decimal-radix numbers. This includes Unicode characters defined as Arabic digits.a

isdigit

s.isdigit()

Returns True when len(s) is greater than 0 and all characters in s are digits. When s is empty, or when at least one character of s is not a digit, isdigit returns False.

isidentifier

s.isidentifier()

str only, v3 only. Returns True when s is a valid identifier according to the Python language’s definition; keywords also satisfy the definition, so, for example, 'class'.isidentifier() returns True.

islower

s.islower()

Returns True when all letters in s are lowercase. When s contains no letters, or when at least one letter of s is uppercase, islower returns False.

isnumeric

s.isnumeric()

str only, v3 only. Similar to s.isdigit(), but uses a broader definition of numeric characters that includes all characters defined as numeric in the Unicode standard (such as fractions).

isprintable

s.isprintable()

str only, v3 only. Returns True when all characters in s are spaces ('x20') or are defined in the Unicode standard as printable. Differently from other methods starting with is, ''.isprintable() returns True.

isspace

s.isspace()

Returns True when len(s) is greater than 0 and all characters in s are whitespace. When s is empty, or when at least one character of s is not whitespace, isspace returns False.

istitle

s.istitle()

Returns True when letters in s are titlecase: a capital letter at the start of each contiguous sequence of letters, all other letters lowercase (e.g., 'King Lear'.istitle() is True). When s contains no letters, or when at least one letter of s violates the titlecase condition, istitle returns False (e.g., '1900'.istitle() and 'Troilus and Cressida'.istitle() return False).

isupper

s.isupper()

Returns True when all letters in s are uppercase. When s contains no letters, or when at least one letter of s is lowercase, isupper returns False.

join

s.join(seq)

Returns the string obtained by concatenating the items of seq, which must be an iterable whose items are strings, and interposing a copy of s between each pair of items (e.g., ''.join(str(x) for x in range(7)) is '0123456' and 'x'.join('aeiou') is 'axexixoxu').

ljust

s.ljust(n,fillchar=' ')

Returns a string of length max(len(s),n), with a copy of s at the start, followed by zero or more trailing copies of character fillchar.

lower

s.lower()

Returns a copy of s with all letters, if any, converted to lowercase.

lstrip

s.lstrip(x=string.whitespace)

Returns a copy of s, removing leading characters that are found in string x. For example, 'banana'.lstrip('ab') returns 'nana'.

replace

s.replace(old,new,maxsplit=sys.maxsize)

Returns a copy of s with the first maxsplit (or fewer, if there are fewer) nonoverlapping occurrences of substring old replaced by string new (e.g., 'banana'.replace('a',
'e',2)
returns 'benena').

rfind

s.rfind(sub,start=0,end=sys.maxsize)

Returns the highest index in s where substring sub is found, such that sub is entirely contained in s[start:end]. rfind returns -1 if sub is not found.

rindex

s.rindex(sub,start=0,end=sys.maxsize)

Like rfind, but raises ValueError if sub is not found.

rjust

s.rjust(n,fillchar=' ')

Returns a string of length max(len(s),n), with a copy of s at the end, preceded by zero or more leading copies of character fillchar.

rstrip

s.rstrip(x=string.whitespace)

Returns a copy of s, removing trailing characters that are found in string x. For example, 'banana'.rstrip('ab') returns 'banan'.

split

s.split(sep=None,maxsplit=sys.maxsize)

Returns a list L of up to maxsplit+1 strings. Each item of L is a “word” from s, where string sep separates words. When s has more than maxsplit words, the last item of L is the substring of s that follows the first maxsplit words. When sep is None, any string of whitespace separates words (e.g., 'four score and seven years'.split(None,3) is ['four','score','and','seven years']).

Note the difference between splitting on None (any string of whitespace is a separator) and splitting on ' ' (each single space character, not other whitespace such as tabs and newlines, and not strings of spaces, is a separator). For example:

>>> x = 'a  b'  # two spaces between a and b
>>> x.split()  # or, equivalently, x.split(None)
['a', 'b']
>>> x.split(' ')
['a', '', 'b']

In the first case, the two-spaces string in the middle is a single separator; in the second case, each single space is a separator, so that there is an empty string between the two spaces.

splitlines

s.splitlines(keepends=False)

Like s.split(' '). When keepends is true, however, the trailing ' ' is included in each item of the resulting list.

startswith

s.startswith(prefix,start=0,end=sys.maxsize)

Returns True when s[start:end] starts with string prefix; otherwise, False. prefix can be a tuple of strings, in which case startswith returns True when s[start:end] starts with any one of them.

strip

s.strip(x=string.whitespace)

Returns a copy of s, removing both leading and trailing characters that are found in string x. For example, 'banana'.strip('ab') is 'nan'.

swapcase

s.swapcase()

Returns a copy of s with all uppercase letters converted to lowercase and vice versa.

title

s.title()

Returns a copy of s transformed to titlecase: a capital letter at the start of each contiguous sequence of letters, with all other letters (if any) lowercase.

translate

s.translate(table)

Returns a copy of s where characters found in table are translated or deleted. In v3 (and in v2, when s is an instance of unicode), table is a dict whose keys are Unicode ordinals; values are Unicode ordinals, Unicode strings, or None (to delete the character)—for example (coded to work both in v2 and v3, with the redundant-in-v3 u prefix on strings):

print(u'banana'.translate({ord('a'):None,ord('n'):u'ze'}))
# prints: 'bzeze'

In v2, when s is a string of bytes, its translate method is quite different—see the online docs. Here are some examples of v2’s translate:

import string
identity = string.maketrans('','')
print('some string'.translate(identity,'aeiou'))
# prints: sm strng

The Unicode or v3 equivalent of this would be:

no_vowels = dict.fromkeys(ord(x) for x in 'aeiou')
print(u'some string'.translate(no_vowels))
# prints: sm strng

Here are v2 examples of turning all vowels into a’s and also deleting s’s:

intoas = string.maketrans('eiou','aaaa')
print('some string'.translate(intoas))
# prints: sama strang
print('some string'.translate(intoas,'s'))
# prints: ama trang

The Unicode or v3 equivalent of this would be:

intoas = dict.fromkeys((ord(x) for x in 'eiou'), 'a')
print(u'some string'.translate(intoas))
# prints: sama strang
intoas_nos = dict(intoas, s='None')
print(u'some string'.translate(intoas_nos))
# prints: ama trang

upper

s.upper()

Returns a copy of s with all letters, if any, converted to uppercase.

a Note that this does not include the punctuation marks used as a radix, such as dot (.) and comma (,).

The string Module

The string module supplies several useful string attributes:

ascii_letters

The string ascii_lowercase+ascii_uppercase

ascii_lowercase

The string 'abcdefghijklmnopqrstuvwxyz'

ascii_uppercase

The string 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

digits

The string '0123456789'

hexdigits

The string '0123456789abcdefABCDEF'

octdigits

The string '01234567'

punctuation

The string '!"#$%&'( )*+,-./:;<=>?@[]^_'{|}~' (i.e., all ASCII characters that are deemed punctuation characters in the 'C' locale; does not depend on which locale is active)

printable

The string of those ASCII characters that are deemed printable (i.e., digits, letters, punctuation, and whitespace)

whitespace

A string containing all ASCII characters that are deemed whitespace: at least space, tab, linefeed, and carriage return, but more characters (e.g., certain control characters) may be present, depending on the active locale

You should not rebind these attributes; the effects of doing so are undefined, since other parts of the Python library may rely on them.

The module string also supplies the class Formatter, covered in “String Formatting”.

String Formatting

v3 has introduced a powerful new string formatting facility, which has also been backported into v2. Unicode strings (but, in v3, not bytestrings) provide a format method that you call with arguments to interpolate into the format string, in which values to be formatted are indicated by replacement fields enclosed within braces.

The formatting process is best understood as a sequence of operations, each of which is guided by its replacement field. First, each value to be formatted is selected; next, it is converted, if required; finally, it is formatted.

Value Selection

'[str]{[selector]}[str]'.format(*args, **kwargs)

The format selector can handle positional (automatic or numbered) and named arguments. The simplest replacement field is the empty pair of braces ({}). With no indication of which argument is to be interpolated, each such replacement field automatically refers to the value of the next positional argument to format:

>>> 'First: {} second: {}'.format(1, 'two')
'First: 1 second: two'

To repeatedly select an argument, or use it out of order, use the argument’s number to specify its position in the list of arguments:

>>> 'Second: {1} first: {0}'.format(1, 'two')
'Second: two first: 1'

You cannot mix automatic and numbered replacement fields: it’s an either-or choice. For named arguments, you can use argument names and mix them with (automatic or numbered) positional arguments:

>>> 'a: {a}, 1st: {}, 2nd: {}, a again: {a}'.format(1, 'two', a=3)
'a: 3, 1st: 1, 2nd: two, a again: 3'
>>> 'a: {a} first:{0} second: {1} first: {0}'.format(1, 'two', a=3)
'a: 3 first:1 second: two first: 1'

If an argument is a sequence, you can use numeric indexes within it to select a specific element of the argument as the value to be formatted. This applies to both positional (automatic or numbered) and named arguments.

>>> 'p0[1]: {[1]} p1[0]: {[0]}'.format(('zero', 'one'),
                                       ('two', 'three'))
'p0[1]: one p1[0]: two'
>>> 'p1[0]: {1[0]} p0[1]: {0[1]}'.format(('zero', 'one'), 
                                         ('two', 'three'))
'p1[0]: two p0[1]: one'
>>> '{} {} {a[2]}'.format(1, 2, a=(5, 4, 3))
'1 2 3'

If an argument is a composite object, you can select its individual attributes as values to be formatted by applying attribute-access dot notation to the argument selector. Here is an example using complex numbers, which have real and imag attributes that hold the real and imaginary parts, respectively:

>>> 'First r: {.real} Second i: {a.imag}'.format(1+2j, a=3+4j)
'First r: 1.0 Second i: 4.0'

Indexing and attribute-selection operations can be used multiple times, if required.

Value Conversion

You may apply a default conversion to the value via one of its methods. You indicate this by following any selector with !s to apply the object’s __str__ method, !r for its __repr__ method, or!a for the ascii built-in (v3 only).

If you apply any conversion, the converted value replaces the originally selected value in the remainder of the formatting process.

Value Formatting

'{[selector][conversion]:[format_specifier]}'.format(value)

The formatting of the value (if any further formatting is required) is determined by a final (optional) portion of the replacement field, following a colon (:), known as the format specifier. The absence of a colon in the replacement field means that the converted value is used with no further formatting. Format specifiers may include one or more of the following: fill, alignment, sign, #, width, comma, precision, type.

Alignment, with optional (preceding) fill

If alignment is required the formatted value is filled to the correct field width. The default fill character is the space, but an alternative fill character (which may not be an opening or closing brace) can, if required, precede the alignment indicator. See Table 8-2.

Table 8-2. Alignment indicators
Character Significance as alignment indicator
'<' Align value on left of field
'>' Align value on right of field
'^' Align value in the center of the field
'=' Only for numeric types: add fill characters between the sign and the first digit of the numeric value

When no alignment is specified, most values are left-aligned, except that numeric values are right-aligned. Unless a field width is specified later in the format specifier no fill characters are added, whatever the fill and alignment may be.

Optional sign indication

For numeric values only, you can indicate how positive and negative numbers are differentiated by optionally including a sign indicator. See Table 8-3.

Table 8-3. Sign indicators
Character Significance as sign indicator
'+' Insert '+' as sign for positive numbers; '-' as sign for negative numbers
'-' Insert '-' as sign for negative numbers; do not insert any sign for positive numbers (default behavior if no sign indicator is included)
' ' Insert ' ' as sign for positive numbers; '-' as sign for negative numbers

Radix indicator

For numeric integer formats only, you can include a radix indicator, the '#' character. If present, this indicates that the digits of binary-formatted numbers is preceded by '0b', those of octal-formatted numbers by '0o', and those of hexadecimal-formatted numbers by '0x'. For example, '{:x}'.format(23) is '17', while '{:#x}'.format(23) is '0x17'.

Field width

You can specify the width of the field to be printed. If the width specified is less than the length of the value, the length of the value is used (no truncation). If alignment is not specified, the value is left-justified (except numbers, which are right-justified):

>>> s = 'a string'
>>> '{:^12s}'.format(s)
'  a string  '
>>> '{:.>12s}'.format(s)
'....a string'

Comma separation

For numeric values only, only in decimal (default) format type, you can insert a comma to request that each group of three digits in the result be separated by a comma. This behavior ignores system locale; for a locale-aware use of appropriate digit grouping and decimal point character, see format type 'n' in Table 8-4. For example:

print('{:,}'.format(12345678))
# prints 12,345,678

Precision specification

The precision (e.g., .2) has different meaning for different format types (see the following section), with .6 as the default for most numeric formats. For the f and F format types, it specifies the number of decimal digits to which the value should be rounded in formatting; for the g and G format types, it specifies the number of significant digits to which the value should be rounded; for nonnumeric values, it specifies truncation of the value to its leftmost characters before formatting.

>>> s = 'a string'
>>> x = 1.12345 
>>> 'as f: {:.4f}'.format(x) 
'as f: 1.1235' 
>>> 'as g: {:.4g}'.format(x) 
'as g: 1.123'
>>> 'as s: {:.6s}'.format(s)
'as s: a stri'

Format type

The format specification ends with an optional format type, which determines how the value gets represented in the given width and at the given precision. When the format type is omitted, the value being formatted applies a default format type.

The s format type is used to format Unicode strings.

Integer numbers have a range of acceptable format types, listed in Table 8-4.

Table 8-4.  
Format type Formatting description
'b' Binary format—a series of ones and zeros
'c' The Unicode character whose ordinal value is the formatted value
'd'

Decimal (the default format type)

'o' Octal format—a series of octal digits
'x' or 'X' Hexadecimal format—a series of hexadecimal digits, with the letters, respectively, in lower- or uppercase
'n' Decimal format, with locale-specific separators (commas in the UK and US) when system locale is set

Floating-point numbers have a different set of format types, shown in Table 8-5.

Table 8-5.  
Format type Formatting description
'e' or 'E' Exponential format—scientific notation, with an integer part between one and nine, using 'e' or 'E' just before the exponent
'f' or 'F' Fixed-point format with infinities ('inf') and nonnumbers ('nan') in lower- or uppercase
'g' or 'G'

General format—uses a fixed-point format if possible, and otherwise uses exponential format; uses lower- or uppercase representations for 'e', 'inf', and 'nan', depending on the case of the format type

'n' Like general format, but uses locale-specific separators, when system locale is set, for groups of three digits and decimal points
'%' Percentage format—multiplies the value by 100 and formats it as a fixed-point followed by '%'

When no format type is specified, a float uses the 'g' format, with at least one digit after the decimal point and a default precision of 12.

>>> n = [3.1415, -42, 1024.0] 
>>> for num in n:
...     '{:>+9,.2f}'.format(num)
... 
'    +3.14'
'   -42.00'
'+1,024.00'

Nested format specifications

In some cases you want to include an argument to format to help determine the precise format of another argument. Nested formatting can be used to achieve this. For example, to format a string in a field four characters wider than the string itself, you can pass a value for the width to format, as in:

>>> s = 'a string'
>>> '{0:>{1}s}'.format(s, len(s)+4)
'    a string'
>>> '{0:_^{1}s}'.format(s, len(s)+4)
'__a string__'

With some care, you can use width specification and nested formatting to print a sequence of tuples into well-aligned columns. For example:

def print_strings_in_columns(seq_of_strings, widths):
    for cols in seq_of_strings:
        row = ['{c:{w}.{w}s}'.format(c=c, w=w)
               for c, w in zip(cols, widths)]
        print(' '.join(row))

(In 3.6, '{c:{w}.{w}s}'.format(c=c, w=w) can be simplified to f'{c:{w}.{w}s}', as covered in “New in 3.6: Formatted String Literals”.) Given this function, the following code:

c = [
        'four score and'.split(),
        'seven years ago'.split(),
        'our forefathers brought'.split(),
        'forth on this'.split(),
    ]

print_strings_in_columns(c, (8, 8, 8))

prints:

four     score    and     
seven    years    ago     
our      forefath brought 
forth    on       this    

Formatting of user-coded classes

Values are ultimately formatted by a call to their __format__ method with the format specification as an argument. Built-in types either implement their own method or inherit from object, whose format method only accepts an empty string as an argument.

>>> object().__format__('')
'<object object at 0x110045070>'
>>> math.pi.__format__('18.6')
'           3.14159'

You can use this knowledge to implement an entirely different formatting mini-language of your own, should you choose. The following simple example demonstrates the passing of format specifications and the return of a (constant) formatted string result. Since the format specification is arbitrary, you may choose to implement whatever formatting notation you choose.

>>> class S(object):
...     def __format__(self, fstr):
...         print('Format string:', fstr)
...         return '42'
...
>>> my_s = S()
>>> '{:format_string}'.format(s)
Format string: format string
'42'
>>> 'The formatted value is: {:anything you like}'.format(s)
Format string: anything you like
'The formatted value is: 42'

The return value of the __format__ method is substituted for the replacement field in the output of the call to format, allowing any desired interpretation of the format string.

In order to control your objects’ formatting more easily, the string module provides a Formatter class with many helpful methods for handling formatting tasks. See the documentation for Formatter in the online docs.

New in 3.6: Formatted String Literals

This new feature helps use the formatting capabilities just described. It uses the same formatting syntax, but lets you specify expression values inline rather than through parameter substitution. Instead of argument specifiers, f-strings use expressions, evaluated and formatted as specified. For example, instead of:

>>> name = 'Dawn'
>>> print('{name!r} is {l} characters long'
           .format(name=name, l=len(name)))
'Dawn' is 4 characters long

from 3.6 onward you can use the more concise form:

>>> print(f'{name!r} is {len(name)} characters long')
'Dawn' is 4 characters long

You can use nested braces to specify components of formatting expressions:

>>> for width in 8, 11:
...     for precision in 2, 3, 4, 5:
...         print(f'{3.14159:{width}.{precision}}')
...
     3.1
    3.14
   3.142
  3.1416
        3.1
       3.14
      3.142
     3.1416

Do remember, though, that these string literals are not constants—they evaluate each time a statement containing them runs, potentially implying runtime overhead.

Legacy String Formatting with %

A legacy form of string formatting expression in Python has the syntax:

format % values

where format is a string containing format specifiers and values are the values to format, usually as a tuple (in this book we cover only the subset of this legacy feature, the format specifier, that you must know to properly use the logging module, covered in “The logging package”).

The equivalent use in logging would be, for example:

logging.info(format, *values)

with the values coming as positional arguments after the first, format one.

The legacy string-formatting approach has roughly the same set of features as the C language’s printf and operates in a similar way. Each format specifier is a substring of format that starts with a percent sign (%) and ends with one of the conversion characters shown in Table 8-6.

Table 8-6. String-formatting conversion characters
Character Output format Notes

d, i

Signed decimal integer

Value must be number.

u

Unsigned decimal integer

Value must be number.

o

Unsigned octal integer

Value must be number.

x

Unsigned hexadecimal integer (lowercase letters)

Value must be number.

X

Unsigned hexadecimal integer (uppercase letters)

Value must be number.

e

Floating-point value in exponential form (lowercase e for exponent)

Value must be number.

E

Floating-point value in exponential form (uppercase E for exponent)

Value must be number.

f, F

Floating-point value in decimal form

Value must be number.

g, G

Like e or E when exp is >=4 or < precision; otherwise, like f or F

exp is the exponent of the number being converted.

c

Single character

Value can be integer or single-character string.

r

String

Converts any value with repr.

s

String

Converts any value with str.

%

Literal % character

Consumes no value.

The r, s, and % conversion characters are the ones most often used with the logging module. Between the % and the conversion character, you can specify a number of optional modifiers, as we’ll discuss shortly.

What is logged with a formatting expression is format, where each format specifier is replaced by the corresponding item of values converted to a string according to the specifier. Here are some simple examples:

import logging
logging.getLogger().setLevel(logging.INFO)
x = 42
y = 3.14
z = 'george'
logging.info('result = %d', x)        # logs: result = 42
logging.info('answers: %d %f', x, y)  # logs: answers: 42 3.140000
logging.info('hello %s', z)           # logs: hello george

Format Specifier Syntax

A format specifier can include modifiers to control how the corresponding item in values is converted to a string. The components of a format specifier, in order, are:

  1. The mandatory leading % character that marks the start of the specifier

  2. Zero or more optional conversion flags:

    #

    The conversion uses an alternate form (if any exists for its type).

    0

    The conversion is zero-padded.

    -

    The conversion is left-justified.

    A space

    A space is placed before a positive number.

    +

    A numeric sign (+ or -) is placed before any numeric conversion.

  3. An optional minimum width of the conversion: one or more digits, or an asterisk (*), meaning that the width is taken from the next item in values

  4. An optional precision for the conversion: a dot (.) followed by zero or more digits, or by a *, meaning that the precision is taken from the next item in values

  5. A mandatory conversion type from Table 8-6

Each format specifier corresponds to an item in values by position, and there must be exactly as many values as format has specifiers (plus one extra for each width or precision given by *). When a width or precision is given by *, the * consumes one item in values, which must be an integer and is taken as the number of characters to use as width or precision of that conversion.

When to use %r

Most often, the format specifiers in your format string are all %s; occasionally, you’ll want to ensure horizontal alignment on the output (for example, in a right-justified, possibly truncated space of exactly six characters, in which case you might use %6.6s). However, there is an important special case for %r.

Always use %r to log possibly erroneous strings

When you’re logging a string value that might be erroneous (for example, the name of a file that is not found), don’t use %s: when the error is that the string has spurious leading or trailing spaces, or contains some nonprinting characters such as , %s might make this hard for you to spot by studying the logs. Use %r instead, so that all characters are clearly shown.

Text Wrapping and Filling

The textwrap module supplies a class and a few functions to format a string by breaking it into lines of a given maximum length. To fine-tune the filling and wrapping, you can instantiate the TextWrapper class supplied by textwrap and apply detailed control. Most of the time, however, one of the two main functions exposed by textwrap suffices:

wrap

wrap(s,width=70)

Returns a list of strings (without terminating newlines), each of which is no longer than width characters, and which (joined back together with spaces) equal s. wrap also supports other named arguments (equivalent to attributes of instances of class TextWrapper); for such advanced uses, see the online docs.

fill

fill(s,width=70)

Returns a single multiline string equal to ' '.join(wrap(s,width)).

The pprint Module

The pprint module pretty-prints complicated data structures, with formatting that strives to be more readable than that supplied by the built-in function repr (covered in Table 7-2). To fine-tune the formatting, you can instantiate the PrettyPrinter class supplied by pprint and apply detailed control, helped by auxiliary functions also supplied by pprint. Most of the time, however, one of two functions exposed by pprint suffices:

pformat

pformat(obj)

Returns a string representing the pretty-printing of obj.

pprint

pprint(obj,stream=sys.stdout)

Outputs the pretty-printing of obj to open-for-writing file object stream, with a terminating newline.

The following statements do exactly the same thing:

print(pprint.pformat(x))
pprint.pprint(x)

Either of these constructs is roughly the same as print( x) in many cases, such as when the string representation of x fits within one line. However, with something like x=list(range(30)), print(x) displays x in two lines, breaking at an arbitrary point, while using the module pprint displays x over 30 lines, one line per item. You can use pprint when you prefer the module’s specific display effects to the ones of normal string representation.

The reprlib Module

The reprlib module (named repr in v2) supplies an alternative to the built-in function repr (covered in Table 7-2), with limits on length for the representation string. To fine-tune the length limits, you can instantiate or subclass the Repr class supplied by the module and apply detailed control. Most of the time, however, the function exposed by the module suffices.

repr

repr(obj)

Returns a string representing obj, with sensible limits on length.

Unicode

To convert bytestrings into Unicode strings, in v2, use the unicode built-in or the decode method of bytestrings; or, you can let the conversion happen implicitly, when you pass a bytestring to a function that expects Unicode. In v3, the conversion must always be explicit, with the decode method of bytestrings.

In either case, the conversion is done by an auxiliary object known as a codec (short for coder-decoder). A codec can also convert Unicode strings to bytestrings, either explicitly, with the encode method of Unicode strings, or, in v2 only, implicitly.

To identify a codec, pass the codec name to unicode, decode, or encode. When you pass no codec name, and for v2 implicit conversion, Python uses a default encoding, normally 'ascii' in v2 ( 'utf8' in v3).

Every conversion has a parameter errors, a string specifying how conversion errors are to be handled. The default is 'strict', meaning any error raises an exception. When errors is 'replace', the conversion replaces each character causing errors with '?' in a bytestring result, with u'ufffd' in a Unicode result. When errors is 'ignore', the conversion silently skips characters causing errors. When errors is 'xmlcharrefreplace', the conversion replaces each character causing errors with the XML character reference representation of that character in the result. You may code your own function to implement a conversion-error-handling strategy and register it under an appropriate name by calling codecs.register_error, covered in Table 8-7.

The codecs Module

The mapping of codec names to codec objects is handled by the codecs module. This module also lets you develop your own codec objects and register them so that they can be looked up by name, just like built-in codecs. The codecs module also lets you look up any codec explicitly, obtaining the functions the codec uses for encoding and decoding, as well as factory functions to wrap file-like objects. Such advanced facilities of the module codecs are rarely used, and we do not cover them in this book.

The codecs module, together with the encodings package of the standard Python library, supplies built-in codecs useful to Python developers dealing with internationalization issues. Python comes with over 100 codecs; a list of these codecs, with a brief explanation of each, is in the online docs. It’s technically possible to install any supplied codec as the site-wide default in the module sitecustomize, but this is not good practice: rather, the preferred usage is to always specify the codec by name whenever you are converting between byte and Unicode strings. The codec installed by default in v2 is 'ascii', which accepts only characters with codes between 0 and 127, the 7-bit range of the American Standard Code for Information Interchange (ASCII) that is common to almost all encodings. A popular codec in Western Europe is 'latin-1', a fast, built-in implementation of the ISO 8859-1 encoding that offers a one-byte-per-character encoding of special characters found in Western European languages (note that 'latin-1' lacks the Euro currency character '€'; if you need that, use 'iso8859-15').

The codecs module also supplies codecs implemented in Python for most ISO 8859 encodings, with codec names from 'iso8859-1' to 'iso8859-15'. For example, if you use ASCII plus some Greek letters, as is common in scientific papers, you might choose 'iso8859-7'. On Windows systems only, the codec named 'mbcs' wraps the platform’s multibyte character set conversion procedures. Many codecs specifically support Asian languages. The codecs module also supplies several standard code pages (codec names from 'cp037' to 'cp1258'), old Mac-specific encodings (codec names from 'mac-cyrillic' to 'mac-turkish'), and Unicode standard encodings 'utf-8' (likely to be most often the best choice, thus recommended, and the default in v3) and 'utf-16' (the latter also has specific big-endian and little-endian variants: 'utf-16-be' and 'utf-16-le'). For use with UTF-16, codecs also supplies attributes BOM_BE and BOM_LE, byte-order marks for big-endian and little-endian machines, respectively, and BOM, the byte-order mark for the current platform.

The codecs module also supplies a function to let you register your own conversion-error-handling functions, as described in Table 8-7.

Table 8-7.  

register_error

register_error(name,func)

name must be a string. func must be callable with one argument e that’s an instance of exception UnicodeDecodeError, and must return a tuple with two items: the Unicode string to insert in the converted-string result and the index from which to continue the conversion (the latter is normally e.end). The function’s body can use e.encoding, the name of the codec of this conversion, and e.object[e.start:e.end], the substring that caused the conversion error.

The codecs module also supplies a function to help with files of encoded text:

EncodedFile

EncodedFile(file,datacodec,filecodec=None,errors='strict')

Wraps the file-like object file, returning a file-like object ef that implicitly and transparently applies the given encodings to all data read from or written to the file. When you write a bytestring s to ef, ef first decodes s with the codec named by datacodec, then encodes the result with the codec named by filecodec and writes it to file. When you read a string, ef applies filecodec first, then datacodec. When filecodec is None, ef uses datacodec for both steps in either direction.

For example, if you want to write bytestrings that are encoded in latin-1 to sys.stdout and have the strings come out in utf-8, in v2, use the following:

import sys, codecs
sys.stdout = codecs.EncodedFile(sys.stdout,'latin-1',
'utf-8')

It’s normally sufficient to use, instead, io.open, covered in “Creating a “file” Object with io.open”.

The unicodedata Module

The unicodedata module supplies easy access to the Unicode Character Database. Given any Unicode character, you can use functions supplied by unicodedata to obtain the character’s Unicode category, official name (if any), and other, more exotic information. You can also look up the Unicode character (if any) that corresponds to a given official name. Such advanced facilities are rarely needed, and we do not cover them further in this book.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset