Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7. Strings

Given the fundamental nature of strings in all forms of programming, it should come as no surprise that Python's string features can fill an entire chapter. Whether it's interacting with users by way of keyboard input, sending content over the Web, analyzing the great American novel or participating in a Turing test,^[8] strings can be used for just about anything.

With all this emphasis on strings, Python makes sure to include a wide variety of features to support them. Some of these features are built right into the string objects themselves, while others are provided by modules in the standard library and many third-party libraries offer even more options. This chapter will focus on strings themselves and the tools in the standard library, rather than investigating third-party applications.

The first thing to understand about Python strings is that there are actually two different types: bytes and Unicode strings.

Bytes

At a very basic level, a string is really just a sequence of individual bytes. In this general sense, bytes are used for every piece of data a computer processes. Numbers, strings and more complex objects are all stored as bytes at some point, and anything more structured is built on top of a sequence of bytes. In a byte string, represented in Python by a bytes object, each character represents exactly one byte, so it's easy to interact with files and other interfaces to the outside world.

While standard strings—described later in the section on text—are identified as literals simply with a pair of straight single quotes ('example'), byte string literals include a b before the first quote. This is used in the source code as well as the repr() output for these values.

>>> b'example'
b'example'

Compatibility: Prior to 3.0

Originally, Python strings only supported sequences of bytes. Once it gained Unicode support, as described in the next section, the existing basic string started being called a "byte string" because it could only support one byte per character. This left it well-suited for non-textual data, like numbers and complex structures, particularly when working with files.

In Python 3.0, this distinction was made official with the new bytes type, which is equivalent to the standard str type in older versions. Therefore, to be compatible with older Python installations, you should use regular strings when working with non-textual data. Python 2.6 did add support for the b'' syntax mentioned here, but only for the sake of migrating source code; there is no actual bytes type. If you create a string using this syntax prior to 3.0, you'll just get a standard str object out of it.

Unfortunately, this is one area where it's impossible to be fully compatible with both the Python 2 line and the Python 3 line in the same code. There's simply no syntax or type that will reliably get the same behavior in both. For easiest maintenance, you can write code using the Python 2 syntax—perhaps even marking strings with b if you don't need compatibility with Python 2.5 or below—and use the 2to3 tool to convert those files to be compatible with Python 3.

The primary use of bytes is to convey non-textual information, such as numbers, dates, sets of flags and a number of other things. Even though Python doesn't directly know how to deal with those particular values, a bytes object will make sure that they pass through unchanged, so that your own code can handle each situation appropriately. Without any assumptions about the intentions of the data, bytes offer you maximum flexibility, but that means you'll need some way to convert data back and forth between bytes and something with more meaning to your application.

Simple Conversion: chr() and ord()

At a basic level, a byte is really just a number, which happens to be represented by a character of some kind. Python considers numbers and characters to be two different things, but their values are equivalent, so it's fairly easy to convert between them. Given a single byte, you can pass it into the built-in ord() function, which will return its equivalent integer value.

>>> ord(b'A')
65
>>> ord(b'!')
33
>>> list(b'Example')
[69, 120, 97, 109, 112, 108, 101]

Notice what happens when iterating over a sequence of bytes. Rather than one-character byte strings, you actually get the raw integers immediately, removing the need for ord() at all. This works well when converting single-byte values from bytes to numbers, but going in the other direction requires the built-in chr() function. As an inverse to ord(), it returns a single character based on the integer value you pass in.

>>> chr(65)
'A'
>>> chr(33)
'!'
>>> [chr(o) for o in [69, 120, 97, 109, 112, 108, 101]]
['E', 'x', 'a', 'm', 'p', 'l', 'e']
>>> ''.join(chr(o) for o in [69, 120, 97, 109, 112, 108, 101])
'Example'

There's one important thing to notice here: the string returned by chr() is a standard string, rather than a byte string, as evidenced by the lack of a b prefix. As you'll see in the section on text later in this chapter, standard strings work a bit differently from byte strings. The biggest problem for our purposes here, though, is that a standard string doesn't always equate directly to a single byte, so it's possible to get things wrong. In order to get things to work more reliably and get some extra features on top of it, we can use the struct module.

Complex Conversion: The Struct Module

In addition to the problem with chr() returning standard strings, a big problem with the ord()/chr() combination is that it can only be reliably used when working with individual bytes. When converting numbers to bytes, that limits it to values from 0 to 255. In order to support a wider range of values and some other interesting features, Python provides the struct module.

Similarly to how chr() and ord() represent a pair to convert values between byte strings and native Python values, struct.pack() writes out byte strings, while struct.unpack() reads those values back into Python. Unlike those simpler functions, though, the struct module uses a format string to specify how values should get converted. This format has its own sort of simple syntax to control what types of values to use and how they work.

Since we came by struct to overcome some difficulties with chr(), we'll start by looking at how struct.pack() can provided the intended functionality. The format to use for a single, unsigned byte is B, and here's how you'd use it in practice.

>>> import struct
>>> struct.pack(b'B', 65)
b'A'
>>> struct.pack(b'B', 33)
b'!'
>>> struct.pack(b'BBBBBBB', 69, 120, 97, 109, 112, 108, 101)
b'Example'

As you can see, the first argument is the format string itself, with one character for each argument that should get converted into the byte string. All additional arguments are used to provide the values that should be converted. Therefore, for each format specifier, you'll need to include an argument at the equivalent position.

As mentioned, B specifies an unsigned value, which means there can be no negative values. With this, you could provide values from 0 to 255, but nothing below zero. A signed value, on the other hand, allows negative values by using one of the eight bits in the byte to identify whether the value is positive or negative. There are still 256 unique values, but the range is shifted a bit so that half the values are on each side of the sign. With 0 being considered a positive value, a signed byte can contain values from −128 to 127. To complement unsigned bytes, the format specifier for signed bytes is b.

>>> struct.pack(b'b', 65)
b'A'
>>> struct.pack(b'Bb', 65, −23)
b'Axe9'
>>> struct.pack(b'B', 130)
b'x82'
>>> struct.pack(b'b', 130)
Traceback (most recent call last):
  ...
struct.error: byte format requires −128 <= number <= 127

Of course, B and b are only valid for single byte values, limited to 256 total values. To support larger numbers, you can use H and h for 2-byte numbers, allowing up to 65,536 values. Just like the single-byte option, the uppercase format assumes an unsigned value, while the lowercase format assumes a signed value.

>>> struct.pack(b'Hh', 42, −137)
b'*x00wxff'

Now that a single value can span multiple bytes, there comes the question of which byte comes first. One of the two bytes contains the 256 smallest values, while the other contains the values 0 to 256, but multiplied by 256. Therefore, getting the two mixed up can greatly affect the value that gets stored or retrieved. This is easy enough to see by taking a quick look at the inverse function, struct.unpack().

>>> struct.unpack(b'H', b'*x00')
(42,)
>>> struct.unpack(b'H', b'x00*')
(10752,)

As you can see, the function call for struct.unpack() looks very similar to struct.pack(), but there are a couple notable differences. First, there are always only two arguments to unpack() because the second argument is the raw byte string. This string can contain multiple values to be pulled out, but it's still passed as just one argument, unlike pack().

Instead, the return value is a tuple, which could contain multiple values. Therefore, struct.unpack() is a true inverse of struct.pack(); that is, you can pass the result from one into the call to the other and get the same value you passed in the first time. All you need is to ensure you use the same format string in each of the individual function calls.

>>> struct.unpack(b'Hh', struct.pack(b'Hh', 42, −42))
(42, −42)
>>> struct.pack(b'Hh', *struct.unpack(b'Hh', b'*x00x00*'))
b'*x00x00*'

So what's the problem with values spanning multiple bytes? After all, these examples show that values can be converted to a string and back without worrying about how those strings are created or parsed. Unfortunately, it's only easy because we're currently working only within Python, which has an implementation that's consistent with itself. If you have to work with strings, such as file contents, that need to be used with other applications, you'll need to make sure you match up with what those applications expect.

Therefore, struct formats also allow you to explicitly specify the endianness of a value. Endianness is the term for how the bytes of a value are ordered; in a big-endian value, the most significant byte—the byte that provides the largest part of the number—gets stored first. For little-endian values, the least significant byte is stored first.

To distinguish between the two, the format specification can take a prefix. If you place a < before the format, you can explicitly declare it to be little-endian. Conversely, using > will mark it as big-endian. If neither options is supplied, as in the previous examples, the default behavior is to use the same endianness as the system where Python is executing, which is typically little-endian on modern systems. This allows you to control the way values are treated for both pack() and unpack(), covering both sides of the conversion process.

>>> struct.pack(b'<H', 42)
b'*x00'

>>> struct.pack(b'>H', 42)
b'x00*'
>>> struct.unpack(b'<H', b'*x00')
(42,)
>>> struct.unpack(b'>H', b'*x00')
(10752,)

Now that it's possible to control the ordering of multiple-byte numbers, it's easier to work with larger values. In addition to the one- and two-byte integers discussed previously, struct supports four-byte values using I and i, while eight-byte values can be specified using Q and q. Like the others, uppercase letters indicate unsigned values, while lowercase letters indicate signed values.

The struct module goes beyond just conversion of integers, though. You can also convert floating point values using the f format, or perhaps even the b format for greater precision. In fact, you can use struct to work with strings inside strings as well, giving you some extra flexibility. The s format code, combined with a numeric prefix to indicate the size of the string to read or write.

>>> struct.pack(b'7s', b'example')
b'example'
>>> struct.unpack(b'7s', b'example')
(b'example',)
>>> struct.pack(b'10s', b'example')
b'examplex00x00x00'

As you can see, pack() will add in null bytes to fill in as many bytes as necessary to match the prefix supplied in the format. But why would you want to use struct to turn a string into a string? The benefit is that you can pack and unpack multiple values at a time, so the string might just be part of the structure. Consider a simple byte string that contains a person's contact information.

>>> first_name = 'Marty'
>>> last_name = 'Alchin'
>>> age = 28
>>> data = struct.pack(b'10s10sB', last_name, first_name, age)
>>> data
b'Alchinx00x00x00x00Martyx00x00x00x00x00x1c'

If you're looking to work with strings in this manner, though, you're more likely working with text, where the string has meaning as a whole, rather than its characters being conversions of some other types of values.

Text

Conceptually, text is a collection of written words. It's a linguistic concept that existed long before computing, but once it became clear that computers would need to work with text, it was necessary to determine how to represent text in a system designed for numbers. When programming was still young, text was limited to a set of characters known as the American Standard Code for Information Interchange (ASCII).

Notice the "American" part in there; this set of 127 characters—only 95 of them printable—is designed to address only the needs of the English language. ASCII only covered 7 bits of each byte, so there was some room for potential future expansion, but even another 128 values weren't enough. Some applications employed special tricks to convey additional letters by adding accents and other marks, but the standard was still very limited in scope.

Unicode

Later, the Unicode standard emerged as an alternative that could contain most of the characters used in the vast majority of the world's languages. In order for Unicode to support as many code points as it needs, each code point takes up more than one byte, unlike in ASCII. When loaded in memory, this isn't a problem because it's only used within Python, which only has one way of managing those multiple-byte values.

Note

The Unicode standard is actually made up of more than a million individual "code points" rather than characters. A code point is a number that represents some facet of written text, which can be a regular character, a symbol or a modifier, such as an accent. Some characters are even present at multiple code points for compatibility with systems in use prior to the introduction of Unicode.

By default, all standard strings in Python are Unicode, supporting a wide array of languages in the process. The byte strings shown in the previous section all required the use of a b prefix to distinguish them as different from standard Unicode strings.

Compatibility: Prior to 3.0

Much like other programming languages, strings in Python used to be byte strings by default, with Unicode represented by a separate type. While byte strings were quoted without a prefix, Unicode strings used the u prefix to indicate that they support the full Unicode character set. In the switch to Python 3, Unicode strings were made the default, and the u prefix is no longer supported.

Unfortunately, like the byte strings mentioned earlier in this chapter, there's no syntax that will be compatible with both versions of Python. Instead, any strings marked with the u prefix in your Python 2.x code will be converted to the new syntax with the 2to3 conversion tool.

The trouble comes when writing those values out to strings that can be read by other systems because not all systems use the same internal representation of Unicode strings. Instead, there are several different encodings that can be used to collapse a Unicode string into a series of bytes for storage or distribution.

Encodings

Much like how multiple bytes can be used to store a number larger than one byte would allow, Unicode text can be stored in a multiple-byte format. Unlike numbers, though, text generally contains a large number of individual characters, so storing each as up to four bytes would mean a long passage of text could end up much larger than it may seem.

To support text as efficiently as possible, it quickly became clear that not all text requires the full range of available characters. This book, for example, is written in English, which means the vast majority of its content lies within the ASCII range. As such, most of it could go from four bytes per character down to just one.

ASCII is one example of a text encoding. In this particular case, a small set of available characters are mapped to specific values from 0 to 255. The characters chosen are intended to support English, so it contains all the available letters in uppercase and lowercase variants, all ten numerals and a variety of punctuation options. Any text that contains just these values can be converted to bytes using the ASCII encoding.

The encoding process itself is managed using a string's encode() method. Simply pass in the name of an encoding and it will return a byte string representing the text in the given encoding. In the case of ASCII, the representation of the byte string looks just like the input text, because each byte maps to exactly one character.

>>> 'This is an example, with punctuation and UPPERCASE.'.encode('ascii')
b'This is an example, with punctuation and UPPERCASE.'

By mapping each byte to a single character, ASCII is very efficient, but only if the source text only contains those characters specified in the encoding. Certain assumptions had to be made about what characters were important enough to include in such a small range. Other languages will have their own characters that take priority, so they use different encodings in order to be as efficient as ASCII is for English.

Some languages, including Chinese and Japanese, have so many characters that there's no way a single byte could hope to represent them. The encodings for these languages use two bytes for every character, further highlighting how different the various text encodings can be. Because of this, an encoding designed for a particular language often can't be used for text outside of that language.

To address this, there are some more generic Unicode-focused encodings. Because of the sheer number of available characters, these encodings use a variable-length approach. In UTF-8, the most common of these, characters within a certain range can be represented in a single byte. Other characters require two bytes, while still others can use three or even four bytes. UTF-8 is desirable because of a few particular traits it exhibits.

It can support any available Unicode code point, even if it isn't commonly in actual text. That feature isn't unique to UTF-8, but it definitely sets it apart from other language-specific encodings, such as ASCII.
The more common the character is in actual use, the less space its code point takes. In a collection of mostly English documents, for example, UTF-8 can be nearly as efficient as ASCII. Even when encoding non-English text, most languages share certain common characters, such as spaces and punctuation, which can be encoded with a single byte. When it has to use two bytes, it's still more efficient than an in-memory Unicode object.
The single-byte range precisely coincides with the ASCII standard, making UTF-8 completely backward compatible with ASCII text. All ASCII text can be read as UTF-8, without modification. Likewise, text that only contains characters that are also available in ASCII can be encoded using UTF-8 and still be accessed by applications that only understand ASCII.

For these reasons, among others, UTF-8 has emerged as a very common encoding for applications that need to support multiple languages or where the language of the application isn't known at the time it's being designed. That may seem like an odd situation to be in, but it comes up fairly frequently when looking at frameworks, libraries and other large-scale applications. They could be deployed in any environment on the planet, so they should do as much as possible to support other languages. The next chapter will describe, in more detail, the steps an application can take to support multiple languages.

The consequences of using the wrong encoding can vary, depending on the needs of the application, the encoding used and the text passed in. For example, ASCII text can be decoded using UTF-8 without a problem, yielding a perfectly valid Unicode string. Reversing that process is not always as forgiving because a Unicode string can contain code points outside the valid ASCII range.

>>> ascii = 'This is a test'.encode('ascii')
>>> ascii
b'This is a test'
>>> ascii.decode('utf-8')
'This is a test'
>>> unicode = 'This is a test: u20ac'  # A manually encoded Euro symbol
>>> unicode.encode('utf-8')
b'This is a test: xe2x82xac'
>>> unicode.encode('ascii')
Traceback (most recent call last):
  ...
UnicodeEncodeError: 'ascii' codec can't encode character 'u20ac' in position 16
: ordinal not in range(128)

Other times, text can seem to be encoded or decoded properly, only to have the resulting text be gibberish. Typically, though, problems like that arise when upgrading an application to include proper Unicode support, but existing data wasn't encoded consistently. Building an application for Unicode from the ground up doesn't completely eliminate the possibility of these problems, but it greatly helps avoid them.

Compatibility: Prior to 3.0

One of the main reasons all of this seems confusing is that Python 2 was fairly vague on how encoding and decoding were supposed to work. Both byte strings and Unicode strings had an encode() and a decode() method, and the two types could often be used interchangeably. Python 3 clarifies the situation by only putting encode() on standard Unicode strings, while decode() is only available from byte strings. This, along with a few other differences, means the two can no longer be used interchangeably.

Most applications do well using UTF-8, but there are a number of other encodings available. Feel free to consult the full list,^[9] in case something else is more appropriate for your needs.

Simple Substitution

There are a few different ways to produce a string with information that's only available at run-time. Perhaps the most obvious is to concatenate multiple strings together using the + operator, but that only works if all the values are strings. Python won't implicitly convert other values to strings to be concatenated, so you'd have to convert them explicitly, by first passing them into the str() function, for example.

As an alternative, Python strings also support a way to inject objects into a string. This uses placeholders inside a string to denote where objects should go, along with a collection of objects that should fill them in. This is called string substitution, and is performed using the % operator, using a custom __mod__() method, as described in Chapter 5.

Placeholders consist of a percent sign and a conversion format, optionally with some modifiers between them to specify how the conversion should take place. This scheme allows the string to specify how objects should get converted, rather than having to call separate function explicitly. The most common of these formats is %s, which is equivalent to using the str() function directly.

>>> 'This object is %s' % 1
'This object is 1'
>>> 'This object is %s' % object()
'This object is <object object at 0x...>'

Because this is equivalent to calling str() directly, the value placed into the string is the result of calling the object's __str__() method. Similarly, if you use the %r placeholder inside the substitution string, Python will call the object's __repr__() method instead. This can be useful for logging arguments to a function, for example.

>>> def func(*args):
...     for i, arg in enumerate(args):
...         print('Argument %s: %r' % (i, arg))
...
>>> func('example', {}, [1, 2, 3], object())
Argument 0: 'example'
Argument 1: {}
Argument 2: [1, 2, 3]
Argument 3: <object object at 0x...>

This example also illustrates how multiple values can be placed into the string at once, by wrapping them in a tuple. They're matched up with their counterparts in the string according to their position, so the first object goes in the first placeholder and so on. Unfortunately, this feature can also be a stumbling block at times, if you're not careful. The most common error occurs when attempting to inject a tuple into the substitution string.

>>> def log(*args):
...     print('Logging arguments: %r' % args)
...
>>> log('test')
"Logging arguments: 'test'"
>>> log('test', 'ing')
Traceback (most recent call last):
  ...
TypeError: not all arguments converted during string formatting

What's going on here is that Python makes no distinction between a tuple that was written as such in the source code and one that was merely passed from somewhere else. Therefore, string substitution has no way of knowing what your intention is. In this example, the substitution works fine as long as only one argument is passed in because there's exactly one placeholder in the string. As soon as you pass in more than one argument, it breaks.

In order to resolve this, you'll need to build a one-item tuple to contain the tuple you want to place in the string. This way, the string substitution always gets a single tuple, which contains one tuple to be placed into a single placeholder.

>>> def log(*args):
...     print('Logging arguments: %r' % (args,))
...
>>> log('test')
"Logging arguments: ('test',)"
>>> log('test', 'ing')
"Logging arguments: ('test', 'ing')"

With the tuple situation sorted out, it's worth noting that objects can be inserted by keyword as well. Doing so requires the substitution string to contain the keywords in parentheses, immediately following the percent sign. Then, to pass in values to inject, simply pass in a dictionary of objects, rather than a tuple.

>>> def log(*args):
...     for i, arg in enumerate(args):
...         print('Argument %(i)s: %(arg)r' % {'i': i, 'arg': arg})
...
>>> log('test')
Argument 0: 'test'
>>> log('test', 'ing')
Argument 0: 'test'
Argument 1: 'ing'

In addition to being able to more easily rearrange placeholders in the substitution string, this feature allows you to include just those values that are important. If you have a dictionary with more values than you need in the string, you can reference only the ones you need. Python will simply ignore any values that aren't mentioned by name in the string. This is in contrast to the positional option, where supplying more values than you've marked in the string will result in a TypeError.

Compatibility: Going Forward

The string formatting described in this section is a bit of an interesting case with regard to compatibility. It's considered to be obsolete, having been replaced by the more robust string formatting system described in the next section. However, it was not removed from Python during the switch to version 3, so strings that use this system will work in all versions of Python covered by this book, including 3.1.

The long-term plan, though, is to remove this feature in a future version of Python, once the new formatting option takes hold. Therefore, this section exists to document something that will work in all current versions of Python as of this book's publish date, but it will be removed at some point in the future. Please use this section, along with the next one, as a guide to understanding how existing string substitution works, so that any strings using it can be converted to the newer formatting feature.

Formatting

For a more powerful alternative to the simple string substitution described in the previous section, Python also includes a robust formatting system for strings. Rather than relying on a less obvious operator, string formatting uses an explicit format() method on strings. In addition, the syntax used for the formatting string is considerably different from what was used in simple substitution previously.

Instead of using a percent sign and a format code, format() expects its placeholders to be surrounded by curly braces. What goes inside those braces depends on how you plan to pass in the values and how they should be formatted. The first portion of the placeholder determines whether it should look for a positional argument or a keyword argument. For positional arguments, the content is a number, indicating the index of the value to work with, while for keyword arguments, you supply the key that references the appropriate value.

>>> 'This is argument 0: {0}'.format('test')
'This is argument 0: test'
>>> 'This is argument key: {key}'.format(key='value')
'This is argument key: value'

This may look a lot like the older substitution technique, but it has one major advantage already. Because formatting is initiated with a method call, rather than an operator, you can specify both positional and keyword arguments together. That way, you can mix and match indexes and keys in the format string if necessary, referencing them in any order.

As an added bonus, that also means that not all positional arguments need to be referenced in the string in order to work properly. If you supply more than you need, format() will just ignore anything it doesn't have a placeholder for. This makes it much easier to pass a format string into an application that will call format() on it later, with arguments that may come from another source. One such example is a customizable validation function that accepts an error message during customization.

>>> def exact_match(expected, error):
...     def validator(value):
...         if value != expected:
...             raise ValueError(error.format(value, expected))
...     return validator
...
>>> validate_zero = exact_match(0, 'Expected {1}, got {0}')
>>> validate_zero(0)
>>> validate_zero(1)
Traceback (most recent call last):
  ...
ValueError: Expected 0, got 1
>>> validate_zero = exact_match(0, '{0} != {1}')
>>> validate_zero(1)
Traceback (most recent call last):
  ...
ValueError: 1 != 0
>>> validate_zero = exact_match(0, '{0} is not the right value')
>>> validate_zero(1)
Traceback (most recent call last):
  ...
ValueError: 1 is not the right value

As you can see, this feature lets the validator function call format() using all the information it has available at the time, leaving it up to the format string to determine how to lay it out. With the other string substitution, you'd be forced to use keywords to achieve the same effect because positional arguments just didn't work the same way.

Compatibility: Converting Strings

If you prefer the previous behavior of looking up positional arguments without having to number them explicitly—or if you're looking for an easy way to convert all your strings over at once—this new feature can work the same way. Simply leave out the index altogether and it'll pick up arguments according to the order in which the placeholders appear in the string. So if you used a string like '%s: %r' with the older substitution technique, the direct equivalent in the newer formatting would be '{!s}: {!r}'.

You can't use this alongside explicitly numbered placeholders, though, so you'll have to choose one style for the whole string. That works out for an automated conversion to the new style, but it requires some attention if you have complicated strings that need to reorder arguments later on. Of course, keyword arguments don't have any ambiguity in either case, so they work fine alongside either style.

Looking Up Values Within Objects

In addition to being able to reference the objects being passed in, the format string syntax allows you to refer to portions of those objects specifically. The syntax for this looks much like it would in regular Python code. To reference an attribute, separate its name from the object reference with a period. To use an indexed or keyword value, supply the index or keyword inside square brackets; just don't use quotes around the keyword.

>>> import datetime
>>> def format_time(time):
...     return '{0.minute} past {0.hour}'.format(time)
...
>>> format_time(datetime.time(8, 10))
'10 past 8'
>>> '{0[spam]}'.format({'spam': 'eggs'})
'eggs'

Distinguishing Types of Strings

You may remember that simple substitution required you to specify either %s or %r to indicate whether the __str__() method or the __repr__() method should be used to convert an object to a string, while the examples given thus far haven't included such a hint. By default, format() will use __str__(), but that behavior can still be controlled as part of the format string. Immediately following the object reference, simply include an exclamation point, followed by either s or r.

>>> validate_test = exact_match('test', 'Expected {1!r}, got {0!r}')
>>> validate_test('invalid')
Traceback (most recent call last):
  ...

ValueError: Expected 'test', got 'invalid'

Standard Format Specification

Where this new string formatting really differs from the previous substitution feature is in the amount of flexibility available to format the output of objects. After the field reference and the string type mentioned in previous sections, you can include a colon, followed by a string that controls the formatting of the referenced object. There's a standard syntax for this format specification, which is generally applicable to most objects.

The first option controls the alignment of the output string, which is used when you need to specify a minimum number of characters to output. Supplying a left angle bracket (<) produces a left-aligned value, a right angle bracket (>) aligns to the right and a caret (^) centers the value. The total width can be specified as a number afterward.

>>> import os.path
>>> '{0:>20}{1}'.format(*os.path.splitext('contents.txt'))
'            contents.txt'
>>> for filename in ['contents.txt', 'chapter.txt', 'index.txt']:
...     print('{0:<10}{1}'.format(*os.path.splitext(filename)))
...
contents  .txt
chapter   .txt
index     .txt

Notice here that the default behavior of the length specification is to pad the output with spaces to reach the necessary length. That too can be controlled by inserting a different character before the alignment specifier. For example, some plain-text document formats expect headings to be centered within a length of equal signs or hyphens. This is easy to accomplish using string formatting.

>>> def heading(text):
...     return '{0:=^40}'.format(text)
...
>>> heading('Standard Format Specification')
'=====Standard Format Specification======'
>>> heading('This is a longer heading, beyond 40 characters')
'This is a longer heading, beyond 40 characters'

The second call here demonstrates an important property of the length format; if the argument string is longer than the length specified, format() will lengthen the output to match, rather than truncating the text. That creates a bit of a problem with the heading example, though, because if the input was too long, the output doesn't contain any of the padding characters at all. This can be fixed by explicitly adding in one character each at the beginning and end of the string and reducing the placeholder's length by two to compensate.

>>> def heading(text):
...     return '={0:=^38}='.format(text)
...
>>> heading('Standard Format Specification')
'=====Standard Format Specification======'
>>> heading('This is a longer heading, beyond 40 characters')
'=This is a longer heading, beyond 40 characters='

Now the heading will always be at least 40 characters wide but also always have at least one equal sign on each side of the text, even if it runs long. Unfortunately, doing so now requires writing the equal sign three times in the format string, which becomes a bit of a maintenance hassle once we consider that sometimes the padding character will be a hyphen.

Solving one part of this problem is simple: because we're explicitly numbering the placeholders, we can pass in the padding character as an argument and just reference that argument twice in the format string; once at the beginning and once at the end. That alone doesn't really solve the problem, though, because it leaves the core problem untouched: how to replace just part of the argument reference for the text.

To solve that problem, the format specification also allows argument references to be nested. Inside the placeholder for the text portion, we can add another placeholder at the position reserved for the padding character, and Python will evaluate that one first, before trying to evaluate the other. While we're at it, this also allows us to control how many characters the output will fill up.

>>> def heading(text, padding='=', width=40):
...     return '{1}{0:{1}^{2}}{1}'.format(text, padding, width - 2)
...
>>> heading('Standard Format Specification')
'=====Standard Format Specification======'
>>> heading('This is a longer heading, beyond 40 characters')
'=This is a longer heading, beyond 40 characters='
>>> heading('Standard Format Specification', padding='-', width=60)
'---------------Standard Format Specification----------------'

Example: Plain Text Table of Contents

Though there are many forms of documentation, plain text is perhaps the most common, as it doesn't require any additional software to view. Navigating large chunks of documentation can be difficult, though, because of the lack of links or page numbers for a table of contents. Line numbers could be used instead of page numbers, but a properly formatted table of contents can still be tedious to maintain.

Consider a typical table of contents, where the title of a section is left-aligned and the page or line number is right-aligned, and the two are joined by a line of periods to help guide the eye from one to the other. Adding or removing lines from such a format is simple, but every time you change the name or location of a section, you not only have to change the relevant information; you also need to update the line of periods in-between, which is less than ideal.

String formatting can come in handy here because you can specify both alignment and padding options for multiple values within a string. With this, you can set up a simple script that formats the table of contents for you automatically. The key to doing this, though, is to realize what you're working with.

On the surface, it seems like the goal is just as mentioned: to left-align the section title, right-align the line number and place a line of periods in-between. Unfortunately, there's no option to do exactly that, so we'll need to look at it a bit differently. By having each part of the string be responsible for part of the padding, it's fairly easy to achieve the desired effect.

>>> '{0:.<50}'.format('Example')
'Example...........................................'
>>> '{0:.<50}'.format('Longer Example')
'Longer Example....................................'
>>> '{0:.>10}'.format(20)
'........20'
>>> '{0:.>10}'.format(1138)
'......1138'

With these two parts in place, they just need to be combined in order to create a full line in the table of contents. Most plain text documents are limited to 80 characters in a single line, so we can expand it a bit to give some breathing room for longer titles. In addition, 10 digits is a bit much to expect for line numbers even in extremely long documents, so that can be reduced in order to yield more space for the titles as well.

>>> def contents_line(title, line_number=1):
...     return '{0:.<70}{1:.>5}'.format(title, line_number)
...
>>> contents_line('Installation', 20)
'Installation.............................................................20'
>>> contents_line('Usage', 112)
'Usage...................................................................112'

Calling this function one line at a time isn't a realistic solution in the long run, though, so we'll create a new function that can accept a more useful data structure to work with. It doesn't need to be complicated, so we'll just use a sequence of 2-tuples, each consisting of a section title and its corresponding line number.

>>> contents = (('Installation', 20), ('Usage', 112))
>>> def format_contents(contents):
...     for title, line_number in contents:
...         yield '{0:.<70}{1:.>5}'.format(title, line_number)
...
>>> for line in format_contents(contents):
...     print(line)
...
Installation.............................................................20
Usage...................................................................112

Custom Format Specification

The true strength of the new formatting system, though, is that format() isn't actually in control of the formatting syntax described in the previous section. Like many of the features described in Chapter 4, it instead delegates that control to a method on the objects passed in as arguments.

This method, __format__(), accepts one argument, which is the format specification that was written into the format string where the object is being placed. It doesn't get the entire bracketed expression, though, just the bit after the colon. This is true for all objects, as you can see by calling it directly on a brand new instance of object.

>>> object().__format__('=^40')
'=====<object object at 0x0209F158>======'

Because of this, the standard format specification options described in the previous section aren't the only way to do things. If you have a custom need, you can override that behavior by replacing that method on the class you're working with. You can either extend the existing behavior or write a completely new one.

For example, you could have a class to represent a verb, which can have a present or a past tense. This Verb class could be instantiated with a word to use for each tense, then be used in expressions to form complete sentences.

>>> class Verb:
...     def __init__(self, present, past=None):
...         self.present = present
...         self.past = past
...     def __format__(self, tense):
...         if tense == 'past':
...             return self.past
...         else:
...             return self.present
...
>>> format = Verb('format', past='formatted')
>>> message = 'You can {0:present} strings with {0:past} objects.'
>>> message.format(format)
'You can format strings with formatted objects.'
>>> save = Verb('save', past='saved')
>>> message.format(save)
'You can save strings with saved objects.'

In this example, there's no way for the placeholder string to know how to format a past tense verb, so it delegates that responsibility to the verb passed in. This way, the string can be written once and used many times with different verbs, without skipping a beat.

Taking It With You

Because strings are so common throughout all kinds of programming, you'll find yourself with a wide range of needs. The features shown in this chapter will help you make better use of your strings, but the proper combination of techniques is something that can't be written for you. As you go forward with your code, you'll need to keep an open mind about which techniques to use, so that you can choose what's best for your needs.

So far, these chapters have focused on how to use various aspects of Python to perform complex and useful tasks so that your applications can be that much more powerful. The next chapter will show you how to verify that those tasks are being performed properly.

^[8]http://propython.com/turing_test/

^[9]http://propython.com/standard-encodings/

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 7. Strings

Create new playlist

Sign In

Sign Up

Chapter 7. Strings

Bytes

Simple Conversion: chr() and ord()

Complex Conversion: The Struct Module

Text

Unicode

Note

Encodings

Simple Substitution

Formatting

Looking Up Values Within Objects

Distinguishing Types of Strings

Standard Format Specification

Example: Plain Text Table of Contents

Custom Format Specification

Taking It With You

Table of Contents for
7. Strings