The next major type on our built-in object tour is the Python string—an ordered collection of characters used to store and represent text-based information. We looked briefly at strings in Chapter 4. Here, we will revisit them in more depth, filling in some of the details we skipped then.
From a functional perspective, strings can be used to represent just about anything that can be encoded as text: symbols and words (e.g., your name), contents of text files loaded into memory, Internet addresses, Python programs, and so on. They can also be used to hold the absolute binary values of bytes, and multibyte Unicode text used in internationalized programs.
You may have used strings in other languages, too. Python’s strings serve the same role as character arrays in languages such as C, but they are a somewhat higher-level tool than arrays. Unlike in C, in Python, strings come with a powerful set of processing tools. Also unlike languages such as C, Python has no distinct type for individual characters; instead, you just use one-character strings.
Strictly speaking, Python strings are categorized as immutable sequences, meaning that the characters they contain have a left-to-right positional order and that they cannot be changed in-place. In fact, strings are the first representative of the larger class of objects called sequences that we will study here. Pay special attention to the sequence operations introduced in this chapter, because they will work the same on other sequence types we’ll explore later, such as lists and tuples.
Table 7-1 previews common string literals and operations we will discuss in this chapter. Empty strings are written as a pair of quotation marks (single or double) with nothing in between, and there are a variety of ways to code strings. For processing, strings support expression operations such as concatenation (combining strings), slicing (extracting sections), indexing (fetching by offset), and so on. Besides expressions, Python also provides a set of string methods that implement common string-specific tasks, as well as modules for more advanced text-processing tasks such as pattern matching. We’ll explore all of these later in the chapter.
Operation | Interpretation |
| Empty string |
| Double quotes, same as single |
| Escape sequences |
| Triple-quoted block strings |
| Raw strings |
| Byte strings in 3.0 (Chapter 36) |
| Unicode strings in 2.6 only (Chapter 36) |
| Concatenate, repeat |
| Index, slice, length |
| String formatting expression |
| String formatting method in 2.6 and 3.0 |
| String method calls: search, remove whitespace, replacement, split on delimiter, content test, case conversion, end test, delimiter join, Unicode encoding, etc. |
| Iteration, membership |
Beyond the core set of string tools in Table 7-1, Python also supports
more advanced pattern-based string processing with the standard
library’s re
(regular expression)
module, introduced in Chapter 4, and even higher-level text
processing tools such as XML parsers, discussed briefly in Chapter 36. This book’s scope, though, is
focused on the fundamentals represented by Table 7-1.
To cover the basics, this chapter begins with an overview of string literal forms and string expressions, then moves on to look at more advanced tools such as string methods and formatting. Python comes with many string tools, and we won’t look at them all here; the complete story is chronicled in the Python library manual. Our goal here is to explore enough commonly used tools to give you a representative sample; methods we won’t see in action here, for example, are largely analogous to those we will.
Content note: Technically speaking, this
chapter tells only part of the string story in Python—the part most
programmers need to know. It presents the basic str
string type, which handles ASCII text and
works the same regardless of which version of Python you use. That is,
this chapter intentionally limits its scope to the string processing
essentials that are used in most Python scripts.
From a more formal perspective, ASCII is a simple form of Unicode text. Python addresses the distinction between text and binary data by including distinct object types:
In Python 3.0 there are three string types: str
is used for Unicode text (ASCII or
otherwise), bytes
is used for binary data (including
encoded text), and bytearray
is a mutable variant of bytes
.
In Python 2.6, unicode
strings
represent wide Unicode text, and str
strings handle both 8-bit text and
binary data.
The bytearray
type is also
available as a back-port in 2.6, but not earlier, and it’s not as
closely bound to binary data as it is in 3.0. Because most programmers
don’t need to dig into the details of Unicode encodings or binary data
formats, though, I’ve moved all such details to the Advanced Topics
part of this book, in Chapter 36.
If you do need to deal with more advanced string concepts such as alternative character sets or packed binary data and files, see Chapter 36 after reading the material here. For now, we’ll focus on the basic string type and its operations. As you’ll find, the basics we’ll study here also apply directly to the more advanced string types in Python’s toolset.
By and large, strings are fairly easy to use in Python. Perhaps the most complicated thing about them is that there are so many ways to write them in your code:
Single quotes: 'spa"m'
Double quotes: "spa'm"
Triple quotes: '''... spam
...'''
, """... spam
..."""
Escape sequences: "s p
a m"
Raw strings: r"C:
ew est.spm"
Byte strings in 3.0 (see Chapter 36): b'spx01am'
Unicode strings in 2.6 only (see Chapter 36): u'eggsu0020spam'
The single- and double-quoted forms are by far the most common; the others serve specialized roles, and we’re postponing discussion of the last two advanced forms until Chapter 36. Let’s take a quick look at all the other options in turn.
Around Python strings, single and double quote characters are interchangeable. That is, string literals can be written enclosed in either two single or two double quotes—the two forms work the same and return the same type of object. For example, the following two strings are identical, once coded:
>>> 'shrubbery', "shrubbery"
('shrubbery', 'shrubbery')
The reason for supporting both is that it allows you to embed a quote character of the other variety inside a string without escaping it with a backslash. You may embed a single quote character in a string enclosed in double quote characters, and vice versa:
>>> 'knight"s', "knight's"
('knight"s', "knight's")
Incidentally, Python automatically concatenates adjacent
string literals in any expression, although it is almost as simple
to add a +
operator between them
to invoke concatenation explicitly (as we’ll see in Chapter 12, wrapping this form in
parentheses also allows it to span multiple lines):
>>>title = "Meaning " 'of' " Life"
# Implicit concatenation >>>title
'Meaning of Life'
Notice that adding commas between these strings would result in a tuple, not a string. Also notice in all of these outputs that Python prefers to print strings in single quotes, unless they embed one. You can also embed quotes by escaping them with backslashes:
>>> 'knight's', "knight"s"
("knight's", 'knight"s')
To understand why, you need to know how escapes work in general.
The last example embedded a quote inside a string by preceding it with a backslash. This is representative of a general pattern in strings: backslashes are used to introduce special byte codings known as escape sequences.
Escape sequences let us embed byte codes in strings that
cannot easily be typed on a keyboard. The character , and one or more characters following it
in the string literal, are replaced with a single character in the
resulting string object, which has the binary value specified by the
escape sequence. For example, here is a five-character string that
embeds a newline and a tab:
>>> s = 'a
b c'
The two characters
stand
for a single character—the byte containing the binary value of the
newline character in your character set (usually, ASCII code 10).
Similarly, the sequence
is
replaced with the tab character. The way this string looks when
printed depends on how you print it. The interactive echo shows the
special characters as escapes, but print
interprets them instead:
>>>s
'a b c' >>>print(s)
a b c
To be completely sure how many bytes are in this string, use
the built-in len
function—it returns the actual number
of bytes in a string, regardless of how it is displayed:
>>> len(s)
5
This string is five bytes long: it contains an ASCII a byte, a newline byte, an ASCII b byte, and so on. Note that the original backslash characters are not really stored with the string in memory; they are used to tell Python to store special byte values in the string. For coding such special bytes, Python recognizes a full set of escape code sequences, listed in Table 7-2.
Escape | Meaning |
| Ignored (continuation line) |
| Backslash (stores one
|
| Single quote (stores
|
| Double quote (stores
|
| Bell |
Backspace | |
| Formfeed |
| Newline (linefeed) |
| Carriage return |
| Horizontal tab |
| Vertical tab |
| Character with hex
value |
| Character with octal
value |