6.7. Special Features of Strings

6.7.1. Special or Control Characters

Like most other high-level or scripting languages, a backslash paired with another single character indicates the presence of a “special” character, usually a non-printable character, and that this pair of characters will be substituted by the special character. These are the special characters we discussed above that will not be interpreted if the raw string operator precedes a string containing these characters.

In addition to the well-known characters such as NEWLINE ( ) and (horizontal) TAB ( ), specific characters via their ASCII values may be used as well: OOO or xXX where OOO and XX are their respective octal and hexadecimal ASCII values. Here are the base 10, 8, and 16 representations of 0, 65, and 255:

 ASCIIASCIIASCII
decimal065255
octal00101177
hexadecimalx00x41xFF

Special characters, including the backslash-escaped ones, can be stored in Python strings just like regular characters.

Another way that strings in Python are different from those in C is that Python strings are not terminated by the NUL (00) character (ASCII value 0). NUL characters are just like any of the other special backslash-escaped characters. In fact, not only can NUL characters appear in Python strings, but there can be any number of them in a string, not to mention that they can occur anywhere within the string. They are no more special than any of the other control characters. Table6.7 represents a summary of the escape characters supported by most versions of Python.

Table 6.7. String Literal Backslash Escape Characters
/XOctDecHexCharDescription
00000x00NULNull character
a00770x07BELBell
01080x08BSBackspace
01190x09HTHorizontal Tab
012100x0ALFLinefeed/Newline
v013110x0BVTVertical Tab
f014120x0CFFForm Feed
015130x0DCRCarriage Return
e033270x1BESCEscape
"042340x22"Double quote
'047390x27'Single quote/apostrophe
\134920x5CBackslash

And as mentioned before, explicit ASCII octal or hexadecimal values can be given, as well as escaping a NEWLINE to continue a statement to the next line. All valid ASCII character values are between 0 and 255 (octal 0177, hexadecimal 0XFF).

OOOoctal value OOO (range is 0000 to 0177)
xXX'x' plus hexadecimal value XX (range is 0X00 to 0xFF)
escape NEWLINE for statement continuation

One use of control characters in strings is to serve as delimiters. In database or Internet/Web processing, it is more than likely that most printable characters are allowed as data items, meaning that they would not make good delimiters.

It becomes difficult to ascertain whether or not a character is a delimiter or a data item, and by using a printable character such as a colon (:) as a delimiter, you are limiting the number of allowed characters in your data, which may not be desirable.

One popular solution is to employ seldomly used, non-printable ASCII values as delimiters. These make the perfect delimiters, freeing up the colon and the other printable characters for more important uses.

6.7.2. Triple Quotes

Although strings can be represented by single or double quote delimitation, it is often difficult to manipulate strings containing special or non-printable characters, especially the NEWLINE character. Python's triple quotes comes to the rescue by allowing strings to span multiple lines, including verbatim NEWLINEs, TABs, and any other special characters.

The syntax for triple quotes consists of three consecutive single or double quotes (used in pairs, naturally):

>>> para_str = """this is a long string that is made up of
… several lines and non-printable characters such as
… TAB ( 	 ) and they will show up that way when displayed.
… NEWLINEs within the string, whether explicitly given like
… this within the brackets [ 
 ], or just a NEWLINE within
… the variable assignment will also show up.
… """

Triple quote lets the developer avoid playing quote and escape character games, all the while bringing at least a small chunk of text closer to WYSIWIG (what you see is what you get) format.

An example below shows you what happens when we use the print statement to display the contents of this string. Note how every single special character has been converted to its printed form, right down to the last NEWLINE at the end of the string between the “up.” and closing triple quotes. Also note that NEWLINEs occur either with an explicit carriage return at the end of a line or its escape code ( ):

>>> print para_str
this is a long string that is made up of
several lines and non-printable characters such as
TAB (    ) and they will show up that way when displayed.
NEWLINEs within the string, whether explicitly given like
this within the brackets [
 ], or just a NEWLINE within
the variable assignment will also show up.

We introduced the len() built-in sequence type function earlier, which, for strings, gives us the total number of characters in a string.

>>> len(para_str)
307

Upon applying that function to our string, we get a result of 307, which includes the NEWLINE and TAB characters. Another way to look at the string within the interactive interpreter is by just giving the interpreter the name of the object in question. Here, we will see the “internal” representation of the string, without the special characters being converted to printable ones. If that last NEWLINE we looked at above (after the final word “up” and before the closing triple quotes) is still elusive to you, take a look at the way the string is represented internally below. You will observe that the last character of the string is the aforementioned NEWLINE.

>>> para_str

'this is a long string that is made up of12several lines
and non-printable characters such as12TAB ( 11 ) and
they will show up that way when displayed.12NEWLINEs
within the string, whether explicitly given like12this
within the brackets [ 12 ], or just a NEWLINE
within12the variable assignment will also show up.12'

6.7.3. String Immutability

In Section 4.7.2, we discussed how strings are immutable data types, meaning that their values cannot be changed or modified. This means that if you do want to update a string, either by taking a substring, concatenating another string on the end, or concatenating the string in question to the end of another string, etc., a new string object must be created for it.

This sounds more complicated than it really is. Since Python manages memory for you, you won't really notice when this occurs. Any time you modify a string or perform any operation that is contrary to immutability, Python will allocate a new string for you. In the following example, Python allocates space for the strings, 'abc' and 'def'. But when performing the addition operation to create the string 'abcdef', new space is allocated automatically for the new string.

>>> 'abc' + 'def'
'abcdef'

Assigning values to variables is no different:

>>> string = 'abc'
>>> string = string + 'def'
>>> string
'abcdef'

In the above example, it looks like we assigned the string 'abc' to string, then appended the string 'def' to string. To the naked eye, strings look mutable. What you cannot see, however, is the fact that a new string was created when the operation "s + 'def'" was performed, and that the new object was then assigned back to s. The old string of 'abc' was deallocated.

Once again, we can use the id() built-in function to help show us exactly what happened. If you recall, id() returns the “identity” of an object. This value is as close to a “memory address” as we can get in Python.

>> string = 'abc'
>>>
>>> id(string)
135060856
>>>
>>> string = string + 'def'
>>> id(string)
135057968

Note how the identities are different for the string before and after the update. Another test of mutability is to try to modify individual characters or substrings of a string. We will now show how any update of a single character or a slice is not allowed:

>>> string
'abcdef'
>>>
>>> string[2] = 'C'
Traceback (innermost last):
 File "<stdin>", line 1, in ?
AttributeError: __setitem__
>>>
>>> string[3:6] = 'DEF'
Traceback (innermost last):
  File "<stdin>", line 1, in ?
AttributeError: __setslice__

Both operations result in an error. In order to perform the actions that we want, we will have to create new strings using substrings of the existing string, then assign those new strings back to string:

>>> string
'abcdef'
>>>
>>> string = string[0:2] + 'C' + string[3:]
>>> string
'abCdef'
>>>
>>> string[0:3] + 'DEF'
'abCDEF'
>>>
>>> string = string[0:3] + 'DEF'
>>> string
'abCDEF'

So for immutable objects like strings, we make the observation that only valid expressions on the left-hand side of an assignment (to the left of the equals sign [ = ]) must be the variable representation of an entire object such as a string, not single characters or substrings. There is no such restriction for the expression on the right-hand side.

6.7.4. Unicode Support

Unicode string support, introduced to Python in version 1.6, is used to convert between multiple double-byte character formats and encodings, and include as much functionality to manage these strings as possible. With the addition of string methods (see Section 6.6), Python strings are fully-featured to handle a much wider variety of applications requiring Unicode string storage, access, and manipulation. At the time of this writing, the exact Python specifications have not been finalized. We will do our best here to give an overview of native Unicode 3.0 support in Python:

unicode() Built-in Function

The Unicode built-in function should operate in a manner similar to that of the Unicode string operator (u/U). It takes a string and returns a Unicode string.

encode() Built-in Methods

The encode() built-in methods take a string and return an equivalent encoded string. encode() exists as methods for both regular and Unicode strings in 2.0, but only for Unicode strings in 1.6.

Unicode Type

There is a new Unicode type named unicode that is returned when a Unicode string is sent as an argument to type(), i.e., type(u'')

Unicode Ordinals

The standard ord() built-in function should work the same way. It was enhanced recently to support Unicode objects. The new unichr() built-in function returns a Unicode object for character (provided it is a 32-bit value); a ValueError exception is raised, otherwise.

Coercion

Mixed-mode string operations require standard strings be converted to Unicode objects.

Exceptions

UnicodeError is defined in the exceptions module as subclass of ValueError. All exceptions related to Unicode encoding/decoding should be subclasses of UnicodeError. Also see the string encode() method.

Table 6.8. Unicode Codecs/Encodings
codecDescription
utf-88-bit variable length encoding (default encoding)
utf-1616-bit variable length encoding (little/big endian)
utf-16-leutf-16 but explicitly little endian
utf-16-beutf-16 but explicitly big endian
ascii7-bit ASCII codepage
iso-8859-1ISO 8859-1 (Latin 1) codepage
unicode-escape(see Python Unicode Constructors for a definition)
raw-unicode-escape(see Python Unicode Constructors for a definition)
nativedump of the internal format used by Python

RE Engine Unicode-aware

The new regular expression engine should be Unicode aware. See the re Code Module sidebar in the next section (6.8).

String Format Operator

For Python format strings: '%s' does str(u) for Unicode objects embedded in Python strings, so the output will be u.encode (<default encoding>). If the format string is an Unicode object, all parameters are coerced to Unicode first and then put together and formatted according to the format string. Numbers are first converted to strings and then to Unicode. Python strings are interpreted as Unicode strings using the <default encoding>. Unicode objects are taken as is. All other string formatters should work accordingly. Here is an example:

u"%s %s" % (u"abc", "abc") ⇒ u"abc abc"

Specific information regarding Python's support of Unicode strings can be found in the Misc/unicode.txt of the distribution. The latest version of this document is always available online at:

http://starship.python.net/~lemburg/unicode-proposal.txt

For more help and information on Python's Unicode strings, see the Python Unicode Tutorial at:

http://www.reportlab.com/il8n/python_unicode_tutorial.html

6.7.5. No Characters or Arrays in Python

We mentioned in the previous section that Python does not support a character type. We can also say that C does not support string types explicitly. Instead, strings in C are merely arrays of individual characters. Our third fact is that Python does not have an “array” type as a primitive (although the array module exists if you really have to have one). Implementing strings as character arrays is also deemed unnecessary due to the sequential access ability of strings.

In choosing between single characters and strings, Python wisely uses strings as types. It is much easier manipulating the larger entity as a “blob” since most applications operate on strings as a whole rather than individual characters. Applications will convert strings to integers, ask users to input strings, perform regular expression matches on substrings, search files for specific strings, and will even sort a set of strings like names, etc. How often are individual characters operated on, except for searches (i.e., search-and-replace, search-for-delimiter, etc.)? Probably not often as far as most applications are concerned.

However, such functionality should still be available to the Python programmer. Search-and-replacing can be done with regular expressions and the re module, searching for and breaking up strings based on delimiters can be accomplished with split(), searching for substrings can be accomplished using find() and rfind(), and just plain old character membership in a string can be verified with the in and not in sequence operators.

We are going to quickly revisit the chr() and ord() built-in functions that convert between ASCII integer values and their equivalent characters, and describe one of the “features” of C that has been lost to Python because characters are not integer types in Python as they are in C.

One feature of C which is lost is the ability to perform numerical calculations directly on characters, i.e., 'A' + 3. This is allowed in C because both 'A' as a char and 3 as an int are integers (1-byte and 2/4-bytes, respectively), but would be a type mismatch in Python because 'A' is a string, 3 is a plain integer, and no such addition ( + ) operation exists between numeric and string types.

>>> 'B'
'B'
>>> 'B' + 1
Traceback (innermost last):
  File "<stdin>", line 1, in ?
TypeError: illegal argument type for built-in operation
>>>
>>> chr('B')
Traceback (innermost last):
  File "<stdin>", line 1, in ?
TypeError: illegal argument type for built-in operation
>>>
>>> ord('B')
66
>>> ord('B') + 1
67
>>> chr(67)
'C'
>>> chr(ord('B') + 1)
'C'

Our failure scenario occurred when we attempted to increase the ASCII value of 'B' by 1 to get 'C' by addition. Rather than 1-byte integer arithmetic, our solution in Python involves using the chr() and ord() built-in functions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset