Backslash in string literals

Regular expressions aren't part of the core Python language. Thus, there isn't a special syntax for them and therefore they are handled as any other string. As we've seen in Chapter 1, Introducing Regular Expressions, the backslash character is used to indicate metacharacters or special forms in regular expressions. The backslash is also used in strings to escape special characters. In other words, it has a special meaning in Python. So, if we need to use the character, we'll have to escape it: \. This will give the string literal meaning to the backslash. However, in order to match inside a regular expression, we should escape the backslashes, effectively writing four back slashes: \\.

Just as an example, let's write a regular expression to match :

>>> pattern = re.compile("\\")
>>> pattern.match("\author")
<_sre.SRE_Match at 0x104a88e68>

As you can see, this is tedious and difficult to understand when the pattern is long.

Python provides the raw string notation r, with which the backslashes are treated as normal characters. So, r"" is not the backspace anymore; it's just the character and the character b, and the same goes for r" ".

Python 2.x and Python 3.x treat strings differently. In Python 2, there are two types of Strings, 8-bit Strings and Unicode strings; while in Python 3, we have text and binary data. Text is always Unicode and the encoded Unicode is represented as binary data (http://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit).

Strings have special notation to indicate what type we're using.

String Python 2.x

Type

Prefixed

Description

String

 

String literals. They're encoded automatically by using the default encoding (UTF-8 in our case). The backslash is necessary to escape meaningful characters.

>>>"España 
"
'Espaxc3xb1a 
'

Raw string

r or R

They're equal to literal strings with the exception of the backslashes, which are treated as normal characters.

>>>r"España 
"
'Espaxc3xb1a \n'

Unicode string

u or U

These strings use the Unicode character set (ISO 10646).

>>>u"España 
"
u'Espaxf1a 
'

Unicode raw string

ur or UR

They're Unicode strings but treat backslashes as normal raw strings.

>>>ur"España 
"
u'Espaxf1a \n'

Go to the What's new in Python 3 section to find out how the notation is in Python 3

Using raw string is the recommended option following the Python official documentation, and that's what we will be using with Python 2.7 throughout the book. So with this in mind, we can rewrite the regex as follows:

>>> pattern = re.compile(r"\")
>>> pattern.match(r"author")
<_sre.SRE_Match at 0x104a88f38>
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset