Regular expressions aren't part of the core Python language. Thus, there isn't a special syntax for them and therefore they are handled as any other string. As we've seen in Chapter 1, Introducing Regular Expressions, the backslash character is used to indicate metacharacters or special forms in regular expressions. The backslash is also used in strings to escape special characters. In other words, it has a special meaning in Python. So, if we need to use the
character, we'll have to escape it:
\
. This will give the string literal meaning to the backslash. However, in order to match inside a regular expression, we should escape the backslashes, effectively writing four back slashes: \\
.
Just as an example, let's write a regular expression to match :
>>> pattern = re.compile("\\") >>> pattern.match("\author") <_sre.SRE_Match at 0x104a88e68>
As you can see, this is tedious and difficult to understand when the pattern is long.
Python provides the raw string notation r
, with which the backslashes are treated as normal characters. So, r""
is not the backspace anymore; it's just the character
and the character b
, and the same goes for r"
"
.
Python 2.x and Python 3.x treat strings differently. In Python 2, there are two types of Strings, 8-bit Strings and Unicode strings; while in Python 3, we have text and binary data. Text is always Unicode and the encoded Unicode is represented as binary data (http://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit).
Strings have special notation to indicate what type we're using.
Go to the What's new in Python 3 section to find out how the notation is in Python 3
Using raw string is the recommended option following the Python official documentation, and that's what we will be using with Python 2.7 throughout the book. So with this in mind, we can rewrite the regex as follows:
>>> pattern = re.compile(r"\") >>> pattern.match(r"author") <_sre.SRE_Match at 0x104a88f38>