In this section, we will review differences with other flavors, how to deal with Unicode, and also differences in the re
module between Python 2.x and Python 3.
As we mentioned at the beginning of the book, the re
module has Perl-style regular expressions. However, that doesn't mean Python support every feature the Perl engine has.
There are too many differences to cover them in a short book like this, if you want to know them in-depth here you have two good places to start:
When you're using Python 2.x and you want to match Unicode, the regex has to be Unicode escape. For example:
>>> re.findall(r"u03a9", u"adeΩa") [] >>> re.findall(ur"u03a9", u"adeΩa") [u'u03a9']
Note that if you use Unicode characters but the type of the string you're using is not Unicode, python automatically encodes it using the default encoding. For example, in my case I have UTF-8:
>>> u"Ω".encode("utf-8") 'xcexa9' >>> "Ω" 'xcexa9'
So, you have to be careful while mixing types:
>>> re.findall(r'Ω', "adeΩa") ['xcexa9']
Here, you're not matching Unicode but the characters in the default encoding:
>>> re.findall(r'xcexa9', "adeΩa") ['xcexa9']
So, if you use Unicode in any of them, you're pattern won't match anything:
>>> re.findall(r'Ω', u"adeΩa") []
On the other hand, you can use Unicode on both sides and it would match as expected:
>>> re.findall(ur'Ω', u"adeΩa") [u'u03a9']
The re
module doesn't do Unicode case folding, so case insensitive doesn't work on Unicode:
>>> re.findall(ur"ñ" ,ur"Ñ", re.I) []
There are some changes in Python 3 that affect the regex behavior, and new features have been added to the re
module. First, let's review how the string notation has changed.
Type |
Prefixed |
Description |
---|---|---|
String |
They are string literals. They're Unicode. The backslash is necessary to escape meaningful characters. >>>"España " 'España ' | |
Raw string |
|
They're equal to literal strings with the exception of the backslashes, which are treated as normal characters. >>>r"España " 'España \n' |
Byte strings |
|
Strings represented as bytes. They can only contain ASCII characters; if the byte is greater than 128, it must be escaped. >>> b"Espaxc3xb1a " b'Espaxc3xb1a ' We can convert to Unicode in this way: >>> str(b"Espaxc3xb1a ", "utf-8") 'España ' The backslash is necessary to escape meaningful characters. |
Byte raw string |
|
They are like byte strings, but the backslashes are escaped. >>> br"Espaxc3xb1a " b'Espa\xc3\xb1a \n' So, the backslash used to escape bytes are escaped again, which complicates their conversion to Unicode: >>> str(br"Espaxc3xb1a ", "utf-8") 'Espa\xc3\xb1a \n' |
Unicode |
|
The |
Literal strings are Unicode by default in Python 3, which means that there is no need to use the flag Unicode anymore.
>>> re.findall(r"w+", "这是一个例子") ['这是一个例子']
Python 3.3 (http://docs.python.org/dev/whatsnew/3.3.html) adds more features related to Unicode and how it is treated in the language. For example, it adds support for the complete range of code points, including non-BMP (http://en.wikipedia.org/wiki/Plane_(Unicode)). So, for example:
>>> re.findall(r".", u'U0010FFFF') [u'udbff', u'udfff']
>>> re.findall(r".", u'U0010FFFF') ['U0010ffff']
As we've seen in the Compilation Flags section, the ASCII flag has been added.
Another important aspect to note when using Python 3 has to do with metacharacters. As the strings are Unicode by default, the metacharacters too, unless you use 8-bit patterns or use the ASCII flag.
>>> re.findall(r"w+", "هذا⇢مثال") ['هذا', 'مثال'] >>> re.findall(r"w+", "هذا⇢مثال word", re.ASCII) ['word']
In the preceding example, the characters that aren't ASCII are ignored.
Take into account that Unicode pattern and 8-bit patterns cannot be mixed.
In the following example, we've tried to match an 8-bit pattern against a Unicode String, that's why an exception is thrown (remember that it would work in Python 2.x):
>>> re.findall(b"w+", b"hello⇢world") [b'hello', b'world'] >>> re.findall(b"w+", "hello world") …. TypeError: can't use a bytes pattern on a string-like object