Python and regex special considerations

In this section, we will review differences with other flavors, how to deal with Unicode, and also differences in the re module between Python 2.x and Python 3.

Differences between Python and other flavors

As we mentioned at the beginning of the book, the re module has Perl-style regular expressions. However, that doesn't mean Python support every feature the Perl engine has.

There are too many differences to cover them in a short book like this, if you want to know them in-depth here you have two good places to start:

Unicode

When you're using Python 2.x and you want to match Unicode, the regex has to be Unicode escape. For example:

>>> re.findall(r"u03a9", u"adeΩa")
[]
>>> re.findall(ur"u03a9", u"adeΩa")
[u'u03a9']

Note that if you use Unicode characters but the type of the string you're using is not Unicode, python automatically encodes it using the default encoding. For example, in my case I have UTF-8:

>>> u"Ω".encode("utf-8")
'xcexa9'
>>> "Ω"
'xcexa9'

So, you have to be careful while mixing types:

>>> re.findall(r'Ω', "adeΩa")
['xcexa9']

Here, you're not matching Unicode but the characters in the default encoding:

>>> re.findall(r'xcexa9', "adeΩa")
['xcexa9']

So, if you use Unicode in any of them, you're pattern won't match anything:

>>> re.findall(r'Ω', u"adeΩa")
[]

On the other hand, you can use Unicode on both sides and it would match as expected:

>>> re.findall(ur'Ω', u"adeΩa")
   [u'u03a9']

The re module doesn't do Unicode case folding, so case insensitive doesn't work on Unicode:

>>> re.findall(ur"ñ" ,ur"Ñ", re.I)
[]

What's new in Python 3

There are some changes in Python 3 that affect the regex behavior, and new features have been added to the re module. First, let's review how the string notation has changed.

Type

Prefixed

Description

String

 

They are string literals. They're Unicode. The backslash is necessary to escape meaningful characters.

>>>"España 
"
'España 
'

Raw string

r or R

They're equal to literal strings with the exception of the backslashes, which are treated as normal characters.

>>>r"España 
"
'España \n'

Byte strings

b or B

Strings represented as bytes. They can only contain ASCII characters; if the byte is greater than 128, it must be escaped.

>>> b"Espaxc3xb1a 
"
b'Espaxc3xb1a 
'

We can convert to Unicode in this way:

>>> str(b"Espaxc3xb1a 
", "utf-8")
'España 
'

The backslash is necessary to escape meaningful characters.

Byte raw string

r or R

They are like byte strings, but the backslashes are escaped.

>>> br"Espaxc3xb1a 
"
b'Espa\xc3\xb1a \n'

So, the backslash used to escape bytes are escaped again, which complicates their conversion to Unicode:

>>> str(br"Espaxc3xb1a 
", "utf-8")
'Espa\xc3\xb1a \n'

Unicode

r or U

The u prefix was removed in the early versions of Python 3, and recovered in version 3.3 the syntax is accepted again. They're equal to strings.

Literal strings are Unicode by default in Python 3, which means that there is no need to use the flag Unicode anymore.

>>> re.findall(r"w+", "这是一个例子")
  ['这是一个例子']

Python 3.3 (http://docs.python.org/dev/whatsnew/3.3.html) adds more features related to Unicode and how it is treated in the language. For example, it adds support for the complete range of code points, including non-BMP (http://en.wikipedia.org/wiki/Plane_(Unicode)). So, for example:

  • In Python 2.7:
    >>> re.findall(r".", u'U0010FFFF')
    [u'udbff', u'udfff'] 
  • In Python 3.3.2:
    >>> re.findall(r".", u'U0010FFFF')
    ['U0010ffff']

As we've seen in the Compilation Flags section, the ASCII flag has been added.

Another important aspect to note when using Python 3 has to do with metacharacters. As the strings are Unicode by default, the metacharacters too, unless you use 8-bit patterns or use the ASCII flag.

>>> re.findall(r"w+", "هذا⇢مثال")
['هذا', 'مثال'] 
>>> re.findall(r"w+", "هذا⇢مثال word", re.ASCII)
['word']

In the preceding example, the characters that aren't ASCII are ignored.

Take into account that Unicode pattern and 8-bit patterns cannot be mixed.

In the following example, we've tried to match an 8-bit pattern against a Unicode String, that's why an exception is thrown (remember that it would work in Python 2.x):

>>> re.findall(b"w+", b"hello⇢world")
[b'hello', b'world']
>>> re.findall(b"w+", "hello world")
….
TypeError: can't use a bytes pattern on a string-like object
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset