Compilation flags

When compiling a pattern string into a pattern object, it's possible to modify the standard behavior of the patterns. In order to do that, we have to use the compilation flags. These can be combined using the bitwise OR "|".

Flag

Python

Description

re.IGNORECASE or re.I

2.x

3.x

The pattern will match lower case and upper case.

re.MULTILINE or re.M

2.x

3.x

This flag changes the behavior of two metacharacters:

  • ^: Which now matches at the beginning of the string and at the beginning of each new line.
  • $: In this case, it matches at the end of the string and the end of each line. Concretely, it matches right before the newline character.

re.DOTALL or re.S

2.x

3.x

The metacharacter "." will match any character even the newline.

re.LOCALE or re.L

2.x

3.x

This flag makes w, W, , B, s, and S dependent on the current locale.

"re.LOCALE just passes the character to the underlying C library. It really only works on bytestrings which have 1 byte per character. UTF-8 encodes code points outside the ASCII range to multiple bytes per code point, and the re module will treat each of those bytes as a separate character." (at http://www.gossamer-threads.com/lists/python/python/850772)

Note that when using re.L and re.U together (re.L|re.U, only Locale is used). Also, note that in Python 3 the use of this flag is discouraged; go to the documentation for more info.

re.VERBOSE or re.X

2.x

3.x

It allows writing of regular expressions that are easier to read and understand. For that, it treats some characters in a special way:

  • Whitespace is ignored except when it's in character class or preceded by a backslash
  • All characters to the right of the # are ignored like it was a comment, except when # is preceded by the backslash or it's in a character class.

re.DEBUG

2.x

3.x

It gives you information about the compilation pattern.

re.UNICODE or re.U

2.x

It makes w, W, , B, d, D, s, and S dependent on the Unicode character properties database.

re.ASCII or re.A (only Python 3)

3.x

It makes w, W, , B, d, D, s, and S perform ASCII-only matching. This makes sense because in Python 3 the matches are Unicode by default. You can find more on this in the What's new on Python 3 section.

Let's see some examples of the most important flags.

re.IGNORECASE or re.I

As you can see, the following pattern matches even though the string starts with A and not with an a.

>>> pattern = re.compile(r"[a-z]+", re.I)
>>> pattern.search("Felix")
<_sre.SRE_Match at 0x10e27a238>
>>> pattern.search("felix")
<_sre.SRE_Match at 0x10e27a510>

re.MULTILINE or re.M

In the following example, the pattern doesn't match the date after newline because we're not using the flag:

>>> pattern = re.compile("^w+: (w+/w+/w+)")
>>> pattern.findall("date: ⇢12/01/2013 
date: 11/01/2013")
['12/01/2013']

However, on using the Multiline flag, it matches the two dates:

>>> pattern = re.compile("^w+: (w+/w+/w+)", re.M)
>>> pattern.findall("date: ⇢12/01/2013⇢
date: ⇢11/01/2013")
  ['12/01/2013', '12/01/2013']

Note

This is not the best way to capture a date.

re.DOTALL or re.S

Let's try to match anything after a digit:

>>> re.findall("^d(.)", "1
e")
   []

We can see in the previous example that the character class . with its default behavior doesn't match the newline. Let's see what happens on using the flag:

>>> re.findall("^d(.)", "1
e", re.S)
['
']

As expected, on using the DOTALL flag it matches the newline perfectly.

re.LOCALE or re.L

In the following example, we get the first 256 characters and then we try to find every alphanumeric character in the string, so we obtain the expected characters as follows:

>>> chars = ''.join(chr(i) for i in xrange(256))
>>> " ".join(re.findall(r"w", chars))
'0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z'   

After setting the locale to our system locale, we can again try to obtain every alphanumeric character:

>>> locale.setlocale(locale.LC_ALL, '')
'ru_RU.KOI8-R'  

In this case, we get many more characters according to the new locale:

>>> " ".join(re.findall(r"w", chars, re.LOCALE))
'0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z xa3 xb3 xc0 xc1 xc2 xc3 xc4 xc5 xc6 xc7 xc8 xc9 xca xcb xcc xcd xce xcf xd0 xd1 xd2 xd3 xd4 xd5 xd6 xd7 xd8 xd9 xda xdb xdc xdd xde xdf xe0 xe1 xe2 xe3 xe4 xe5 xe6 xe7 xe8 xe9 xea xeb xec xed xee xef xf0 xf1 xf2 xf3 xf4 xf5 xf6 xf7 xf8 xf9 xfa xfb xfc xfd xfe xff'

re.UNICODE or re.U

Let's try to find all the alphanumeric characters in a string:

>>> re.findall("w+", "this⇢is⇢an⇢example")
['this', 'is', 'an', 'example']

But what would happen if we want to do the same with other languages? The alphanumeric characters depend on the language, so we need to indicate it to the regex engine:

>>> re.findall(ur"w+", u"这是一个例子", re.UNICODE)
  [u'u8fd9u662fu4e00u4e2au4f8bu5b50']
>>> re.findall(ur"w+", u"هذا مثال", re.UNICODE)
   [u'u0647u0630u0627', u'u0645u062bu0627u0644']

re.VERBOSE or re.X

In the following pattern, we've used several ⇢; the first one is ignored because it is not in a character class or preceded by a backslash and the second one is part of the pattern. We've also used # three times, the first and the third one are ignored because they're not preceded by a backslash, and the second one is part of the pattern.

>>> pattern = re.compile(r"""[#|_] + #comment
               # #comment
              d+""", re.VERBOSE)
>>> pattern.findall("#⇢#2")
['#⇢#2']

re.DEBUG

>>>re.compile(r"[a-f|3-8]", re.DEBUG)
  in
    range (97, 102)
    literal 124
    range (51, 56)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset