Understanding the Python regex engine

The re module uses a backtracking regular expression engine; although, in the very well-known book Mastering Regular Expressions by Jeffrey E. F. Friedl, it is classified as Nondeterministic Finite Automata (NFA) type. Also, according to Tim Peters (https://mail.python.org/pipermail/tutor/2006-January/044335.html), the module is not purely NFA.

These are the most common characteristics of the algorithm:

  • It supports "lazy quantifiers" such as *?, +?, and ??.
  • It matches the first coincidence, even though there are longer ones in the string.
    >>>re.search("engineer|engineering", "engineering").group()'engineer'

    This also means that order is important.

  • The algorithm tracks only one transition at one step, which means that the engine checks one character at a time.
  • Backreferences and capturing parentheses are supported.
  • Backtracking is the ability to remember the last successful position so that it can go back and retry if needed
  • In the worst case, complexity is exponential O(Cn). We'll see this later in Backtracking.

Backtracking

As we've mentioned before, backtracking allows going back and repeating the different paths of the regular expression. It does so by remembering the last successful position. This applies to alternation and quantifiers. Let's see an example:

Backtracking

Backtracking

As you can see in the preceding figure, the regex engine tries to match one character at a time until it fails, and then starts again with the next path it can retry.

The regex used in the figure is a perfect example of the importance of how the regex is built. In this case, the expression can be rebuilt as spa(in|niard) so that the regex engine doesn't have to go back to the start of the string in order to retry the second alternative.

This leads us to something called a catastrophic backtracking; a well-known problem with backtracking that can give you several problems ranging from slow regex to a crash with a stack overflow.

In the previous example, you can see that the behavior grows not only with the input but also with different paths in the regex, so the algorithm can be exponential O(Cn). With this in mind, it's easy to understand why we can end up with a stack overflow. The problem arises when the regex fails to match the string. Let's benchmark a regex with a technique we've seen previously so that we can understand the problem better.

First, let's try a simple regex:

>>> def catastrophic(n):
        print "Testing with %d characters" %n
        pat = re.compile('(a+)+c')
text = "%s" %('a' * n)
        pat.search(text)

As you can see, the text we're trying to match is always going to fail as there is no c at the end. Let's test it with different inputs:

>>> for n in range(20, 30):
        test(catastrophic, n)
Testing with 20 characters
The function catastrophic lasted: 0.130457
Testing with 21 characters
The function catastrophic lasted: 0.245125
……
The function catastrophic lasted: 14.828221
Testing with 28 characters
The function catastrophic lasted: 29.830929
Testing with 29 characters
The function catastrophic lasted: 61.110949

The behavior of this regex looks as if it is quadratic. But why? What's happening here? The problem is that (a+) starts greedy, so it tries to get as many a characters as possible. After that, it fails to match the c, that is, it backtracks to the second a, and continues consuming a characters until it fails to match c. And then, it tries the whole process again (backtracks) starting with the second a character.

Let's see another example, in this case with an exponential behavior:

>>> def catastrophic(n):
        print "Testing with %d characters" %n
        pat = re.compile('(x+)+(b+)+c')
        text = 'x' * n
        text += 'b' * n
        pat.search(text)
>>> for n in range(12, 18):
        test(catastrophic, n)
Testing with 12 characters
The function catastrophic lasted: 1.035162
Testing with 13 characters
The function catastrophic lasted: 4.084714
Testing with 14 characters
The function catastrophic lasted: 16.319145
Testing with 15 characters
The function catastrophic lasted: 65.855182
Testing with 16 characters
The function catastrophic lasted: 276.941307

As you can see, the behavior is exponential, which can lead to catastrophic scenarios. Finally, let's see what happens when the regex has a match:

>>> def non_catastrophic(n):
        print "Testing with %d characters" %n
        pat = re.compile('(x+)+(b+)+c')
        text = 'x' * n
        text += 'b' * n
        text += 'c'
        pat.search(text)
>>> for n in range(12, 18):
        test(non_catastrophic, n)
Testing with 10 characters
The function catastrophic lasted: 0.000029
……
Testing with 19 characters
The function catastrophic lasted: 0.000012
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset