2.15. Prevent Runaway Repetition

Problem

Use a single regular expression to match a complete HTML file, checking for properly nested html, head, title, and body tags. The regular expression must fail efficiently on HTML files that do not have the proper tags.

Solution

<html>(?>.*?<head>)(?>.*?<title>)(?>.*?</title>)↵
(?>.*?</head>)(?>.*?<body[^>]*>)(?>.*?</body>).*?</html>
Regex options: Case insensitive, dot matches line breaks
Regex flavors: .NET, Java, PCRE, Perl, Ruby

JavaScript and Python do not support atomic grouping. There is no way to eliminate needless backtracking with these two regex flavors. When programming in JavaScript or Python, you can solve this problem by doing a literal text search for each of the tags one by one, searching for the next tag through the remainder of the subject text after the one last found.

Discussion

The proper solution to this problem is more easily understood if we start from this naïve solution:

<html>.*?<head>.*?<title>.*?</title>↵
.*?</head>.*?<body[^>]*>.*?</body>.*?</html>
Regex options: Case insensitive, dot matches line breaks
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

When you test this regex on a proper HTML file, it works perfectly well. .*? skips over anything, because we turn on “dot matches line breaks.” The lazy asterisk makes sure the regex goes ahead only one character at a time, each time checking whether the next tag can be matched. Recipes 2.4 and 2.13 explain all this.

But this regex gets you into trouble when it needs to deal with a subject text that does not have all the HTML tags. The worst case occurs when </html> is missing.

Imagine the regex engine has matched all the preceding tags and is now busy expanding the last .*?. Since </html> can never match, the .*? expands all the way to the end of the file. When it can no longer expand, it fails.

But that is not the end of the story. The other six .*? have all remembered a backtracking position that allows them to expand further. When the last .*? fails, the one before expands, gradually matching </body>. That same text was previously matched by the literal </body> in the regex. This .*? too will expand all the way to the end of the file, as will all preceding lazy dots. Only when the first one reaches the end of the file will the regex engine declare failure.

This regular expression has a worst-case complexity[4] of O(n7), the length of the subject text to the seventh power. There are seven lazy dots that can potentially expand all the way to the end of the file. If the file is twice the size, the regex can need up to 128 times as many steps to figure out it doesn’t match.

We call this catastrophic backtracking. So much backtracking occurs that the regex either takes forever or crashes your application. Some regex implementations are clever and will abort runaway match attempts early, but even then the regex will still kill your application’s performance.

Tip

Catastrophic backtracking is an instance of a phenomenon known as a combinatorial explosion, in which several orthogonal conditions intersect and all combinations have to be tried. You could also say that the regex is a Cartesian product of the various repetition operators.

The solution is to use atomic grouping to prevent needless backtracking. There is no need for the sixth .*? to expand after </body> has matched. If </html> fails, expanding the sixth lazy dot will not magically produce a closing html tag.

To make a quantified regular expression token stop when the following delimiter matches, place both the quantified part of the regex and the delimiter together in an atomic group: (?>.*?</body>). Now the regex engine throws away all the matching positions for .*?</body> when </body> is found. If </html> later fails, the regex engine has forgotten about .*?</body>, and no further expansion will occur.

If we do the same for all the other .*? in the regex, none of them will expand further. Although there are still seven lazy dots in the regex, they will never overlap. This reduces the complexity of the regular expression to O(n), which is linear with respect to the length of the subject text. A regular expression can never be more efficient than this.

Variations

If you really want to see catastrophic backtracking at work, try (x+x+)+y on xxxxxxxxxx. If it fails quickly, add one x to the subject. Repeat this until the regex starts to take very long to match or your application crashes. It won’t take many more x characters, unless you’re using Perl.

Of the regex flavors discussed in this book, only Perl is able to detect that the regular expression is too complex and then abort the match attempt without crashing.

The complexity of this regex is O(2n). When y fails to match, the regex engine will try all possible permutations of repeating each x+ and the group containing them. For instance, one such permutation, far down the match attempt, is x+ matching xxx, the second x+ matching x, and then the group being repeated three more times with each x+ matching x. With 10 x characters, there are 1,024 such permutations. If we increase the number to 32, we’re at over 4 billion permutations, which will surely cause any regex engine to run out of memory, unless it has a safety switch that allows it to give up and say that your regular expression is too complicated.

In this case, this nonsensical regular expression is easily rewritten as xx+y, which finds exactly the same matches in linear time. In practice, the solution may not be so obvious with more complicated regexes.

Essentially, you have to watch out when two or more parts of the regular expression can match the same text. In these cases, you may need atomic grouping to make sure the regex engine doesn’t try all possible ways of dividing the subject text between those two parts of the regex. Always test your regex on (long) test subjects that contain text that can be partially but not entirely matched by the regex.

See Also

Recipe 2.13 explains how to choose between minimal repetition and maximal repetition.

Recipe 2.14 explains how to make sure the regex engine doesn’t needlessly try different amounts of repetition.

The section Unicode grapheme in Recipe 2.7 has another example of how atomic grouping can prevent undesirable match results.

SDL Regex Fuzzer describes SDL Regex Fuzzer, which is a tool that can test (some) regular expressions for catastrophic backtracking.



[4] Complexity of computer algorithms is usually described using the “big O notation.” The article at http://en.wikipedia.org/wiki/Time_complexity provides a good overview of common time complexities for computer algorithms.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset