Use a single regular expression to match a complete HTML
file, checking for properly nested html
, head
,
title
, and body
tags. The regular expression must fail
efficiently on HTML files that do not have the proper tags.
<html>(?>.*?<head>)(?>.*?<title>)(?>.*?</title>)↵ (?>.*?</head>)(?>.*?<body[^>]*>)(?>.*?</body>).*?</html>
Regex options: Case insensitive, dot matches line breaks |
Regex flavors: .NET, Java, PCRE, Perl, Ruby |
JavaScript and Python do not support atomic grouping. There is no way to eliminate needless backtracking with these two regex flavors. When programming in JavaScript or Python, you can solve this problem by doing a literal text search for each of the tags one by one, searching for the next tag through the remainder of the subject text after the one last found.
The proper solution to this problem is more easily understood if we start from this naïve solution:
<html>.*?<head>.*?<title>.*?</title>↵ .*?</head>.*?<body[^>]*>.*?</body>.*?</html>
Regex options: Case insensitive, dot matches line breaks |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
When you test this regex on a proper HTML file, it works perfectly
well. ‹.*?
› skips over
anything, because we turn on “dot matches line breaks.” The lazy
asterisk makes sure the regex goes ahead only one character at a time,
each time checking whether the next tag can be matched. Recipes 2.4 and 2.13 explain
all this.
But this regex gets you into trouble when it needs to deal with a
subject text that does not have all the HTML tags. The worst case occurs
when </html>
is missing.
Imagine the regex engine has matched all the preceding tags and is
now busy expanding the last ‹.*?
›. Since ‹</html>
› can never match, the ‹.*?
› expands all the way to the
end of the file. When it can no longer expand, it fails.
But that is not the end of the story. The other six ‹.*?
› have all remembered a
backtracking position that allows them to expand further. When the last
‹.*?
› fails, the one
before expands, gradually matching </body>
. That same text was
previously matched by the literal ‹</body>
› in the regex. This ‹.*?
› too will expand all the way
to the end of the file, as will all preceding lazy dots. Only when the
first one reaches the end of the file will the regex engine declare
failure.
This regular expression has a worst-case complexity[4] of O(n7), the length of the subject text to the seventh power. There are seven lazy dots that can potentially expand all the way to the end of the file. If the file is twice the size, the regex can need up to 128 times as many steps to figure out it doesn’t match.
We call this catastrophic backtracking. So much backtracking occurs that the regex either takes forever or crashes your application. Some regex implementations are clever and will abort runaway match attempts early, but even then the regex will still kill your application’s performance.
Catastrophic backtracking is an instance of a phenomenon known as a combinatorial explosion, in which several orthogonal conditions intersect and all combinations have to be tried. You could also say that the regex is a Cartesian product of the various repetition operators.
The solution is to use atomic grouping to prevent needless
backtracking. There is no need for the sixth ‹.*?
› to expand after ‹</body>
› has matched. If ‹</html>
› fails, expanding
the sixth lazy dot will not magically produce a closing html
tag.
To make a quantified regular expression token stop when the
following delimiter matches, place both the quantified part of the regex
and the delimiter together in an atomic group: ‹(?>.*?</body>)
›. Now the regex engine
throws away all the matching positions for ‹.*?</body>
› when ‹</body>
› is found. If ‹</html>
› later fails, the
regex engine has forgotten about ‹.*?</body>
›, and no further expansion will
occur.
If we do the same for all the other ‹.*?
› in the regex, none of them will expand
further. Although there are still seven lazy dots in the regex, they
will never overlap. This reduces the complexity of the regular
expression to O(n), which is linear with respect to
the length of the subject text. A regular expression can never be more
efficient than this.
If you really want to see catastrophic backtracking at work, try
‹(x+x+)+y
› on xxxxxxxxxx
. If it fails
quickly, add one x
to the subject. Repeat this until the
regex starts to take very long to match or your application crashes. It
won’t take many more x
characters, unless you’re using
Perl.
Of the regex flavors discussed in this book, only Perl is able to detect that the regular expression is too complex and then abort the match attempt without crashing.
The complexity of this regex is
O(2n). When ‹y
› fails to match, the regex
engine will try all possible permutations of repeating each ‹x+
› and the group containing them.
For instance, one such permutation, far down the match attempt, is
‹x+
› matching xxx
, the second ‹x+
› matching x
, and then the group
being repeated three more times with each ‹x+
› matching x
. With 10 x
characters, there are 1,024 such
permutations. If we increase the number to 32, we’re at over 4 billion
permutations, which will surely cause any regex engine to run out of
memory, unless it has a safety switch that allows it to give up and say
that your regular expression is too complicated.
In this case, this nonsensical regular expression is easily
rewritten as ‹xx+y
›, which
finds exactly the same matches in linear time. In practice, the solution
may not be so obvious with more complicated regexes.
Essentially, you have to watch out when two or more parts of the regular expression can match the same text. In these cases, you may need atomic grouping to make sure the regex engine doesn’t try all possible ways of dividing the subject text between those two parts of the regex. Always test your regex on (long) test subjects that contain text that can be partially but not entirely matched by the regex.
Recipe 2.13 explains how to choose between minimal repetition and maximal repetition.
Recipe 2.14 explains how to make sure the regex engine doesn’t needlessly try different amounts of repetition.
The section Unicode grapheme in Recipe 2.7 has another example of how atomic grouping can prevent undesirable match results.
SDL Regex Fuzzer describes SDL Regex Fuzzer, which is a tool that can test (some) regular expressions for catastrophic backtracking.
[4] Complexity of computer algorithms is usually described using the “big O notation.” The article at http://en.wikipedia.org/wiki/Time_complexity provides a good overview of common time complexities for computer algorithms.