You want to find all occurrences of the word TODO
within (X)HTML or
XML comments. For example, you want to match only the underlined text
within the following string:
This "TODO" is not within a comment, but the next one is. <!--
TODO
: ↵ Come up with a cooler comment for this example. -->
There are at least two approaches to this problem, and both have their advantages. The first tactic, which we’ll call the “two-step approach,” is to find comments with an outer regex, and then search within each match using a separate regex or even a plain text search. That works best if you’re writing code to do the job, since separating the task into two steps keeps things simple and fast. However, if you’re searching through files using a text editor or grep tool, splitting the task in two won’t work unless your tool of choice offers a special option to search within matches found by another regex.[23]
When you need to find words within comments using a single regex, you can accomplish this with the help of lookaround. This second method is shown in the upcoming section .
When it’s a workable option, the better solution is to split the
task in two: search for comments, and then search within those
comments for TODO
.
Here’s how you can find comments:
<!--.*?-->
Regex options: Dot matches line breaks |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
Standard JavaScript doesn’t have a “dot matches line breaks” option, but you can use an all-inclusive character class in place of the dot, as follows:
<!--[sS]*?-->
Regex options: None |
Regex flavor: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
For each comment you find using one of the regexes just shown,
you can then search within the matched text for the literal characters
‹TODO
›. If you prefer,
you can make it a case-insensitive regex with word boundaries on each
end to make sure that only the complete word TODO
is matched, like
so:
TODO
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Follow the code in Recipe 3.13 to search within matches of an outer regex.
Lookahead (described in Recipe 2.16)
lets you solve this problem with a single regex, albeit less
efficiently. In the following regex, positive lookahead is used to
make sure that the word TODO
is followed by the closing comment
delimiter -->
. On its own, that doesn’t tell
whether the word appears within a comment or is simply followed by a
comment, so a nested negative lookahead is used to ensure that the
opening comment delimiter <!--
does not appear before the
-->
:
TODO(?=(?:(?!<!--).)*?-->)
Regex options: Case insensitive, dot matches line breaks |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
Since standard JavaScript doesn’t have a “dot matches line
breaks” option, use ‹[sS]
› in place of the dot:
TODO(?=(?:(?!<!--)[sS])*?-->)
Regex options: Case insensitive |
Regex flavor: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Recipe 3.13 shows the code you need to
search within matches of another regex. It takes an inner and outer
regex. The comment regex serves as the outer regex, and ‹TODO
› as the inner regex.
The main thing to note here is the lazy ‹*?
›
quantifier that follows the dot or character class in the comment
regex. As explained in Recipe 2.13, that
lets you match up to the first -->
(the one that ends the comment),
rather than the very last occurrence of -->
in your subject string.
This solution is more complex, and slower. On the plus side, it combines the two steps of the previous approach into one regex. Thus, it can be used when working with a text editor, IDE, or other tool that doesn’t allow searching within matches of another regex.
Let’s break this regex down in free-spacing mode, and take a closer look at each part:
TODO # Match the characters "TODO", as a complete word (?= # Followed by: (?: # Group but don't capture: (?! <!-- ) # Not followed by: "<!--" . # Match any single character )*? # Repeat zero or more times, as few as possible (lazy) --> # Match the characters "-->" )
Regex options: Dot matches line breaks, free-spacing |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
This commented version of the regex doesn’t work in JavaScript unless you use the XRegExp library, since standard JavaScript lacks both “free-spacing” and “dot matches line breaks” modes.
Notice that the regex contains a negative lookahead nested
within an outer, positive lookahead. That lets you require that any
match of TODO
is followed by -->
and that
<!--
does
not occur in between.
If it’s clear to you how all of this works together, great: you can skip the rest of this section. But in case it’s still a little hazy, let’s take another step back and build the outer, positive lookahead in this regex step by step.
Let’s say for a moment that we simply want to match occurrences
of the word TODO
that are followed at some point in
the string by -->
. That gives us the regex
‹TODO(?=.*?-->)
›
(with “dot matches line breaks” enabled), which matches the underlined
text in <!--
TODO
-->
just fine. We need the ‹.*?
› at the beginning of the
lookahead, because otherwise the regex would match only when
TODO
is
immediately followed by -->
, with no characters in between.
The ‹*?
› quantifier
repeats the dot zero or more times, as few times as possible, which is
great since we only want to match until the first following -->
.
As an aside, the regex so far could be rewritten as ‹TODO(?=.*?-->)
›—with the
second ‹› moved
after the lookahead—without any affect on the text that is matched.
That’s because both the word boundary and the lookahead are
zero-length assertions (see Lookaround). However, it’s
better to place the word boundary first for readability and
efficiency. In the middle of a partial match, the regex engine can
more quickly test a word boundary, fail, and move forward to try the
regex again at the next character in the string without having to
spend time testing the lookahead when it isn’t necessary.
OK, so the regex ‹TODO(?=.*?-->)
› seems to work fine so
far, but what about when it’s applied to the subject string TODO <!--separate
comment-->
? The regex still matches TODO
since it’s followed
by -->
,
even though TODO
is not within a comment this time.
We therefore need to change the dot within the lookahead from matching
any character to matching any character that is not part of the string
<!--
,
since that would indicate the start of a new comment. We can’t use a
negated character class such as ‹[^<!-]
›, because we want to allow <
, !
, and -
characters that are
not grouped into the exact sequence <!--
.
That’s where the nested negative lookahead comes in. ‹(?!<!--).
› matches any single
character that is not part of an opening comment delimiter. Placing
that pattern within a noncapturing group as ‹(?:(?!<!--).)
› allows us to repeat the whole
sequence with the lazy ‹*?
› quantifier we’d previously applied to just
the dot.
Putting it all together, we get the final regex that was listed
as the solution for this
problem: ‹TODO(?=(?:(?!<!--).)*?-->)
›. In
JavaScript, which lacks the necessary “dot matches line breaks”
option, ‹TODO(?=(?:(?!<!--)[sS])*?-->)
› is
equivalent.
Although the “single-step approach” regex ensures that any match
of TODO
is
followed by -->
without <!--
occurring in between, it doesn’t
check the reverse: that the target word is also preceded by <!--
without
-->
in
between. There are several reasons we left that rule out:
You can usually get away with not doing this double-check, especially since the single-step regex is meant to be used with text editors and the like, where you can visually verify your results.
Having less to verify means less time spent performing the verification. In other words, the regex is faster when the extra check is left out.
Most importantly, since you don’t know how far back the comment may have started, looking backward like this requires infinite-length lookbehind, which is supported by the .NET regex flavor only.
If you’re working with .NET and want to include this added check, use the following regex:
(?<=<!--(?:(?!-->).)*?)TODO(?=(?:(?!<!--).)*?-->)
Regex options: Case insensitive, dot matches line breaks |
Regex flavor: .NET |
This stricter, .NET-only regex adds a positive lookbehind at the
front, which works just like the lookahead at the end but in reverse.
Because the lookbehind works forward from the position where it finds
<!--
, the
lookbehind contains a nested negative lookahead that lets it match any
characters that are not part of the sequence -->
.
Since the leading lookahead and trailing lookbehind are both
zero-length assertions, the final match is just the word TODO
. The strings matched
within the lookarounds do not become a part of the final matched
text.
Recipe 9.9 includes a detailed discussion of how to match XML-style comments.
Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.4 explains that the dot matches any character. Recipe 2.6 explains word boundaries. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition. Recipe 2.13 explains how greedy and lazy quantifiers backtrack. Recipe 2.16 explains lookaround.
[23] PowerGREP—described in Tools for Working with Regular Expressions in Chapter 1—is one tool that’s able to search within matches.