9.10. Find Words Within XML-Style Comments

Problem

You want to find all occurrences of the word TODO within (X)HTML or XML comments. For example, you want to match only the underlined text within the following string:

        This "TODO" is not within a comment, but the next one is. <!-- 
        TODO
        : ↵
Come up with a cooler comment for this example. -->

Solution

There are at least two approaches to this problem, and both have their advantages. The first tactic, which we’ll call the “two-step approach,” is to find comments with an outer regex, and then search within each match using a separate regex or even a plain text search. That works best if you’re writing code to do the job, since separating the task into two steps keeps things simple and fast. However, if you’re searching through files using a text editor or grep tool, splitting the task in two won’t work unless your tool of choice offers a special option to search within matches found by another regex.[23]

When you need to find words within comments using a single regex, you can accomplish this with the help of lookaround. This second method is shown in the upcoming section .

Two-step approach

When it’s a workable option, the better solution is to split the task in two: search for comments, and then search within those comments for TODO.

Here’s how you can find comments:

<!--.*?-->
Regex options: Dot matches line breaks
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

Standard JavaScript doesn’t have a “dot matches line breaks” option, but you can use an all-inclusive character class in place of the dot, as follows:

<!--[sS]*?-->
Regex options: None
Regex flavor: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

For each comment you find using one of the regexes just shown, you can then search within the matched text for the literal characters TODO. If you prefer, you can make it a case-insensitive regex with word boundaries on each end to make sure that only the complete word TODO is matched, like so:

TODO
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Follow the code in Recipe 3.13 to search within matches of an outer regex.

Single-step approach

Lookahead (described in Recipe 2.16) lets you solve this problem with a single regex, albeit less efficiently. In the following regex, positive lookahead is used to make sure that the word TODO is followed by the closing comment delimiter -->. On its own, that doesn’t tell whether the word appears within a comment or is simply followed by a comment, so a nested negative lookahead is used to ensure that the opening comment delimiter <!-- does not appear before the -->:

TODO(?=(?:(?!<!--).)*?-->)
Regex options: Case insensitive, dot matches line breaks
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

Since standard JavaScript doesn’t have a “dot matches line breaks” option, use [sS] in place of the dot:

TODO(?=(?:(?!<!--)[sS])*?-->)
Regex options: Case insensitive
Regex flavor: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

Two-step approach

Recipe 3.13 shows the code you need to search within matches of another regex. It takes an inner and outer regex. The comment regex serves as the outer regex, and TODO as the inner regex. The main thing to note here is the lazy *? quantifier that follows the dot or character class in the comment regex. As explained in Recipe 2.13, that lets you match up to the first --> (the one that ends the comment), rather than the very last occurrence of --> in your subject string.

Single-step approach

This solution is more complex, and slower. On the plus side, it combines the two steps of the previous approach into one regex. Thus, it can be used when working with a text editor, IDE, or other tool that doesn’t allow searching within matches of another regex.

Let’s break this regex down in free-spacing mode, and take a closer look at each part:

 TODO       # Match the characters "TODO", as a complete word
(?=             # Followed by:
  (?:           #   Group but don't capture:
    (?! <!-- )  #     Not followed by: "<!--"
    .           #     Match any single character
  )*?           #   Repeat zero or more times, as few as possible (lazy)
  -->           #   Match the characters "-->"
)
Regex options: Dot matches line breaks, free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

This commented version of the regex doesn’t work in JavaScript unless you use the XRegExp library, since standard JavaScript lacks both “free-spacing” and “dot matches line breaks” modes.

Notice that the regex contains a negative lookahead nested within an outer, positive lookahead. That lets you require that any match of TODO is followed by --> and that <!-- does not occur in between.

If it’s clear to you how all of this works together, great: you can skip the rest of this section. But in case it’s still a little hazy, let’s take another step back and build the outer, positive lookahead in this regex step by step.

Let’s say for a moment that we simply want to match occurrences of the word TODO that are followed at some point in the string by -->. That gives us the regex TODO(?=.*?-->) (with “dot matches line breaks” enabled), which matches the underlined text in <!--TODO--> just fine. We need the .*? at the beginning of the lookahead, because otherwise the regex would match only when TODO is immediately followed by -->, with no characters in between. The *? quantifier repeats the dot zero or more times, as few times as possible, which is great since we only want to match until the first following -->.

As an aside, the regex so far could be rewritten as TODO(?=.*?-->)—with the second  moved after the lookahead—without any affect on the text that is matched. That’s because both the word boundary and the lookahead are zero-length assertions (see Lookaround). However, it’s better to place the word boundary first for readability and efficiency. In the middle of a partial match, the regex engine can more quickly test a word boundary, fail, and move forward to try the regex again at the next character in the string without having to spend time testing the lookahead when it isn’t necessary.

OK, so the regex TODO(?=.*?-->) seems to work fine so far, but what about when it’s applied to the subject string TODO <!--separate comment-->? The regex still matches TODO since it’s followed by -->, even though TODO is not within a comment this time. We therefore need to change the dot within the lookahead from matching any character to matching any character that is not part of the string <!--, since that would indicate the start of a new comment. We can’t use a negated character class such as [^<!-], because we want to allow <, !, and - characters that are not grouped into the exact sequence <!--.

That’s where the nested negative lookahead comes in. (?!<!--). matches any single character that is not part of an opening comment delimiter. Placing that pattern within a noncapturing group as (?:(?!<!--).) allows us to repeat the whole sequence with the lazy *? quantifier we’d previously applied to just the dot.

Putting it all together, we get the final regex that was listed as the solution for this problem: TODO(?=(?:(?!<!--).)*?-->). In JavaScript, which lacks the necessary “dot matches line breaks” option, TODO(?=(?:(?!<!--)[sS])*?-->) is equivalent.

Variations

Although the “single-step approach” regex ensures that any match of TODO is followed by --> without <!-- occurring in between, it doesn’t check the reverse: that the target word is also preceded by <!-- without --> in between. There are several reasons we left that rule out:

  • You can usually get away with not doing this double-check, especially since the single-step regex is meant to be used with text editors and the like, where you can visually verify your results.

  • Having less to verify means less time spent performing the verification. In other words, the regex is faster when the extra check is left out.

  • Most importantly, since you don’t know how far back the comment may have started, looking backward like this requires infinite-length lookbehind, which is supported by the .NET regex flavor only.

If you’re working with .NET and want to include this added check, use the following regex:

(?<=<!--(?:(?!-->).)*?)TODO(?=(?:(?!<!--).)*?-->)
Regex options: Case insensitive, dot matches line breaks
Regex flavor: .NET

This stricter, .NET-only regex adds a positive lookbehind at the front, which works just like the lookahead at the end but in reverse. Because the lookbehind works forward from the position where it finds <!--, the lookbehind contains a nested negative lookahead that lets it match any characters that are not part of the sequence -->.

Since the leading lookahead and trailing lookbehind are both zero-length assertions, the final match is just the word TODO. The strings matched within the lookarounds do not become a part of the final matched text.

See Also

Recipe 9.9 includes a detailed discussion of how to match XML-style comments.

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.4 explains that the dot matches any character. Recipe 2.6 explains word boundaries. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition. Recipe 2.13 explains how greedy and lazy quantifiers backtrack. Recipe 2.16 explains lookaround.



[23] PowerGREP—described in Tools for Working with Regular Expressions in Chapter 1—is one tool that’s able to search within matches.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset