9.9. Remove XML-Style Comments

Problem

You want to remove comments from an (X)HTML or XML document. For example, you want to remove development comments from a web page before it is served to web browsers, or you want to perform subsequent searches without finding any matches within comments.

Solution

Finding comments is not a difficult task, thanks to the availability of lazy quantifiers. Here is the regular expression for the job:

<!--.*?-->
Regex options: Dot matches line breaks
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

That’s pretty straightforward. As usual, though, JavaScript’s lack of a “dot matches line breaks” option (unless you use the XRegExp library) means that you’ll need to replace the dot with an all-inclusive character class in order for the regular expression to match comments that span more than one line. Following is a version that works with standard JavaScript:

<!--[sS]*?-->
Regex options: None
Regex flavor: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

To remove the comments, replace all matches with the empty string (i.e., nothing). Recipe 3.14 lists code to replace all matches of a regex.

Discussion

How it works

At the beginning and end of this regular expression are the literal character sequences <!-- and -->. Since none of those characters are special in regex syntax (except within character classes, where hyphens create ranges), they don’t need to be escaped. That just leaves the .*? or [sS]*? in the middle of the regex to examine further.

Thanks to the “dot matches line breaks” option, the dot in the regex shown first matches any single character. In the JavaScript version, the character class [sS] takes its place. However, the two regexes are exactly equivalent. s matches any whitespace character, and S matches everything else. Combined, they match any character.

The lazy *? quantifier repeats its preceding “any character” element zero or more times, as few times as possible. Thus, the preceding token is repeated only until the first occurrence of -->, rather than matching all the way to the end of the subject string, and then backtracking until the last -->. (See Recipe 2.13 for more on how backtracking works with lazy and greedy quantifiers.) This simple strategy works well since XML-style comments cannot be nested within each other. In other words, they always end at the first (leftmost) occurrence of -->.

When comments can’t be removed

Most web developers are familiar with using HTML comments within <script> and <style> elements for backward compatibility with ancient browsers. These days, it’s just a meaningless incantation, but its use lives on in part because of copy-and-paste coding. We’re going to assume that when you remove comments from an (X)HTML document, you don’t want to strip out embedded JavaScript and CSS. You probably also want to leave the contents of <textarea> elements, CDATA sections, and the values of attributes within tags alone.

Earlier, we said removing comments wasn’t a difficult task. As it turns out, that was only true if you ignore some of the tricky areas of (X)HTML or XML where the syntax rules change. In other words, if you ignore the hard parts of the problem, it’s easy.

Of course, in some cases you might evaluate the markup you’re dealing with and decide it’s OK to ignore these problem cases, maybe because you wrote the markup yourself and know what to expect. It might also be OK if you’re doing a search-and-replace in a text editor and are able to manually inspect each match before removing it.

But getting back to how to work around these issues, in Skip Tricky (X)HTML and XML Sections we discussed some of these same problems in the context of matching XML-style tags. We can use a similar line of attack when searching for comments. Use the code in Recipe 3.18 to first search for tricky sections using the regular expression shown next, and then replace comments found between matches with the empty string (in other words, remove the comments):

<(script|style|textarea|title|xmp)(?:[^>"']|"[^"]*"|'[^']*')*>↵
.*?</1s*>|<plaintext(?:[^>"']|"[^"]*"|'[^']*')*>.*|↵
<[a-z](?:[^>"']|"[^"]*"|'[^']*')*>|<![CDATA[.*?]]>
Regex options: Case insensitive, dot matches line breaks
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

Adding some whitespace and a few comments to the regex in free-spacing mode makes this a lot easier to follow:

# Special element: tag and its content
<( script | style | textarea | title | xmp )
  (?:[^>"']|"[^"]*"|'[^']*')*
> .*? </1s*>
|
# <plaintext/> continues until the end of the string
<plaintext
  (?:[^>"']|"[^"]*"|'[^']*')*
> .*
|
# Standard element: tag only
<[a-z]  # Tag name initial character
  (?:[^>"']|"[^"]*"|'[^']*')*
>
|
# CDATA section
<![CDATA[ .*? ]]>
Regex options: Case insensitive, dot matches line breaks, free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

Here’s an equivalent version for standard JavaScript, which lacks both “dot matches line breaks” and “free-spacing” options:

<(script|style|textarea|title|xmp)(?:[^>"']|"[^"]*"|'[^']*')*>↵
[sS]*?</1s*>|<plaintext(?:[^>"']|"[^"]*"|'[^']*')*>[sS]*|↵
<[a-z](?:[^>"']|"[^"]*"|'[^']*')*>|<![CDATA[[sS]*?]]>
Regex options: Case insensitive
Regex flavor: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Variations

Find valid XML comments

There are in fact a few syntax rules for XML comments that go beyond simply starting with <!-- and ending with -->. Specifically:

  • Two hyphens cannot appear in a row within a comment. For example, <!-- com--ment --> is invalid because of the two hyphens in the middle.

  • The closing delimiter cannot be preceded by a hyphen that is part of the comment. For example, <!-- comment ---> is invalid, but the completely empty comment <!----> is allowed.

  • Whitespace may occur between the closing -- and >. For example, <!-- comment -- > is a valid, complete comment.

It’s not hard to work these rules into a regex:

<!--[^-]*(?:-[^-]+)*--s*>
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Notice that everything between the opening and closing comment delimiters is still optional, so it matches the completely empty comment <!---->. However, if a hyphen occurs between the delimiters, it must be followed by at least one nonhyphen character. And since the inner portion of the regex can no longer match two hyphens in a row, the lazy quantifier from the regexes at the beginning of this recipe has been replaced with greedy quantifiers. Lazy quantifiers would still work fine, but sticking with them here would result in unnecessary backtracking (see Recipe 2.13).

Some readers might look at this new regex and wonder why the [^-] negated character class is used twice, rather than just making the hyphen inside the noncapturing group optional (i.e., <!--(?:-?[^-]+)*--s*>). There’s a good reason, which brings us back to the discussion of “catastrophic backtracking” from Recipe 2.15.

So-called nested quantifiers always warrant extra attention and care in order to ensure that you’re not creating the potential for catastrophic backtracking. A quantifier is nested when it occurs within a grouping that is itself repeated by a quantifier. For example, the pattern (?:-?[^-]+)* contains two nested quantifiers: the question mark following the hyphen and the plus sign following the negated character class.

However, nesting quantifiers is not really what makes this dangerous, performance-wise. Rather, it’s that there are a potentially massive number of ways that the outer * quantifier can be combined with the inner quantifiers while attempting to match a string. If the regex engine fails to find --> at the end of a partial match (as is required when you plug this pattern segment into the comment-matching regex), the engine must try all possible repetition combinations before failing the match attempt and moving on. This number of options expands extremely rapidly with each additional character that the engine must try to match. However, there is nothing dangerous about the nested quantifiers if this situation is avoided. For example, the pattern (?:-[^-]+)* does not pose a risk even though it contains a nested + quantifier, because now that exactly one hyphen must be matched per repetition of the group, the potential number of backtracking points increases linearly with the length of the subject string.

Another way to avoid the potential backtracking problem we’ve just described is to use an atomic group. The following is equivalent to the first regex shown in this section, but it’s a few characters shorter and isn’t supported by JavaScript or Python:

<!--(?>-?[^-]+)*--s*>
Regex options: None
Regex flavors: .NET, Java, PCRE, Perl, Ruby

See Recipe 2.14 for the details about how atomic groups (and their counterpart, possessive quantifiers) work.

Find valid HTML comments

HTML 4.01 officially used the XML comment rules we described earlier, but web browsers never paid much attention to the finer points. HTML5 comment syntax has two differences from XML, which brings it closer to what web browsers actually implement. First, whitespace is not allowed between the closing -- and >. Second, the text within comments is not allowed to start with > or -> (in web browsers, that ends the comment early).

Here are the HTML5 comment rules translated into regex:

<!--(?!-?>)[^-]*(?:-[^-]+)*-->
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Compared to the earlier regex for matching valid XML comments, this one doesn’t include s* before the trailing >, and adds the negative lookahead (?!-?>) just after the opening <!--.

Tip

The reality of what web browsers treat as comments is more permissive than the official HTML rules. It’s therefore typically preferable to use the simple <!--.*?--> (with “dot matches line breaks”) or <!--[sS]*?--> regexes shown in this recipe’s main section.

See Also

Recipe 9.10 shows how to find specific words when they occur within XML-style comments.

Recipes , , and explain how to match various styles of single- and multiline programming language comments in source code.

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.1 explains which special characters need to be escaped. Recipe 2.3 explains character classes. Recipe 2.4 explains that the dot matches any character. Recipe 2.6 explains word boundaries. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.10 explains backreferences. Recipe 2.12 explains repetition. Recipe 2.13 explains how greedy and lazy quantifiers backtrack. Recipe 2.16 explains lookaround.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset