You want to remove comments from an (X)HTML or XML document. For example, you want to remove development comments from a web page before it is served to web browsers, or you want to perform subsequent searches without finding any matches within comments.
Finding comments is not a difficult task, thanks to the availability of lazy quantifiers. Here is the regular expression for the job:
<!--.*?-->
Regex options: Dot matches line breaks |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
That’s pretty straightforward. As usual, though, JavaScript’s lack of a “dot matches line breaks” option (unless you use the XRegExp library) means that you’ll need to replace the dot with an all-inclusive character class in order for the regular expression to match comments that span more than one line. Following is a version that works with standard JavaScript:
<!--[sS]*?-->
Regex options: None |
Regex flavor: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
To remove the comments, replace all matches with the empty string (i.e., nothing). Recipe 3.14 lists code to replace all matches of a regex.
At the beginning and end of this regular expression are the
literal character sequences ‹<!--
› and ‹-->
›. Since none of those characters are
special in regex syntax (except within character classes, where
hyphens create ranges), they don’t need to be escaped. That just
leaves the ‹.*?
› or
‹[sS]*?
› in the middle
of the regex to examine further.
Thanks to the “dot matches line breaks” option, the dot in the
regex shown first matches any single character. In the JavaScript
version, the character class ‹[sS]
›
takes its place. However, the two regexes are exactly equivalent.
‹s
›
matches any whitespace character, and ‹S
›
matches everything else. Combined, they match any character.
The lazy ‹*?
›
quantifier repeats its preceding “any character” element zero or more
times, as few times as possible. Thus, the preceding token is repeated
only until the first occurrence of -->
, rather than matching all the
way to the end of the subject string, and then backtracking until the
last -->
.
(See Recipe 2.13 for more on how
backtracking works with lazy and greedy quantifiers.) This simple
strategy works well since XML-style comments cannot be nested within
each other. In other words, they always end at the first (leftmost)
occurrence of -->
.
Most web developers are familiar with using HTML comments within
<script>
and <style>
elements for backward
compatibility with ancient browsers. These days, it’s just a
meaningless incantation, but its use lives on in part because of
copy-and-paste coding. We’re
going to assume that when you remove comments from an (X)HTML
document, you don’t want to strip out embedded JavaScript and CSS. You
probably also want to leave the contents of <textarea>
elements, CDATA sections,
and the values of attributes within tags alone.
Earlier, we said removing comments wasn’t a difficult task. As it turns out, that was only true if you ignore some of the tricky areas of (X)HTML or XML where the syntax rules change. In other words, if you ignore the hard parts of the problem, it’s easy.
Of course, in some cases you might evaluate the markup you’re dealing with and decide it’s OK to ignore these problem cases, maybe because you wrote the markup yourself and know what to expect. It might also be OK if you’re doing a search-and-replace in a text editor and are able to manually inspect each match before removing it.
But getting back to how to work around these issues, in Skip Tricky (X)HTML and XML Sections we discussed some of these same problems in the context of matching XML-style tags. We can use a similar line of attack when searching for comments. Use the code in Recipe 3.18 to first search for tricky sections using the regular expression shown next, and then replace comments found between matches with the empty string (in other words, remove the comments):
<(script|style|textarea|title|xmp)(?:[^>"']|"[^"]*"|'[^']*')*>↵ .*?</1s*>|<plaintext(?:[^>"']|"[^"]*"|'[^']*')*>.*|↵ <[a-z](?:[^>"']|"[^"]*"|'[^']*')*>|<![CDATA[.*?]]>
Regex options: Case insensitive, dot matches line breaks |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
Adding some whitespace and a few comments to the regex in free-spacing mode makes this a lot easier to follow:
# Special element: tag and its content <( script | style | textarea | title | xmp ) (?:[^>"']|"[^"]*"|'[^']*')* > .*? </1s*> | # <plaintext/> continues until the end of the string <plaintext (?:[^>"']|"[^"]*"|'[^']*')* > .* | # Standard element: tag only <[a-z] # Tag name initial character (?:[^>"']|"[^"]*"|'[^']*')* > | # CDATA section <![CDATA[ .*? ]]>
Regex options: Case insensitive, dot matches line breaks, free-spacing |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
Here’s an equivalent version for standard JavaScript, which lacks both “dot matches line breaks” and “free-spacing” options:
<(script|style|textarea|title|xmp)(?:[^>"']|"[^"]*"|'[^']*')*>↵ [sS]*?</1s*>|<plaintext(?:[^>"']|"[^"]*"|'[^']*')*>[sS]*|↵ <[a-z](?:[^>"']|"[^"]*"|'[^']*')*>|<![CDATA[[sS]*?]]>
Regex options: Case insensitive |
Regex flavor: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
There are in fact a few syntax rules for XML comments
that go beyond simply starting with <!--
and ending with -->
.
Specifically:
Two hyphens cannot appear in a row within a comment. For
example, <!--
com--ment -->
is invalid because of the two hyphens in the
middle.
The closing delimiter cannot be preceded by a hyphen that is
part of the comment. For
example, <!--
comment --->
is invalid, but the completely empty
comment <!---->
is allowed.
Whitespace may occur between the closing --
and >
. For example,
<!-- comment --
>
is a valid, complete comment.
It’s not hard to work these rules into a regex:
<!--[^-]*(?:-[^-]+)*--s*>
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Notice that everything between the opening and closing comment
delimiters is still optional, so it matches the completely empty
comment <!---->
. However, if a hyphen
occurs between the delimiters, it must be followed by at least one
nonhyphen character. And since the inner portion of the regex can no
longer match two hyphens in a row, the lazy quantifier from the
regexes at the beginning of this recipe has been replaced with greedy
quantifiers. Lazy quantifiers would still work fine, but sticking with
them here would result in unnecessary backtracking (see Recipe 2.13).
Some readers might look at this new regex and wonder why the
‹[^-]
› negated character
class is used twice, rather than just making the hyphen inside the
noncapturing group optional (i.e., ‹<!--(?:-?[^-]+)*--s*>
›). There’s a good
reason, which brings us back to the discussion of “catastrophic
backtracking” from Recipe 2.15.
So-called nested quantifiers always
warrant extra attention and care in order to ensure that you’re not
creating the potential for catastrophic backtracking. A quantifier is
nested when it occurs within a grouping that is itself repeated by a
quantifier. For example, the pattern ‹(?:-?[^-]+)*
› contains two nested quantifiers:
the question mark following the hyphen and the plus sign following the
negated character class.
However, nesting quantifiers is not really what makes this
dangerous, performance-wise. Rather, it’s that there are a potentially
massive number of ways that the outer ‹*
› quantifier can be combined with the inner
quantifiers while attempting to match a string. If the regex engine
fails to find -->
at the end of a partial match
(as is required when you plug this pattern segment into the
comment-matching regex), the engine must try all possible repetition
combinations before failing the match attempt and moving on. This
number of options expands extremely rapidly with each additional
character that the engine must try to match. However, there is nothing
dangerous about the nested quantifiers if this situation is avoided.
For example, the pattern ‹(?:-[^-]+)*
› does not pose a risk even though it
contains a nested ‹+
›
quantifier, because now that exactly one hyphen must be matched per
repetition of the group, the potential number of backtracking points
increases linearly with the length of the subject string.
Another way to avoid the potential backtracking problem we’ve just described is to use an atomic group. The following is equivalent to the first regex shown in this section, but it’s a few characters shorter and isn’t supported by JavaScript or Python:
<!--(?>-?[^-]+)*--s*>
Regex options: None |
Regex flavors: .NET, Java, PCRE, Perl, Ruby |
See Recipe 2.14 for the details about how atomic groups (and their counterpart, possessive quantifiers) work.
HTML 4.01 officially used the XML comment rules we
described earlier, but web browsers never paid much attention to the
finer points. HTML5 comment syntax has two differences from XML, which
brings it closer to what web browsers actually implement. First,
whitespace is not allowed between the closing --
and >
. Second, the text within comments is
not allowed to start with >
or
->
(in web browsers, that ends
the comment early).
Here are the HTML5 comment rules translated into regex:
<!--(?!-?>)[^-]*(?:-[^-]+)*-->
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Compared to the earlier regex for matching valid XML comments,
this one doesn’t include ‹s*
› before the trailing ‹>
›, and adds the negative
lookahead ‹(?!-?>)
›
just after the opening ‹<!--
›.
Recipe 9.10 shows how to find specific words when they occur within XML-style comments.
Recipes , , and explain how to match various styles of single- and multiline programming language comments in source code.
Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.1 explains which special characters need to be escaped. Recipe 2.3 explains character classes. Recipe 2.4 explains that the dot matches any character. Recipe 2.6 explains word boundaries. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.10 explains backreferences. Recipe 2.12 explains repetition. Recipe 2.13 explains how greedy and lazy quantifiers backtrack. Recipe 2.16 explains lookaround.