You want to match any HTML, XHTML, or XML tags in a string, in order to remove, modify, count, or otherwise deal with them.
The most appropriate solution depends on several factors, including the level of accuracy, efficiency, and tolerance for erroneous markup that is acceptable to you. Once you’ve determined the approach that works for your needs, there are any number of things you might want to do with the results. But whether you want to remove the tags, search within them, add or remove attributes, or replace them with alternative markup, the first step is to find them.
Be forewarned that this will be a long recipe, fraught with subtleties, exceptions, and variations. If you’re looking for a quick fix and are not willing to put in the effort to determine the best solution for your needs, you might want to jump to the section of this recipe, which offers a decent mix of tolerance versus precaution.
This first solution is simple and more commonly used
than you might expect, but it’s included here mostly for comparison
and for an examination of its flaws. It may be good enough when you
know exactly what type of content you’re dealing with and are not
overly concerned about the consequences of incorrect handling. This
regex matches a <
symbol, then simply continues
until the first >
occurs:
<[^>]*>
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
This next regex is again rather simplistic and does not
handle all cases correctly. However, it might work well for your needs
if it will be used to process only snippets of valid (X)HTML. It’s
advantage over the previous regex is that it correctly passes over
>
characters that appear within
attribute values:
<(?:[^>"']|"[^"]*"|'[^']*')*>
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Here is the same regex, with added whitespace and comments for readability:
< (?: [^>"'] # Non-quoted character | "[^"]*" # Double-quoted attribute value | '[^']*' # Single-quoted attribute value )* >
Regex options: Free-spacing |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
The two regexes just shown work identically, so you can use whichever you prefer. JavaScripters are stuck with the first option unless using the XRegExp library, since standard JavaScript lacks a free-spacing option.
In addition to supporting >
characters embedded in attribute
values, this next regex emulates the lenient rules for (X)HTML tags
that browsers actually implement. This both improves accuracy with
poorly formed markup and lets the regex avoid content that does not
look like a tag, including comments, DOCTYPEs, and unencoded <
characters in text. To get these
improvements, two main changes are made. First, there is extra
handling that helps determine where attribute values start and end in
edge cases, such as when tags contain stray quote marks as part of an
unquoted attribute value or separate from any legit attribute. Second,
special handling is added for the tag name, including requiring the
name to begin with a letter A–Z. The tag name is captured to
backreference 1 in case you need to refer back to it:
</?([A-Za-z][^s>/]*)(?:=s*(?:"[^"]*"|'[^']*'|[^s>]+)|[^>])*(?:>|$)
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
And in free-spacing mode:
< /? # Permit closing tags ([A-Za-z][^s>/]*) # Capture the tag name to backreference 1 (?: # Attribute value branch: = s* # Signals the start of an attribute value (?: "[^"]*" # Double-quoted attribute value | '[^']*' # Single-quoted attribute value | [^s>]+ # Unquoted attribute value ) | # Non-attribute-value branch: [^>] # Character outside of an attribute value )* (?:>|$) # End of the tag or string
Regex options: Free-spacing |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
The last two regexes work identically, although the latter cannot be used in JavaScript (without XRegExp), since it lacks a free-spacing option.
This regex is more complicated than those we’ve already seen in this recipe, because it actually follows the rules for (X)HTML tags explained in the introductory section of this chapter. This is not always desirable, since browsers don’t strictly adhere to these rules. In other words, this regex will avoid matching content that does not look like a valid (X)HTML tag, at the cost of possibly not matching some content that browsers would in fact interpret as a tag (e.g., if your markup uses an attribute name that includes characters not accounted for here, or if attributes are included in a closing tag). Both HTML and XHTML tag rules are handled together since it is common for their conventions to be mixed. The tag name is captured to backreference 1 or 2 (depending on whether it is an opening or closing tag), in case you need to refer back to it:
<(?:([A-Z][-:A-Z0-9]*)(?:s+[A-Z][-:A-Z0-9]*(?:s*=s*(?:"[^"]*"|↵ '[^']*'|[^"'`=<>s]+))?)*s*/?|/([A-Z][-:A-Z0-9]*)s*)>
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
To make it a little less cryptic, here is the same regex in free-spacing mode with comments:
< (?: # Branch for opening tags: ([A-Z][-:A-Z0-9]*) # Capture the opening tag name to backreference 1 (?: # This group permits zero or more attributes s+ # Whitespace to separate attributes [A-Z][-:A-Z0-9]* # Attribute name (?: s*=s* # Attribute name-value delimiter (?: "[^"]*" # Double-quoted attribute value | '[^']*' # Single-quoted attribute value | [^"'`=<>s]+ # Unquoted attribute value (HTML) ) )? # Permit attributes without a value (HTML) )* s* # Permit trailing whitespace /? # Permit self-closed tags | # Branch for closing tags: / ([A-Z][-:A-Z0-9]*) # Capture the closing tag name to backreference 2 s* # Permit trailing whitespace ) >
Regex options: Case insensitive, free-spacing |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
XML is a precisely specified language, and requires that user agents strictly adhere to and enforce its rules. This is a stark change from HTML and the longsuffering browsers that process it. We’ve therefore included only a “strict” version for XML:
<(?:([_:A-Z][-.:w]*)(?:s+[_:A-Z][-.:w]*s*=s*(?:"[^"]*"|'[^']*'))*s*↵ /?|/([_:A-Z][-.:w]*)s*)>
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Once again, here is the same regex in free-spacing mode with added comments:
< (?: # Branch for opening tags: ([_:A-Z][-.:w]*) # Capture the opening tag name to backreference 1 (?: # This group permits zero or more attributes s+ # Whitespace to separate attributes [_:A-Z][-.:w]* # Attribute name s*=s* # Attribute name-value delimiter (?: "[^"]*" # Double-quoted attribute value | '[^']*' # Single-quoted attribute value ) )* s* # Permit trailing whitespace /? # Permit self-closed tags | # Branch for closing tags: / ([_:A-Z][-.:w]*) # Capture the closing tag name to backreference 2 s* # Permit trailing whitespace ) >
Regex options: Case insensitive, free-spacing |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
Like the previous solution for (X)HTML tags, these regexes capture the tag name to backreference 1 or 2, depending on whether an opening or closing tag is matched. The XML tag regex is a little shorter than the (X)HTML version since it doesn’t have to deal with HTML-only syntax (minimized attributes and unquoted values). It also allows a wider range of characters to be used for element and attribute names.
Although it’s common to want to match XML-style tags using regular expressions, doing it safely requires balancing trade-offs and thinking carefully about the data you’re working with. Because of these difficulties, some people choose to forgo the use of regular expressions for any sort of XML or (X)HTML processing in favor of specialized parsers and APIs. That’s an approach you should seriously consider, since such tools are sometimes easier to use and typically include robust detection or handling for incorrect markup. In browser-land, for example, it’s usually best to take advantage of the tree-based Document Object Model (DOM) for your HTML search and manipulation needs. Elsewhere, you might be well-served by a SAX parser or XPath. However, you may occasionally find places where regex-based solutions make a lot of sense and work perfectly fine.
If you want to sterilize HTML from untrusted sources because
you’re worried about specially-crafted malicious HTML and cross-site
scripting (XSS) attacks, your safest bet is to first convert all
<
, >
, and &
characters to their corresponding
named character references (<
, >
, and &
), then bring back tags that are
known to be safe (as long as they contain no attributes or only use
those within a select list of approved attributes). For example, to
bring back <p>
, <em>
, and <strong>
tags with no attributes
after replacing <
, >
, and &
with character references, search
case-insensitively using the regex ‹<(/?)(p|em|strong)>
› and replace
matches with «<$1$2>
» (or in Python and
Ruby, «<12>
»). If necessary, you
can then safely search your modified string for HTML tags using the
regexes in this recipe.
With those disclaimers out of the way, let’s examine the regexes
we’ve already seen in this recipe. The first two solutions are overly
simplistic for most cases, but handle XML-style markup languages
equally. The latter three follow stricter rules and are tailored to
their respective markup languages. Even in the latter solutions,
however, HTML and XHTML tag conventions are handled together since
it’s common for them to be mixed, often inadvertently. For example, an
author may use an XHTML-style self-closing <br
/>
tag in an HTML4 document, or incorrectly use
an uppercase element name in a document with an XHTML DOCTYPE. HTML5
further blurs the distinction between HTML and XHTML syntax.
The advantage of this solution is its simplicity, which makes it easy to remember and type, and also fast to run. The trade-off is that it incorrectly handles certain valid and invalid XML and (X)HTML constructs. If you’re working with markup you wrote yourself and know that such cases will never appear in your subject text, or if you are not concerned about the consequences if they do, this trade-off might be OK. Another example of where this solution might be good enough is when you’re working with a text editor that lets you preview regex matches.
The regex starts off by finding a literal ‹<
› character (the start of a
tag). It then uses a negated character class and greedy asterisk
quantifier ‹[^>]*
› to
match zero or more following characters that are not >
. This takes care of matching the name
of the tag, attributes, and a leading or trailing /
. We could use a lazy quantifier (‹[^>]*?
›) instead, but that
wouldn’t change anything other than making the regex a tiny bit slower
since it would cause more backtracking (Recipe 2.13 explains why). To end the tag, the
regex then matches a literal ‹>
›.
If you prefer to use a dot instead of the negated character
class ‹[^>]
›, go for
it. A dot will work fine as long as you also use a lazy asterisk along
with it (‹.*?
›) and make
sure to enable the “dot matches line breaks” option (in JavaScript,
you could use ‹[sS]*?
›
instead). A dot with a greedy asterisk (making the full pattern
‹<.*>
›) would
change the regex’s meaning, causing it to incorrectly match from the
first <
until the very last >
in the subject string, even if the
regex has to swallow multiple tags along the way in order to do
so.
It’s time for a few examples. The regex matches each of the following lines in full:
<div>
</div>
<div
class="box">
<div
id="pandoras-box" class="box" />
<!-- comment
-->
<!DOCTYPE
html>
<< <
w00t! >
<>
Notice that the pattern matches more than just tags. Worse, it
will not correctly match the entire tags in the subject strings
<input type="button"
value=">>">
or <input type="button"
onclick="alert(2>1)">
. Instead, it will only match
until the first >
that appears
within the attribute values. It will have similar problems with
comments, XML CDATA sections, DOCTYPEs, code within <script>
elements, and anything else
that contains embedded >
symbols.
If you’re processing anything more than the most basic markup, especially if the subject text is coming from mixed or unknown sources, you will be better served by one of the more robust solutions further along in this recipe.
Like the quick and dirty regex we’ve just described,
this next one is included primarily to contrast it with the later,
more robust solutions. Nevertheless, it covers the basics needed to
match XML-style tags, and thus it might work well for your needs if it
will be used to process snippets of valid markup that include only
elements and text. The difference from the last regex is that it
passes over >
characters that
appear within attribute values. For example, it will correctly match
the entire <input>
tags in
the example subject strings we’ve previously shown: <input type="button"
value=">>">
and <input type="button"
onclick="alert(2>1)">
.
As before, the regex uses literal angle bracket characters at
the edges of the regex to match the start and end of a tag. In
between, it repeats a noncapturing group containing three
alternatives, each separated by the ‹|
› alternation metacharacter.
The first alternative is the negated character class ‹[^>"']
›, which matches any
single character other than a right angle bracket (which closes the
tag), double quote, or single quote (both quote marks indicate the
start of an attribute value). This first alternative is responsible
for matching the tag and attribute names as well as any other
characters outside of quoted values. The order of the alternatives is
intentional, and written with performance in mind. Regular expression
engines attempt alternative paths through a regex from left to right,
and attempts at matching this first option will most likely succeed
more often than the alternatives for quoted values (especially since
it matches only one character at a time).
Next come the alternatives for matching double and single quoted
attribute values (‹"[^"]*"
› and ‹'[^']*'
›). Their use of negated character
classes allows them to continue matching past any included >
characters, line breaks, and anything
else that isn’t a closing quote mark.
Note that this solution has no special handling that allows it to exclude or properly match comments and other special nodes in your documents. Make sure you’re familiar with the kind of content you’re working with before putting this regex to use.
Via a couple main changes, this regex gets a lot closer to emulating the easygoing rules that web browsers use to identify (X)HTML tags in source code. That makes it a good solution in cases where you’re trying to copy browser behavior or the HTML5 parsing algorithm and don’t care whether the tags you match actually follow all the rules for valid markup. Keep in mind that it’s still possible to create horrifically invalid HTML that this regex will not handle in the same way as one or more browsers, since browsers parse some edge cases of erroneous markup in their own, unique ways.
This regex’s most significant difference from the previous
solution is that it requires the character following the opening left
angle bracket (<
) to be a letter
A–Z or a–z, optionally preceded by /
(for closing tags). This constraint rules
out matching stray, unencoded <
characters in text, as well as comments, DOCTYPEs, XML declarations
and processing instructions, CDATA sections, and so on. That doesn’t
protect it from matching something that looks like a tag but is
actually within a comment, scripting language code, the content of a
<textarea>
element, or other
similar situation where text is treated literally. The upcoming
section, Skip Tricky (X)HTML and XML Sections, shows
a workaround for this issue. But first, let’s look at how this regex
works.
‹<
› starts off
the match with a literal left angle bracket. The ‹/?
› that follows allows an
optional forward slash, for closing tags. Next comes the capturing
group ‹([A-Za-z][^s>/]*)
›, which matches the tag’s
name and remembers it as backreference 1. If you don’t need to refer
back to the tag name (e.g., if you’re simply removing all tags), you
can remove the capturing parentheses (just don’t get rid of the
pattern within them). Within the group are two character classes. The
first class, ‹[A-Za-z]
›,
matches the first character of the tag’s name. The second class,
‹[^s>/]
›, allows
nearly any characters to follow as part of the name. The only
exceptions are whitespace (‹s
›, which separates the tag name from any
following attributes), >
(which
ends the tag), and /
(used before
the closing >
for XHTML-style
singleton tags). Any other characters (even including quote marks and
the equals sign) are treated as part of the tag’s name. That might
seem a bit overly permissive, but it’s how most browsers operate.
Bogus tags might not have any effect on the way a page is rendered,
but they nevertheless become accessible via the DOM tree and are not
rendered as text, although any content within them will show
up.
After the tag name comes the attribute handling, which is
significantly changed from the previous solution in order to more
accurately emulate browser-style parsing of edge cases with poorly
formed markup. Since unencoded >
symbols end a tag unless they are within attribute values, it’s
important to accurately determine where attribute values start and
end. This is a bit tricky since it’s possible for stray quote marks
and equals signs to appear within a tag but separate from any
attribute value, or even as part of an unquoted attribute
value.
Consider a few examples. This regex matches each of the following lines in their entirety:
<em
title=">">
<em
!=">">
</em// em
<em>
<em
title=">"">
<em
title=""em">
[18]
<em"
title=">">
The regex matches only the underlined portions of the following lines:
<em
">
">
<em=">
">
<em
title==">
">
[19]
<em
title=em=">
">
<em title=
=">
">
Keep in mind that the handling for these examples is specifically designed to match common browser behavior.
Getting back to the attribute handling, we come to the
noncapturing group ‹(?:=s*(?:"[^"]*"|'[^']*'|[^s>]+)|[^>])*
›.
There are two outermost alternatives here, separated by ‹|
›.
The first alternative, ‹=s*(?:"[^"]*"|'[^']*'|[^s>]+)
›, is for
matching attribute values; the equals sign at the start signals their
onset. After the equals sign and optional whitespace (‹s*
›), there is a nested
noncapturing group that includes three options: ‹"[^"]*"
› for double quoted
values, ‹'[^']*'
› for
single quoted values, and ‹[^s>]+
› for unquoted values. The pattern for
unquoted values notably allows anything except whitespace or >
, even matching quote marks and equals
signs. This is more permissive than is officially allowed for valid
HTML, but follows browser behavior. Note that because the pattern for
unquoted values matches quote marks, it must appear last in the list
of options or the other two alternatives (for matching quoted values)
would never have a chance to match.
The second alternative in the outer group is simply ‹[^>]
›. This is used to match
(one character at a time) attribute names, the whitespace separating
attributes, the trailing /
symbol
for self-closed tags, and any other stray characters within the tag’s
boundaries. Because this character class matches equals signs (in
addition to almost everything else), it must be the latter option in
its containing group or else the alternative that matches attribute
values would never have a chance to participate.
Finally, we close out the regex with ‹(?:>|$)
›. This matches either the end of the
tag or, if it’s reached first, the end of the string.
By letting the match end successfully if the end of the string
is reached without finding the end of the tag, we’re emulating most
browsers’ behavior, but we’re also doing it to avoid potential runaway
backtracking (see Recipe 2.15). If we forced
the regex to backtrack (and ultimately fail to match) when there is no
tag-ending >
to be found, the
amount of backtracking that might be needed to try every possible
permutation of this regex’s medley of overlapping patterns and nested
repeating groups could create performance problems. However, the regex
as it’s written sidesteps this issue, and should always perform
efficiently.
The following regexes show how this pattern can be tweaked to match opening and singleton (self-closing) or closing tags only:
<([A-Za-z][^s>/]*)(?:=s*(?:"[^"]*"|'[^']*'|[^s>]+)|[^>])*(?:>|$)
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
This version removes the ‹/?
› that appeared after the opening
‹<
›.
</([A-Za-z][^s>/]*)(?:=s*(?:"[^"]*"|'[^']*'|[^s>]+)|[^>])*(?:>|$)
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
The forward slash after the opening ‹<
› has been made a
required part of the match here. Note that we are intentionally
allowing attributes inside closing tags, since this is based on
the “loose” solution. Although browsers don’t use attributes
that occur in closing tags, they don’t mind if such attributes
exist.
By saying that this solution is strict, we mean that it attempts to follow the HTML and XHTML syntax rules explained in the introductory section of this chapter, rather than emulating the rules browsers actually use when parsing the source code of a document. This strictness adds the following rules compared to the previous regexes:
Both tag and attribute names must start with a letter A–Z or
a–z, and their names may only use the characters A–Z, a–z, 0–9,
hyphen, and colon. In regex, that’s ‹^[A-Za-z][-:A-Za-z0-9]*$
›.
Inappropriate, stray characters are not allowed after the
tag name. Only whitespace, attributes (with or without an
accompanying value), and optionally a trailing forward slash
(/
) may appear after the tag
name.
Unquoted attribute values may not use the characters
"
, '
, `
, =
, <
, >
, and whitespace. In regex, ‹^[^"'`=<>s]+$
›.
Closing tags cannot include attributes.
Since the pattern is split into two branches using alternation, the tag name is captured to either backreference 1 or 2, depending on what type of tag is matched. The first branch is for opening and singleton tags, and the second branch is for closing tags. Both sets of capturing parentheses may be removed if you have no need to refer back to the tag names.
The two branches of the pattern are separated into their own regexes in the following modified versions. Both capture the tag name to backreference 1:
<([A-Z][-:A-Z0-9]*)(?:s+[A-Z][-:A-Z0-9]*(?:s*=s*↵ (?:"[^"]*"|'[^']*'|[^"'`=<>s]+))?)*s*/?>
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
The ‹/?
›
that appears just before the closing ‹>
› is what allows this regex to match
both opening and singleton tags. Remove it to match opening tags
only. Remove just the question mark quantifier (making the
‹/
› required), and
it will match singleton tags only.
</([A-Z][-:A-Z0-9]*)s*>
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
In the last couple of sections, we’ve shown how to get a
potential performance boost by adding atomic groups or possessive
quantifiers. The strictly defined paths through this regex (and the
adapted versions just shown) result in there being no potential to
match the same strings more than one way, and therefore having less
potential backtracking to worry about. These regexes don’t actually
rely on backtracking, so if you wanted to, you
could make every last one of their ‹*
›, ‹+
›, and
‹?
›
quantifiers possessive (or achieve the same effect using atomic
groups) and they would continue matching or failing to match the
exactly same strings with only slightly less backtracking along the
way. We’re therefore going to skip such variations for this (and the
next) regex, to try to keep the number of options in this recipe under
control.
See Skip Tricky (X)HTML and XML Sections for a
way to avoid matching tags within comments, <script>
tags, and so on.
XML precludes the need for a “loose” solution through its precise specification and requirement that conforming parsers do not process markup that is not well-formed. Although you could use one of the preceding regexes when processing XML documents, their simplicity won’t give you the advantage of actually providing a more reliable search, since there is no loose XML user agent behavior to emulate.
This regex is basically a simpler version of the “(X)HTML tags
(strict)” regex, since we’re able to remove support for two HTML
features that are not allowed in XML: unquoted attribute values and
minimized attributes (attributes without an accompanying value). The
only other difference is the characters that are allowed as part of
the tag and attribute names. In fact, the rules for XML names (which
govern the requirements for both tag and attribute names) are more
permissive than shown here, allowing hundreds of thousands of
additional Unicode characters. If you need to allow these characters
in your search, you can replace the three occurrences of ‹[_:A-Z][-.:w]*
› with one of the
patterns found in Recipe 9.4. Note that the
list of characters allowed differs depending on the version of XML in
use.
As with the (X)HTML regexes, the tag name is captured to backreference 1 or 2, depending on whether an opening/singleton or closing tag is matched. And once again, you can remove the capturing parentheses if you don’t need to refer back to the tag names.
The two branches of the pattern are separated in the following modified regexes. As a result, both regexes capture the tag name to backreference 1:
<([_:A-Z][-.:w]*)(?:s+[_:A-Z][-.:w]*s*=s*↵ (?:"[^"]*"|'[^']*'))*s*/?>
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
The ‹/?
›
that appears just before the closing ‹>
› is what allows this regex to match
both opening and singleton tags. Remove it to match only opening
tags. Remove just the question mark quantifier, and it will
match only singleton tags.
</([_:A-Z][-.:w]*)s*>
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
See the next section, , for a way to avoid matching tags within comments, CDATA sections, and DOCTYPEs.
When trying to match XML-style tags within a source file or string, much of the battle is avoiding content that looks like a tag, even though its placement or other context precludes it from being interpreted as a tag. The (X)HTML- and XML-specific regexes we’ve shown in this recipe avoid some problematic content by restricting the initial character of an element’s name. Some went even further, requiring tags to fulfill the (X)HTML or XML syntax rules. Still, a robust solution requires that we also avoid any content that appears within comments, scripting language code (which may use greater-than and less-than symbols for mathematical operations), XML CDATA sections, and various other constructs. We can solve this issue by first searching for these problematic sections, and then searching for tags only in the content outside of those matches.
Recipe 3.18 shows the code for searching between matches of another regex. It takes two patterns: an inner regex and outer regex. Any of the tag-matching regexes in this recipe can serve as the inner regex. The outer regex is shown next, with separate patterns for (X)HTML and XML. This approach hides the problematic sections from the inner regex’s view, and thereby lets us keep things relatively simple.
Instead of searching between matches of the outer regex, it
might be easier to simply remove all matches of the outer regex (i.e.,
replace matches with an empty string). You can then search for XML or
(X)HTML tags without worrying about skipping over tricky sections like
CDATA blocks and <script>
tags, since they’ve already been removed.
The following regex matches comments, CDATA sections,
and a number of special elements. Of the special elements, <script>
, <style>
, <textarea>
, <title>
, and <xmp>
[20] tags are matched together with their entire contents and
end tags. The <plaintext>
[21] element is also matched, and when found, the match
continues until the end of the string:
<!--.*?-->|<![CDATA[.*?]]>|<(script|style|textarea|title|xmp)↵ (?:[^>"']|"[^"]*"|'[^']*')*>.*?</1s*>|<plaintext↵ (?:[^>"']|"[^"]*"|'[^']*')*>.*
Regex options: Case insensitive, dot matches line breaks |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
In case that’s not the most readable line of code you’ve ever read, here is the regex again in free-spacing mode, with a few comments added:
# Comment <!-- .*? --> | # CDATA section <![CDATA[ .*? ]]> | # Special element and its content <( script | style | textarea | title | xmp ) (?:[^>"']|"[^"]*"|'[^']*')* > .*? </1s*> | # <plaintext/> continues until the end of the string <plaintext (?:[^>"']|"[^"]*"|'[^']*')* > .*
Regex options: Case insensitive, dot matches line breaks, free-spacing |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
Neither of the above regexes work correctly in JavaScript
without XRegExp, since standard JavaScript lacks both the “dot matches
line breaks” and “free-spacing” options. The following regex reverts
to being unreadable and replaces the dots with ‹[sS]
›
so it can be used in standard JavaScript:
<!--[sS]*?-->|<![CDATA[[sS]*?]]>|<(script|style|textarea|title|xmp)↵ (?:[^>"']|"[^"]*"|'[^']*')*>[sS]*?</1s*>|<plaintext↵ (?:[^>"']|"[^"]*"|'[^']*')*>[sS]*
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
These regexes present a bit of a dilemma: because they match
<script>
, <style>
, <textarea>
, <title>
, <xmp>
, and <plaintext>
tags, those tags are never
matched by the second (inner) regex, even though we’re supposedly
searching for all tags. However, it should just be a matter of adding
a bit of extra procedural code to handle those tags specially, when
they are matched by the outer regex.
This regex matches comments, CDATA sections, and
DOCTYPEs. Each of these cases are matched using a discrete pattern.
The patterns are combined into one regex using the ‹|
› alternation
metacharacter:
<!--.*?--s*>|<![CDATA[.*?]]>|<!DOCTYPEs(?:[^<>"']|"[^"]*"|↵ '[^']*'|<!(?:[^>"']|"[^"]*"|'[^']*')*>)*>
Regex options: Case insensitive, dot matches line breaks |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
Here it is again in free-spacing mode:
# Comment <!-- .*? --s*> | # CDATA section <![CDATA[ .*? ]]> | # Document type declaration <!DOCTYPEs (?: [^<>"'] # Non-special character | "[^"]*" # Double-quoted value | '[^']*' # Single-quoted value | <!(?:[^>"']|"[^"]*"|'[^']*')*> # Markup declaration )* >
Regex options: Case insensitive, dot matches line breaks, free-spacing |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
And here is a version that works in standard JavaScript (which lacks the “dot matches line breaks” and “free-spacing” options):
<!--[sS]*?--s*>|<![CDATA[[sS]*?]]>|<!DOCTYPEs(?:[^<>"']|"[^"]*"|↵ '[^']*'|<!(?:[^>"']|"[^"]*"|'[^']*')*>)*>
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
The regexes just shown allow whitespace via ‹s*
› between the closing
--
and >
of XML comments. This differs from
the version shown earlier, because HTML5
and web browsers differ from XML on this point. See Find valid HTML comments for
a discussion of the differences between valid XML and HTML
comments.
Matching any and all tags can be useful, but it’s also common to want to match a specific one or a few out of the bunch; Recipe 9.2 shows how to pull off these tasks. Recipe 9.3 describes how to match all except a select list of tags.
Recipe 9.4 details the characters that can be used in valid XML element and attribute names.
Recipe 9.7 shows how to find tags that contain a specific attribute. Recipe 9.8 finds tags that do not contain a specific attribute.
Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.1 explains which special characters need to be escaped. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.10 explains backreferences. Recipe 2.12 explains repetition. Recipe 2.13 explains how greedy and lazy quantifiers backtrack. Recipe 2.14 explains possessive quantifiers and atomic groups. Recipe 2.16 explains lookaround.
[18] The title
attribute’s
value is the empty string, not em
.
[19] The title
attribute’s
value is ="
, not >
. The second equals sign
triggers the start of an unquoted value.
[20] <xmp>
is a
little-known but widely supported element similar to <pre>
. Like <pre>
, it preserves all whitespace
and uses a fixed-width font by default, but it goes one step
further and displays all of its contents (including HTML tags) as
plain text. <xmp>
was
deprecated in HTML 3.2, and removed entirely from HTML 4.0.
[21] <plaintext>
is like
<xmp>
except that it
cannot be turned off by an end tag and runs until the very end of
the document. Also like <xmp>
, it was obsoleted in HTML
4.0 but remains widely supported.