9.1. Find XML-Style Tags

Problem

You want to match any HTML, XHTML, or XML tags in a string, in order to remove, modify, count, or otherwise deal with them.

Solution

The most appropriate solution depends on several factors, including the level of accuracy, efficiency, and tolerance for erroneous markup that is acceptable to you. Once you’ve determined the approach that works for your needs, there are any number of things you might want to do with the results. But whether you want to remove the tags, search within them, add or remove attributes, or replace them with alternative markup, the first step is to find them.

Be forewarned that this will be a long recipe, fraught with subtleties, exceptions, and variations. If you’re looking for a quick fix and are not willing to put in the effort to determine the best solution for your needs, you might want to jump to the section of this recipe, which offers a decent mix of tolerance versus precaution.

Quick and dirty

This first solution is simple and more commonly used than you might expect, but it’s included here mostly for comparison and for an examination of its flaws. It may be good enough when you know exactly what type of content you’re dealing with and are not overly concerned about the consequences of incorrect handling. This regex matches a < symbol, then simply continues until the first > occurs:

<[^>]*>
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Allow > in attribute values

This next regex is again rather simplistic and does not handle all cases correctly. However, it might work well for your needs if it will be used to process only snippets of valid (X)HTML. It’s advantage over the previous regex is that it correctly passes over > characters that appear within attribute values:

<(?:[^>"']|"[^"]*"|'[^']*')*>
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Here is the same regex, with added whitespace and comments for readability:

<
(?: [^>"']   # Non-quoted character
  | "[^"]*"  # Double-quoted attribute value
  | '[^']*'  # Single-quoted attribute value
)*
>
Regex options: Free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

The two regexes just shown work identically, so you can use whichever you prefer. JavaScripters are stuck with the first option unless using the XRegExp library, since standard JavaScript lacks a free-spacing option.

(X)HTML tags (loose)

In addition to supporting > characters embedded in attribute values, this next regex emulates the lenient rules for (X)HTML tags that browsers actually implement. This both improves accuracy with poorly formed markup and lets the regex avoid content that does not look like a tag, including comments, DOCTYPEs, and unencoded < characters in text. To get these improvements, two main changes are made. First, there is extra handling that helps determine where attribute values start and end in edge cases, such as when tags contain stray quote marks as part of an unquoted attribute value or separate from any legit attribute. Second, special handling is added for the tag name, including requiring the name to begin with a letter A–Z. The tag name is captured to backreference 1 in case you need to refer back to it:

</?([A-Za-z][^s>/]*)(?:=s*(?:"[^"]*"|'[^']*'|[^s>]+)|[^>])*(?:>|$)
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

And in free-spacing mode:

<
/?                  # Permit closing tags
([A-Za-z][^s>/]*)  # Capture the tag name to backreference 1
(?:                 # Attribute value branch:
  = s*             #   Signals the start of an attribute value
  (?: "[^"]*"       #   Double-quoted attribute value
    | '[^']*'       #   Single-quoted attribute value
    | [^s>]+       #   Unquoted attribute value
  )
|                   # Non-attribute-value branch:
  [^>]              #   Character outside of an attribute value
)*
(?:>|$)             # End of the tag or string
Regex options: Free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

The last two regexes work identically, although the latter cannot be used in JavaScript (without XRegExp), since it lacks a free-spacing option.

(X)HTML tags (strict)

This regex is more complicated than those we’ve already seen in this recipe, because it actually follows the rules for (X)HTML tags explained in the introductory section of this chapter. This is not always desirable, since browsers don’t strictly adhere to these rules. In other words, this regex will avoid matching content that does not look like a valid (X)HTML tag, at the cost of possibly not matching some content that browsers would in fact interpret as a tag (e.g., if your markup uses an attribute name that includes characters not accounted for here, or if attributes are included in a closing tag). Both HTML and XHTML tag rules are handled together since it is common for their conventions to be mixed. The tag name is captured to backreference 1 or 2 (depending on whether it is an opening or closing tag), in case you need to refer back to it:

<(?:([A-Z][-:A-Z0-9]*)(?:s+[A-Z][-:A-Z0-9]*(?:s*=s*(?:"[^"]*"|↵
'[^']*'|[^"'`=<>s]+))?)*s*/?|/([A-Z][-:A-Z0-9]*)s*)>
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

To make it a little less cryptic, here is the same regex in free-spacing mode with comments:

<
(?:                    # Branch for opening tags:
  ([A-Z][-:A-Z0-9]*)   #   Capture the opening tag name to backreference 1
  (?:                  #   This group permits zero or more attributes
    s+                #   Whitespace to separate attributes
    [A-Z][-:A-Z0-9]*   #   Attribute name
    (?: s*=s*        #   Attribute name-value delimiter
      (?: "[^"]*"      #   Double-quoted attribute value
        | '[^']*'      #   Single-quoted attribute value
        | [^"'`=<>s]+ #   Unquoted attribute value (HTML)
      )
    )?                 #   Permit attributes without a value (HTML)
  )*
  s*                  #   Permit trailing whitespace
  /?                   #   Permit self-closed tags
|                      # Branch for closing tags:
  /
  ([A-Z][-:A-Z0-9]*)   #   Capture the closing tag name to backreference 2
  s*                  #   Permit trailing whitespace
)
>
Regex options: Case insensitive, free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

XML tags (strict)

XML is a precisely specified language, and requires that user agents strictly adhere to and enforce its rules. This is a stark change from HTML and the longsuffering browsers that process it. We’ve therefore included only a “strict” version for XML:

<(?:([_:A-Z][-.:w]*)(?:s+[_:A-Z][-.:w]*s*=s*(?:"[^"]*"|'[^']*'))*s*↵
/?|/([_:A-Z][-.:w]*)s*)>
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Once again, here is the same regex in free-spacing mode with added comments:

<
(?:                  # Branch for opening tags:
  ([_:A-Z][-.:w]*)  #   Capture the opening tag name to backreference 1
  (?:                #   This group permits zero or more attributes
    s+              #   Whitespace to separate attributes
    [_:A-Z][-.:w]*  #   Attribute name
    s*=s*          #   Attribute name-value delimiter
    (?: "[^"]*"      #   Double-quoted attribute value
      | '[^']*'      #   Single-quoted attribute value
    )
  )*
  s*                #   Permit trailing whitespace
  /?                 #   Permit self-closed tags
|                    # Branch for closing tags:
  /
  ([_:A-Z][-.:w]*)  #   Capture the closing tag name to backreference 2
  s*                #   Permit trailing whitespace
)
>
Regex options: Case insensitive, free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

Like the previous solution for (X)HTML tags, these regexes capture the tag name to backreference 1 or 2, depending on whether an opening or closing tag is matched. The XML tag regex is a little shorter than the (X)HTML version since it doesn’t have to deal with HTML-only syntax (minimized attributes and unquoted values). It also allows a wider range of characters to be used for element and attribute names.

Discussion

A few words of caution

Although it’s common to want to match XML-style tags using regular expressions, doing it safely requires balancing trade-offs and thinking carefully about the data you’re working with. Because of these difficulties, some people choose to forgo the use of regular expressions for any sort of XML or (X)HTML processing in favor of specialized parsers and APIs. That’s an approach you should seriously consider, since such tools are sometimes easier to use and typically include robust detection or handling for incorrect markup. In browser-land, for example, it’s usually best to take advantage of the tree-based Document Object Model (DOM) for your HTML search and manipulation needs. Elsewhere, you might be well-served by a SAX parser or XPath. However, you may occasionally find places where regex-based solutions make a lot of sense and work perfectly fine.

Tip

If you want to sterilize HTML from untrusted sources because you’re worried about specially-crafted malicious HTML and cross-site scripting (XSS) attacks, your safest bet is to first convert all <, >, and & characters to their corresponding named character references (&lt;, &gt;, and &amp;), then bring back tags that are known to be safe (as long as they contain no attributes or only use those within a select list of approved attributes). For example, to bring back <p>, <em>, and <strong> tags with no attributes after replacing <, >, and & with character references, search case-insensitively using the regex &lt;(/?)(p|em|strong)&gt; and replace matches with «<$1$2>» (or in Python and Ruby, «<12>»). If necessary, you can then safely search your modified string for HTML tags using the regexes in this recipe.

With those disclaimers out of the way, let’s examine the regexes we’ve already seen in this recipe. The first two solutions are overly simplistic for most cases, but handle XML-style markup languages equally. The latter three follow stricter rules and are tailored to their respective markup languages. Even in the latter solutions, however, HTML and XHTML tag conventions are handled together since it’s common for them to be mixed, often inadvertently. For example, an author may use an XHTML-style self-closing <br /> tag in an HTML4 document, or incorrectly use an uppercase element name in a document with an XHTML DOCTYPE. HTML5 further blurs the distinction between HTML and XHTML syntax.

Quick and dirty

The advantage of this solution is its simplicity, which makes it easy to remember and type, and also fast to run. The trade-off is that it incorrectly handles certain valid and invalid XML and (X)HTML constructs. If you’re working with markup you wrote yourself and know that such cases will never appear in your subject text, or if you are not concerned about the consequences if they do, this trade-off might be OK. Another example of where this solution might be good enough is when you’re working with a text editor that lets you preview regex matches.

The regex starts off by finding a literal < character (the start of a tag). It then uses a negated character class and greedy asterisk quantifier [^>]* to match zero or more following characters that are not >. This takes care of matching the name of the tag, attributes, and a leading or trailing /. We could use a lazy quantifier ([^>]*?) instead, but that wouldn’t change anything other than making the regex a tiny bit slower since it would cause more backtracking (Recipe 2.13 explains why). To end the tag, the regex then matches a literal >.

If you prefer to use a dot instead of the negated character class [^>], go for it. A dot will work fine as long as you also use a lazy asterisk along with it (.*?) and make sure to enable the “dot matches line breaks” option (in JavaScript, you could use [sS]*? instead). A dot with a greedy asterisk (making the full pattern <.*>) would change the regex’s meaning, causing it to incorrectly match from the first < until the very last > in the subject string, even if the regex has to swallow multiple tags along the way in order to do so.

It’s time for a few examples. The regex matches each of the following lines in full:

  • <div>

  • </div>

  • <div class="box">

  • <div id="pandoras-box" class="box" />

  • <!-- comment -->

  • <!DOCTYPE html>

  • << < w00t! >

  • <>

Notice that the pattern matches more than just tags. Worse, it will not correctly match the entire tags in the subject strings <input type="button" value=">>"> or <input type="button" onclick="alert(2>1)">. Instead, it will only match until the first > that appears within the attribute values. It will have similar problems with comments, XML CDATA sections, DOCTYPEs, code within <script> elements, and anything else that contains embedded > symbols.

If you’re processing anything more than the most basic markup, especially if the subject text is coming from mixed or unknown sources, you will be better served by one of the more robust solutions further along in this recipe.

Allow > in attribute values

Like the quick and dirty regex we’ve just described, this next one is included primarily to contrast it with the later, more robust solutions. Nevertheless, it covers the basics needed to match XML-style tags, and thus it might work well for your needs if it will be used to process snippets of valid markup that include only elements and text. The difference from the last regex is that it passes over > characters that appear within attribute values. For example, it will correctly match the entire <input> tags in the example subject strings we’ve previously shown: <input type="button" value=">>"> and <input type="button" onclick="alert(2>1)">.

As before, the regex uses literal angle bracket characters at the edges of the regex to match the start and end of a tag. In between, it repeats a noncapturing group containing three alternatives, each separated by the | alternation metacharacter.

The first alternative is the negated character class [^>"'], which matches any single character other than a right angle bracket (which closes the tag), double quote, or single quote (both quote marks indicate the start of an attribute value). This first alternative is responsible for matching the tag and attribute names as well as any other characters outside of quoted values. The order of the alternatives is intentional, and written with performance in mind. Regular expression engines attempt alternative paths through a regex from left to right, and attempts at matching this first option will most likely succeed more often than the alternatives for quoted values (especially since it matches only one character at a time).

Next come the alternatives for matching double and single quoted attribute values ("[^"]*" and '[^']*'). Their use of negated character classes allows them to continue matching past any included > characters, line breaks, and anything else that isn’t a closing quote mark.

Note that this solution has no special handling that allows it to exclude or properly match comments and other special nodes in your documents. Make sure you’re familiar with the kind of content you’re working with before putting this regex to use.

(X)HTML tags (loose)

Via a couple main changes, this regex gets a lot closer to emulating the easygoing rules that web browsers use to identify (X)HTML tags in source code. That makes it a good solution in cases where you’re trying to copy browser behavior or the HTML5 parsing algorithm and don’t care whether the tags you match actually follow all the rules for valid markup. Keep in mind that it’s still possible to create horrifically invalid HTML that this regex will not handle in the same way as one or more browsers, since browsers parse some edge cases of erroneous markup in their own, unique ways.

This regex’s most significant difference from the previous solution is that it requires the character following the opening left angle bracket (<) to be a letter A–Z or a–z, optionally preceded by / (for closing tags). This constraint rules out matching stray, unencoded < characters in text, as well as comments, DOCTYPEs, XML declarations and processing instructions, CDATA sections, and so on. That doesn’t protect it from matching something that looks like a tag but is actually within a comment, scripting language code, the content of a <textarea> element, or other similar situation where text is treated literally. The upcoming section, Skip Tricky (X)HTML and XML Sections, shows a workaround for this issue. But first, let’s look at how this regex works.

< starts off the match with a literal left angle bracket. The /? that follows allows an optional forward slash, for closing tags. Next comes the capturing group ([A-Za-z][^s>/]*), which matches the tag’s name and remembers it as backreference 1. If you don’t need to refer back to the tag name (e.g., if you’re simply removing all tags), you can remove the capturing parentheses (just don’t get rid of the pattern within them). Within the group are two character classes. The first class, [A-Za-z], matches the first character of the tag’s name. The second class, [^s>/], allows nearly any characters to follow as part of the name. The only exceptions are whitespace (s, which separates the tag name from any following attributes), > (which ends the tag), and / (used before the closing > for XHTML-style singleton tags). Any other characters (even including quote marks and the equals sign) are treated as part of the tag’s name. That might seem a bit overly permissive, but it’s how most browsers operate. Bogus tags might not have any effect on the way a page is rendered, but they nevertheless become accessible via the DOM tree and are not rendered as text, although any content within them will show up.

After the tag name comes the attribute handling, which is significantly changed from the previous solution in order to more accurately emulate browser-style parsing of edge cases with poorly formed markup. Since unencoded > symbols end a tag unless they are within attribute values, it’s important to accurately determine where attribute values start and end. This is a bit tricky since it’s possible for stray quote marks and equals signs to appear within a tag but separate from any attribute value, or even as part of an unquoted attribute value.

Consider a few examples. This regex matches each of the following lines in their entirety:

  • <em title=">">

  • <em !=">">

  • </em// em <em>

  • <em title=">"">

  • <em title=""em"> [18]

  • <em" title=">">

The regex matches only the underlined portions of the following lines:

  • <em "> ">

  • <em="> ">

  • <em title=="> "> [19]

  • <em title=em="> ">

  • <em title= ="> ">

Keep in mind that the handling for these examples is specifically designed to match common browser behavior.

Getting back to the attribute handling, we come to the noncapturing group (?:=s*(?:"[^"]*"|'[^']*'|[^s>]+)|[^>])*. There are two outermost alternatives here, separated by |.

The first alternative, =s*(?:"[^"]*"|'[^']*'|[^s>]+), is for matching attribute values; the equals sign at the start signals their onset. After the equals sign and optional whitespace (s*), there is a nested noncapturing group that includes three options: "[^"]*" for double quoted values, '[^']*' for single quoted values, and [^s>]+ for unquoted values. The pattern for unquoted values notably allows anything except whitespace or >, even matching quote marks and equals signs. This is more permissive than is officially allowed for valid HTML, but follows browser behavior. Note that because the pattern for unquoted values matches quote marks, it must appear last in the list of options or the other two alternatives (for matching quoted values) would never have a chance to match.

The second alternative in the outer group is simply [^>]. This is used to match (one character at a time) attribute names, the whitespace separating attributes, the trailing / symbol for self-closed tags, and any other stray characters within the tag’s boundaries. Because this character class matches equals signs (in addition to almost everything else), it must be the latter option in its containing group or else the alternative that matches attribute values would never have a chance to participate.

Finally, we close out the regex with (?:>|$). This matches either the end of the tag or, if it’s reached first, the end of the string.

By letting the match end successfully if the end of the string is reached without finding the end of the tag, we’re emulating most browsers’ behavior, but we’re also doing it to avoid potential runaway backtracking (see Recipe 2.15). If we forced the regex to backtrack (and ultimately fail to match) when there is no tag-ending > to be found, the amount of backtracking that might be needed to try every possible permutation of this regex’s medley of overlapping patterns and nested repeating groups could create performance problems. However, the regex as it’s written sidesteps this issue, and should always perform efficiently.

The following regexes show how this pattern can be tweaked to match opening and singleton (self-closing) or closing tags only:

Opening and singleton tags only
<([A-Za-z][^s>/]*)(?:=s*(?:"[^"]*"|'[^']*'|[^s>]+)|[^>])*(?:>|$)
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

This version removes the /? that appeared after the opening <.

Closing tags only
</([A-Za-z][^s>/]*)(?:=s*(?:"[^"]*"|'[^']*'|[^s>]+)|[^>])*(?:>|$)
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

The forward slash after the opening < has been made a required part of the match here. Note that we are intentionally allowing attributes inside closing tags, since this is based on the “loose” solution. Although browsers don’t use attributes that occur in closing tags, they don’t mind if such attributes exist.

(X)HTML tags (strict)

By saying that this solution is strict, we mean that it attempts to follow the HTML and XHTML syntax rules explained in the introductory section of this chapter, rather than emulating the rules browsers actually use when parsing the source code of a document. This strictness adds the following rules compared to the previous regexes:

  • Both tag and attribute names must start with a letter A–Z or a–z, and their names may only use the characters A–Z, a–z, 0–9, hyphen, and colon. In regex, that’s ^[A-Za-z][-:A-Za-z0-9]*$.

  • Inappropriate, stray characters are not allowed after the tag name. Only whitespace, attributes (with or without an accompanying value), and optionally a trailing forward slash (/) may appear after the tag name.

  • Unquoted attribute values may not use the characters ", ', `, =, <, >, and whitespace. In regex, ^[^"'`=<>s]+$.

  • Closing tags cannot include attributes.

Since the pattern is split into two branches using alternation, the tag name is captured to either backreference 1 or 2, depending on what type of tag is matched. The first branch is for opening and singleton tags, and the second branch is for closing tags. Both sets of capturing parentheses may be removed if you have no need to refer back to the tag names.

The two branches of the pattern are separated into their own regexes in the following modified versions. Both capture the tag name to backreference 1:

Opening and singleton tags only
<([A-Z][-:A-Z0-9]*)(?:s+[A-Z][-:A-Z0-9]*(?:s*=s*↵
(?:"[^"]*"|'[^']*'|[^"'`=<>s]+))?)*s*/?>
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

The /? that appears just before the closing > is what allows this regex to match both opening and singleton tags. Remove it to match opening tags only. Remove just the question mark quantifier (making the / required), and it will match singleton tags only.

Closing tags only
</([A-Z][-:A-Z0-9]*)s*>
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

In the last couple of sections, we’ve shown how to get a potential performance boost by adding atomic groups or possessive quantifiers. The strictly defined paths through this regex (and the adapted versions just shown) result in there being no potential to match the same strings more than one way, and therefore having less potential backtracking to worry about. These regexes don’t actually rely on backtracking, so if you wanted to, you could make every last one of their *, +, and ? quantifiers possessive (or achieve the same effect using atomic groups) and they would continue matching or failing to match the exactly same strings with only slightly less backtracking along the way. We’re therefore going to skip such variations for this (and the next) regex, to try to keep the number of options in this recipe under control.

See Skip Tricky (X)HTML and XML Sections for a way to avoid matching tags within comments, <script> tags, and so on.

XML tags (strict)

XML precludes the need for a “loose” solution through its precise specification and requirement that conforming parsers do not process markup that is not well-formed. Although you could use one of the preceding regexes when processing XML documents, their simplicity won’t give you the advantage of actually providing a more reliable search, since there is no loose XML user agent behavior to emulate.

This regex is basically a simpler version of the “(X)HTML tags (strict)” regex, since we’re able to remove support for two HTML features that are not allowed in XML: unquoted attribute values and minimized attributes (attributes without an accompanying value). The only other difference is the characters that are allowed as part of the tag and attribute names. In fact, the rules for XML names (which govern the requirements for both tag and attribute names) are more permissive than shown here, allowing hundreds of thousands of additional Unicode characters. If you need to allow these characters in your search, you can replace the three occurrences of [_:A-Z][-.:w]* with one of the patterns found in Recipe 9.4. Note that the list of characters allowed differs depending on the version of XML in use.

As with the (X)HTML regexes, the tag name is captured to backreference 1 or 2, depending on whether an opening/singleton or closing tag is matched. And once again, you can remove the capturing parentheses if you don’t need to refer back to the tag names.

The two branches of the pattern are separated in the following modified regexes. As a result, both regexes capture the tag name to backreference 1:

Opening and singleton tags only
<([_:A-Z][-.:w]*)(?:s+[_:A-Z][-.:w]*s*=s*↵
(?:"[^"]*"|'[^']*'))*s*/?>
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

The /? that appears just before the closing > is what allows this regex to match both opening and singleton tags. Remove it to match only opening tags. Remove just the question mark quantifier, and it will match only singleton tags.

Closing tags only
</([_:A-Z][-.:w]*)s*>
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

See the next section, , for a way to avoid matching tags within comments, CDATA sections, and DOCTYPEs.

Skip Tricky (X)HTML and XML Sections

When trying to match XML-style tags within a source file or string, much of the battle is avoiding content that looks like a tag, even though its placement or other context precludes it from being interpreted as a tag. The (X)HTML- and XML-specific regexes we’ve shown in this recipe avoid some problematic content by restricting the initial character of an element’s name. Some went even further, requiring tags to fulfill the (X)HTML or XML syntax rules. Still, a robust solution requires that we also avoid any content that appears within comments, scripting language code (which may use greater-than and less-than symbols for mathematical operations), XML CDATA sections, and various other constructs. We can solve this issue by first searching for these problematic sections, and then searching for tags only in the content outside of those matches.

Recipe 3.18 shows the code for searching between matches of another regex. It takes two patterns: an inner regex and outer regex. Any of the tag-matching regexes in this recipe can serve as the inner regex. The outer regex is shown next, with separate patterns for (X)HTML and XML. This approach hides the problematic sections from the inner regex’s view, and thereby lets us keep things relatively simple.

Tip

Instead of searching between matches of the outer regex, it might be easier to simply remove all matches of the outer regex (i.e., replace matches with an empty string). You can then search for XML or (X)HTML tags without worrying about skipping over tricky sections like CDATA blocks and <script> tags, since they’ve already been removed.

Outer regex for (X)HTML

The following regex matches comments, CDATA sections, and a number of special elements. Of the special elements, <script>, <style>, <textarea>, <title>, and <xmp>[20] tags are matched together with their entire contents and end tags. The <plaintext>[21] element is also matched, and when found, the match continues until the end of the string:

<!--.*?-->|<![CDATA[.*?]]>|<(script|style|textarea|title|xmp)↵
(?:[^>"']|"[^"]*"|'[^']*')*>.*?</1s*>|<plaintext↵
(?:[^>"']|"[^"]*"|'[^']*')*>.*
Regex options: Case insensitive, dot matches line breaks
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

In case that’s not the most readable line of code you’ve ever read, here is the regex again in free-spacing mode, with a few comments added:

# Comment
<!-- .*? -->
|
# CDATA section
<![CDATA[ .*? ]]>
|
# Special element and its content
<( script | style | textarea | title | xmp )
  (?:[^>"']|"[^"]*"|'[^']*')*
> .*? </1s*>
|
# <plaintext/> continues until the end of the string
<plaintext
  (?:[^>"']|"[^"]*"|'[^']*')*
> .*
Regex options: Case insensitive, dot matches line breaks, free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

Neither of the above regexes work correctly in JavaScript without XRegExp, since standard JavaScript lacks both the “dot matches line breaks” and “free-spacing” options. The following regex reverts to being unreadable and replaces the dots with [sS] so it can be used in standard JavaScript:

<!--[sS]*?-->|<![CDATA[[sS]*?]]>|<(script|style|textarea|title|xmp)↵
(?:[^>"']|"[^"]*"|'[^']*')*>[sS]*?</1s*>|<plaintext↵
(?:[^>"']|"[^"]*"|'[^']*')*>[sS]*
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

These regexes present a bit of a dilemma: because they match <script>, <style>, <textarea>, <title>, <xmp>, and <plaintext> tags, those tags are never matched by the second (inner) regex, even though we’re supposedly searching for all tags. However, it should just be a matter of adding a bit of extra procedural code to handle those tags specially, when they are matched by the outer regex.

Outer regex for XML

This regex matches comments, CDATA sections, and DOCTYPEs. Each of these cases are matched using a discrete pattern. The patterns are combined into one regex using the | alternation metacharacter:

<!--.*?--s*>|<![CDATA[.*?]]>|<!DOCTYPEs(?:[^<>"']|"[^"]*"|↵
'[^']*'|<!(?:[^>"']|"[^"]*"|'[^']*')*>)*>
Regex options: Case insensitive, dot matches line breaks
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

Here it is again in free-spacing mode:

# Comment
<!-- .*? --s*>
|
# CDATA section
<![CDATA[ .*? ]]>
|
# Document type declaration
<!DOCTYPEs
    (?: [^<>"']  # Non-special character
      | "[^"]*"  # Double-quoted value
      | '[^']*'  # Single-quoted value
      | <!(?:[^>"']|"[^"]*"|'[^']*')*>  # Markup declaration
    )*
>
Regex options: Case insensitive, dot matches line breaks, free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

And here is a version that works in standard JavaScript (which lacks the “dot matches line breaks” and “free-spacing” options):

<!--[sS]*?--s*>|<![CDATA[[sS]*?]]>|<!DOCTYPEs(?:[^<>"']|"[^"]*"|↵
'[^']*'|<!(?:[^>"']|"[^"]*"|'[^']*')*>)*>
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Tip

The regexes just shown allow whitespace via s* between the closing -- and > of XML comments. This differs from the version shown earlier, because HTML5 and web browsers differ from XML on this point. See Find valid HTML comments for a discussion of the differences between valid XML and HTML comments.

See Also

Matching any and all tags can be useful, but it’s also common to want to match a specific one or a few out of the bunch; Recipe 9.2 shows how to pull off these tasks. Recipe 9.3 describes how to match all except a select list of tags.

Recipe 9.4 details the characters that can be used in valid XML element and attribute names.

Recipe 9.7 shows how to find tags that contain a specific attribute. Recipe 9.8 finds tags that do not contain a specific attribute.

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.1 explains which special characters need to be escaped. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.10 explains backreferences. Recipe 2.12 explains repetition. Recipe 2.13 explains how greedy and lazy quantifiers backtrack. Recipe 2.14 explains possessive quantifiers and atomic groups. Recipe 2.16 explains lookaround.



[18] The title attribute’s value is the empty string, not em.

[19] The title attribute’s value is =", not >. The second equals sign triggers the start of an unquoted value.

[20] <xmp> is a little-known but widely supported element similar to <pre>. Like <pre>, it preserves all whitespace and uses a fixed-width font by default, but it goes one step further and displays all of its contents (including HTML tags) as plain text. <xmp> was deprecated in HTML 3.2, and removed entirely from HTML 4.0.

[21] <plaintext> is like <xmp> except that it cannot be turned off by an end tag and runs until the very end of the document. Also like <xmp>, it was obsoleted in HTML 4.0 but remains widely supported.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset