Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

9.1. Find XML-Style Tags

Problem

You want to match any HTML, XHTML, or XML tags in a string, in order to remove, modify, count, or otherwise deal with them.

Solution

The most appropriate solution depends on several factors, including the level of accuracy, efficiency, and tolerance for erroneous markup that is acceptable to you. Once you’ve determined the approach that works for your needs, there are any number of things you might want to do with the results. But whether you want to remove the tags, search within them, add or remove attributes, or replace them with alternative markup, the first step is to find them.

Be forewarned that this will be a long recipe, fraught with subtleties, exceptions, and variations. If you’re looking for a quick fix and are not willing to put in the effort to determine the best solution for your needs, you might want to jump to the section of this recipe, which offers a decent mix of tolerance versus precaution.

Quick and dirty

This first solution is simple and more commonly used than you might expect, but it’s included here mostly for comparison and for an examination of its flaws. It may be good enough when you know exactly what type of content you’re dealing with and are not overly concerned about the consequences of incorrect handling. This regex matches a < symbol, then simply continues until the first > occurs:

<[^>]*>

Regex options: None

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Allow > in attribute values

This next regex is again rather simplistic and does not handle all cases correctly. However, it might work well for your needs if it will be used to process only snippets of valid (X)HTML. It’s advantage over the previous regex is that it correctly passes over > characters that appear within attribute values:

<(?:[^>"']|"[^"]*"|'[^']*')*>

Regex options: None

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Here is the same regex, with added whitespace and comments for readability:

<
(?: [^>"']   # Non-quoted character
  | "[^"]*"  # Double-quoted attribute value
  | '[^']*'  # Single-quoted attribute value
)*
>

Regex options: Free-spacing

Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

The two regexes just shown work identically, so you can use whichever you prefer. JavaScripters are stuck with the first option unless using the XRegExp library, since standard JavaScript lacks a free-spacing option.

(X)HTML tags (loose)

In addition to supporting > characters embedded in attribute values, this next regex emulates the lenient rules for (X)HTML tags that browsers actually implement. This both improves accuracy with poorly formed markup and lets the regex avoid content that does not look like a tag, including comments, DOCTYPEs, and unencoded < characters in text. To get these improvements, two main changes are made. First, there is extra handling that helps determine where attribute values start and end in edge cases, such as when tags contain stray quote marks as part of an unquoted attribute value or separate from any legit attribute. Second, special handling is added for the tag name, including requiring the name to begin with a letter A–Z. The tag name is captured to backreference 1 in case you need to refer back to it:

</?([A-Za-z][^s>/]*)(?:=s*(?:"[^"]*"|'[^']*'|[^s>]+)|[^>])*(?:>|$)

Regex options: None

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

And in free-spacing mode:

<
/?                  # Permit closing tags
([A-Za-z][^s>/]*)  # Capture the tag name to backreference 1
(?:                 # Attribute value branch:
  = s*             #   Signals the start of an attribute value
  (?: "[^"]*"       #   Double-quoted attribute value
    | '[^']*'       #   Single-quoted attribute value
    | [^s>]+       #   Unquoted attribute value
  )
|                   # Non-attribute-value branch:
  [^>]              #   Character outside of an attribute value
)*
(?:>|$)             # End of the tag or string

Regex options: Free-spacing

Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

The last two regexes work identically, although the latter cannot be used in JavaScript (without XRegExp), since it lacks a free-spacing option.

(X)HTML tags (strict)

This regex is more complicated than those we’ve already seen in this recipe, because it actually follows the rules for (X)HTML tags explained in the introductory section of this chapter. This is not always desirable, since browsers don’t strictly adhere to these rules. In other words, this regex will avoid matching content that does not look like a valid (X)HTML tag, at the cost of possibly not matching some content that browsers would in fact interpret as a tag (e.g., if your markup uses an attribute name that includes characters not accounted for here, or if attributes are included in a closing tag). Both HTML and XHTML tag rules are handled together since it is common for their conventions to be mixed. The tag name is captured to backreference 1 or 2 (depending on whether it is an opening or closing tag), in case you need to refer back to it:

<(?:([A-Z][-:A-Z0-9]*)(?:s+[A-Z][-:A-Z0-9]*(?:s*=s*(?:"[^"]*"|↵
'[^']*'|[^"'`=<>s]+))?)*s*/?|/([A-Z][-:A-Z0-9]*)s*)>

Regex options: Case insensitive

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

To make it a little less cryptic, here is the same regex in free-spacing mode with comments:

<
(?:                    # Branch for opening tags:
  ([A-Z][-:A-Z0-9]*)   #   Capture the opening tag name to backreference 1
  (?:                  #   This group permits zero or more attributes
    s+                #   Whitespace to separate attributes
    [A-Z][-:A-Z0-9]*   #   Attribute name
    (?: s*=s*        #   Attribute name-value delimiter
      (?: "[^"]*"      #   Double-quoted attribute value
        | '[^']*'      #   Single-quoted attribute value
        | [^"'`=<>s]+ #   Unquoted attribute value (HTML)
      )
    )?                 #   Permit attributes without a value (HTML)
  )*
  s*                  #   Permit trailing whitespace
  /?                   #   Permit self-closed tags
|                      # Branch for closing tags:
  /
  ([A-Z][-:A-Z0-9]*)   #   Capture the closing tag name to backreference 2
  s*                  #   Permit trailing whitespace
)
>

Regex options: Case insensitive, free-spacing

Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

XML tags (strict)

XML is a precisely specified language, and requires that user agents strictly adhere to and enforce its rules. This is a stark change from HTML and the longsuffering browsers that process it. We’ve therefore included only a “strict” version for XML:

<(?:([_:A-Z][-.:w]*)(?:s+[_:A-Z][-.:w]*s*=s*(?:"[^"]*"|'[^']*'))*s*↵
/?|/([_:A-Z][-.:w]*)s*)>

Regex options: Case insensitive

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Once again, here is the same regex in free-spacing mode with added comments:

<
(?:                  # Branch for opening tags:
  ([_:A-Z][-.:w]*)  #   Capture the opening tag name to backreference 1
  (?:                #   This group permits zero or more attributes
    s+              #   Whitespace to separate attributes
    [_:A-Z][-.:w]*  #   Attribute name
    s*=s*          #   Attribute name-value delimiter
    (?: "[^"]*"      #   Double-quoted attribute value
      | '[^']*'      #   Single-quoted attribute value
    )
  )*
  s*                #   Permit trailing whitespace
  /?                 #   Permit self-closed tags
|                    # Branch for closing tags:
  /
  ([_:A-Z][-.:w]*)  #   Capture the closing tag name to backreference 2
  s*                #   Permit trailing whitespace
)
>

Regex options: Case insensitive, free-spacing

Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

Like the previous solution for (X)HTML tags, these regexes capture the tag name to backreference 1 or 2, depending on whether an opening or closing tag is matched. The XML tag regex is a little shorter than the (X)HTML version since it doesn’t have to deal with HTML-only syntax (minimized attributes and unquoted values). It also allows a wider range of characters to be used for element and attribute names.

Discussion

A few words of caution

Although it’s common to want to match XML-style tags using regular expressions, doing it safely requires balancing trade-offs and thinking carefully about the data you’re working with. Because of these difficulties, some people choose to forgo the use of regular expressions for any sort of XML or (X)HTML processing in favor of specialized parsers and APIs. That’s an approach you should seriously consider, since such tools are sometimes easier to use and typically include robust detection or handling for incorrect markup. In browser-land, for example, it’s usually best to take advantage of the tree-based Document Object Model (DOM) for your HTML search and manipulation needs. Elsewhere, you might be well-served by a SAX parser or XPath. However, you may occasionally find places where regex-based solutions make a lot of sense and work perfectly fine.

Tip

If you want to sterilize HTML from untrusted sources because you’re worried about specially-crafted malicious HTML and cross-site scripting (XSS) attacks, your safest bet is to first convert all <, >, and & characters to their corresponding named character references (<, >, and &), then bring back tags that are known to be safe (as long as they contain no attributes or only use those within a select list of approved attributes). For example, to bring back <p>, <em>, and <strong> tags with no attributes after replacing <, >, and & with character references, search case-insensitively using the regex ‹<(/?)(p|em|strong)>› and replace matches with «<$1$2>» (or in Python and Ruby, «<12>»). If necessary, you can then safely search your modified string for HTML tags using the regexes in this recipe.

With those disclaimers out of the way, let’s examine the regexes we’ve already seen in this recipe. The first two solutions are overly simplistic for most cases, but handle XML-style markup languages equally. The latter three follow stricter rules and are tailored to their respective markup languages. Even in the latter solutions, however, HTML and XHTML tag conventions are handled together since it’s common for them to be mixed, often inadvertently. For example, an author may use an XHTML-style self-closing <br /> tag in an HTML4 document, or incorrectly use an uppercase element name in a document with an XHTML DOCTYPE. HTML5 further blurs the distinction between HTML and XHTML syntax.

Quick and dirty

The advantage of this solution is its simplicity, which makes it easy to remember and type, and also fast to run. The trade-off is that it incorrectly handles certain valid and invalid XML and (X)HTML constructs. If you’re working with markup you wrote yourself and know that such cases will never appear in your subject text, or if you are not concerned about the consequences if they do, this trade-off might be OK. Another example of where this solution might be good enough is when you’re working with a text editor that lets you preview regex matches.

The regex starts off by finding a literal ‹<› character (the start of a tag). It then uses a negated character class and greedy asterisk quantifier ‹[^>]*› to match zero or more following characters that are not >. This takes care of matching the name of the tag, attributes, and a leading or trailing /. We could use a lazy quantifier (‹[^>]*?›) instead, but that wouldn’t change anything other than making the regex a tiny bit slower since it would cause more backtracking (Recipe 2.13 explains why). To end the tag, the regex then matches a literal ‹>›.

If you prefer to use a dot instead of the negated character class ‹[^>]›, go for it. A dot will work fine as long as you also use a lazy asterisk along with it (‹.*?›) and make sure to enable the “dot matches line breaks” option (in JavaScript, you could use ‹[sS]*?› instead). A dot with a greedy asterisk (making the full pattern ‹<.*>›) would change the regex’s meaning, causing it to incorrectly match from the first < until the very last > in the subject string, even if the regex has to swallow multiple tags along the way in order to do so.

It’s time for a few examples. The regex matches each of the following lines in full:

<div>
</div>
<div class="box">
<div id="pandoras-box" class="box" />

<!DOCTYPE html>
<< < w00t! >
<>

Notice that the pattern matches more than just tags. Worse, it will not correctly match the entire tags in the subject strings <input type="button" value=">>"> or <input type="button" onclick="alert(2>1)">. Instead, it will only match until the first > that appears within the attribute values. It will have similar problems with comments, XML CDATA sections, DOCTYPEs, code within <script> elements, and anything else that contains embedded > symbols.

If you’re processing anything more than the most basic markup, especially if the subject text is coming from mixed or unknown sources, you will be better served by one of the more robust solutions further along in this recipe.

Allow > in attribute values

Like the quick and dirty regex we’ve just described, this next one is included primarily to contrast it with the later, more robust solutions. Nevertheless, it covers the basics needed to match XML-style tags, and thus it might work well for your needs if it will be used to process snippets of valid markup that include only elements and text. The difference from the last regex is that it passes over > characters that appear within attribute values. For example, it will correctly match the entire <input> tags in the example subject strings we’ve previously shown: <input type="button" value=">>"> and <input type="button" onclick="alert(2>1)">.

As before, the regex uses literal angle bracket characters at the edges of the regex to match the start and end of a tag. In between, it repeats a noncapturing group containing three alternatives, each separated by the ‹|› alternation metacharacter.

The first alternative is the negated character class ‹[^>"']›, which matches any single character other than a right angle bracket (which closes the tag), double quote, or single quote (both quote marks indicate the start of an attribute value). This first alternative is responsible for matching the tag and attribute names as well as any other characters outside of quoted values. The order of the alternatives is intentional, and written with performance in mind. Regular expression engines attempt alternative paths through a regex from left to right, and attempts at matching this first option will most likely succeed more often than the alternatives for quoted values (especially since it matches only one character at a time).

Next come the alternatives for matching double and single quoted attribute values (‹"[^"]*"› and ‹'[^']*'›). Their use of negated character classes allows them to continue matching past any included > characters, line breaks, and anything else that isn’t a closing quote mark.

Note that this solution has no special handling that allows it to exclude or properly match comments and other special nodes in your documents. Make sure you’re familiar with the kind of content you’re working with before putting this regex to use.

A (Safe) Efficiency Optimization

After reading the section, you might think you could make the regex a bit faster by adding a ‹*› or ‹+› quantifier after the leading negated character class (‹[^>"']›). At positions within the subject string where the regex finds matches, you’d be right. By matching more than one character at a time, you’d let the regex engine skip a lot of unnecessary steps on the way to a successful match.

What might not be as readily apparent is the negative consequence such a change could lead to in places where the regex engine finds only a partial match. When the regex matches an opening < character but there is no following > that would allow the match attempt to complete successfully, you’ll run into the “catastrophic backtracking” problem described in Recipe 2.15. This is because of the huge number of ways the new, inner quantifier could be combined with the outer quantifier (following the noncapturing group) to match the text that follows <, all of which the engine must try before giving up on the match attempt. Watch out!

With regex flavors that support possessive quantifiers or atomic groups (JavaScript and Python have neither), it’s possible to avoid this problem while still gaining the performance advantage of matching more than one nonquoted character at a time. In fact, we can go further and reduce potential backtracking elsewhere in the regex as well. If the regex flavor you’re using supports both features, possessive quantifiers (shown here in the second regex) are the better option since they keep the regex shorter and more readable.

With atomic groups:

<(?>(?:(?>[^>"']+)|"[^"]*"|'[^']*')*)>

Regex options: None

Regex flavors: .NET, Java, PCRE, Perl, Ruby

With possessive quantifiers:

<(?:[^>"']++|"[^"]*"|'[^']*')*+>

Regex options: None

Regex flavors: Java, PCRE, Perl 5.10, Ruby 1.9

(X)HTML tags (loose)

Via a couple main changes, this regex gets a lot closer to emulating the easygoing rules that web browsers use to identify (X)HTML tags in source code. That makes it a good solution in cases where you’re trying to copy browser behavior or the HTML5 parsing algorithm and don’t care whether the tags you match actually follow all the rules for valid markup. Keep in mind that it’s still possible to create horrifically invalid HTML that this regex will not handle in the same way as one or more browsers, since browsers parse some edge cases of erroneous markup in their own, unique ways.

This regex’s most significant difference from the previous solution is that it requires the character following the opening left angle bracket (<) to be a letter A–Z or a–z, optionally preceded by / (for closing tags). This constraint rules out matching stray, unencoded < characters in text, as well as comments, DOCTYPEs, XML declarations and processing instructions, CDATA sections, and so on. That doesn’t protect it from matching something that looks like a tag but is actually within a comment, scripting language code, the content of a <textarea> element, or other similar situation where text is treated literally. The upcoming section, Skip Tricky (X)HTML and XML Sections, shows a workaround for this issue. But first, let’s look at how this regex works.

‹<› starts off the match with a literal left angle bracket. The ‹/?› that follows allows an optional forward slash, for closing tags. Next comes the capturing group ‹([A-Za-z][^s>/]*)›, which matches the tag’s name and remembers it as backreference 1. If you don’t need to refer back to the tag name (e.g., if you’re simply removing all tags), you can remove the capturing parentheses (just don’t get rid of the pattern within them). Within the group are two character classes. The first class, ‹[A-Za-z]›, matches the first character of the tag’s name. The second class, ‹[^s>/]›, allows nearly any characters to follow as part of the name. The only exceptions are whitespace (‹s›, which separates the tag name from any following attributes), > (which ends the tag), and / (used before the closing > for XHTML-style singleton tags). Any other characters (even including quote marks and the equals sign) are treated as part of the tag’s name. That might seem a bit overly permissive, but it’s how most browsers operate. Bogus tags might not have any effect on the way a page is rendered, but they nevertheless become accessible via the DOM tree and are not rendered as text, although any content within them will show up.

After the tag name comes the attribute handling, which is significantly changed from the previous solution in order to more accurately emulate browser-style parsing of edge cases with poorly formed markup. Since unencoded > symbols end a tag unless they are within attribute values, it’s important to accurately determine where attribute values start and end. This is a bit tricky since it’s possible for stray quote marks and equals signs to appear within a tag but separate from any attribute value, or even as part of an unquoted attribute value.

Consider a few examples. This regex matches each of the following lines in their entirety:

<em title=">">
<em !=">">
</em// em <em>
<em title=">"">
<em title=""em"> ^[18]
<em" title=">">

The regex matches only the underlined portions of the following lines:

<em "> ">
<em="> ">
<em title=="> "> ^[19]
<em title=em="> ">
<em title= ="> ">

Keep in mind that the handling for these examples is specifically designed to match common browser behavior.

Getting back to the attribute handling, we come to the noncapturing group ‹(?:=s*(?:"[^"]*"|'[^']*'|[^s>]+)|[^>])*›. There are two outermost alternatives here, separated by ‹|›.

The first alternative, ‹=s*(?:"[^"]*"|'[^']*'|[^s>]+)›, is for matching attribute values; the equals sign at the start signals their onset. After the equals sign and optional whitespace (‹s*›), there is a nested noncapturing group that includes three options: ‹"[^"]*"› for double quoted values, ‹'[^']*'› for single quoted values, and ‹[^s>]+› for unquoted values. The pattern for unquoted values notably allows anything except whitespace or >, even matching quote marks and equals signs. This is more permissive than is officially allowed for valid HTML, but follows browser behavior. Note that because the pattern for unquoted values matches quote marks, it must appear last in the list of options or the other two alternatives (for matching quoted values) would never have a chance to match.

The second alternative in the outer group is simply ‹[^>]›. This is used to match (one character at a time) attribute names, the whitespace separating attributes, the trailing / symbol for self-closed tags, and any other stray characters within the tag’s boundaries. Because this character class matches equals signs (in addition to almost everything else), it must be the latter option in its containing group or else the alternative that matches attribute values would never have a chance to participate.

Finally, we close out the regex with ‹(?:>|$)›. This matches either the end of the tag or, if it’s reached first, the end of the string.

By letting the match end successfully if the end of the string is reached without finding the end of the tag, we’re emulating most browsers’ behavior, but we’re also doing it to avoid potential runaway backtracking (see Recipe 2.15). If we forced the regex to backtrack (and ultimately fail to match) when there is no tag-ending > to be found, the amount of backtracking that might be needed to try every possible permutation of this regex’s medley of overlapping patterns and nested repeating groups could create performance problems. However, the regex as it’s written sidesteps this issue, and should always perform efficiently.

The following regexes show how this pattern can be tweaked to match opening and singleton (self-closing) or closing tags only:

Opening and singleton tags only

<([A-Za-z][^s>/]*)(?:=s*(?:"[^"]*"|'[^']*'|[^s>]+)|[^>])*(?:>|$)

Regex options: None

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

This version removes the ‹/?› that appeared after the opening ‹<›.

Closing tags only

</([A-Za-z][^s>/]*)(?:=s*(?:"[^"]*"|'[^']*'|[^s>]+)|[^>])*(?:>|$)

Regex options: None

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

The forward slash after the opening ‹<› has been made a required part of the match here. Note that we are intentionally allowing attributes inside closing tags, since this is based on the “loose” solution. Although browsers don’t use attributes that occur in closing tags, they don’t mind if such attributes exist.

What About Backtracking Controls?

The sidebar A (Safe) Efficiency Optimization showed how to improve performance when matching tags through the use of atomic groups or possessive quantifiers. This time around, the potential performance improvement is much greater since the parts of a match that can be found by the patterns ‹[^s>/]*›, ‹[^s>]+›, and ‹[^>]› all overlap with each other and other parts of the regex, thereby providing a potentially crushing amount of pattern combinations to try before the regex engine can give up on a partial match.

Actually, as previously mentioned, we completely sidestepped this problem by allowing partial matches to end at the end of the subject string. However, if atomic groups or possessive quantifiers are available in the regex flavor you’re using, it might make sense to add them anyway. There are two reasons for this. First, with backtracking controls in place, it’s safe to require all matches to end with > if you want to. In other words, you could replace the ‹(?:>|$)› at the end of the regex with ‹>›, without worrying about runaway backtracking. Second, it will make the regex more resilient when modified. As it stands, even minor changes to the regex risk the introduction of backtracking related problems, and must be carefully considered and tested.

So let’s get some backtracking controls in here! The following changes can also be transferred to the opening/singleton and closing tag specific regexes just shown.

With atomic groups:

</?([A-Za-z](?>[^s>/]*))(?>=s*(?:"[^"]*"|'[^']*'|[^s>]+)|[^>])*(?:>|$)

Regex options: None

Regex flavors: .NET, Java, PCRE, Perl, Ruby

With possessive quantifiers:

</?([A-Za-z][^s>/]*+)(?:=s*(?:"[^"]*"|'[^']*'|[^s>]+)|[^>])*+(?:>|$)

Regex options: None

Regex flavors: Java, PCRE, Perl 5.10, Ruby 1.9

JavaScript and Python don’t support atomic groups or possessive quantifiers, but we can accomplish the same thing by emulating atomic groups using backreferences to matches captured within lookahead (see Lookaround is atomic for an explanation of why this works).

With emulated atomic groups:

</?([A-Za-z](?=([^s>/]*))2)(?=((?:=s*(?:"[^"]*"|'[^']*'|[^s>]+)|↵
[^>])*))3(?:>|$)

Regex options: None

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

(X)HTML tags (strict)

By saying that this solution is strict, we mean that it attempts to follow the HTML and XHTML syntax rules explained in the introductory section of this chapter, rather than emulating the rules browsers actually use when parsing the source code of a document. This strictness adds the following rules compared to the previous regexes:

Both tag and attribute names must start with a letter A–Z or a–z, and their names may only use the characters A–Z, a–z, 0–9, hyphen, and colon. In regex, that’s ‹^[A-Za-z][-:A-Za-z0-9]*$›.
Inappropriate, stray characters are not allowed after the tag name. Only whitespace, attributes (with or without an accompanying value), and optionally a trailing forward slash (/) may appear after the tag name.
Unquoted attribute values may not use the characters ", ', `, =, <, >, and whitespace. In regex, ‹^[^"'`=<>s]+$›.
Closing tags cannot include attributes.

Since the pattern is split into two branches using alternation, the tag name is captured to either backreference 1 or 2, depending on what type of tag is matched. The first branch is for opening and singleton tags, and the second branch is for closing tags. Both sets of capturing parentheses may be removed if you have no need to refer back to the tag names.

The two branches of the pattern are separated into their own regexes in the following modified versions. Both capture the tag name to backreference 1:

Opening and singleton tags only

<([A-Z][-:A-Z0-9]*)(?:s+[A-Z][-:A-Z0-9]*(?:s*=s*↵
(?:"[^"]*"|'[^']*'|[^"'`=<>s]+))?)*s*/?>

Regex options: Case insensitive

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

The ‹/?› that appears just before the closing ‹>› is what allows this regex to match both opening and singleton tags. Remove it to match opening tags only. Remove just the question mark quantifier (making the ‹/› required), and it will match singleton tags only.

Closing tags only

</([A-Z][-:A-Z0-9]*)s*>

Regex options: Case insensitive

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

In the last couple of sections, we’ve shown how to get a potential performance boost by adding atomic groups or possessive quantifiers. The strictly defined paths through this regex (and the adapted versions just shown) result in there being no potential to match the same strings more than one way, and therefore having less potential backtracking to worry about. These regexes don’t actually rely on backtracking, so if you wanted to, you could make every last one of their ‹*›, ‹+›, and ‹?› quantifiers possessive (or achieve the same effect using atomic groups) and they would continue matching or failing to match the exactly same strings with only slightly less backtracking along the way. We’re therefore going to skip such variations for this (and the next) regex, to try to keep the number of options in this recipe under control.

See Skip Tricky (X)HTML and XML Sections for a way to avoid matching tags within comments, <script> tags, and so on.

XML tags (strict)

XML precludes the need for a “loose” solution through its precise specification and requirement that conforming parsers do not process markup that is not well-formed. Although you could use one of the preceding regexes when processing XML documents, their simplicity won’t give you the advantage of actually providing a more reliable search, since there is no loose XML user agent behavior to emulate.

This regex is basically a simpler version of the “(X)HTML tags (strict)” regex, since we’re able to remove support for two HTML features that are not allowed in XML: unquoted attribute values and minimized attributes (attributes without an accompanying value). The only other difference is the characters that are allowed as part of the tag and attribute names. In fact, the rules for XML names (which govern the requirements for both tag and attribute names) are more permissive than shown here, allowing hundreds of thousands of additional Unicode characters. If you need to allow these characters in your search, you can replace the three occurrences of ‹[_:A-Z][-.:w]*› with one of the patterns found in Recipe 9.4. Note that the list of characters allowed differs depending on the version of XML in use.

As with the (X)HTML regexes, the tag name is captured to backreference 1 or 2, depending on whether an opening/singleton or closing tag is matched. And once again, you can remove the capturing parentheses if you don’t need to refer back to the tag names.

The two branches of the pattern are separated in the following modified regexes. As a result, both regexes capture the tag name to backreference 1:

Opening and singleton tags only

<([_:A-Z][-.:w]*)(?:s+[_:A-Z][-.:w]*s*=s*↵
(?:"[^"]*"|'[^']*'))*s*/?>

Regex options: Case insensitive

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

The ‹/?› that appears just before the closing ‹>› is what allows this regex to match both opening and singleton tags. Remove it to match only opening tags. Remove just the question mark quantifier, and it will match only singleton tags.

Closing tags only

</([_:A-Z][-.:w]*)s*>

Regex options: Case insensitive

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

See the next section, , for a way to avoid matching tags within comments, CDATA sections, and DOCTYPEs.

Skip Tricky (X)HTML and XML Sections

When trying to match XML-style tags within a source file or string, much of the battle is avoiding content that looks like a tag, even though its placement or other context precludes it from being interpreted as a tag. The (X)HTML- and XML-specific regexes we’ve shown in this recipe avoid some problematic content by restricting the initial character of an element’s name. Some went even further, requiring tags to fulfill the (X)HTML or XML syntax rules. Still, a robust solution requires that we also avoid any content that appears within comments, scripting language code (which may use greater-than and less-than symbols for mathematical operations), XML CDATA sections, and various other constructs. We can solve this issue by first searching for these problematic sections, and then searching for tags only in the content outside of those matches.

Recipe 3.18 shows the code for searching between matches of another regex. It takes two patterns: an inner regex and outer regex. Any of the tag-matching regexes in this recipe can serve as the inner regex. The outer regex is shown next, with separate patterns for (X)HTML and XML. This approach hides the problematic sections from the inner regex’s view, and thereby lets us keep things relatively simple.

Tip

Instead of searching between matches of the outer regex, it might be easier to simply remove all matches of the outer regex (i.e., replace matches with an empty string). You can then search for XML or (X)HTML tags without worrying about skipping over tricky sections like CDATA blocks and <script> tags, since they’ve already been removed.

Outer regex for (X)HTML

The following regex matches comments, CDATA sections, and a number of special elements. Of the special elements, <script>, <style>, <textarea>, <title>, and <xmp>^[20] tags are matched together with their entire contents and end tags. The <plaintext>^[21] element is also matched, and when found, the match continues until the end of the string:

<!--.*?-->|<![CDATA[.*?]]>|<(script|style|textarea|title|xmp)↵
(?:[^>"']|"[^"]*"|'[^']*')*>.*?</1s*>|<plaintext↵
(?:[^>"']|"[^"]*"|'[^']*')*>.*

Regex options: Case insensitive, dot matches line breaks

Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

In case that’s not the most readable line of code you’ve ever read, here is the regex again in free-spacing mode, with a few comments added:

# Comment
<!-- .*? -->
|
# CDATA section
<![CDATA[ .*? ]]>
|
# Special element and its content
<( script | style | textarea | title | xmp )
  (?:[^>"']|"[^"]*"|'[^']*')*
> .*? </1s*>
|
# <plaintext/> continues until the end of the string
<plaintext
  (?:[^>"']|"[^"]*"|'[^']*')*
> .*

Regex options: Case insensitive, dot matches line breaks, free-spacing

Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

Neither of the above regexes work correctly in JavaScript without XRegExp, since standard JavaScript lacks both the “dot matches line breaks” and “free-spacing” options. The following regex reverts to being unreadable and replaces the dots with ‹[sS]› so it can be used in standard JavaScript:

<!--[sS]*?-->|<![CDATA[[sS]*?]]>|<(script|style|textarea|title|xmp)↵
(?:[^>"']|"[^"]*"|'[^']*')*>[sS]*?</1s*>|<plaintext↵
(?:[^>"']|"[^"]*"|'[^']*')*>[sS]*

Regex options: Case insensitive

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

These regexes present a bit of a dilemma: because they match <script>, <style>, <textarea>, <title>, <xmp>, and <plaintext> tags, those tags are never matched by the second (inner) regex, even though we’re supposedly searching for all tags. However, it should just be a matter of adding a bit of extra procedural code to handle those tags specially, when they are matched by the outer regex.

Outer regex for XML

This regex matches comments, CDATA sections, and DOCTYPEs. Each of these cases are matched using a discrete pattern. The patterns are combined into one regex using the ‹|› alternation metacharacter:

<!--.*?--s*>|<![CDATA[.*?]]>|<!DOCTYPEs(?:[^<>"']|"[^"]*"|↵
'[^']*'|<!(?:[^>"']|"[^"]*"|'[^']*')*>)*>

Regex options: Case insensitive, dot matches line breaks

Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

Here it is again in free-spacing mode:

# Comment
<!-- .*? --s*>
|
# CDATA section
<![CDATA[ .*? ]]>
|
# Document type declaration
<!DOCTYPEs
    (?: [^<>"']  # Non-special character
      | "[^"]*"  # Double-quoted value
      | '[^']*'  # Single-quoted value
      | <!(?:[^>"']|"[^"]*"|'[^']*')*>  # Markup declaration
    )*
>

Regex options: Case insensitive, dot matches line breaks, free-spacing

Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

And here is a version that works in standard JavaScript (which lacks the “dot matches line breaks” and “free-spacing” options):

<!--[sS]*?--s*>|<![CDATA[[sS]*?]]>|<!DOCTYPEs(?:[^<>"']|"[^"]*"|↵
'[^']*'|<!(?:[^>"']|"[^"]*"|'[^']*')*>)*>

Regex options: Case insensitive

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Tip

The regexes just shown allow whitespace via ‹s*› between the closing -- and > of XML comments. This differs from the version shown earlier, because HTML5 and web browsers differ from XML on this point. See Find valid HTML comments for a discussion of the differences between valid XML and HTML comments.

Table of Contents for
9.1. Find XML-Style Tags

9.1. Find XML-Style Tags

Problem

Solution

Quick and dirty

Allow > in attribute values

(X)HTML tags (loose)

(X)HTML tags (strict)

XML tags (strict)

Discussion

A few words of caution

Tip

Quick and dirty

Allow > in attribute values

(X)HTML tags (loose)

(X)HTML tags (strict)

XML tags (strict)

Skip Tricky (X)HTML and XML Sections

Tip

Outer regex for (X)HTML

Outer regex for XML

Tip

See Also

Table of Contents for 9.1. Find XML-Style Tags

Create new playlist

Sign In

Sign Up

9.1. Find XML-Style Tags

Problem

Solution

Quick and dirty

Allow > in attribute values

(X)HTML tags (loose)

(X)HTML tags (strict)

XML tags (strict)

Discussion

A few words of caution

Tip

Quick and dirty

Allow > in attribute values

(X)HTML tags (loose)

(X)HTML tags (strict)

XML tags (strict)

Skip Tricky (X)HTML and XML Sections

Tip

Outer regex for (X)HTML

Outer regex for XML

Tip

See Also

Table of Contents for
9.1. Find XML-Style Tags