Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

9.3. Remove All XML-Style Tags Except and

Problem

You want to remove all tags in a string except  and .

In a separate case, you not only want to remove all tags other than  and , you also want to remove  and  tags that contain attributes.

Solution

This is a perfect setting to put negative lookahead (explained in Recipe 2.16) to use. Applied to this problem, negative lookahead lets you match what looks like a tag, except when certain words come immediately after the opening < or </. If you then replace all matches with an empty string (following the code in Recipe 3.14), only the approved tags are left behind.

Solution 1: Match tags except and

</?(?!(?:em|strong))[a-z](?:[^>"']|"[^"]*"|'[^']*')*>

Regex options: Case insensitive

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

In free-spacing mode:

< /?                   # Permit closing tags
(?!
    (?: em | strong )  # List of tags to avoid matching
                     # Word boundary avoids partial word matches
)
[a-z]                  # Tag name initial character must be a-z
(?: [^>"']             # Any character except >, ", or '
  | "[^"]*"            # Double-quoted attribute value
  | '[^']*'            # Single-quoted attribute value
)*
>

Regex options: Case insensitive, free-spacing

Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

Solution 2: Match tags except and , and any tags that contain attributes

With one change (replacing the ‹› with ‹s*>›), you can make the regex also match any  and  tags that contain attributes:

</?(?!(?:em|strong)s*>)[a-z](?:[^>"']|"[^"]*"|'[^']*')*>

Regex options: Case insensitive

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Once again, the same regex in free-spacing mode:

< /?                   # Permit closing tags
(?!
    (?: em | strong )  # List of tags to avoid matching
    s* >              # Only avoid tags if they contain no attributes
)
[a-z]                  # Tag name initial character must be a-z
(?: [^>"']             # Any character except >, ", or '
  | "[^"]*"            # Double-quoted attribute value
  | '[^']*'            # Single-quoted attribute value
)*
>

Regex options: Case insensitive, free-spacing

Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

Discussion

This recipe’s regular expressions have a lot in common with those we’ve included earlier in this chapter for matching XML-style tags. Apart from the negative lookahead added to prevent some tags from being matched, these regexes are nearly equivalent to the “(X)HTML tags (loose)” regex from Recipe 9.1. The other main difference here is that we’re not capturing the tag name to backreference 1.

So let’s look more closely at what’s new in this recipe. Solution 1 never matches  or  tags, regardless of whether they have any attributes, but matches all other tags. Solution 2 matches all the same tags as Solution 1, and additionally matches  and  tags that contain one or more attributes. Table 9-2 shows a few example subject strings that illustrate this.

Table 9-2. A few example subject strings

Subject string	Solution 1	Solution 2
`<i>`	Match	Match
`</i>`	Match	Match
`<i style="font-size:500%; color:red;">`	Match	Match
`<em>`	No match	No match
`</em>`	No match	No match
`<em style="font-size:500%; color:red;">`	No match	Match

Since the point of these regexes is to replace matches with empty strings (in other words, remove the tags), Solution 2 is less prone to abuse of the allowed  and  tags to provide unexpected formatting or other shenanigans.

Caution

This recipe has (until now) intentionally avoided the word “whitelist” when describing how only a few tags are left in place, since that word has security connotations. There are a variety of ways to work around this pattern’s constraints using specially crafted, malicious HTML strings. If you’re worried about malicious HTML and cross-site scripting (XSS) attacks, your safest bet is to convert all <, >, and & characters to their corresponding named character references (<, >, and &), then bring back tags that are known to be safe (as long as they contain no attributes or only use those within a select list of approved attributes). style is an example of an attribute that is not safe, since some browsers let you embed scripting language code in your CSS. To bring back  and  tags with no attributes after replacing <, >, and & with character references, search case-insensitively using the regex ‹<(/?)(em|strong)>› and replace matches with «<$1$2>» (or in Python and Ruby, «<12>»).

Variations

Whitelist specific attributes

Consider these new requirements: you need to match all tags except <a>, , and , with two exceptions. Any <a> tags that have attributes other than href or title should be matched, and if  or  tags have any attributes at all, match them too. All matched strings will be removed.

In other words, you want to remove all tags except those on your whitelist (<a>, , and ). The only whitelisted attributes are href and title, and they are allowed only within <a> tags. If a nonwhitelisted attribute appears in any tag, the entire tag should be removed.

Here’s a regex that can get the job done:

<(?!(?:em|strong|a(?:s+(?:href|title)s*=s*(?:"[^"]*"|'[^']*'))*)s*>)↵
[a-z](?:[^>"']|"[^"]*"|'[^']*')*>

Regex options: Case insensitive

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

With free-spacing:

< /?          # Permit closing tags
(?!
  (?: em      # Dont match <em>
    | strong  #   or <strong>
    | a       #   or <a>
      (?:     # Only avoid matching <a> tags that use only
        s+   #   href and/or title attributes
        (?:href|title)
        s*=s*
        (?:"[^"]*"|'[^']*')  # Quoted attribute value
      )*
  )
  s* >       # Only avoid matching these tags when they're
)             #   limited to any attributes permitted above
[a-z]         # Tag name initial character must be a-z
(?: [^>"']    # Any character except >, ", or '
  | "[^"]*"   # Double-quoted attribute value
  | '[^']*'   # Single-quoted attribute value
)*
>

Regex options: Case insensitive, free-spacing

Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

This pushes the boundary of where it makes sense to use such a complicated regex. If your rules get any more complex than this, it would probably be better to write some code based on Recipe 3.11 or 3.16 that checks the value of each matched tag to determine how to process it (based on the tag name, included attributes, or whatever else is needed).

Table of Contents for
9.3. Remove All XML-Style Tags Except <em> and <strong>

9.3. Remove All XML-Style Tags Except <em> and <strong>

Problem

Solution

Solution 1: Match tags except <em> and <strong>

Solution 2: Match tags except <em> and <strong>, and any tags that contain attributes

Discussion

Caution

Variations

Whitelist specific attributes

See Also

Table of Contents for 9.3. Remove All XML-Style Tags Except <em> and <strong>

Create new playlist

Sign In

Sign Up

9.3. Remove All XML-Style Tags Except <em> and <strong>

Problem

Solution

Solution 1: Match tags except <em> and <strong>

Solution 2: Match tags except <em> and <strong>, and any tags that contain attributes

Discussion

Caution

Variations

Whitelist specific attributes

See Also

Table of Contents for
9.3. Remove All XML-Style Tags Except <em> and <strong>