9.3. Remove All XML-Style Tags Except <em> and <strong>

Problem

You want to remove all tags in a string except <em> and <strong>.

In a separate case, you not only want to remove all tags other than <em> and <strong>, you also want to remove <em> and <strong> tags that contain attributes.

Solution

This is a perfect setting to put negative lookahead (explained in Recipe 2.16) to use. Applied to this problem, negative lookahead lets you match what looks like a tag, except when certain words come immediately after the opening < or </. If you then replace all matches with an empty string (following the code in Recipe 3.14), only the approved tags are left behind.

Solution 1: Match tags except <em> and <strong>

</?(?!(?:em|strong))[a-z](?:[^>"']|"[^"]*"|'[^']*')*>
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

In free-spacing mode:

< /?                   # Permit closing tags
(?!
    (?: em | strong )  # List of tags to avoid matching
                     # Word boundary avoids partial word matches
)
[a-z]                  # Tag name initial character must be a-z
(?: [^>"']             # Any character except >, ", or '
  | "[^"]*"            # Double-quoted attribute value
  | '[^']*'            # Single-quoted attribute value
)*
>
Regex options: Case insensitive, free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

Solution 2: Match tags except <em> and <strong>, and any tags that contain attributes

With one change (replacing the  with s*>), you can make the regex also match any <em> and <strong> tags that contain attributes:

</?(?!(?:em|strong)s*>)[a-z](?:[^>"']|"[^"]*"|'[^']*')*>
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Once again, the same regex in free-spacing mode:

< /?                   # Permit closing tags
(?!
    (?: em | strong )  # List of tags to avoid matching
    s* >              # Only avoid tags if they contain no attributes
)
[a-z]                  # Tag name initial character must be a-z
(?: [^>"']             # Any character except >, ", or '
  | "[^"]*"            # Double-quoted attribute value
  | '[^']*'            # Single-quoted attribute value
)*
>
Regex options: Case insensitive, free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

Discussion

This recipe’s regular expressions have a lot in common with those we’ve included earlier in this chapter for matching XML-style tags. Apart from the negative lookahead added to prevent some tags from being matched, these regexes are nearly equivalent to the “(X)HTML tags (loose)” regex from Recipe 9.1. The other main difference here is that we’re not capturing the tag name to backreference 1.

So let’s look more closely at what’s new in this recipe. Solution 1 never matches <em> or <strong> tags, regardless of whether they have any attributes, but matches all other tags. Solution 2 matches all the same tags as Solution 1, and additionally matches <em> and <strong> tags that contain one or more attributes. Table 9-2 shows a few example subject strings that illustrate this.

Table 9-2. A few example subject strings

Subject string

Solution 1

Solution 2

<i>

Match

Match

</i>

Match

Match

<i style="font-size:500%; color:red;">

Match

Match

<em>

No match

No match

</em>

No match

No match

<em style="font-size:500%; color:red;">

No match

Match

Since the point of these regexes is to replace matches with empty strings (in other words, remove the tags), Solution 2 is less prone to abuse of the allowed <em> and <strong> tags to provide unexpected formatting or other shenanigans.

Caution

This recipe has (until now) intentionally avoided the word “whitelist” when describing how only a few tags are left in place, since that word has security connotations. There are a variety of ways to work around this pattern’s constraints using specially crafted, malicious HTML strings. If you’re worried about malicious HTML and cross-site scripting (XSS) attacks, your safest bet is to convert all <, >, and & characters to their corresponding named character references (&lt;, &gt;, and &amp;), then bring back tags that are known to be safe (as long as they contain no attributes or only use those within a select list of approved attributes). style is an example of an attribute that is not safe, since some browsers let you embed scripting language code in your CSS. To bring back <em> and <strong> tags with no attributes after replacing <, >, and & with character references, search case-insensitively using the regex &lt;(/?)(em|strong)&gt; and replace matches with «<$1$2>» (or in Python and Ruby, «<12>»).

Variations

Whitelist specific attributes

Consider these new requirements: you need to match all tags except <a>, <em>, and <strong>, with two exceptions. Any <a> tags that have attributes other than href or title should be matched, and if <em> or <strong> tags have any attributes at all, match them too. All matched strings will be removed.

In other words, you want to remove all tags except those on your whitelist (<a>, <em>, and <strong>). The only whitelisted attributes are href and title, and they are allowed only within <a> tags. If a nonwhitelisted attribute appears in any tag, the entire tag should be removed.

Here’s a regex that can get the job done:

<(?!(?:em|strong|a(?:s+(?:href|title)s*=s*(?:"[^"]*"|'[^']*'))*)s*>)↵
[a-z](?:[^>"']|"[^"]*"|'[^']*')*>
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

With free-spacing:

< /?          # Permit closing tags
(?!
  (?: em      # Dont match <em>
    | strong  #   or <strong>
    | a       #   or <a>
      (?:     # Only avoid matching <a> tags that use only
        s+   #   href and/or title attributes
        (?:href|title)
        s*=s*
        (?:"[^"]*"|'[^']*')  # Quoted attribute value
      )*
  )
  s* >       # Only avoid matching these tags when they're
)             #   limited to any attributes permitted above
[a-z]         # Tag name initial character must be a-z
(?: [^>"']    # Any character except >, ", or '
  | "[^"]*"   # Double-quoted attribute value
  | '[^']*'   # Single-quoted attribute value
)*
>
Regex options: Case insensitive, free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

This pushes the boundary of where it makes sense to use such a complicated regex. If your rules get any more complex than this, it would probably be better to write some code based on Recipe 3.11 or 3.16 that checks the value of each matched tag to determine how to process it (based on the tag name, included attributes, or whatever else is needed).

See Also

Recipe 9.1 shows how to match all XML-style tags while balancing trade-offs including tolerance for invalid markup.

Recipe 9.2 is the opposite of this recipe, and shows how to match a select list of tags, rather than all except a few.

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition. Recipe 2.16 explains lookaround.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset