You want to remove all tags in a string except <em>
and <strong>
.
In a separate case, you not only want to remove all tags other
than <em>
and <strong>
, you also want to remove
<em>
and <strong>
tags that contain
attributes.
This is a perfect setting to put negative lookahead (explained in
Recipe 2.16) to use. Applied to this problem,
negative lookahead lets you match what looks like a tag,
except when certain words come immediately after
the opening <
or </
. If you then replace all matches with an
empty string (following the code in Recipe 3.14), only the approved tags are left
behind.
</?(?!(?:em|strong))[a-z](?:[^>"']|"[^"]*"|'[^']*')*>
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
In free-spacing mode:
< /? # Permit closing tags (?! (?: em | strong ) # List of tags to avoid matching # Word boundary avoids partial word matches ) [a-z] # Tag name initial character must be a-z (?: [^>"'] # Any character except >, ", or ' | "[^"]*" # Double-quoted attribute value | '[^']*' # Single-quoted attribute value )* >
Regex options: Case insensitive, free-spacing |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
With one change (replacing the ‹› with ‹
s*>
›), you can make the regex also match any
<em>
and <strong>
tags that contain
attributes:
</?(?!(?:em|strong)s*>)[a-z](?:[^>"']|"[^"]*"|'[^']*')*>
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Once again, the same regex in free-spacing mode:
< /? # Permit closing tags (?! (?: em | strong ) # List of tags to avoid matching s* > # Only avoid tags if they contain no attributes ) [a-z] # Tag name initial character must be a-z (?: [^>"'] # Any character except >, ", or ' | "[^"]*" # Double-quoted attribute value | '[^']*' # Single-quoted attribute value )* >
Regex options: Case insensitive, free-spacing |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
This recipe’s regular expressions have a lot in common with those we’ve included earlier in this chapter for matching XML-style tags. Apart from the negative lookahead added to prevent some tags from being matched, these regexes are nearly equivalent to the “(X)HTML tags (loose)” regex from Recipe 9.1. The other main difference here is that we’re not capturing the tag name to backreference 1.
So let’s look more closely at what’s new in this recipe. Solution
1 never matches <em>
or
<strong>
tags, regardless of
whether they have any attributes, but matches all other tags. Solution 2
matches all the same tags as Solution 1, and additionally matches
<em>
and <strong>
tags that contain one or more
attributes. Table 9-2 shows a few
example subject strings that illustrate this.
Table 9-2. A few example subject strings
Subject string | Solution 1 | Solution 2 |
---|---|---|
| Match | Match |
| Match | Match |
| Match | Match |
| No match | No match |
| No match | No match |
| No match | Match |
Since the point of these regexes is to replace matches with empty
strings (in other words, remove the tags), Solution 2 is less prone to
abuse of the allowed <em>
and
<strong>
tags to provide
unexpected formatting or other shenanigans.
This recipe has (until now) intentionally avoided the word
“whitelist” when describing how only a few tags are left in place,
since that word has security connotations. There are a variety of ways
to work around this pattern’s constraints using specially crafted,
malicious HTML strings. If you’re worried about malicious HTML and
cross-site scripting (XSS) attacks, your safest bet is to convert all
<
, >
, and &
characters to their corresponding
named character references (<
, >
, and &
), then bring back tags that are
known to be safe (as long as they contain no attributes or only use
those within a select list of approved attributes). style
is an example of an attribute that is
not safe, since some browsers let you embed scripting language code in
your CSS. To bring back <em>
and <strong>
tags with no
attributes after replacing <
,
>
, and &
with character references, search
case-insensitively using the regex ‹<(/?)(em|strong)>
› and replace
matches with «<$1$2>
» (or in Python and Ruby,
«<12>
»).
Consider these new requirements: you need to match all tags
except <a>
, <em>
, and <strong>
, with two exceptions. Any
<a>
tags that have attributes
other than href
or title
should be matched, and if <em>
or <strong>
tags have any attributes at
all, match them too. All matched strings will be removed.
In other words, you want to remove all tags except those on your
whitelist (<a>
, <em>
, and <strong>
). The only whitelisted
attributes are href
and title
, and they are allowed only within
<a>
tags. If a nonwhitelisted
attribute appears in any tag, the entire tag should be removed.
Here’s a regex that can get the job done:
<(?!(?:em|strong|a(?:s+(?:href|title)s*=s*(?:"[^"]*"|'[^']*'))*)s*>)↵ [a-z](?:[^>"']|"[^"]*"|'[^']*')*>
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
With free-spacing:
< /? # Permit closing tags (?! (?: em # Dont match <em> | strong # or <strong> | a # or <a> (?: # Only avoid matching <a> tags that use only s+ # href and/or title attributes (?:href|title) s*=s* (?:"[^"]*"|'[^']*') # Quoted attribute value )* ) s* > # Only avoid matching these tags when they're ) # limited to any attributes permitted above [a-z] # Tag name initial character must be a-z (?: [^>"'] # Any character except >, ", or ' | "[^"]*" # Double-quoted attribute value | '[^']*' # Single-quoted attribute value )* >
Regex options: Case insensitive, free-spacing |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
This pushes the boundary of where it makes sense to use such a complicated regex. If your rules get any more complex than this, it would probably be better to write some code based on Recipe 3.11 or 3.16 that checks the value of each matched tag to determine how to process it (based on the tag name, included attributes, or whatever else is needed).
Recipe 9.1 shows how to match all XML-style tags while balancing trade-offs including tolerance for invalid markup.
Recipe 9.2 is the opposite of this recipe, and shows how to match a select list of tags, rather than all except a few.
Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition. Recipe 2.16 explains lookaround.