You want to replace all opening and closing <b>
tags in a string with corresponding
<strong>
tags, while preserving
any existing attributes.
This regex matches opening and closing <b>
tags, with or without
attributes:
<(/?)b((?:[^>"']|"[^"]*"|'[^']*')*)>
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
In free-spacing mode:
< (/?) # Capture the optional leading slash to backreference 1 b # Tag name, with word boundary ( # Capture any attributes, etc. to backreference 2 (?: [^>"'] # Any character except >, ", or ' | "[^"]*" # Double-quoted attribute value | '[^']*' # Single-quoted attribute value )* ) >
Regex options: Case insensitive, free-spacing |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
To preserve all attributes while changing the tag name, use the following replacement text:
<$1strong$2>
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP |
<1strong2>
Replacement text flavors: Python, Ruby |
If you want to discard any attributes in the same process, omit backreference 2 in the replacement string:
<$1strong>
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP |
<1strong>
Replacement text flavors: Python, Ruby |
Recipe 3.15 shows the code needed to implement these replacements.
The previous recipe (9.1) included a detailed discussion of
many ways to match any XML-style tag. That frees
this recipe to focus on a straightforward approach to search for a
specific type of tag. <b>
and
its replacement <strong>
are
offered as examples, but you can substitute those tag names with any two
others.
The regex starts by matching a literal ‹<
›—the first character of any tag. It then
optionally matches the forward slash found in closing tags using
‹/?
›, within capturing
parentheses. Capturing the result of this pattern (which will be either
an empty string or a forward slash) allows you to easily restore the
forward slash in the replacement string, without any conditional
logic.
Next, we match the tag name itself, ‹b
›. You could use any other tag name instead if
you wanted to. Use the case-insensitive option to make sure that you
also match an uppercase B
.
The word boundary (‹›) that follows the tag name is easy to forget,
but it’s one of the most important pieces of this regex. The word
boundary lets us match only
<b>
tags, and not <br>
, <body>
, <blockquote>
, or any other tags that
merely start with the letter “b.” We could alternatively match a
whitespace token (‹s
›)
after the name as a safeguard against this same problem, but that
wouldn’t work for tags that have no attributes and thus might not have
any whitespace following their tag name. The word boundary solves this
problem simply and elegantly.
When working with XML and XHTML, be aware that the colon used
for namespaces, as well as hyphens and some other characters allowed
as part of XML names, create a word boundary. For example, the regex
could end up matching something like <b-sharp>
. If you’re worried about
this, you might want to use the lookahead ‹(?=[s/>])
› instead of a word boundary. It
achieves the same result of ensuring that we do not match partial tag
names, and does so more reliably.
After the tag name, the pattern ‹((?:[^>"']|"[^"]*"|'[^']*')*)
› is used to match
anything remaining within the tag up until the closing right angle
bracket. Wrapping this pattern in a capturing group as we’ve done here
lets us easily bring back any attributes and other characters (such as
the trailing slash for singleton tags) in our replacement string. Within
the capturing parentheses, the pattern repeats a noncapturing group with
three alternatives. The first, ‹[^>"']
›, matches any single character except
>
, "
, or '
.
The remaining two alternatives match an entire double- or single-quoted
string, which lets you match attribute values that contain right angle
brackets without having the regex think it has found the end of the
tag.
If you want to match any tag from a list of tag names, a simple change is needed. Place all of the desired tag names within a group, and alternate between them.
The following regex matches opening and closing <b>
, <i>
, <em>
, and <big>
tags. The replacement text shown
later replaces all of them with a corresponding <strong>
or </strong>
tag, while
preserving any attributes:
<(/?)([bi]|em|big)((?:[^>"']|"[^"]*"|'[^']*')*)>
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Here’s the same regex in free-spacing mode:
< (/?) # Capture the optional leading slash to backreference 1 ([bi]|em|big) # Capture the tag name to backreference 2 ( # Capture any attributes, etc. to backreference 3 (?: [^>"'] # Any character except >, ", or ' | "[^"]*" # Double-quoted attribute value | '[^']*' # Single-quoted attribute value )* ) >
Regex options: Case insensitive, free-spacing |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
We’ve used the character class ‹[bi]
› to match both <b>
and <i>
tags, rather than separating them
with the alternation metacharacter ‹|
› as
we’ve done for <em>
and
<big>
.
Character classes are faster than alternation because they are
implemented using bit vectors (or other fast implementations) rather
than backtracking. When the difference between two options is a single
character, use a character class.
We’ve also added a capturing group for the tag name, which
shifted the group that matches attributes, etc. to store its match as
backreference 3. Although there’s no need to refer back to the tag
name if you’re just going to replace all matches with <strong>
tags, storing the tag name in
its own backreference can help you check what type of tag was matched,
when needed.
To preserve all attributes while replacing the tag name, use the following replacement text:
<$1strong$3>
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP |
<1strong3>
Replacement text flavors: Python, Ruby |
Omit backreference 3 in the replacement string if you want to discard attributes for matched tags as part of the same process:
<$1strong>
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP |
<1strong>
Replacement text flavors: Python, Ruby |
Recipe 9.1 shows how to match all XML-style tags while balancing trade-offs including tolerance for invalid markup.
Recipe 9.3 is the opposite of this recipe, and shows how to match all except a select list of tags.
Techniques used in the regular expressions and replacement text in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.6 explains word boundaries. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition. Recipe 2.16 explains lookaround. Recipe 2.21 explains how to insert text matched by capturing groups into the replacement text.