9.2. Replace <b> Tags with <strong>

Problem

You want to replace all opening and closing <b> tags in a string with corresponding <strong> tags, while preserving any existing attributes.

Solution

This regex matches opening and closing <b> tags, with or without attributes:

<(/?)b((?:[^>"']|"[^"]*"|'[^']*')*)>
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

In free-spacing mode:

<
(/?)             # Capture the optional leading slash to backreference 1
b              # Tag name, with word boundary
(                # Capture any attributes, etc. to backreference 2
    (?: [^>"']   # Any character except >, ", or '
      | "[^"]*"  # Double-quoted attribute value
      | '[^']*'  # Single-quoted attribute value
    )*
)
>
Regex options: Case insensitive, free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

To preserve all attributes while changing the tag name, use the following replacement text:

<$1strong$2>
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP
<1strong2>
Replacement text flavors: Python, Ruby

If you want to discard any attributes in the same process, omit backreference 2 in the replacement string:

<$1strong>
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP
<1strong>
Replacement text flavors: Python, Ruby

Recipe 3.15 shows the code needed to implement these replacements.

Discussion

The previous recipe (9.1) included a detailed discussion of many ways to match any XML-style tag. That frees this recipe to focus on a straightforward approach to search for a specific type of tag. <b> and its replacement <strong> are offered as examples, but you can substitute those tag names with any two others.

The regex starts by matching a literal <—the first character of any tag. It then optionally matches the forward slash found in closing tags using /?, within capturing parentheses. Capturing the result of this pattern (which will be either an empty string or a forward slash) allows you to easily restore the forward slash in the replacement string, without any conditional logic.

Next, we match the tag name itself, b. You could use any other tag name instead if you wanted to. Use the case-insensitive option to make sure that you also match an uppercase B.

The word boundary () that follows the tag name is easy to forget, but it’s one of the most important pieces of this regex. The word boundary lets us match only <b> tags, and not <br>, <body>, <blockquote>, or any other tags that merely start with the letter “b.” We could alternatively match a whitespace token (s) after the name as a safeguard against this same problem, but that wouldn’t work for tags that have no attributes and thus might not have any whitespace following their tag name. The word boundary solves this problem simply and elegantly.

Tip

When working with XML and XHTML, be aware that the colon used for namespaces, as well as hyphens and some other characters allowed as part of XML names, create a word boundary. For example, the regex could end up matching something like <b-sharp>. If you’re worried about this, you might want to use the lookahead (?=[s/>]) instead of a word boundary. It achieves the same result of ensuring that we do not match partial tag names, and does so more reliably.

After the tag name, the pattern ((?:[^>"']|"[^"]*"|'[^']*')*) is used to match anything remaining within the tag up until the closing right angle bracket. Wrapping this pattern in a capturing group as we’ve done here lets us easily bring back any attributes and other characters (such as the trailing slash for singleton tags) in our replacement string. Within the capturing parentheses, the pattern repeats a noncapturing group with three alternatives. The first, [^>"'], matches any single character except >, ", or '. The remaining two alternatives match an entire double- or single-quoted string, which lets you match attribute values that contain right angle brackets without having the regex think it has found the end of the tag.

Variations

Replace a list of tags

If you want to match any tag from a list of tag names, a simple change is needed. Place all of the desired tag names within a group, and alternate between them.

The following regex matches opening and closing <b>, <i>, <em>, and <big> tags. The replacement text shown later replaces all of them with a corresponding <strong> or </strong> tag, while preserving any attributes:

<(/?)([bi]|em|big)((?:[^>"']|"[^"]*"|'[^']*')*)>
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Here’s the same regex in free-spacing mode:

<
(/?)              # Capture the optional leading slash to backreference 1
([bi]|em|big)   # Capture the tag name to backreference 2
(                 # Capture any attributes, etc. to backreference 3
    (?: [^>"']    # Any character except >, ", or '
      | "[^"]*"   # Double-quoted attribute value
      | '[^']*'   # Single-quoted attribute value
    )*
)
>
Regex options: Case insensitive, free-spacing
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

We’ve used the character class [bi] to match both <b> and <i> tags, rather than separating them with the alternation metacharacter | as we’ve done for <em> and <big>. Character classes are faster than alternation because they are implemented using bit vectors (or other fast implementations) rather than backtracking. When the difference between two options is a single character, use a character class.

We’ve also added a capturing group for the tag name, which shifted the group that matches attributes, etc. to store its match as backreference 3. Although there’s no need to refer back to the tag name if you’re just going to replace all matches with <strong> tags, storing the tag name in its own backreference can help you check what type of tag was matched, when needed.

To preserve all attributes while replacing the tag name, use the following replacement text:

<$1strong$3>
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP
<1strong3>
Replacement text flavors: Python, Ruby

Omit backreference 3 in the replacement string if you want to discard attributes for matched tags as part of the same process:

<$1strong>
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP
<1strong>
Replacement text flavors: Python, Ruby

See Also

Recipe 9.1 shows how to match all XML-style tags while balancing trade-offs including tolerance for invalid markup.

Recipe 9.3 is the opposite of this recipe, and shows how to match all except a select list of tags.

Techniques used in the regular expressions and replacement text in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.6 explains word boundaries. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition. Recipe 2.16 explains lookaround. Recipe 2.21 explains how to insert text matched by capturing groups into the replacement text.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset