You want to search through an (X)HTML file and add
cellspacing="0"
to all tables that do
not already include a cellspacing
attribute.
This recipe serves as an example of adding an attribute to XML-style tags that do not already include it. You can modify the regexes and replacement strings in this recipe to use whatever tag and attribute names and values you prefer.
You can use negative lookahead to match <table>
tags that do not contain the
word cellspacing
, as
follows:
<table(?![^>]*?scellspacing)([^>]*)>
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Here’s the regex again in free-spacing mode:
<table # Match "<table", as a complete word (?! # Not followed by: [^>]*? # Any attributes, etc. s cellspacing # "cellspacing", as a complete word ) ([^>]*) # Capture attributes, etc. to backreference 1 >
Regex options: Case insensitive |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
The following regex works exactly the same as Solution 1, except
that both instances of the negated character class ‹[^>]
› are replaced with
‹(?:[^>"']|"[^"]*"|'[^']*')
›. This longer
pattern passes over double- and single-quoted attribute values in one
step:
<table(?!(?:[^>"']|"[^"]*"|'[^']*')*?scellspacing)↵ ((?:[^>"']|"[^"]*"|'[^']*')*)>
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
And here it is in free-spacing mode:
<table # Match "<table", as a complete word (?! # Not followed by: Any attributes, etc., then "cellspacing" (?:[^>"']|"[^"]*"|'[^']*')*? s cellspacing ) ( # Capture attributes, etc. to backreference 1 (?:[^>"']|"[^"]*"|'[^']*')* ) >
Regex options: Case insensitive |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
The regexes shown as Solution 1 and Solution 2 can use the same
replacement string, since they both capture attributes (if any) within
the matched <table>
tags to backreference 1.
This lets you bring back those attributes as part of your replacement
value, while adding the new cellspacing
attribute. Here are the
necessary replacement strings:
<table●cellspacing="0"$1>
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP |
<table●cellspacing="0"1>
Replacement text flavors: Python, Ruby |
Recipe 3.15 shows the code for performing substitutions that use a backreference in the replacement string.
In order to examine how these regexes work, we’ll first break down the simplistic Solution 1. As you’ll see, it has four logical parts.
The first part, ‹<table
›, matches the literal characters
<table
,
followed by a word boundary (‹›). The word boundary prevents matching tag
names that merely start with “table.” Although that might seem
unnecessary here when working with (X)HTML (since there are no valid
elements named “tablet,” “tableau,” or “tablespoon,” for example), it’s
good practice nonetheless, and can help you avoid bugs when adapting
this regex to search for other tags.
The second part of the regex, ‹(?![^>]*?scellspacing)
›, is a negative
lookahead. It doesn’t consume any text as part of the match, but it
asserts that the match attempt should fail if the word cellspacing
occurs
anywhere within the opening tag. Since we’re going to add the cellspacing
attribute to
all matches, we don’t want to match tags that already contain it.
Because the lookahead peeks forward from the current position in
the match attempt, it uses the leading ‹[^>]*?
› to let it search as far forward as it
needs to, up until what is assumed to be the end of the tag (the first
occurrence of >
). The remainder of the lookahead
subpattern (‹scellspacing
›) simply matches the literal
characters “cellspacing” as a complete word. We match a leading
whitespace character (‹s
›) since whitespace must always separate an
attribute name from the tag name or preceding attributes. We match a
trailing word boundary instead of another whitespace character since a
word boundary fulfills the need to match cellspacing
as a complete word, yet works
even if the attribute has no value or if the attribute name is
immediately followed by an equals sign.
The way this is set up, if the regex finds cellspacing
before >
, the match fails. If the lookahead does
not find cellspacing
before it runs
into a >
, the rest of the match
attempt can continue.
Moving along, we get to the third piece of the regex: ‹([^>]*)
›. This is a negated
character class and a following “zero or more” quantifier, wrapped in a
capturing group. Capturing this part of the match allows you to easily
bring back the attributes that each matched tag contained as part of the
replacement string. And unlike the negative lookahead, this part
actually adds the attributes within the tag to the string matched by the
regex.
Finally, the regex matches the literal character ‹>
› to end the tag.
Solution 2, the more reliable version, replaces both instances of
the negated character class ‹[^>]
› from the simplistic solution with
‹(?:[^>"']|"[^"]*"|'[^']*')
›. This improves the
regular expression’s reliability in two ways. First, it adds support for
quoted attribute values that contain literal >
characters. Second, it ensures that we
don’t preclude matching tags that merely contain the word “cellspacing”
within an attribute’s value.
As for the replacement strings, they work with both regexes,
replacing each matched <table>
tag with a new tag that includes cellspacing="0"
as the first attribute,
followed by whatever attributes occurred within the original tag
(backreference 1).
Recipe 9.7 is the conceptual inverse of this recipe, and finds tags that contain a specific attribute.
Recipe 9.1 shows how to match all XML-style tags while balancing trade-offs including tolerance for invalid markup.
Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.6 explains word boundaries. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition. Recipe 2.16 explains lookaround.