9.8. Add a cellspacing Attribute to <table> Tags That Do Not Already Include It

Problem

You want to search through an (X)HTML file and add cellspacing="0" to all tables that do not already include a cellspacing attribute.

This recipe serves as an example of adding an attribute to XML-style tags that do not already include it. You can modify the regexes and replacement strings in this recipe to use whatever tag and attribute names and values you prefer.

Solution

Solution 1, simplistic

You can use negative lookahead to match <table> tags that do not contain the word cellspacing, as follows:

<table(?![^>]*?scellspacing)([^>]*)>
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Here’s the regex again in free-spacing mode:

<table             # Match "<table", as a complete word
(?!                  # Not followed by:
  [^>]*?             #   Any attributes, etc.
  s cellspacing   #   "cellspacing", as a complete word
)
([^>]*)              # Capture attributes, etc. to backreference 1
>
Regex options: Case insensitive
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

Solution 2, more reliable

The following regex works exactly the same as Solution 1, except that both instances of the negated character class [^>] are replaced with (?:[^>"']|"[^"]*"|'[^']*'). This longer pattern passes over double- and single-quoted attribute values in one step:

<table(?!(?:[^>"']|"[^"]*"|'[^']*')*?scellspacing)↵
((?:[^>"']|"[^"]*"|'[^']*')*)>
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

And here it is in free-spacing mode:

<table   # Match "<table", as a complete word
(?!  # Not followed by: Any attributes, etc., then "cellspacing"
  (?:[^>"']|"[^"]*"|'[^']*')*?
  s cellspacing 
)
(  # Capture attributes, etc. to backreference 1
  (?:[^>"']|"[^"]*"|'[^']*')*
)
>
Regex options: Case insensitive
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

Insert the new attribute

The regexes shown as Solution 1 and Solution 2 can use the same replacement string, since they both capture attributes (if any) within the matched <table> tags to backreference 1. This lets you bring back those attributes as part of your replacement value, while adding the new cellspacing attribute. Here are the necessary replacement strings:

<tablecellspacing="0"$1>
Replacement text flavors: .NET, Java, JavaScript, Perl, PHP
<tablecellspacing="0"1>
Replacement text flavors: Python, Ruby

Recipe 3.15 shows the code for performing substitutions that use a backreference in the replacement string.

Discussion

In order to examine how these regexes work, we’ll first break down the simplistic Solution 1. As you’ll see, it has four logical parts.

The first part, <table, matches the literal characters <table, followed by a word boundary (). The word boundary prevents matching tag names that merely start with “table.” Although that might seem unnecessary here when working with (X)HTML (since there are no valid elements named “tablet,” “tableau,” or “tablespoon,” for example), it’s good practice nonetheless, and can help you avoid bugs when adapting this regex to search for other tags.

The second part of the regex, (?![^>]*?scellspacing), is a negative lookahead. It doesn’t consume any text as part of the match, but it asserts that the match attempt should fail if the word cellspacing occurs anywhere within the opening tag. Since we’re going to add the cellspacing attribute to all matches, we don’t want to match tags that already contain it.

Because the lookahead peeks forward from the current position in the match attempt, it uses the leading [^>]*? to let it search as far forward as it needs to, up until what is assumed to be the end of the tag (the first occurrence of >). The remainder of the lookahead subpattern (scellspacing) simply matches the literal characters “cellspacing” as a complete word. We match a leading whitespace character (s) since whitespace must always separate an attribute name from the tag name or preceding attributes. We match a trailing word boundary instead of another whitespace character since a word boundary fulfills the need to match cellspacing as a complete word, yet works even if the attribute has no value or if the attribute name is immediately followed by an equals sign.

The way this is set up, if the regex finds cellspacing before >, the match fails. If the lookahead does not find cellspacing before it runs into a >, the rest of the match attempt can continue.

Moving along, we get to the third piece of the regex: ([^>]*). This is a negated character class and a following “zero or more” quantifier, wrapped in a capturing group. Capturing this part of the match allows you to easily bring back the attributes that each matched tag contained as part of the replacement string. And unlike the negative lookahead, this part actually adds the attributes within the tag to the string matched by the regex.

Finally, the regex matches the literal character > to end the tag.

Solution 2, the more reliable version, replaces both instances of the negated character class [^>] from the simplistic solution with (?:[^>"']|"[^"]*"|'[^']*'). This improves the regular expression’s reliability in two ways. First, it adds support for quoted attribute values that contain literal > characters. Second, it ensures that we don’t preclude matching tags that merely contain the word “cellspacing” within an attribute’s value.

As for the replacement strings, they work with both regexes, replacing each matched <table> tag with a new tag that includes cellspacing="0" as the first attribute, followed by whatever attributes occurred within the original tag (backreference 1).

See Also

Recipe 9.7 is the conceptual inverse of this recipe, and finds tags that contain a specific attribute.

Recipe 9.1 shows how to match all XML-style tags while balancing trade-offs including tolerance for invalid markup.

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.6 explains word boundaries. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition. Recipe 2.16 explains lookaround.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset