5.6. Find Any Word Not Preceded by a Specific Word

Problem

You want to match any word that is not immediately preceded by the word cat, ignoring any whitespace, punctuation, or other nonword characters that come between.

Solution

Lookbehind you

Lookbehind lets you check if text appears before a given position. It works by instructing the regex engine to temporarily step backward in the string, checking whether something can be found ending at the position where you placed the lookbehind. See Recipe 2.16 if you need to brush up on the details of lookbehind.

The following regexes use negative lookbehind, (?<!). Unfortunately, the regex flavors covered by this book differ in what kinds of patterns they allow you to place within lookbehind. The solutions therefore end up working a bit differently in each case. Read on to the section of this recipe for further details.

Words not preceded by “cat”

Any number of separating nonword characters:

(?<!catW+)w+
Regex options: Case insensitive
Regex flavor: .NET

Limited number of separating nonword characters:

(?<!catW{1,9})w+
Regex options: Case insensitive
Regex flavors: .NET, Java

Single separating nonword character:

(?<!catW)w+
Regex options: Case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python
(?<!WcatW)(?<!^catW)w+
Regex options: Case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9

Simulate lookbehind

JavaScript and Ruby 1.8 do not support lookbehind at all, even though they do support lookahead. However, because the lookbehind for this problem appears at the very beginning of the regex, it’s possible to simulate the lookbehind by splitting the regex into two parts, as demonstrated in the following JavaScript example:

var subject = "My cat is fluffy.",
    mainRegex = /w+/g,
    lookbehind = /catW+$/i,
    lookbehindType = false, // false for negative, true for positive
    matches = [],
    match,
    leftContext;

while (match = mainRegex.exec(subject)) {
    leftContext = subject.substring(0, match.index);

    if (lookbehindType == lookbehind.test(leftContext)) {
        matches.push(match[0]);
    } else {
        mainRegex.lastIndex = match.index + 1;
    }
}

// matches:  ["My", "cat", "fluffy"]

Discussion

Fixed, finite, and infinite length lookbehind

The first regular expression uses the negative lookbehind (?<!catW+). Because the + quantifier used inside the lookbehind has no upper limit on how many characters it can match, this version works with the .NET regular expression flavor only. All of the other regular expression flavors covered by this book require a fixed or maximum (finite) length for lookbehind patterns.

The second regular expression replaces the + within the lookbehind with {1,9}. As a result, it can be used with .NET and Java, both of which support variable-length lookbehind when there is a known upper limit to how many characters can be matched within them. I’ve arbitrarily chosen a maximum length of nine nonword characters to separate the words. That allows a bit of punctuation and a few blank lines to separate the words. Unless you’re working with unusual subject text, this will probably end up working exactly like the previous .NET-only solution. Even in .NET, however, providing a reasonable repetition limit for any quantifiers inside lookbehind is a good safety practice since it reduces the amount of unanticipated backtracking that can potentially occur within the lookbehind.

The third regular expression entirely dropped the quantifier after the W nonword character inside the lookbehind. Doing so lets the lookbehind test a fixed-length string, thereby adding support for PCRE, Perl, and Python. But it’s a steep price to pay, and now the regular expression only avoids matching words that are preceded by “cat” and exactly one separating character. The regex correctly matches only cat in the string cat fluff, but it matches both cat and fluff in the string cat, fluff.

Since Ruby 1.9 doesn’t allow  word boundaries in lookbehind, the fourth regular expression uses two separate lookbehinds. The first lookbehind prevents cat as the preceding word when it is itself preceded by a nonword character such as whitespace or punctuation. The second uses the ^ anchor to prevent cat as the preceding word when it appears at the start of the string.

Simulate lookbehind

JavaScript does not support lookbehind, but the JavaScript example code shows how you can simulate lookbehind that appears at the beginning of a regex. It doesn’t impose any restrictions on the length of the text matched by the (simulated) lookbehind.

We start by splitting the (?<!catW+)w+ regular expression from the first solution into two pieces: the pattern inside the lookbehind (catW+) and the pattern that comes after it (w+). Append a $ to the end of the lookbehind pattern. If you need to use the “^ and $ match at line breaks” option (/m) with the lookbehind regex, use $(?!s) instead of $ at the end of the lookbehind pattern to ensure that it can match only at the very end of its subject text. The lookbehindType variable controls whether we’re emulating positive or negative lookbehind. Use true for positive and false for negative.

After the variables are set up, we use mainRegex and the exec() method to iterate over the subject string (see Recipe 3.11 for a description of this process). When a match is found, the part of the subject text before the match is copied into a new string variable (leftContext), and we test whether the lookbehind regex matches that string. Because of the anchor we appended to the end of lookbehind, this can only match immediately to the left of the match found by mainRegex, or in other words, at the end of leftContext. By comparing the result of the lookbehind test to lookbehindType, we can determine whether the match meets the complete criteria for a successful match.

Finally, we take one of two steps. If we have a successful match, append the text matched by mainRegex to the matches array. If not, change the position at which to continue searching for a match (using mainRegex.lastIndex) to the position one character after the starting position of mainRegex’s last match, rather than letting the next iteration of the exec() method start at the end of the current match.

Whew! We’re done.

This is an advanced trick that takes advantage of the lastIndex property that is dynamically updated with JavaScript regular expressions that use the /g (global) flag. Usually, updating and resetting lastIndex is something that happens automagically. Here, we use it to take control of the regex’s path through the subject string, moving forward and backward as necessary. This trick only lets you emulate lookbehind that appears at the beginning of a regex. With a few changes, the code could also be used to emulate lookbehind at the very end of a regex. However, it does not serve as a full substitute for lookbehind support. Due to the interplay of lookbehind and backtracking, this approach cannot help you accurately emulate the behavior of a lookbehind that appears in the middle of a regex.

Variations

If you want to match words that are preceded by cat (without including the word cat and its following nonword characters as part of the matched text), change the negative lookbehind to positive lookbehind, as shown next.

Any number of separating nonword characters:

(?<=catW+)w+
Regex options: Case insensitive
Regex flavor: .NET

Limited number of separating nonword characters:

(?<=catW{1,9})w+
Regex options: Case insensitive
Regex flavors: .NET, Java

Single separating nonword character:

(?<=catW)w+
Regex options: Case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python
(?:(?<=WcatW)|(?<=^catW))w+
Regex options: Case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9

These adapted versions of the regexes no longer include a  word boundary before the w+ at the end because the positive lookbehinds already ensure that any match is preceded by a nonword character. The last regex (which adds support for Ruby 1.9) wraps its two positive lookbehinds in (?:|), since only one of the lookbehinds can match at a given position.

PCRE 7.2 and Perl 5.10 support the fancy K or keep operator that resets the starting position for the part of a match that is returned in the match result (see Alternative to Lookbehind for more details). We can use this to come close to emulating leading infinite-length positive lookbehind, as shown in the next regex:

catW+Kw+
Regex options: Case insensitive
Regex flavors: PCRE 7.2, Perl 5.10

There is a subtle but important difference between this and the .NET-only regex that allowed any number of separating nonword characters. Unlike with lookbehind, the text matched to the left of the K is consumed by the match even though it is not included in the match result. You can see this difference by comparing the results of the regexes with K and positive lookbehind when they’re applied to the subject string cat cat cat cat. In Perl and PHP, if you replace all matches of (?<=catW)w+ with «dog», you’ll get the result cat dog dog dog, since only the first word is not itself preceded by cat. If you use the regex catW+Kw+ to perform the same replacement, the result will be cat dog cat dog. After matching the leading cat cat (and replacing it with cat dog), the next match attempt can’t peek to the left of its starting position like lookbehind does. The regex matches the second cat cat, which is again replaced with cat dog.

See Also

Recipe 5.4 explains how to find all except a specific word. Recipe 5.5 explains how to find any word not followed by a specific word.

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.6 explains word boundaries. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition. Recipe 2.16 explains lookaround. It also explains K, in the section Alternative to Lookbehind.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset