5.5. Find Any Word Not Followed by a Specific Word

Problem

You want to match any word that is not immediately followed by the word cat, ignoring any whitespace, punctuation, or other nonword characters that appear in between.

Solution

Negative lookahead is the secret ingredient for this recipe:

w+(?!W+cat)
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

As with many other recipes in this chapter, word boundaries () and the word character token (w) work together to match a complete word. You can find in-depth descriptions of these features in Recipe 2.6.

The (?!) surrounding the second part of this regex is a negative lookahead. Lookahead tells the regex engine to temporarily step forward in the string, to check whether the pattern inside the lookahead can be matched just ahead of the current position. It does not consume any of the characters matched inside the lookahead. Instead, it merely asserts whether a match is possible. Since we’re using a negative lookahead, the result of the assertion is inverted. In other words, if the pattern inside the lookahead can be matched just ahead, the match attempt fails, and regex engine moves forward to try all over again starting from the next character in the subject string. You can find much more detail about lookahead (and its counterpart, lookbehind) in Recipe 2.16.

As for the pattern inside the lookahead, the W+ matches one or more nonword characters, such as whitespace and punctuation, that appear before cat. The word boundary at the end of the lookahead ensures that we skip only words not followed by cat as a complete word, rather than just any word starting with cat.

Note that this regular expression even matches the word cat, as long as the subsequent word is not also cat. If you also want to avoid matching cat, you could combine this regex with the one in Recipe 5.4 to end up with (?!cat)w+(?!W+cat).

Variations

If you want to only match words that are followed by cat (without including cat and its preceding nonword characters as part of the matched text), change the lookahead from negative to positive, then turn your frown upside-down:

w+(?=W+cat)
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

See Also

Recipe 5.4 explains how to find all except a specific word. Recipe 5.6 explains how to find any word not preceded by a specific word.

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.6 explains word boundaries. Recipe 2.12 explains repetition. Recipe 2.16 explains lookaround.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset