5.4. Find All Except a Specific Word

Problem

You want to use a regular expression to match any complete word except cat. Catwoman, vindicate, and other words that merely contain the letters “cat” should be matched—just not cat.

Solution

A negative lookahead can help you rule out specific words, and is key to this next regex:

(?!cat)w+
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

Although a negated character class (written as [^]) makes it easy to match anything except a specific character, you can’t just write [^cat] to match anything except the word cat. [^cat] is a valid regex, but it matches any character except c, a, or t. Hence, although [^cat]+ would avoid matching the word cat, it wouldn’t match the word time either, because it contains the forbidden letter t. The regular expression [^c][^a][^t]w* is no good either, because it would reject any word with c as its first letter, a as its second letter, or t as its third. Furthermore, that doesn’t restrict the first three letters to word characters, and it only matches words with at least three characters since none of the negated character classes are optional.

With all that in mind, let’s take another look at how the regular expression shown at the beginning of this recipe solved the problem:

     # Assert position at a word boundary.
(?!    # Not followed by:
  cat  #   Match "cat".
     #   Assert position at a word boundary.
)      # End the negative lookahead.
w+    # Match one or more word characters.
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

The key to this pattern is its negative lookahead, (?!). The negative lookahead disallows the sequence cat followed by a word boundary, without preventing the use of those letters when they do not appear in that exact sequence, or when they appear as part of a longer or shorter word. There’s no word boundary at the very end of the regular expression, because it wouldn’t change what the regex matches. The + quantifier in w+ repeats the word character token as many times as possible, which means that it will always match until the next word boundary.

When applied to the subject string categorically match any word except cat, the regex will find five matches: categorically, match, any, word, and except.

Variations

Find words that don’t contain another word

If, instead of trying to match any word that is not cat, you are trying to match any word that does not contain cat, a slightly different approach is needed:

(?:(?!cat)w)+
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

In the earlier section of this recipe, the word boundary at the beginning of the regular expression provided a convenient anchor that allowed us to simply place the negative lookahead at the beginning of the word. The solution used here is not as efficient, but it’s nevertheless a commonly used construct that allows you to match something other than a particular word or pattern. It does this by repeating a group containing a negative lookahead and a single word character. Before matching each character, the regex engine makes sure that the word cat cannot be matched starting at the current position.

Unlike the previous regular expression, this one requires a terminating word boundary. Otherwise, it could match just the first part of a word, up to where cat appears within it.

When applied to the subject string categorically match any word except cat, the regex will find four matches: match, any, word, and except.

See Also

Recipe 5.1 explains how to find a specific word. Recipe 5.5 explains how to find any word not followed by a specific word. Recipe 5.6 explains how to find any word not preceded by a specific word. Recipe 5.11 explains how to match complete lines that do not contain a word.

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.6 explains word boundaries. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition. Recipe 2.16 explains lookaround.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset