You want to use a regular expression to match any complete
word except cat
. Catwoman
, vindicate
, and other words that
merely contain the letters “cat” should be matched—just not cat
.
A negative lookahead can help you rule out specific words, and is key to this next regex:
(?!cat)w+
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Although a negated character class (written as ‹[^⋯]
›) makes it easy to match anything
except a specific character, you can’t just write ‹[^cat]
› to match anything except
the word cat
.
‹[^cat]
› is a valid regex,
but it matches any character except c
, a
, or t
. Hence, although ‹[^cat]+
› would avoid matching
the word cat
,
it wouldn’t match the word time
either, because it contains the
forbidden letter t
. The regular expression ‹[^c][^a][^t]w*
› is no good
either, because it would reject any word with c
as its first letter, a
as its second letter,
or t
as its
third. Furthermore, that doesn’t restrict the first three letters to
word characters, and it only matches words with at least three
characters since none of the negated character classes are
optional.
With all that in mind, let’s take another look at how the regular expression shown at the beginning of this recipe solved the problem:
# Assert position at a word boundary. (?! # Not followed by: cat # Match "cat". # Assert position at a word boundary. ) # End the negative lookahead. w+ # Match one or more word characters.
Regex options: Free-spacing, case insensitive |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
The key to this pattern is its negative lookahead,
‹(?!⋯)
›. The negative lookahead disallows the
sequence cat
followed by a word boundary, without preventing the use of those letters
when they do not appear in that exact sequence, or when they appear as
part of a longer or shorter word. There’s no word boundary at the very
end of the regular expression, because it wouldn’t change what the regex
matches. The ‹+
›
quantifier in ‹w+
›
repeats the word character token as many times as possible, which means
that it will always match until the next word boundary.
When applied to the subject string categorically match any word except cat
,
the regex will find five matches: categorically
, match
, any
, word
, and except
.
If, instead of trying to match any word that is not
cat
, you are
trying to match any word that does not contain cat
, a slightly different approach is
needed:
(?:(?!cat)w)+
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
In the earlier section of this recipe, the word boundary at the
beginning of the regular expression provided a convenient anchor that
allowed us to simply place the negative lookahead at the beginning of
the word. The solution used here is not as efficient, but it’s
nevertheless a commonly used construct that allows you to match
something other than a particular word or pattern. It does this by
repeating a group containing a negative lookahead and a single word
character. Before matching each character, the regex engine makes sure
that the word cat
cannot be matched starting at the
current position.
Unlike the previous regular expression, this one requires a
terminating word boundary. Otherwise, it could match just the first
part of a word, up to where cat
appears within it.
When applied to the subject string categorically match any word except
cat
, the regex will find four matches: match
, any
, word
, and except
.
Recipe 5.1 explains how to find a specific word. Recipe 5.5 explains how to find any word not followed by a specific word. Recipe 5.6 explains how to find any word not preceded by a specific word. Recipe 5.11 explains how to match complete lines that do not contain a word.
Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.6 explains word boundaries. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition. Recipe 2.16 explains lookaround.