You want to match any word that is not immediately
preceded by the word cat
, ignoring any whitespace,
punctuation, or other nonword characters that come between.
Lookbehind lets you check if text appears before a given position. It works by instructing the regex engine to temporarily step backward in the string, checking whether something can be found ending at the position where you placed the lookbehind. See Recipe 2.16 if you need to brush up on the details of lookbehind.
The following regexes use negative lookbehind, ‹(?<!⋯)
›. Unfortunately, the regex
flavors covered by this book differ in what kinds of patterns they
allow you to place within lookbehind. The solutions therefore end up
working a bit differently in each case. Read on to the
section of this recipe for further details.
Any number of separating nonword characters:
(?<!catW+)w+
Regex options: Case insensitive |
Regex flavor: .NET |
Limited number of separating nonword characters:
(?<!catW{1,9})w+
Regex options: Case insensitive |
Regex flavors: .NET, Java |
Single separating nonword character:
(?<!catW)w+
Regex options: Case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python |
(?<!WcatW)(?<!^catW)w+
Regex options: Case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9 |
JavaScript and Ruby 1.8 do not support lookbehind at all, even though they do support lookahead. However, because the lookbehind for this problem appears at the very beginning of the regex, it’s possible to simulate the lookbehind by splitting the regex into two parts, as demonstrated in the following JavaScript example:
var subject = "My cat is fluffy.", mainRegex = /w+/g, lookbehind = /catW+$/i, lookbehindType = false, // false for negative, true for positive matches = [], match, leftContext; while (match = mainRegex.exec(subject)) { leftContext = subject.substring(0, match.index); if (lookbehindType == lookbehind.test(leftContext)) { matches.push(match[0]); } else { mainRegex.lastIndex = match.index + 1; } } // matches: ["My", "cat", "fluffy"]
The first regular expression uses the negative lookbehind
‹(?<!catW+)
›.
Because the ‹+
›
quantifier used inside the lookbehind has no upper limit on how many
characters it can match, this version works with the .NET regular
expression flavor only. All of the other regular expression flavors
covered by this book require a fixed or maximum (finite) length for
lookbehind patterns.
The second regular expression replaces the ‹+
› within the lookbehind with
‹{1,9}
›. As a result, it
can be used with .NET and Java, both of which support variable-length
lookbehind when there is a known upper limit to how many characters
can be matched within them. I’ve arbitrarily chosen a maximum length
of nine nonword characters to separate the words. That allows a bit of
punctuation and a few blank lines to separate the words. Unless you’re
working with unusual subject text, this will probably end up working
exactly like the previous .NET-only solution. Even in .NET, however,
providing a reasonable repetition limit for any quantifiers inside
lookbehind is a good safety practice since it reduces the amount of
unanticipated backtracking that can potentially occur within the
lookbehind.
The third regular expression entirely dropped the quantifier
after the ‹W
› nonword
character inside the lookbehind. Doing so lets the lookbehind test a
fixed-length string, thereby adding support for PCRE, Perl, and
Python. But it’s a steep price to pay, and now the regular expression
only avoids matching words that are preceded by “cat” and exactly one
separating character. The regex correctly matches only cat
in the string
cat fluff
,
but it matches both cat
and fluff
in the string cat, fluff
.
Since Ruby 1.9 doesn’t allow ‹› word
boundaries in lookbehind, the fourth regular expression uses two
separate lookbehinds. The first lookbehind prevents
cat
as the preceding word when it is itself
preceded by a nonword character such as whitespace or punctuation. The
second uses the ‹^
›
anchor to prevent cat
as the
preceding word when it appears at the start of the string.
JavaScript does not support lookbehind, but the JavaScript example code shows how you can simulate lookbehind that appears at the beginning of a regex. It doesn’t impose any restrictions on the length of the text matched by the (simulated) lookbehind.
We start by splitting the ‹(?<!catW+)w+
› regular expression from
the first solution into two pieces: the pattern inside the lookbehind
(‹catW+
›) and the
pattern that comes after it (‹w+
›). Append a ‹$
› to the end of the lookbehind pattern. If you
need to use the “^ and $ match at line breaks” option (/m
) with the lookbehind
regex, use ‹$(?!s)
› instead of ‹$
› at the end of the lookbehind
pattern to ensure that it can match only at the very end of its
subject text. The lookbehindType
variable controls whether we’re emulating positive or negative
lookbehind. Use true
for positive
and false
for negative.
After the variables are set up, we use mainRegex
and the exec()
method to iterate over the subject
string (see Recipe 3.11 for a description
of this process). When a match is found, the part of the subject text
before the match is copied into a new string variable (leftContext
), and we test whether the
lookbehind
regex matches that
string. Because of the anchor we appended to the end of lookbehind
, this can only match immediately
to the left of the match found by mainRegex
, or in other words, at the end of
leftContext
. By comparing the
result of the lookbehind test to lookbehindType
, we can determine whether the
match meets the complete criteria for a successful match.
Finally, we take one of two steps. If we have a successful
match, append the text matched by mainRegex
to the matches
array. If not, change the position
at which to continue searching for a match (using mainRegex.lastIndex
) to the position one
character after the starting position of mainRegex
’s last match, rather than letting
the next iteration of the exec()
method start at the end of the current match.
Whew! We’re done.
This is an advanced trick that takes advantage of the lastIndex
property that is dynamically
updated with JavaScript regular expressions that use the /g
(global) flag. Usually, updating and
resetting lastIndex
is something
that happens automagically. Here, we use it to take control of the
regex’s path through the subject string, moving forward and backward
as necessary. This trick only lets you emulate lookbehind that appears
at the beginning of a regex. With a few changes, the code could also
be used to emulate lookbehind at the very end of a regex. However, it
does not serve as a full substitute for lookbehind support. Due to the
interplay of lookbehind and backtracking, this approach cannot help
you accurately emulate the behavior of a lookbehind that appears in
the middle of a regex.
If you want to match words that are preceded by cat
(without including the word
cat
and its
following nonword characters as part of the matched text), change the
negative lookbehind to positive lookbehind, as shown next.
Any number of separating nonword characters:
(?<=catW+)w+
Regex options: Case insensitive |
Regex flavor: .NET |
Limited number of separating nonword characters:
(?<=catW{1,9})w+
Regex options: Case insensitive |
Regex flavors: .NET, Java |
Single separating nonword character:
(?<=catW)w+
Regex options: Case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python |
(?:(?<=WcatW)|(?<=^catW))w+
Regex options: Case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9 |
These adapted versions of the regexes no longer include a ‹› word boundary before the
‹
w+
› at the end because
the positive lookbehinds already ensure that any match is preceded by a
nonword character. The last regex (which adds support for Ruby 1.9)
wraps its two positive lookbehinds in ‹(?:⋯|⋯)
›, since only one of the
lookbehinds can match at a given position.
PCRE 7.2 and Perl 5.10 support the fancy ‹K
› or
keep operator that resets the starting position for
the part of a match that is returned in the match result (see Alternative to Lookbehind for more details). We can use this
to come close to emulating leading infinite-length positive lookbehind,
as shown in the next regex:
catW+Kw+
Regex options: Case insensitive |
Regex flavors: PCRE 7.2, Perl 5.10 |
There is a subtle but important difference between this and the
.NET-only regex that allowed any number of separating nonword
characters. Unlike with lookbehind, the text matched to the left of the
‹K
› is
consumed by the match even though it is not included in the match
result. You can see this difference by comparing the results of the
regexes with ‹K
› and
positive lookbehind when they’re applied to the subject string
cat cat cat
cat
. In Perl and PHP, if you replace all matches of ‹(?<=catW)w+
› with «dog
», you’ll get the
result cat dog dog dog
, since only
the first word is not itself preceded by cat
. If you use the regex ‹catW+Kw+
› to perform the same
replacement, the result will be cat dog cat
dog
. After matching the leading cat cat
(and replacing it with cat dog
), the next match attempt can’t peek to
the left of its starting position like lookbehind does. The regex
matches the second cat
cat
, which is again replaced with cat dog
.
Recipe 5.4 explains how to find all except a specific word. Recipe 5.5 explains how to find any word not followed by a specific word.
Techniques used in the regular expressions in this recipe are
discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.6 explains word boundaries. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition. Recipe 2.16 explains lookaround. It also explains
‹K
›, in the
section Alternative to Lookbehind.