You’re given the simple task of finding all occurrences of
the word cat
, case insensitively. The
catch is that it must appear as a complete word. You don’t want to find
pieces of longer words, such as hellcat
, application
, or Catwoman
.
Word boundary tokens make this a very easy problem to solve:
cat
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
The word boundaries at both ends of the regular expression
ensure that cat
is matched only when it appears as a
complete word. More precisely, the word boundaries require that
cat
is set
apart from other text by the beginning or end of the string, whitespace,
punctuation, or other nonword characters.
Regular expression engines consider letters, numbers, and underscores to all be word characters. Recipe 2.6 is where we first talked about word boundaries, and covers them in greater detail.
A problem can occur when working with international text
in JavaScript, PCRE, and Ruby, since those regular expression flavors
only consider letters in the ASCII table to
create a word boundary. In other words, word boundaries are
found only at the positions
between a match of ‹[^A-Za-z0-9_]|^
› and ‹[A-Za-z0-9_]
›, or between ‹[A-Za-z0-9_]
› and ‹[^A-Za-z0-9_]|$
›. The same is true
in Python when the UNICODE
or U
flag is not set. This
prevents ‹› from
being useful for a “whole word only” search within text that contains
accented letters or words that use non-Latin scripts. For example, in
JavaScript, PCRE, and Ruby, ‹
über
› will find a match within darüber
, but not within
dar über
. In
most cases, this is the exact opposite of what you would want. The
problem occurs because ü
is
considered a nonword character, and a word boundary is therefore found
between the two characters rü
. No word boundary is found between a
space character and ü
, because they
create a contiguous sequence of nonword characters.
You can deal with this problem by using lookahead and
lookbehind (collectively, lookaround—see Recipe 2.16) instead of word boundaries. Like word
boundaries, lookarounds match zero-width positions. In PCRE (when
compiled with UTF-8 support) and Ruby 1.9, you can emulate Unicode-based
word boundaries using, for example, ‹(?<=[^p{L}p{M}]|^)cat(?=[^p{L}p{M}]|$)
›.
This regular expression also uses Unicode Letter and Mark category
tokens (‹p{L}
› and
‹p{M}
›), which are
discussed in Recipe 2.7. If you want the
lookarounds to also treat any Unicode decimal numbers and connector
punctuation (underscore and similar) as word characters, like ‹› does in regex flavors that
correctly support Unicode, replace the two instances of ‹
[^p{L}p{M}]
› with ‹[^p{L}p{M}p{Nd}p{Pc}]
›.
JavaScript and Ruby 1.8 support neither lookbehind nor Unicode
categories. You can work around the lack of lookbehind support by
matching the nonword character preceding each match, and then either
removing it from each match using procedural code or putting it back
into the string when replacing matches (see the examples of using parts
of a match in a replacement string in Recipe 3.15). The additional lack of support
for matching Unicode categories (coupled with the fact that both
programming languages’ ‹w
› and
‹W
› tokens consider only
ASCII word characters) means you might need to make do with a more
restrictive solution. Code points in the Letter and Mark categories are
scattered throughout Unicode’s character set, so it would take thousands
of characters to emulate ‹[^p{L}p{M}]
› using Unicode escape sequences and
character class ranges. A good compromise might be ‹[^A-Za-zxAAxB5xBAxC0-xD6xD8-xF6xF8-xFF]
›,
which matches all except Unicode letter characters in eight-bit address
space (i.e., the first 256 Unicode code points, from positions 0x00 to
0xFF). There are no code points in the Mark category within this range.
See Figure 5-1 for the list of
nonmatched characters. This negated character class lets you exclude (or
in nonnegated form, match) some of the most commonly used, accented
characters.
Following is an example of how to replace all instances of the
word “cat” with “dog” in JavaScript. It correctly accounts for common,
accented characters, so écat
is not altered. To do this, you’ll
need to construct your own character class instead of relying on the
built-in ‹› or ‹
w
›:
// 8-bit-wide letter characters var pL = "A-Za-zxAAxB5xBAxC0-xD6xD8-xF6xF8-xFF", pattern = "([^{L}]|^)cat([^{L}]|$)".replace(/{L}/g, pL), regex = new RegExp(pattern, "gi"); // replace cat with dog, and put back any // additional matched characters subject = subject.replace(regex, "$1dog$2");
Note that JavaScript string literals use x
(where
HH
is a
two-digit hexadecimal number) to insert special characters. Hence, the
HH
pL
variable that is passed to the
regular expression actually ends up containing the literal versions of
the characters. If you wanted the x
metasequences to be passed through to the
regex itself, you would have to escape the backslashes in the string
literal (i.e., HH
"\x
). However,
in this case it doesn’t matter and will not change what the regular
expression matches.HH
"
This chapter has a variety of recipes that deal with matching words. Recipe 5.2 explains how to find any of multiple words. Recipe 5.3 explains how to find similar words. Recipe 5.4 explains how to find all except a specific word. Recipe 5.10 explains how to match complete lines that contain a word.
Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.6 explains word boundaries. Recipe 2.7 explains how to match Unicode characters. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.16 explains lookaround.