Create a regex that matches cat
in My cat is brown
, but not in category
or bobcat
. Create another
regex that matches cat
in staccato
, but not in any of the three
previous subject strings.
The regular expression token ‹› is
called a word boundary. It matches at the start
or the end of a word. By itself, it results in a zero-length match.
‹
› is an
anchor, just like the tokens introduced in the
previous section.
Strictly speaking, ‹› matches in these three positions:
Before the first character in the subject, if the first character is a word character
After the last character in the subject, if the last character is a word character
Between two characters in the subject, where one is a word character and the other is not a word character
To run a “whole words only” search using a regular expression,
simply place the word between two word boundaries, as we did with
‹cat
›. The first
‹› requires the
‹
c
› to occur at the very
start of the string, or after a nonword character. The second ‹› requires the ‹
t
› to occur at the very end of
the string, or before a nonword character.
Line break characters are nonword characters. ‹› will match after a line
break if the line break is immediately followed by a word character.
It will also match before a line break immediately preceded by a word
character. So a word that occupies a whole line by itself will be
found by a “whole words only” search. ‹
› is unaffected by “multiline” mode or
‹
(?m)
›,
which is one of the reasons why this book refers to “multiline” mode
as “^ and $ match at line breaks” mode.
None of the flavors discussed in this book have separate tokens
for matching only before or only after a word. Unless you wanted to
create a regex that consists of nothing but a word boundary, these
aren’t needed. The tokens before or after the ‹› in your regular expression will determine
where ‹
› can match.
The ‹
› in ‹
x
› and ‹!
› could match only at the
start of a word. The ‹› in ‹
x
› and ‹!
› could match only at the end of a word.
‹xx
› and ‹!!
› can never match
anywhere.
If you really want to match only the position before a
word or only after a word, you can do so with lookahead and
lookbehind. Recipe 2.16 explains lookahead
and lookbehind. This method does not work with JavaScript and Ruby 1.8
because these flavors do not support lookbehind. The regex ‹(?<!w)(?=w)
› matches the
start of a word by checking that the character before the match
position is not a word character, and that the character after the
match position is a word character. ‹(?<=w)(?!w)
› does the opposite: it matches
the end of the word by checking that the preceding character is a word
character, and that the following character is not a word character.
It’s important to use negative lookaround with ‹w
› rather than positive
lookaround with ‹W
› to
check for the absence of a word character. ‹(?<!w)
› matches at the start of the string
because there is no word character (or any character at all) before
the start of the string. But ‹(?<=W)
› never matches at the start of the
string. ‹(?!w)
› matches
at the end of the string for the same reason. So our two lookaround
constructs will correctly match the start of the string if the string
begins with a word and the end of the string if it ends with a
word.
‹B
›
matches at every position in the subject text where ‹› does
not match. ‹
B
› matches
at every position that is not at the start or end of a word.
Strictly speaking, ‹B
› matches in these five positions:
Before the first character in the subject, if the first character is not a word character
After the last character in the subject, if the last character is not a word character
Between two word characters
Between two nonword characters
The empty string
‹BcatB
› matches
cat
in
staccato
,
but not in My cat is
brown
, category
, or bobcat
.
To do the opposite of a “whole words only” search (i.e.,
excluding My cat is
brown
and including staccato
, category
, and bobcat
), you need to
use alternation to combine ‹Bcat
› and ‹catB
› into ‹Bcat|catB
›. ‹Bcat
› matches cat
in staccato
and bobcat
. ‹catB
› matches cat
in category
(and staccato
if ‹Bcat
› hadn’t already taken care
of that). Recipe 2.8 explains
alternation.
All this talk about word boundaries, but no talk about
what a word character is. A word character is a
character that can occur as part of a word. The subsection Shorthands in Recipe 2.3 discussed which characters are included
in ‹w
›, which
matches a single word character. Unfortunately, the story is not the
same for ‹›.
Although all the flavors in this book support ‹› and
‹
B
›, they
differ in which characters are word characters.
.NET, JavaScript, PCRE, Perl, Python, and Ruby have ‹› match between two characters
where one is matched by ‹
w
› and the other by ‹W
›. ‹B
› always
matches between two characters where both are matched by ‹w
› or ‹W
›.
JavaScript, PCRE, and Ruby view only ASCII characters as word
characters. ‹w
› is
identical to ‹[a-zA-Z0-9_]
›. With these flavors, you can do a
“whole words only” search on words in languages that use only the
letters A to Z without diacritics, such as English. But these flavors
cannot do “whole words only” searches on words in other languages, such
as Spanish or Russian.
.NET treats letters and digits from all scripts as word characters. You can do a “whole words only” search on words in any language, including those that don’t use the Latin alphabet.
Python gives you an option. In Python 2.x, non-ASCII characters
are included only if you pass the UNICODE
or U
flag when creating the regex. In Python 3.x,
non-ASCII character are included by default, but you can exclude them
with the ASCII
or A
flag. This flag affects both ‹› and ‹
w
› equally.
In Perl, it depends on your version of Perl and /adlu
flags whether ‹w
› is pure ASCII or includes all Unicode letters,
digits, and underscores. The subsection Shorthands in Recipe 2.3 explains this in more detail. In all
versions of Perl, ‹› is
consistent with ‹
w
›.
Java behaves inconsistently. ‹w
› matches only ASCII characters in Java 4 to 6.
In Java 7, ‹w
› matches
only ASCII characters by default, but matches Unicode characters if you
set the UNICODE_CHARACTER_CLASS
flag.
But ‹› is
Unicode-enabled in all versions of Java, supporting any script. In Java
4 to 6, ‹
w
› matches a
single English letter, digit, or underscore that does not occur as part
of a word in any language. ‹кошка
› always correctly matches the Russian
word for cat in Java, because ‹› supports Unicode. But ‹
w+
› will not match any Russian word in Java 4 to
6, because ‹w
› is
ASCII-only.
Recipe 2.3 discusses which characters
are matched by the shorthand character class ‹w
› which matches a word character.
Recipe 5.1 shows how you can use word boundaries to match complete words, and how you can work around the different behavior of word boundaries in various regex flavors.