Create a regular expression that when applied repeatedly
to the text Mary, Jane,
and Sue went to Mary's house
will match Mary
, Jane
, Sue
, and then Mary
again. Further match
attempts should fail.
The vertical bar, or pipe
symbol, splits the regular expression into multiple
alternatives. ‹Mary|Jane|Sue
› matches Mary
, or Jane
, or Sue
with each match
attempt. Only one name matches each time, but a different name can match
each time.
All regular expression flavors discussed in this book use a regex-directed engine. The engine is simply the software that makes the regular expression work. Regex-directed[3] means that all possible permutations of the regular expression are attempted at each character position in the subject text, before the regex is attempted at the next character position.
When you apply ‹Mary|Jane|Sue
› to Mary, Jane, and Sue went to Mary's house
,
the match Mary
is immediately found at the start of the string.
When you apply the same regex to the remainder of the string—e.g.,
by clicking “Find Next” in your text editor—the regex engine attempts to
match ‹Mary
› at the first
comma in the string. That fails. Then, it attempts to match ‹Jane
› at the same position, which
also fails. Attempting to match ‹Sue
› at the comma fails, too. Only then does the
regex engine advance to the next character in the string. Starting at
the first space, all three alternatives fail in the same way.
Starting at the J
, the first alternative, ‹Mary
›, fails to match. The second
alternative, ‹Jane
›, is
then attempted starting at the J
. It matches Jane
. The regex engine declares
victory.
Notice that Jane
was found even though there is another
occurrence of Mary
in the subject text, and that ‹Mary
› appears before ‹Jane
› in the regex. At least in
this case, the order of the alternatives in the regular expression does
not matter. The regular expression finds the leftmost match. It scans the text
from left to right, tries all alternatives in the regular expression at
each step, and stops at the first position in the text where any of the
alternatives produces a valid match.
If we do another search through the remainder of the string,
Sue
will be
found. The fourth search will find Mary
once more. If you tell the regular
engine to do a fifth search, that will fail, because none of the three
alternatives match the remaining ’s house
string.
The order of the alternatives in the regex matters only when two
of them can match at the same position in the string. The regex ‹Jane|Janet
› has two alternatives
that match at the same position in the text Her name is Janet
. There are no word
boundaries in the regular expression. The fact that ‹Jane
› matches the word Janet
in Her name is Janet
only
partially does not matter.
‹Jane|Janet
› matches
Jane
in
Her name is
Janet
because a regex-directed regular expression engine is
eager. In addition to scanning the
subject text from left to right, finding the leftmost match in the text,
it also scans the alternatives in the regex from left to right. The
engine stops as soon as it finds an alternative that matches.
When ‹Jane|Janet
›
reaches the J
in Her name is
Janet
, the first alternative, ‹Jane
›, matches. The second alternative is not
attempted. If we tell the engine to look for a second match, the
t
is all that
is left of the subject text. Neither alternative matches there.
There are two ways to stop Jane from stealing Janet’s limelight.
One way is to put the longer alternative first: ‹Janet|Jane
›. A more solid solution is to be
explicit about what we’re trying to do: we’re looking for names, and
names are complete words. Regular expressions don’t deal with words, but
they can deal with word boundaries.
So ‹Jane|Janet
› and ‹Janet|Jane
› will both match Janet
in Her name is Janet
.
Because of the word boundaries, only one alternative can match. The
order of the alternatives is again irrelevant.
Recipe 2.12 explains the best solution:
‹Janet?
›.
Recipe 2.9 explains how to group parts of a regex. You need to use a group if you want to place several alternatives in the middle of a regex.
[3] The other kind of engine is a text-directed engine. The key difference is that a text-directed engine visits each character in the subject text only once, whereas a regex-directed engine may visit each character many times. Text-directed engines are much faster, but support regular expressions only in the mathematical sense described at the beginning of Chapter 1. The fancy Perl-style regular expressions that make this book so interesting can be implemented only with a regex-directed engine.