Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

2.8. Match One of Several Alternatives

Problem

Create a regular expression that when applied repeatedly to the text Mary, Jane, and Sue went to Mary's house will match Mary, Jane, Sue, and then Mary again. Further match attempts should fail.

Solution

Mary|Jane|Sue

Regex options: None

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

The vertical bar, or pipe symbol, splits the regular expression into multiple alternatives. ‹Mary|Jane|Sue› matches Mary, or Jane, or Sue with each match attempt. Only one name matches each time, but a different name can match each time.

All regular expression flavors discussed in this book use a regex-directed engine. The engine is simply the software that makes the regular expression work. Regex-directed^[3] means that all possible permutations of the regular expression are attempted at each character position in the subject text, before the regex is attempted at the next character position.

When you apply ‹Mary|Jane|Sue› to Mary, Jane, and Sue went to Mary's house, the match Mary is immediately found at the start of the string.

When you apply the same regex to the remainder of the string—e.g., by clicking “Find Next” in your text editor—the regex engine attempts to match ‹Mary› at the first comma in the string. That fails. Then, it attempts to match ‹Jane› at the same position, which also fails. Attempting to match ‹Sue› at the comma fails, too. Only then does the regex engine advance to the next character in the string. Starting at the first space, all three alternatives fail in the same way.

Starting at the J, the first alternative, ‹Mary›, fails to match. The second alternative, ‹Jane›, is then attempted starting at the J. It matches Jane. The regex engine declares victory.

Notice that Jane was found even though there is another occurrence of Mary in the subject text, and that ‹Mary› appears before ‹Jane› in the regex. At least in this case, the order of the alternatives in the regular expression does not matter. The regular expression finds the leftmost match. It scans the text from left to right, tries all alternatives in the regular expression at each step, and stops at the first position in the text where any of the alternatives produces a valid match.

If we do another search through the remainder of the string, Sue will be found. The fourth search will find Mary once more. If you tell the regular engine to do a fifth search, that will fail, because none of the three alternatives match the remaining ’s house string.

The order of the alternatives in the regex matters only when two of them can match at the same position in the string. The regex ‹Jane|Janet› has two alternatives that match at the same position in the text Her name is Janet. There are no word boundaries in the regular expression. The fact that ‹Jane› matches the word Janet in Her name is Janet only partially does not matter.

‹Jane|Janet› matches Jane in Her name is Janet because a regex-directed regular expression engine is eager. In addition to scanning the subject text from left to right, finding the leftmost match in the text, it also scans the alternatives in the regex from left to right. The engine stops as soon as it finds an alternative that matches.

When ‹Jane|Janet› reaches the J in Her name is Janet, the first alternative, ‹Jane›, matches. The second alternative is not attempted. If we tell the engine to look for a second match, the t is all that is left of the subject text. Neither alternative matches there.

There are two ways to stop Jane from stealing Janet’s limelight. One way is to put the longer alternative first: ‹Janet|Jane›. A more solid solution is to be explicit about what we’re trying to do: we’re looking for names, and names are complete words. Regular expressions don’t deal with words, but they can deal with word boundaries.

So ‹Jane|Janet› and ‹Janet|Jane› will both match Janet in Her name is Janet. Because of the word boundaries, only one alternative can match. The order of the alternatives is again irrelevant.

Recipe 2.12 explains the best solution: ‹Janet?›.

Table of Contents for
2.8. Match One of Several Alternatives

2.8. Match One of Several Alternatives

Problem

Solution

Discussion

See Also

Table of Contents for 2.8. Match One of Several Alternatives

Create new playlist

Sign In

Sign Up

2.8. Match One of Several Alternatives

Problem

Solution

Discussion

See Also

Table of Contents for
2.8. Match One of Several Alternatives