2.8. Match One of Several Alternatives

Problem

Create a regular expression that when applied repeatedly to the text Mary, Jane, and Sue went to Mary's house will match Mary, Jane, Sue, and then Mary again. Further match attempts should fail.

Solution

Mary|Jane|Sue
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

The vertical bar, or pipe symbol, splits the regular expression into multiple alternatives. Mary|Jane|Sue matches Mary, or Jane, or Sue with each match attempt. Only one name matches each time, but a different name can match each time.

All regular expression flavors discussed in this book use a regex-directed engine. The engine is simply the software that makes the regular expression work. Regex-directed[3] means that all possible permutations of the regular expression are attempted at each character position in the subject text, before the regex is attempted at the next character position.

When you apply Mary|Jane|Sue to Mary, Jane, and Sue went to Mary's house, the match Mary is immediately found at the start of the string.

When you apply the same regex to the remainder of the string—e.g., by clicking “Find Next” in your text editor—the regex engine attempts to match Mary at the first comma in the string. That fails. Then, it attempts to match Jane at the same position, which also fails. Attempting to match Sue at the comma fails, too. Only then does the regex engine advance to the next character in the string. Starting at the first space, all three alternatives fail in the same way.

Starting at the J, the first alternative, Mary, fails to match. The second alternative, Jane, is then attempted starting at the J. It matches Jane. The regex engine declares victory.

Notice that Jane was found even though there is another occurrence of Mary in the subject text, and that Mary appears before Jane in the regex. At least in this case, the order of the alternatives in the regular expression does not matter. The regular expression finds the leftmost match. It scans the text from left to right, tries all alternatives in the regular expression at each step, and stops at the first position in the text where any of the alternatives produces a valid match.

If we do another search through the remainder of the string, Sue will be found. The fourth search will find Mary once more. If you tell the regular engine to do a fifth search, that will fail, because none of the three alternatives match the remaining ’s house string.

The order of the alternatives in the regex matters only when two of them can match at the same position in the string. The regex Jane|Janet has two alternatives that match at the same position in the text Her name is Janet. There are no word boundaries in the regular expression. The fact that Jane matches the word Janet in Her name is Janet only partially does not matter.

Jane|Janet matches Jane in Her name is Janet because a regex-directed regular expression engine is eager. In addition to scanning the subject text from left to right, finding the leftmost match in the text, it also scans the alternatives in the regex from left to right. The engine stops as soon as it finds an alternative that matches.

When Jane|Janet reaches the J in Her name is Janet, the first alternative, Jane, matches. The second alternative is not attempted. If we tell the engine to look for a second match, the t is all that is left of the subject text. Neither alternative matches there.

There are two ways to stop Jane from stealing Janet’s limelight. One way is to put the longer alternative first: Janet|Jane. A more solid solution is to be explicit about what we’re trying to do: we’re looking for names, and names are complete words. Regular expressions don’t deal with words, but they can deal with word boundaries.

So Jane|Janet and Janet|Jane will both match Janet in Her name is Janet. Because of the word boundaries, only one alternative can match. The order of the alternatives is again irrelevant.

Recipe 2.12 explains the best solution: Janet?.

See Also

Recipe 2.9 explains how to group parts of a regex. You need to use a group if you want to place several alternatives in the middle of a regex.



[3] The other kind of engine is a text-directed engine. The key difference is that a text-directed engine visits each character in the subject text only once, whereas a regex-directed engine may visit each character many times. Text-directed engines are much faster, but support regular expressions only in the mathematical sense described at the beginning of Chapter 1. The fancy Perl-style regular expressions that make this book so interesting can be implemented only with a regex-directed engine.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset