2.17. Match One of Two Alternatives Based on a Condition

Problem

Create a regular expression that matches a comma-delimited list of the words one, two, and three. Each word can occur any number of times in the list, and the words can occur in any order, but each word must appear at least once.

Solution

(?:(?:(one)|(two)|(three))(?:,|)){3,}(?(1)|(?!))(?(2)|(?!))(?(3)|(?!))
Regex options: None
Regex flavors: .NET, PCRE, Perl, Python

Java, JavaScript, and Ruby do not support conditionals. When programming in these languages (or any other language), you can use the regular expression without the conditionals, and write some extra code to check if each of the three capturing groups matched something.

(?:(?:(one)|(two)|(three))(?:,|)){3,}
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

.NET, PCRE, Perl, and Python support conditionals using numbered capturing groups. (?(1)then|else) is a conditional that checks whether the first capturing group has already matched something. If it has, the regex engine attempts to match then. If the capturing group has not participated in the match attempt thus far, the else part is attempted.

The parentheses, question mark, and vertical bar are all part of the syntax for the conditional. They don’t have their usual meaning. You can use any kind of regular expression for the then and else parts. The only restriction is that if you want to use alternation for one of the parts, you have to use a group to keep it together. Only one vertical bar is permitted directly in the conditional.

If you want, you can omit either the then or else part. The empty regex always finds a zero-length match. The solution for this recipe uses three conditionals that have an empty then part. If the capturing group participated, the conditional simply matches.

An empty negative lookahead, (?!), fills the else part. Since the empty regex always matches, a negative lookahead containing the empty regex always fails. Thus, the conditional (?(1)|(?!)) always fails when the first capturing group did not match anything.

By placing each of the three required alternatives in their own capturing group, we can use three conditionals at the end of the regex to test if all the capturing groups captured something. If one of the words was not matched, the conditional referencing its capturing group will evaluate the “else” part, which will cause the conditional to fail to match because of our empty negative lookahead. Thus the regex will fail to match if one of the words was not matched.

To allow the words to appear in any order and any number of times, we place the words inside a group using alternation, and repeat this group with a quantifier. Since we have three words, and we require each word to be matched at least once, we know the group has to be repeated at least three times.

.NET, Python, and PCRE 6.7 allow you to specify the name of a capturing group in a conditional. (?(name)then|else) checks whether the named capturing group name participated in the match attempt thus far. Perl 5.10 and later also support named conditionals. But Perl requires angle brackets or quotes around the name, as in (?(<name>)then|else) or (?('name')then|else). PCRE 7.0 and later also supports Perl’s syntax for named conditional, while also supporting the syntax used by .NET and Python.

To better understand how conditionals work, let’s examine the regular expression (a)?b(?(1)c|d). This is essentially a complicated way of writing abc|bd.

If the subject text starts with an a, this is captured in the first capturing group. If not, the first capturing group does not participate in the match attempt at all. It is important that the question mark is outside the capturing group because this makes the whole group optional. If there is no a, the group is repeated zero times, and never gets the chance to capture anything at all. It can’t capture a zero-length string.

If you use (a?), the group always participates in the match attempt. There’s no quantifier after the group, so it is repeated exactly once. The group will either capture a or capture nothing.

Regardless of whether a was matched, the next token is b. The conditional is next. If the capturing group participated in the match attempt, even if it captured the zero-length string (not possible here), c will be attempted. If not, d will be attempted.

In English, (a)?b(?(1)c|d) either matches ab followed by c, or matches b followed by d.

With .NET, PCRE, and Perl, but not with Python, conditionals can also use lookaround. (?(?=if)then|else) first tests (?=if) as a normal lookahead. Recipe 2.16 explains how this works. If the lookaround succeeds, the then part is attempted. If not, the else part is attempted. Since lookaround is zero-width, the then and else regexes are attempted at the same position in the subject text where if either matched or failed.

You can use lookbehind instead of lookahead in the conditional. You can also use negative lookaround, though we recommend against it, as it only confuses things by reversing the meaning of “then” and “else.”

Tip

A conditional using lookaround can be written without the conditional as (?=if)then|(?!if)else. If the positive lookahead succeeds, the then part is attempted. If the positive lookahead fails, the alternation kicks in. The negative lookahead then does the same test. The negative lookahead succeeds when if fails, which is already guaranteed because (?=if) failed. Thus, else is attempted. Placing the lookahead in a conditional saves time, as the conditional attempts if only once.

See Also

A conditional is essentially the combination of a lookaround (Recipe 2.16) and alternation (Recipe 2.8) inside a group (Recipe 2.9).

Eliminate incorrect ISBN identifiers in Recipe 4.13 and Using a conditional in Recipe 5.7 show how you can solve some real-world problems using conditionals.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset