Create a regular expression that matches a comma-delimited
list of the words one
, two
, and three
. Each word can occur any number of
times in the list, and the words can occur in any order, but each word
must appear at least once.
(?:(?:(one)|(two)|(three))(?:,|)){3,}(?(1)|(?!))(?(2)|(?!))(?(3)|(?!))
Regex options: None |
Regex flavors: .NET, PCRE, Perl, Python |
Java, JavaScript, and Ruby do not support conditionals. When programming in these languages (or any other language), you can use the regular expression without the conditionals, and write some extra code to check if each of the three capturing groups matched something.
(?:(?:(one)|(two)|(three))(?:,|)){3,}
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
.NET, PCRE, Perl, and Python support conditionals using numbered
capturing groups. ‹(?(1)
›
is a conditional that checks whether the first capturing group has
already matched something. If it has, the regex engine attempts to match
‹then
|else
)
›. If the capturing
group has not participated in the match attempt thus far, the ‹then
›
part is attempted.else
The parentheses, question mark, and vertical bar are all
part of the syntax for the conditional. They don’t have their usual
meaning. You can use any kind of regular expression for the ‹
›
and ‹then
› parts. The only
restriction is that if you want to use alternation for one of the parts,
you have to use a group to keep it together. Only one vertical bar is
permitted directly in the conditional.else
If you want, you can omit either the ‹
› or ‹then
›
part. The empty regex always finds a zero-length match. The solution for
this recipe uses three conditionals that have an empty ‹else
›
part. If the capturing group participated, the conditional simply
matches.then
An empty negative lookahead, ‹(?!)
›,
fills the ‹
› part. Since the
empty regex always matches, a negative lookahead containing the empty
regex always fails. Thus, the conditional ‹else
(?(1)|(?!))
› always fails when the first capturing
group did not match anything.
By placing each of the three required alternatives in their own capturing group, we can use three conditionals at the end of the regex to test if all the capturing groups captured something. If one of the words was not matched, the conditional referencing its capturing group will evaluate the “else” part, which will cause the conditional to fail to match because of our empty negative lookahead. Thus the regex will fail to match if one of the words was not matched.
To allow the words to appear in any order and any number of times, we place the words inside a group using alternation, and repeat this group with a quantifier. Since we have three words, and we require each word to be matched at least once, we know the group has to be repeated at least three times.
.NET, Python, and PCRE 6.7 allow you to specify the name of a
capturing group in a conditional. ‹(?(
›
checks whether the named capturing group name
)then
|else
)name
participated in the match attempt thus
far. Perl 5.10 and later also support named conditionals. But Perl
requires angle brackets or quotes around the name, as in ‹(?(<
›
or ‹name
>)then
|else
)(?('
›.
PCRE 7.0 and later also supports Perl’s syntax for named conditional,
while also supporting the syntax used by .NET and Python.name
')then
|else
)
To better understand how conditionals work, let’s examine the
regular expression ‹(a)?b(?(1)c|d)
›. This is essentially a complicated
way of writing ‹abc|bd
›.
If the subject text starts with an a
, this is captured in the first
capturing group. If not, the first capturing group does not participate
in the match attempt at all. It is important that the question mark is
outside the capturing group because this makes the whole group optional.
If there is no a
, the group is repeated zero times, and
never gets the chance to capture anything at all. It can’t capture a
zero-length string.
If you use ‹(a?)
›,
the group always participates in the match attempt. There’s no
quantifier after the group, so it is repeated exactly once. The group
will either capture a
or capture nothing.
Regardless of whether ‹a
› was matched, the next token is ‹b
›. The conditional is next. If
the capturing group participated in the match attempt, even if it
captured the zero-length string (not possible here), ‹c
› will be attempted. If not,
‹d
› will be
attempted.
In English, ‹(a)?b(?(1)c|d)
› either matches ab
followed by c
, or matches b
followed by d
.
With .NET, PCRE, and Perl, but not with Python, conditionals can
also use lookaround. ‹(?(?=
›
first tests ‹if
)then
|else
)(?=
› as a normal
lookahead. Recipe 2.16 explains how this
works. If the lookaround succeeds, the ‹if
)
› part is
attempted. If not, the ‹then
› part is
attempted. Since lookaround is zero-width, the ‹else
› and ‹then
›
regexes are attempted at the same position in the subject text where
‹else
› either matched or
failed.if
You can use lookbehind instead of lookahead in the conditional. You can also use negative lookaround, though we recommend against it, as it only confuses things by reversing the meaning of “then” and “else.”
A conditional using lookaround can be written without the
conditional as ‹(?=
›.
If the positive lookahead succeeds, the ‹if
)then
|(?!if
)else
› part is
attempted. If the positive lookahead fails, the alternation kicks in.
The negative lookahead then does the same test. The negative lookahead
succeeds when ‹then
› fails, which is
already guaranteed because ‹if
(?=
› failed. Thus,
‹if
)
› is attempted.
Placing the lookahead in a conditional saves time, as the conditional
attempts ‹else
› only
once.if
A conditional is essentially the combination of a lookaround (Recipe 2.16) and alternation (Recipe 2.8) inside a group (Recipe 2.9).
Eliminate incorrect ISBN identifiers in Recipe 4.13 and Using a conditional in Recipe 5.7 show how you can solve some real-world problems using conditionals.