Improve the regular expression for matching Mary
, Jane
, or Sue
by forcing the match to
be a whole word. Use grouping to achieve this with one pair of word
boundaries for the whole regex, instead of one pair for each
alternative.
Create a regular expression that matches any date in yyyy-mm-dd format, and separately captures the year, month, and day. The goal is to make it easy to work with these separate values in the code that processes the match. You can assume all dates in the subject text to be valid. The regular expression does not have to exclude things like 9999-99-99, as these won’t occur in the subject text at all.
(Mary|Jane|Sue)
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
(dddd)-(dd)-(dd)
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
The alternation operator, explained in the previous section, has
the lowest precedence of all regex operators. If you try ‹Mary|Jane|Sue
›, the three
alternatives are ‹Mary
›,
‹Jane
›, and ‹Sue
›. This regex matches
Jane
in
Her name is
Janet
.
If you want something in your regex to be excluded from the
alternation, you have to group the alternatives. Grouping is
done with parentheses. They have the highest precedence of all regex
operators, just as in most programming languages. ‹(Mary|Jane|Sue)
› has three
alternatives—‹Mary
›,
‹Jane
›, and ‹Sue
›—between two word boundaries.
This regex does not match anything in Her name is Janet
.
When the regex engine reaches the J
in Janet
in the subject text, the first word
boundary matches. The engine then enters the group. The first
alternative in the group, ‹Mary
›, fails. The second alternative, ‹Jane
›, succeeds. The engine exits
the group. All that is left is ‹›. The word boundary fails to match between the
e
and
t
at the end
of the subject. The overall match attempt starting at J
fails.
A pair of parentheses isn’t just a group; it’s a
capturing group. For the Mary-Jane-Sue regex, the
capture isn’t very useful, because it’s simply the overall regex match.
Captures become useful when they cover only part of the regular
expression, as in ‹(dddd)-(dd)-(dd)
›.
This regular expression matches a date in yyyy-mm-dd format. The
regex ‹dddd-dd-dd
› does exactly the same.
Because this regular expression does not use any alternation or
repetition, the grouping function of the parentheses is not needed. But
the capture function is very handy.
The regex ‹(dddd)-(dd)-(dd)
› has three capturing
groups. Groups are numbered by counting opening parentheses from left to
right. ‹(dddd)
› is
group number 1. ‹(dd)
›
is number 2. The second ‹(dd)
› is group number 3.
During the matching process, when the regular expression engine
exits the group upon reaching the closing parenthesis, it stores the
part of the text matched by the capturing group. When our regex matches
2008-05-24
,
2008
is stored
in the first capture, 05
in the second capture, and 24
in the third
capture.
There are three ways you can use the captured text. Recipe 2.10 in this chapter explains how you can match the captured text again within the same regex match. Recipe 2.21 shows how to insert the captured text into the replacement text when doing a search-and-replace. Recipe 3.9 in the next chapter describes how your application can use the parts of the regex match.
In the regex ‹(Mary|Jane|Sue)
›, we need the parentheses
for grouping only. Instead of using a capturing group, we could use a
noncapturing group:
(?:Mary|Jane|Sue)
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
The three characters ‹(?:
› open
the noncapturing group. The parenthesis ‹)
› closes it. The noncapturing group provides
the same grouping functionality, but does not capture anything.
When counting opening parentheses of capturing groups to determine their numbers, do not count the parenthesis of the noncapturing group. This is the main benefit of noncapturing groups: you can add them to an existing regex without upsetting the references to numbered capturing groups.
Another benefit of noncapturing groups is performance. If you’re not going to use a backreference to a particular group (Recipe 2.10), reinsert it into the replacement text (Recipe 2.21), or retrieve its match in source code (Recipe 3.9), a capturing group adds unnecessary overhead that you can eliminate by using a noncapturing group. In practice, you’ll hardly notice the performance difference, unless you’re using the regex in a tight loop and/or on lots of data.
In the variation of Recipe 2.1, we explain that .NET, Java, PCRE, Perl, and Ruby support local
mode modifiers, using the mode toggles: ‹sensitive(?i)caseless(?-i)sensitive
›.
Although this syntax also involves parentheses, a toggle such as
‹(?i)
› does not involve any grouping.
Instead of using toggles, you can specify mode modifiers in a noncapturing group:
(?i:Mary|Jane|Sue)
Regex options: None |
Regex flavors: .NET, Java, PCRE, Perl, Ruby |
sensitive(?i:caseless)sensitive
Regex options: None |
Regex flavors: .NET, Java, PCRE, Perl, Ruby |
Adding mode modifiers to a noncapturing group sets that mode for the part of the regular expression inside the group. The previous settings are restored at the closing parenthesis. Since case sensitivity is the default, only the part of the regex inside:
(?i:⋯)
is case insensitive.
You can combine multiple modifiers. ‹(?ism:⋯)
›. Use a hyphen to turn off modifiers:
‹(?-ism:⋯)
› turns off the three options.
‹(?i-sm)
› turns on case
insensitivity (i
), and turns off
both “dot matches line breaks” (s
)
and “^ and $ match at line breaks” (m
). These options are explained in Recipes
2.4 and
2.5.
Recipe 2.10 explains how to make a regex match the same text that was matched by a capturing group.
Recipe 2.11 explains named capturing groups. Naming the groups in your regex makes the regex easier to read and maintain.
Recipe 2.21 explains how to make the replacement text reinsert text matched by a capturing group when doing a search-and-replace.
Recipe 3.9 explains how to retrieve the text matched by a capturing group in procedural code.
Recipe 2.15 explains how to make sure the regex engine doesn’t needlessly try different ways of matching a group.