2.9. Group and Capture Parts of the Match

Problem

Improve the regular expression for matching Mary, Jane, or Sue by forcing the match to be a whole word. Use grouping to achieve this with one pair of word boundaries for the whole regex, instead of one pair for each alternative.

Create a regular expression that matches any date in yyyy-mm-dd format, and separately captures the year, month, and day. The goal is to make it easy to work with these separate values in the code that processes the match. You can assume all dates in the subject text to be valid. The regular expression does not have to exclude things like 9999-99-99, as these won’t occur in the subject text at all.

Solution

(Mary|Jane|Sue)
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
(dddd)-(dd)-(dd)
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

The alternation operator, explained in the previous section, has the lowest precedence of all regex operators. If you try Mary|Jane|Sue, the three alternatives are Mary, Jane, and Sue. This regex matches Jane in Her name is Janet.

If you want something in your regex to be excluded from the alternation, you have to group the alternatives. Grouping is done with parentheses. They have the highest precedence of all regex operators, just as in most programming languages. (Mary|Jane|Sue) has three alternatives—Mary, Jane, and Sue—between two word boundaries. This regex does not match anything in Her name is Janet.

When the regex engine reaches the J in Janet in the subject text, the first word boundary matches. The engine then enters the group. The first alternative in the group, Mary, fails. The second alternative, Jane, succeeds. The engine exits the group. All that is left is . The word boundary fails to match between the e and t at the end of the subject. The overall match attempt starting at J fails.

A pair of parentheses isn’t just a group; it’s a capturing group. For the Mary-Jane-Sue regex, the capture isn’t very useful, because it’s simply the overall regex match. Captures become useful when they cover only part of the regular expression, as in (dddd)-(dd)-(dd).

This regular expression matches a date in yyyy-mm-dd format. The regex dddd-dd-dd does exactly the same. Because this regular expression does not use any alternation or repetition, the grouping function of the parentheses is not needed. But the capture function is very handy.

The regex (dddd)-(dd)-(dd) has three capturing groups. Groups are numbered by counting opening parentheses from left to right. (dddd) is group number 1. (dd) is number 2. The second (dd) is group number 3.

During the matching process, when the regular expression engine exits the group upon reaching the closing parenthesis, it stores the part of the text matched by the capturing group. When our regex matches 2008-05-24, 2008 is stored in the first capture, 05 in the second capture, and 24 in the third capture.

There are three ways you can use the captured text. Recipe 2.10 in this chapter explains how you can match the captured text again within the same regex match. Recipe 2.21 shows how to insert the captured text into the replacement text when doing a search-and-replace. Recipe 3.9 in the next chapter describes how your application can use the parts of the regex match.

Variations

Noncapturing groups

In the regex (Mary|Jane|Sue), we need the parentheses for grouping only. Instead of using a capturing group, we could use a noncapturing group:

(?:Mary|Jane|Sue)
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

The three characters (?: open the noncapturing group. The parenthesis ) closes it. The noncapturing group provides the same grouping functionality, but does not capture anything.

When counting opening parentheses of capturing groups to determine their numbers, do not count the parenthesis of the noncapturing group. This is the main benefit of noncapturing groups: you can add them to an existing regex without upsetting the references to numbered capturing groups.

Another benefit of noncapturing groups is performance. If you’re not going to use a backreference to a particular group (Recipe 2.10), reinsert it into the replacement text (Recipe 2.21), or retrieve its match in source code (Recipe 3.9), a capturing group adds unnecessary overhead that you can eliminate by using a noncapturing group. In practice, you’ll hardly notice the performance difference, unless you’re using the regex in a tight loop and/or on lots of data.

Group with mode modifiers

In the variation of Recipe 2.1, we explain that .NET, Java, PCRE, Perl, and Ruby support local mode modifiers, using the mode toggles: sensitive(?i)caseless(?-i)sensitive. Although this syntax also involves parentheses, a toggle such as (?i) does not involve any grouping.

Instead of using toggles, you can specify mode modifiers in a noncapturing group:

(?i:Mary|Jane|Sue)
Regex options: None
Regex flavors: .NET, Java, PCRE, Perl, Ruby
sensitive(?i:caseless)sensitive
Regex options: None
Regex flavors: .NET, Java, PCRE, Perl, Ruby

Adding mode modifiers to a noncapturing group sets that mode for the part of the regular expression inside the group. The previous settings are restored at the closing parenthesis. Since case sensitivity is the default, only the part of the regex inside:

(?i:)

is case insensitive.

You can combine multiple modifiers. (?ism:). Use a hyphen to turn off modifiers: (?-ism:) turns off the three options. (?i-sm) turns on case insensitivity (i), and turns off both “dot matches line breaks” (s) and “^ and $ match at line breaks” (m). These options are explained in Recipes 2.4 and 2.5.

See Also

Recipe 2.10 explains how to make a regex match the same text that was matched by a capturing group.

Recipe 2.11 explains named capturing groups. Naming the groups in your regex makes the regex easier to read and maintain.

Recipe 2.21 explains how to make the replacement text reinsert text matched by a capturing group when doing a search-and-replace.

Recipe 3.9 explains how to retrieve the text matched by a capturing group in procedural code.

Recipe 2.15 explains how to make sure the regex engine doesn’t needlessly try different ways of matching a group.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset