2.11. Capture and Name Parts of the Match

Problem

Create a regular expression that matches any date in yyyy-mm-dd format and separately captures the year, month, and day. The goal is to make it easy to work with these separate values in the code that processes the match. Contribute to this goal by assigning the descriptive names “year,” “month,” and “day” to the captured text.

Create another regular expression that matches “magical” dates in yyyy-mm-dd format. A date is magical if the year minus the century, the month, and the day of the month are all the same numbers. For example, 2008-08-08 is a magical date. Capture the magical number (08 in the example), and label it “magic.”

You can assume all dates in the subject text to be valid. The regular expressions don’t have to exclude things like 9999-99-99, because these won’t occur in the subject text.

Solution

Named capture

(?<year>dddd)-(?<month>dd)-(?<day>dd)
Regex options: None
Regex flavors: .NET, Java 7, XRegExp, PCRE 7, Perl 5.10, Ruby 1.9
(?'year'dddd)-(?'month'dd)-(?'day'dd)
Regex options: None
Regex flavors: .NET, PCRE 7, Perl 5.10, Ruby 1.9
(?P<year>dddd)-(?P<month>dd)-(?P<day>dd)
Regex options: None
Regex flavors: PCRE 4 and later, Perl 5.10, Python

Named backreferences

dd(?<magic>dd)-k<magic>-k<magic>
Regex options: None
Regex flavors: .NET, Java 7, XRegExp, PCRE 7, Perl 5.10, Ruby 1.9
dd(?'magic'dd)-k'magic'-k'magic'
Regex options: None
Regex flavors: .NET, PCRE 7, Perl 5.10, Ruby 1.9
dd(?P<magic>dd)-(?P=magic)-(?P=magic)
Regex options: None
Regex flavors: PCRE 4 and later, Perl 5.10, Python

Discussion

Named capture

Recipes 2.9 and 2.10 illustrate capturing groups and backreferences. To be more precise: these recipes use numbered capturing groups and numbered backreferences. Each group automatically gets a number, which you use for the backreference.

Modern regex flavors support named capturing groups in addition to numbered groups. The only difference between named and numbered groups is your ability to assign a descriptive name, instead of being stuck with automatic numbers. Named groups make your regular expression more readable and easier to maintain. Inserting a capturing group into an existing regex can change the numbers assigned to all the capturing groups. Names that you assign remain the same.

Python was the first regular expression flavor to support named capture. It uses the syntax (?P<name>regex). The name must consist of word characters matched by w. (?P<name> is the group’s opening bracket, and ) is the closing bracket.

The designers of the .NET Regex class came up with their own syntax for named capture, using two interchangeable variants. (?<name>regex) mimics Python’s syntax, minus the P. The name must consist of word characters matched by w. (?<name> is the group’s opening bracket, and ) is the closing bracket.

The angle brackets in the named capture syntax are annoying when you’re coding in XML, or writing this book in DocBook XML. That’s the reason for .NET’s alternate named capture syntax: (?'name'regex). The angle brackets are replaced with single quotes. Choose whichever syntax is easier for you to type. Their functionality is identical.

Perhaps due to .NET’s popularity over Python, the .NET syntax seems to be the one that other regex library developers prefer to copy. Perl 5.10 and later have it, and so does the Oniguruma engine in Ruby 1.9. Perl 5.10 and Ruby 1.9 support both the syntax using angle brackets and single quotes. Java 7 also copied the .NET syntax, but only the variant using angle brackets. Standard JavaScript does not support named capture. XRegExp adds support for named capture using the .NET syntax, but only the variant with angle brackets.

PCRE copied Python’s syntax long ago, at a time when Perl did not support named capture at all. PCRE 7, the version that adds the new features in Perl 5.10, supports both the .NET syntax and the Python syntax. Perhaps as a testament to the success of PCRE, in a reverse compatibility move, Perl 5.10 also supports the Python syntax. In PCRE and Perl 5.10, the functionality of the .NET syntax and the Python syntax for named capture is identical.

Choose the syntax that is most useful to you. If you’re coding in PHP and you want your code to work with older versions of PHP that incorporate older versions of PCRE, use the Python syntax. If you don’t need compatibility with older versions and you also work with .NET or Ruby, the .NET syntax makes it easier to copy and paste between all these languages. If you’re unsure, use the Python syntax for PHP/PCRE. People recompiling your code with an older version of PCRE are going to be unhappy if the regexes in your code suddenly stop working. When copying a regex to .NET or Ruby, deleting a few Ps is easy enough.

Documentation for PCRE 7 and Perl 5.10 barely mention the Python syntax, but it is by no means deprecated. For PCRE and PHP, we actually recommend it.

Named backreferences

With named capture comes named backreferences. Just as named capturing groups are functionally identical to numbered capturing groups, named backreferences are functionally identical to numbered backreferences. They’re just easier to read and maintain.

Python uses the syntax (?P=name) to create a backreference to the group name. Although this syntax uses parentheses, the backreference is not a group. You cannot put anything between the name and the closing parenthesis. A backreference (?P=name) is a singular regex token, just like 1. PCRE and Perl 5.10 also support the Python syntax for named backreferences.

.NET uses the syntax k<name> and k'name'. The two variants are identical in functionality, and you can freely mix them. A named group created with the bracket syntax can be referenced with the quote syntax, and vice versa. Perl 5.10, PCRE 7, and Ruby 1.9 also support the .NET syntax for named backreferences. Java 7 and XRegExp support only the variant using angle brackets.

We strongly recommend you don’t mix named and numbered groups in the same regex. Different flavors follow different rules for numbering unnamed groups that appear between named groups. Perl 5.10, Ruby 1.9, Java 7, and XRegExp copied .NET’s syntax, but they do not follow .NET’s way of numbering named capturing groups or of mixing numbered capturing groups with named groups. Instead of trying to explain the differences, we simply recommend not mixing named and numbered groups. Avoid the confusion and either give all unnamed groups a name or make them noncapturing.

Groups with the same name

Perl 5.10, Ruby 1.9, and .NET allow multiple named capturing groups to share the same name. We take advantage of this in the solutions for recipes 4.5, 8.7, and 8.19. When a regular expression uses alternation to find different variations of certain text, using capturing groups with the same name makes it easy to extract parts from the match, regardless of which alternative actually matched the text. The section Pure regular expression in Recipe 4.5 uses alternation to separately match dates in months of different lengths. Each alternative matches the day and the month. By using the same group names “day” and “month” in all the alternatives, we only need to query two capturing groups to retrieve the day and the month after the regular expression finds a match.

All the other flavors in this book that support named capture treat multiple groups with the same name as an error.

Caution

Using multiple capturing groups with the same name only works reliably when only one of the groups participates in the match. That is the case in all the recipes in this book that use capturing groups with the same name. The groups are in separate alternatives, and the alternatives are not inside a group that is repeated. Perl 5.10, Ruby 1.9, and .NET do allow two groups with the same name to participate in the match. But then the behavior of backreferences and the text retained for the group after the match will differ significantly between these flavors. It is confusing enough for us to recommend to use groups with the same name only when they’re in separate alternatives in the regular expression.

See Also

Recipe 2.9 on numbered capturing groups has more fundamental information on how grouping works in regular expressions.

Recipe 2.10 explains how to make a regex match the same text that was matched by a named capturing group.

Recipe 2.11 explains named capturing groups. Naming the groups in your regex makes the regex easier to read and maintain.

Recipe 2.21 explains how to make the replacement text reinsert text matched by a capturing group when doing a search-and-replace.

Recipe 3.9 explains how to retrieve the text matched by a capturing group in procedural code.

Recipe 2.15 explains how to make sure the regex engine doesn’t needlessly try different ways of matching a group.

Many of the recipes in the later chapters use named capture to make it easier to retrieve parts of the text that was matched. Recipes 4.5, 8.7, and Recipe 8.19 show some of the more interesting solutions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset