2.21. Insert Part of the Regex Match into the Replacement Text

Problem

Match any contiguous sequence of 10 digits, such as 1234567890. Convert the sequence into a nicely formatted phone number—for example, (123) 456-7890.

Solution

Regular expression

(d{3})(d{3})(d{4})
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Replacement

($1)$2-$3
Replacement text flavors: .NET, Java, JavaScript, PHP, Perl
(${1})${2}-${3}
Replacement text flavors: .NET, PHP, Perl
(1)2-3
Replacement text flavors: PHP, Python, Ruby

Discussion

Replacements using capturing groups

Recipe 2.10 explains how you can use capturing groups in your regular expression to match the same text more than once. The text matched by each capturing group in your regex is also available after each successful match. You can insert the text of some or all capturing groups—in any order, or even more than once—into the replacement text.

Some flavors, such as Python and Ruby, use the same «1» syntax for backreferences in both the regular expression and the replacement text. Other flavors use Perl’s «$1» syntax, using a dollar sign instead of a backslash. PHP supports both.

In Perl, «$1» and above are actually variables that are set after each successful regex match. You can use them anywhere in your code until the next regex match. .NET, Java, JavaScript, and PHP support «$1» only in the replacement syntax. These programming languages do offer other ways to access capturing groups in code. Chapter 3 explains that in detail.

$10 and higher

All regex flavors in this book support up to 99 capturing groups in a regular expression. In the replacement text, ambiguity can occur with «$10» or «10» and above. These can be interpreted as either the 10th capturing group, or the first capturing group followed by a literal zero.

.NET, XRegExp, PHP, and Perl allow you to put curly braces around the number to make your intention clear. «${10}» is always the 10th capturing group, and «${1}0» is always the first followed by a literal zero.

Java and JavaScript try to be clever with «$10». If a capturing group with the specified two-digit number exists in your regular expression, both digits are used for the capturing group. If fewer capturing groups exist, only the first digit is used to reference the group, leaving the second as a literal. Thus «$23» is the 23rd capturing group, if it exists. Otherwise, it is the second capturing group followed by a literal «3».

.NET, XRegExp, PHP, Perl, Python, and Ruby always treat «$10» and «10» as the 10th capturing group, regardless of whether it exists. If it doesn’t, the behavior for nonexistent groups comes into play.

References to nonexistent groups

The regular expression in the solution for this recipe has three capturing groups. If you type «$4» or «4» into the replacement text, you’re adding a reference to a capturing group that does not exist. This triggers one of three different behaviors.

Java, XRegExp, and Python will cry foul by raising an exception or returning an error message. Do not use invalid backreferences with these flavors. (Actually, you shouldn’t use invalid backreferences with any flavor.) If you want to insert «$4» or «4» literally, escape the dollar sign or backslash. Recipe 2.19 explains this in detail.

PHP, Perl, and Ruby substitute all backreferences in the replacement text, including those that point to groups that don’t exist. Groups that don’t exist did not capture any text and therefore references to these groups are simply replaced with nothing.

Finally, .NET and JavaScript (without XRegExp) leave backreferences to groups that don’t exist as literal text in the replacement.

All flavors do replace groups that do exist in the regular expression but did not capture anything. Those are replaced with nothing.

Solution Using Named Capture

Regular expression

(?<area>d{3})(?<exchange>d{3})(?<number>d{4})
Regex options: None
Regex flavors: .NET, Java 7, XRegExp, PCRE 7, Perl 5.10, Ruby 1.9
(?'area'd{3})(?'exchange'd{3})(?'number'd{4})
Regex options: None
Regex flavors: .NET, PCRE 7, Perl 5.10, Ruby 1.9
(?P<area>d{3})(?P<exchange>d{3})(?P<number>d{4})
Regex options: None
Regex flavors: PCRE, Perl 5.10, Python

Replacement

(${area})${exchange}-${number}
Replacement text flavors: .NET, Java 7, XRegExp
(g<area>)g<exchange>-g<number>
Replacement text flavor: Python
(k<area>)k<exchange>-k<number>
Replacement text flavor: Ruby 1.9
(k'area')k'exchange'-k'number'
Replacement text flavor: Ruby 1.9
($+{area})$+{exchange}-$+{number}
Replacement text flavor: Perl 5.10
($1)$2-$3
Replacement text flavor: PHP

Flavors that support named capture

.NET, Java 7, XRegExp, Python, and Ruby 1.9 allow you to use named backreferences in the replacement text if you used named capturing groups in your regular expression. The syntax for named backreferences in the replacement text differs from that in the regular expression.

Ruby uses the same syntax for backreferences in the replacement text as it does in the regular expression. For named capturing groups in Ruby 1.9, this syntax is «k<group>» or «k'group'». The choice between angle brackets and single quotes is merely a notational convenience.

Perl 5.10 and later store the text matched by named capturing groups into the hash %+. You can get the text matched by the group “name” with $+{name}. Perl interpolates variables in the replacement text, so you can treat «$+{name}» as a named backreference in the replacement text.

PHP (using PCRE) supports named capturing groups in regular expressions, but not in the replacement text. You can use numbered backreferences in the replacement text to named capturing groups in the regular expression. PCRE assigns numbers to both named and unnamed groups, from left to right.

.NET, Java 7, XRegExp, Python, and Ruby 1.9 also allow numbered references to named groups. However, .NET uses a different numbering scheme for named groups, as Recipe 2.11 explains. Mixing names and numbers with .NET, Java 7, XRegExp, Python, or Ruby is not recommended. Either give all your capturing groups names or don’t name any groups at all. Always use named backreferences for named groups.

See Also

Recipe 2.9 explains the capturing groups that backreferences refer to.

Recipe 2.11 explains named capturing groups. Naming the groups in your regex and the backreferences in your replacement text makes them easier to read and maintain.

Search and Replace with Regular Expressions in Chapter 1 describes the various replacement text flavors.

Recipe 2.10 shows how to use backrefreences in the regular expression itself. The syntax is different than for backreferences in the replacement text.

Recipe 3.15 explains how to use replacement text in source code.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset