2.10. Match Previously Matched Text Again

Problem

Create a regular expression that matches “magical” dates in yyyy-mm-dd format. A date is magical if the year minus the century, the month, and the day of the month are all the same numbers. For example, 2008-08-08 is a magical date. You can assume all dates in the subject text to be valid. The regular expression does not have to exclude things like 9999-99-99, as these won’t occur in the subject text. You only need to find the magical dates.

Solution

dd(dd)-1-1
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

To match previously matched text later in a regex, we first have to capture the previous text. We do that with a capturing group, as shown in Recipe 2.9. After that, we can match the same text anywhere in the regex using a backreference. You can reference the first nine capturing groups with a backslash followed by a single digit one through nine. For groups 10 through 99, use 10 to 99.

Warning

Do not use 1. That is either an octal escape or an error. We don’t use octal escapes in this book at all, because the xFF hexadecimal escapes are much easier to understand.

When the regular expression dd(dd)-1-1 encounters 2008-08-08, the first dd matches 20. The regex engine then enters the capturing group, noting the position reached in the subject text.

The dd inside the capturing group matches 08, and the engine reaches the group’s closing parenthesis. At this point, the partial match 08 is stored in capturing group 1.

The next token is the hyphen, which matches literally. Then comes the backreference. The regex engine checks the contents of the first capturing group: 08. The engine tries to match this text literally. If the regular expression is case-insensitive, the captured text is matched in this way. Here, the backreference succeeds. The next hyphen and backreference also succeed. Finally, the word boundary matches at the end of the subject text, and an overall match is found: 2008-08-08. The capturing group still holds 08.

If a capturing group is repeated, either by a quantifier (Recipe 2.12) or by backtracking (Recipe 2.13), the stored match is overwritten each time the capturing group matches something. A backreference to the group matches only the text that was last captured by the group.

If the same regex encounters 2008-05-24 2007-07-07, the first time the group captures something is when dd(dd) matches 2008, storing 08 for the first (and only) capturing group. Next, the hyphen matches itself. The backreference, which tries to match 08, fails against 05.

Since there are no other alternatives in the regular expression, the engine gives up the match attempt. This involves clearing all the capturing groups. When the engine tries again, starting at the first 0 in the subject, 1 holds no text at all.

Still processing 2008-05-24 2007-07-07, the next time the group captures something is when dd(dd) matches 2007, storing 07. Next, the hyphen matches itself. Now the backreference tries to match 07. This succeeds, as do the next hyphen, backreference, and word boundary. 2007-07-07 has been found.

Because the regex engine proceeds from start to end, you should put the capturing parentheses before the backreference. The regular expressions dd1-(dd)-1 and dd1-1-(dd) could never match anything. Since the backreference is encountered before the capturing group, it has not captured anything yet. Unless you’re using JavaScript, a backreference always fails if it points to a group that hasn’t already participated in the match attempt.

A group that hasn’t participated is not the same as a group that has captured a zero-length match. A backreference to a group with a zero-length capture always succeeds. When (^)1 matches at the start of the string, the first capturing group captures the caret’s zero-length match, causing 1 to succeed. In practice, this can happen when the contents of the capturing group are all optional.

Tip

JavaScript is the only flavor we know that goes against decades of backreference tradition in regular expressions. In JavaScript, or at least in implementations that follow the JavaScript standard, a backreference to a group that hasn’t participated always succeeds, just like a backreference to a group that captured a zero-length match. So, in JavaScript, dd1-1-(dd) can match 12--34.

See Also

Recipe 2.9 explains the capturing groups that backreferences refer to.

Recipe 2.11 explains named capturing groups and named backreferences. Naming the groups and backreferences in your regex makes the regex easier to read and maintain.

Recipe 2.21 explains how to make the replacement text reinsert text matched by a capturing group when doing a search-and-replace.

Recipe 3.9 explains how to retrieve the text matched by a capturing group in procedural code.

Recipes 5.8, 5.9, and show how you can solve some real-world problems using backreferences.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset