2.12. Repeat Part of the Regex a Certain Number of Times

Problem

Create regular expressions that match the following kinds of numbers:

  • A googol (a decimal number with 100 digits).

  • A 32-bit hexadecimal number.

  • A 32-bit hexadecimal number with an optional h suffix.

  • A floating-point number with an optional integer part, a mandatory fractional part, and an optional exponent. Each part allows any number of digits.

Solution

Googol

d{100}
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Hexadecimal number

[a-f0-9]{1,8}
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Hexadecimal number with optional suffix

[a-f0-9]{1,8}h?
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Floating-point number

d*.d+(ed+)?
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

Fixed repetition

The quantifier {n}, where n is a nonnegative integer, repeats the preceding regex token n number of times. The d{100} in d{100} matches a string of 100 digits. You could achieve the same by typing d 100 times.

{1} repeats the preceding token once, as it would without any quantifier. ab{1}c is the same regex as abc.

{0} repeats the preceding token zero times, essentially deleting it from the regular expression. ab{0}c is the same regex as ac.

Variable repetition

For variable repetition, we use the quantifier {n,m}, where n is a nonnegative integer and m is greater than n. [a-f0-9]{1,8} matches a hexadecimal number with one to eight digits. With variable repetition, the order in which the alternatives are attempted comes into play. Recipe 2.13 explains that in detail.

If n and m are equal, we have fixed repetition. d{100,100} is the same regex as d{100}.

Infinite repetition

The quantifier {n,}, where n is a nonnegative integer, allows for infinite repetition. Essentially, infinite repetition is variable repetition without an upper limit.

d{1,} matches one or more digits, and d+ does the same. A plus after a regex token that’s not a quantifier means “one or more.” Recipe 2.13 shows the meaning of a plus after a quantifier.

d{0,} matches zero or more digits, and d* does the same. The asterisk always means “zero or more.” In addition to allowing infinite repetition, {0,} and the asterisk also make the preceding token optional.

Making something optional

If we use variable repetition with n set to zero, we’re effectively making the token that precedes the quantifier optional. h{0,1} matches the h once or not at all. If there is no h, h{0,1} results in a zero-length match. If you use h{0,1} as a regular expression all by itself, it will find a zero-length match before each character in the subject text that is not an h. Each h will result in a match of one character (the h).

h? does the same as h{0,1}. A question mark after a valid and complete regex token that is not a quantifier means “zero or once.” The next recipe shows the meaning of a question mark after a quantifier.

Tip

A question mark, or any other quantifier, after an opening parenthesis is a syntax error. Perl and the flavors that copy it use this to add “Perl extensions” to the regex syntax. Preceding recipes show noncapturing groups and named capturing groups, which all use a question mark after an opening parenthesis as part of their syntax. These question marks are not quantifiers at all; they’re simply part of the syntax for noncapturing groups and named capturing groups. Following recipes will show more styles of groups using the (? syntax.

Repeating groups

If you place a quantifier after the closing parenthesis of a group, the whole group is repeated. (?:abc){3} is the same as abcabcabc.

Quantifiers can be nested. (ed+)? matches an e followed by one or more digits, or a zero-length match. In our floating-point regular expression, this is the optional exponent.

Capturing groups can be repeated. As explained in Recipe 2.9, the group’s match is captured each time the engine exits the group, overwriting any text previously matched by the group. (dd){1,3} matches a string of two, four, or six digits. The engine exits the group three times. When this regex matches 123456, the capturing group will hold 56, because 56 was stored by the last iteration of the group. The other two matches by the group, 12 and 34, cannot be retrieved.

(dd){3} captures the same text as dddd(dd). If you want the capturing group to capture all two, four, or six digits rather than just the last two, you have to place the capturing group around the quantifier instead of repeating the capturing group: ((?:dd){1,3}). Here we used a noncapturing group to take over the grouping function from the capturing group. We also could have used two capturing groups: ((dd){1,3}). When this last regex matches 123456, 1 holds 123456 and 2 holds 56.

.NET’s regular expression engine is the only one that allows you to retrieve all the iterations of a repeated capturing group. If you directly query the group’s Value property, which returns a string, you’ll get 56, as with every other regular expression engine. Backreferences in the regular expression and replacement text also substitute 56, but if you use the group’s CaptureCollection, you’ll get a stack with 56, 34, and 12.

See Also

Recipe 2.9 explains how to group part of a regex, so that part can be repeated as a whole.

Recipe 2.13 explains how to choose between minimal repetition and maximal repetition.

Recipe 2.14 explains how to make sure the regex engine doesn’t needlessly try different amounts of repetition.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset