Create regular expressions that match the following kinds of numbers:
A googol (a decimal number with 100 digits).
A 32-bit hexadecimal number.
A 32-bit hexadecimal number with an optional h
suffix.
A floating-point number with an optional integer part, a mandatory fractional part, and an optional exponent. Each part allows any number of digits.
[a-f0-9]{1,8}
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
The quantifier ‹{
›,
where n
}n
is a nonnegative integer, repeats
the preceding regex token n
number of
times. The ‹d{100}
› in
‹d{100}
› matches a
string of 100 digits. You could achieve the same by typing ‹d
› 100 times.
‹{1}
› repeats the
preceding token once, as it would without any quantifier. ‹ab{1}c
› is the same regex as
‹abc
›.
‹{0}
› repeats the
preceding token zero times, essentially deleting it from the regular
expression. ‹ab{0}c
› is
the same regex as ‹ac
›.
For variable repetition, we use
the quantifier ‹{
›, where
n,m
}n
is a nonnegative integer and
m
is greater than
n
. ‹[a-f0-9]{1,8}
› matches a hexadecimal number
with one to eight digits. With variable repetition, the order in which
the alternatives are attempted comes into play. Recipe 2.13 explains that in detail.
If n
and m
are equal, we have fixed repetition. ‹d{100,100}
› is the same regex as ‹d{100}
›.
The quantifier ‹{
›, where
n
,}n
is a nonnegative integer, allows for
infinite repetition. Essentially, infinite repetition is
variable repetition without an upper limit.
‹d{1,}
› matches
one or more digits, and ‹d+
› does the same. A plus after a regex token
that’s not a quantifier means “one or more.” Recipe 2.13 shows the meaning of a plus after a
quantifier.
‹d{0,}
› matches
zero or more digits, and ‹d*
› does the same. The asterisk always means
“zero or more.” In addition to allowing infinite repetition, ‹{0,}
› and the asterisk also make
the preceding token optional.
If we use variable repetition with
n
set to zero, we’re effectively making the
token that precedes the quantifier optional. ‹h{0,1}
› matches the ‹h
› once or not at all. If there is no h
, ‹h{0,1}
› results in a zero-length
match. If you use ‹h{0,1}
› as a regular expression all by itself,
it will find a zero-length match before each character in the subject
text that is not an h
. Each h
will result in a match of one
character (the h
).
‹h?
› does the same
as ‹h{0,1}
›. A question
mark after a valid and complete regex token that is not a quantifier
means “zero or once.” The next recipe shows the meaning of a question
mark after a quantifier.
A question mark, or any other quantifier, after an opening
parenthesis is a syntax error. Perl and the flavors that copy it use
this to add “Perl extensions” to the regex syntax. Preceding recipes
show noncapturing groups and named capturing groups, which all use a
question mark after an opening parenthesis as part of their syntax.
These question marks are not quantifiers at all; they’re simply part
of the syntax for noncapturing groups and named capturing groups.
Following recipes will show more styles of groups using the ‹(?
› syntax.
If you place a quantifier after the closing parenthesis
of a group, the whole group is repeated. ‹(?:abc){3}
› is the same as ‹abcabcabc
›.
Quantifiers can be nested. ‹(ed+)?
› matches an e
followed by one or more digits, or a
zero-length match. In our floating-point regular expression, this is
the optional exponent.
Capturing groups can be repeated. As explained in Recipe 2.9, the group’s match is captured each time
the engine exits the group, overwriting any text previously matched by
the group. ‹(dd){1,3}
›
matches a string of two, four, or six digits. The engine exits the
group three times. When this regex matches 123456
, the capturing group will hold
56
, because
56
was stored
by the last iteration of the group. The other two matches by the
group, 12
and
34
, cannot be
retrieved.
‹(dd){3}
›
captures the same text as ‹dddd(dd)
›. If you want the capturing group
to capture all two, four, or six digits rather than just the last two,
you have to place the capturing group around the quantifier instead of
repeating the capturing group: ‹((?:dd){1,3})
›. Here we used a noncapturing
group to take over the grouping function from the capturing group. We
also could have used two capturing groups: ‹((dd){1,3})
›. When this last regex matches
123456
,
‹1
› holds 123456
and ‹2
› holds 56
.
.NET’s regular expression engine is the only one that allows you
to retrieve all the iterations of a repeated capturing group. If you
directly query the group’s Value
property, which returns a string, you’ll get 56
, as with every other
regular expression engine. Backreferences in the regular expression
and replacement text also substitute 56
, but if you use the group’s CaptureCollection
, you’ll get a stack with
56
, 34
, and 12
.
Recipe 2.9 explains how to group part of a regex, so that part can be repeated as a whole.
Recipe 2.13 explains how to choose between minimal repetition and maximal repetition.
Recipe 2.14 explains how to make sure the regex engine doesn’t needlessly try different amounts of repetition.