You want to find various kinds of integer decimal numbers in a larger body of text, or check whether a string variable holds an integer decimal number.
Find any positive integer decimal number in a larger body of text:
[0-9]+
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Check whether a text string holds just a positive integer decimal number:
A[0-9]+
Regex options: None |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
^[0-9]+$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python |
Find any positive integer decimal number that stands alone in a larger body of text:
(?<=^|s)[0-9]+(?=$|s)
Regex options: None |
Regex flavors: .NET, Java, PCRE, Ruby 1.9 |
For Perl and Python, we have to tweak the preceding solution, because they do not support alternatives of different lengths inside lookbehind:
(?:^|(?<=s))[0-9]+(?=$|s)
Regex options: None |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9 |
Find any positive integer decimal number that stands alone in a larger body of text, allowing leading whitespace to be included in the regex match:
(^|s)([0-9]+)(?=$|s)
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Find any integer decimal number with an optional leading plus or minus sign:
[+-]?[0-9]+
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Check whether a text string holds just an integer decimal number with optional sign:
A[+-]?[0-9]+
Regex options: None |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
^[+-]?[0-9]+$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python |
Find any integer decimal number with optional sign, allowing whitespace between the number and the sign, but no leading whitespace without the sign:
([+-]●*)?[0-9]+
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
An integer number is a contiguous series of one or more digits,
each between zero and nine. We can easily represent this with a
character class (Recipe 2.3) and a quantifier
(Recipe 2.12): ‹[0-9]+
›.
We prefer to use the explicit range ‹[0-9]
› instead of the shorthand ‹d
›. In .NET and Perl, ‹d
› matches any digit in any
script, but ‹[0-9]
›
always just matches the 10 digits in the ASCII table. If you know your
subject text doesn’t include any non-ASCII digits, you can save a few
keystrokes and use ‹d
›
instead of ‹[0-9]
›.
If you don’t know whether your subject will include digits
outside the ASCII table, you
need to think about what you want to do with the regex matches and
what the user’s expectations are in order to decide whether you should
use ‹d
› or ‹[0-9]
›. If you plan to convert
the text matched by the regular expression into an integer, check
whether the string-to-integer
function in your programming language can interpret non-ASCII digits.
Users writing documents in their native scripts will expect your
software to recognize digits in their native scripts.
Beyond being a series of digits, the number must also stand alone.
A4
is a paper
size, not a number. There are several ways to make sure your regex only
matches pure numbers.
If you want to check whether your string holds nothing but
a number, simply put start-of-string and end-of-string anchors around
your regex. ‹A
› and
‹› are
your best option, because their meaning doesn’t change. Unfortunately,
JavaScript doesn’t support them. In JavaScript, use ‹
^
› and ‹$
›, and make sure you don’t specify the /m
flag that makes the
caret and dollar match at line breaks. In Ruby, the caret and dollar
always match at line breaks, so you can’t reliably use them to force
your regex to match the whole string.
When searching for numbers within a larger body of text, word
boundaries (Recipe 2.6) are an easy
solution. When you place them before or after a regex token that matches
a digit, the word boundary makes sure there is no word character before
or after the matched digit. For example, ‹4
› matches 4
in A4
. ‹4
› does too, because there’s no word character
after the 4
.
‹4
› and ‹4
› don’t match anything in
A4
, because
‹› fails between the two
word characters
A
and 4
. In regular expressions, word
characters include letters, digits and underscores.
If you include nonword characters such as plus or minus
signs or whitespace in your regex, you have to be careful with the
placement of word boundaries. To match +4
while excluding +4B
, use ‹+4
› instead of ‹+4
›. The latter does not
match +4
,
because there’s no word character before the plus in the subject string
to satisfy the word boundary. ‹+4
› does match +4
in the text 3+4
, because 3
is a word character and +
is not.
‹+4
› only needs
one word boundary. The first ‹› in ‹
+4
› is superfluous. When this regex matches,
the first ‹› is always
between a
+
and a 4
, and
thus never excludes anything. The first ‹› becomes important when the plus sign is
optional. ‹
+?4
› does
not match the 4
in A4
, whereas ‹+?4
› does.
Word boundaries are not always the right solution. Consider the
subject text $123,456.78
. If you iterate over this
string with the regex ‹[0-9]+
›, it’ll match 123
, 456
, and 78
. The dollar sign, comma,
and decimal point are not word characters, so the word boundary matches
between a digit and any of these characters. Sometimes this is what you
want, sometimes not.
If you only want to find integers surrounded by whitespace or the
start or end of a string, you need to use lookaround instead of word
boundaries. ‹(?=$|s)
›
matches at the end of the string or before a character that is
whitespace (whitespace includes line breaks). ‹(?<=^|s)
› matches either at
the start of the string, or after a character that is whitespace. You
can replace ‹s
› with a
character class that matches any of the characters you want to allow
before or after the number. See Recipe 2.16
to learn how lookaround works.
Perl and Python support lookbehind, but they don’t allow
alternatives of different length inside lookbehind. Since ‹^
› is zero-length and ‹s
› matches a single character, we
have to put the ‹^
›
alternative outside the lookbehind. Thus ‹(?<=^|s)
› becomes ‹(?:^|(?<=s))
› for Perl and Python. These two
regexes are functionally identical. The latter just takes a bit more
effort on the keyboard.
JavaScript and Ruby 1.8 don’t support lookbehind. You can use a normal group instead of lookbehind to check if the number occurs at the start of the string, or if it is preceded by whitespace. The drawback is that the whitespace character will be included in the overall regex match if the number doesn’t occur at the start of the string. An easy solution to that is to put the part of the regex that matches the number inside a capturing group. The fifth regex in the section captures the whitespace character in the first capturing group and the matched integer in the second capturing group.
All the other recipes in this chapter show more ways of matching different kinds of numbers with a regular expression.
Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.6 explains word boundaries. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition. Recipe 2.16 explains lookaround.