6.1. Integer Numbers

Problem

You want to find various kinds of integer decimal numbers in a larger body of text, or check whether a string variable holds an integer decimal number.

Solution

Find any positive integer decimal number in a larger body of text:

[0-9]+
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Check whether a text string holds just a positive integer decimal number:

A[0-9]+
Regex options: None
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^[0-9]+$
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python

Find any positive integer decimal number that stands alone in a larger body of text:

(?<=^|s)[0-9]+(?=$|s)
Regex options: None
Regex flavors: .NET, Java, PCRE, Ruby 1.9

For Perl and Python, we have to tweak the preceding solution, because they do not support alternatives of different lengths inside lookbehind:

(?:^|(?<=s))[0-9]+(?=$|s)
Regex options: None
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9

Find any positive integer decimal number that stands alone in a larger body of text, allowing leading whitespace to be included in the regex match:

(^|s)([0-9]+)(?=$|s)
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Find any integer decimal number with an optional leading plus or minus sign:

[+-]?[0-9]+
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Check whether a text string holds just an integer decimal number with optional sign:

A[+-]?[0-9]+
Regex options: None
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^[+-]?[0-9]+$
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python

Find any integer decimal number with optional sign, allowing whitespace between the number and the sign, but no leading whitespace without the sign:

([+-]*)?[0-9]+
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

An integer number is a contiguous series of one or more digits, each between zero and nine. We can easily represent this with a character class (Recipe 2.3) and a quantifier (Recipe 2.12): [0-9]+.

Tip

We prefer to use the explicit range [0-9] instead of the shorthand d. In .NET and Perl, d matches any digit in any script, but [0-9] always just matches the 10 digits in the ASCII table. If you know your subject text doesn’t include any non-ASCII digits, you can save a few keystrokes and use d instead of [0-9].

If you don’t know whether your subject will include digits outside the ASCII table, you need to think about what you want to do with the regex matches and what the user’s expectations are in order to decide whether you should use d or [0-9]. If you plan to convert the text matched by the regular expression into an integer, check whether the string-to-integer function in your programming language can interpret non-ASCII digits. Users writing documents in their native scripts will expect your software to recognize digits in their native scripts.

Beyond being a series of digits, the number must also stand alone. A4 is a paper size, not a number. There are several ways to make sure your regex only matches pure numbers.

If you want to check whether your string holds nothing but a number, simply put start-of-string and end-of-string anchors around your regex. A and  are your best option, because their meaning doesn’t change. Unfortunately, JavaScript doesn’t support them. In JavaScript, use ^ and $, and make sure you don’t specify the /m flag that makes the caret and dollar match at line breaks. In Ruby, the caret and dollar always match at line breaks, so you can’t reliably use them to force your regex to match the whole string.

When searching for numbers within a larger body of text, word boundaries (Recipe 2.6) are an easy solution. When you place them before or after a regex token that matches a digit, the word boundary makes sure there is no word character before or after the matched digit. For example, 4 matches 4 in A4. 4 does too, because there’s no word character after the 4. 4 and 4 don’t match anything in A4, because  fails between the two word characters A and 4. In regular expressions, word characters include letters, digits and underscores.

If you include nonword characters such as plus or minus signs or whitespace in your regex, you have to be careful with the placement of word boundaries. To match +4 while excluding +4B, use +4 instead of +4. The latter does not match +4, because there’s no word character before the plus in the subject string to satisfy the word boundary. +4 does match +4 in the text 3+4, because 3 is a word character and + is not.

+4 only needs one word boundary. The first  in +4 is superfluous. When this regex matches, the first  is always between a + and a 4, and thus never excludes anything. The first  becomes important when the plus sign is optional. +?4 does not match the 4 in A4, whereas +?4 does.

Word boundaries are not always the right solution. Consider the subject text $123,456.78. If you iterate over this string with the regex [0-9]+, it’ll match 123, 456, and 78. The dollar sign, comma, and decimal point are not word characters, so the word boundary matches between a digit and any of these characters. Sometimes this is what you want, sometimes not.

If you only want to find integers surrounded by whitespace or the start or end of a string, you need to use lookaround instead of word boundaries. (?=$|s) matches at the end of the string or before a character that is whitespace (whitespace includes line breaks). (?<=^|s) matches either at the start of the string, or after a character that is whitespace. You can replace s with a character class that matches any of the characters you want to allow before or after the number. See Recipe 2.16 to learn how lookaround works.

Perl and Python support lookbehind, but they don’t allow alternatives of different length inside lookbehind. Since ^ is zero-length and s matches a single character, we have to put the ^ alternative outside the lookbehind. Thus (?<=^|s) becomes (?:^|(?<=s)) for Perl and Python. These two regexes are functionally identical. The latter just takes a bit more effort on the keyboard.

JavaScript and Ruby 1.8 don’t support lookbehind. You can use a normal group instead of lookbehind to check if the number occurs at the start of the string, or if it is preceded by whitespace. The drawback is that the whitespace character will be included in the overall regex match if the number doesn’t occur at the start of the string. An easy solution to that is to put the part of the regex that matches the number inside a capturing group. The fifth regex in the section captures the whitespace character in the first capturing group and the matched integer in the second capturing group.

See Also

All the other recipes in this chapter show more ways of matching different kinds of numbers with a regular expression.

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.6 explains word boundaries. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition. Recipe 2.16 explains lookaround.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset