6.9. Integer Numbers with Separators

Problem

You want to find various kinds of integer numbers in a larger body of text, or check whether a string variable holds an integer number. Underscores are allowed as separators between groups of numbers, to make the integers easier to read. Numbers may not begin or end with an underscore. You want to allow decimal, octal, hexadecimal, and binary numbers. Hexadecimal and binary numbers must be prefixed with 0x and 0b.

0b0111_1111_1111_1111_1111_1111_1111_1111, 0177_7777_7777, 2_147_483_647, and 0x7fff_ffff are examples of valid numbers.

Solution

Find any decimal or octal integer with optional underscores in a larger body of text:

[0-9]+(_+[0-9]+)*
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Find any hexadecimal integer with optional underscores in a larger body of text:

0x[0-9A-F]+(_+[0-9A-F]+)*
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Find any binary integer with optional underscores in a larger body of text:

0b[01]+(_+[01]+)*
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Find any decimal, octal, hexadecimal, or binary integer with optional underscores in a larger body of text:

([0-9]+(_+[0-9]+)*|0x[0-9A-F]+(_+[0-9A-F]+)*|0b[01]+(_+[01]+)*)
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Check whether a text string holds just a decimal, octal, hexadecimal, or binary integer with optional underscores:

A([0-9]+(_+[0-9]+)*|0x[0-9A-F]+(_+[0-9A-F]+)*|0b[01]+(_+[01]+)*)
Regex options: Case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^([0-9]+(_+[0-9]+)*|0x[0-9A-F]+(_+[0-9A-F]+)*|0b[01]+(_+[01]+)*)$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python

Discussion

Recipes 6.1, 6.2, and 6.3 explain in detail how to match integer numbers. These recipes do not allow underscores in the numbers. Their regular expressions can easily use [0-9]+, [0-9A-F]+, and [01]+ to match decimal, hexadecimal, and binary numbers.

If we wanted to allow underscores anywhere, we could just add the underscore to these three character classes. But we do not want to allow underscores at the start or the end. The first and last characters in the number must be a digit. You might think of [0-9][0-9_]+[0-9] as an easy solution. But this fails to match single digit numbers. So we need a slightly more complex solution.

Our solution [0-9]+(_+[0-9]+)* uses [0-9]+ to match the initial digit or digits as before. We add (_+[0-9]+)* to allow the digits to be followed by one or more underscores, as long as those underscores are followed by more digits. _+ allows any number of sequential underscores. [0-9]+ allows any number of digits after the underscores. We put those two inside a group that we repeat zero or more times with a asterisk. This allows any number of nonsequential underscores with digits in between them and after them, while also allowing numbers with no underscores at all.

See Also

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.6 explains word boundaries. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset