You want to match an integer number within a certain range of numbers. You want the regular expression to specify the range accurately, rather than just limiting the number of digits.
1 to 12 (hour or month):
^(1[0-2]|[1-9])$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
1 to 24 (hour):
^(2[0-4]|1[0-9]|[1-9])$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
1 to 31 (day of the month):
^(3[01]|[12][0-9]|[1-9])$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
1 to 53 (week of the year):
^(5[0-3]|[1-4][0-9]|[1-9])$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
0 to 59 (minute or second):
^[1-5]?[0-9]$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
0 to 100 (percentage):
^(100|[1-9]?[0-9])$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
1 to 100:
^(100|[1-9][0-9]?)$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
32 to 126 (printable ASCII codes):
^(12[0-6]|1[01][0-9]|[4-9][0-9]|3[2-9])$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
0 to 127 (nonnegative signed byte):
^(12[0-7]|1[01][0-9]|[1-9]?[0-9])$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
–128 to 127 (signed byte):
^(12[0-7]|1[01][0-9]|[1-9]?[0-9]|-(12[0-8]|1[01][0-9]|[1-9]?[0-9]))$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
0 to 255 (unsigned byte):
^(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[1-9]?[0-9])$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
1 to 366 (day of the year):
^(36[0-6]|3[0-5][0-9]|[12][0-9]{2}|[1-9][0-9]?)$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
1900 to 2099 (year):
^(19|20)[0-9]{2}$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
0 to 32767 (nonnegative signed word):
^(3276[0-7]|327[0-5][0-9]|32[0-6][0-9]{2}|3[01][0-9]{3}|[12][0-9]{4}|↵ [1-9][0-9]{1,3}|[0-9])$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
–32768 to 32767 (signed word):
^(3276[0-7]|327[0-5][0-9]|32[0-6][0-9]{2}|3[01][0-9]{3}|[12][0-9]{4}|↵ [1-9][0-9]{1,3}|[0-9]|-(3276[0-8]|327[0-5][0-9]|32[0-6][0-9]{2}|↵ 3[01][0-9]{3}|[12][0-9]{4}|[1-9][0-9]{1,3}|[0-9]))$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
0 to 65535 (unsigned word):
^(6553[0-5]|655[0-2][0-9]|65[0-4][0-9]{2}|6[0-4][0-9]{3}|[1-5][0-9]{4}|↵ [1-9][0-9]{1,3}|[0-9])$
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
The previous recipes matched integers with any number of digits, or with a certain number of digits. They allowed the full range of digits for all the digits in the number. Such regular expressions are very straightforward.
Matching a number in a specific range (e.g., a number between 0
and 255) is not a simple task with regular expressions. You can’t write
‹[0-255]
›. Well, you
could, but it wouldn’t match a number between 0 and 255. This character
class, which is equivalent to ‹[0125]
›, matches a single character that is one of
the digits 0, 1, 2, or 5.
Because these regular expressions are quite a bit longer, the solutions all use anchors to make the regex suitable to check whether a string, such as user input, consists of a single acceptable number. Recipe 6.1 explains how you can use word boundaries or lookaround instead of the anchors for other purposes. In the discussion, we show the regexes without any anchors, keep the focus on dealing with numeric ranges. If you want to use any of these regexes, you’ll have to add anchors or word boundaries to make sure your regex doesn’t match digits that are part of a longer number.
Regular expressions work character by character. If we want to match a number that consists of more than one digit, we have to spell out all the possible combinations for the digits. The essential building blocks are character classes (Recipe 2.3) and alternation (Recipe 2.8).
In character classes, we can use ranges for single digits, such as
‹[0-5]
›. That’s because
the characters for the digits 0 through 9 occupy consecutive positions
in the ASCII and Unicode character tables. ‹[0-5]
› matches one of six characters, just like
‹[j-o]
› and ‹[x09-x0E]
› match different
ranges of six characters.
When a numeric range is represented as text, it consists of a number of positions. Each position allows a certain range of digits. Some ranges have a fixed number of positions, such as 12 to 24. Others have a variable number of positions, such as 1 to 12. The range of digits allowed by each position can be either interdependent or independent of the digits in the other positions. In the range 40 to 59, the positions are independent. In the range 44 to 55, the positions are interdependent.
The easiest ranges are those with a fixed number of independent positions, such as 40 to 59. To code these as a regular expression, all you need to do is to string together a bunch of character classes. Use one character class for each position, specifying the range of digits allowed at that position.
[45][0-9]
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
The range 40 to 59 requires a number with two digits. Thus we need
two character classes. The first digit must be a 4 or 5. The character
class ‹[45]
› matches
either digit. The second digit can be any of the 10 digits. ‹[0-9]
› does the trick.
We could also have used the shorthand ‹d
› instead of ‹[0-9]
›. We use the explicit range ‹[0-9]
› for consistency with the
other character classes, to help maintain readability. Reducing the
number of backslashes in your regexes is also very helpful if you’re
working with a programming language such as Java that requires
backslashes to be escaped in literal strings.
The numbers in the range 44 to 55 also need two positions, but they’re not independent. The first digit must be 4 or 5. If the first digit is 4, the second digit must be between 4 and 9. That covers the numbers 44 to 49. If the first digit is 5, the second digit must be between 0 and 5. That covers the numbers 50 to 55. To create our regex, we simply use alternation to combine the two ranges:
4[4-9]|5[0-5]
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
By using alternation, we’re telling the regex engine to match
‹4[4-9]
› or ‹5[0-5]
›. The alternation operator
has the lowest precedence of all regex operators, and so we don’t need
to group the digits, as in ‹(4[4-9])|(5[0-5)
›.
You can string together as many ranges using alternation as you want. The range 34 to 65 also has two interdependent positions. The first digit must be between 3 and 6. If the first digit is 3, the second must be 4 to 9. If the first is 4 or 5, the second can be any digit. If the first is 6, the second must be 0 to 5:
3[4-9]|[45][0-9]|6[0-5]
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Just like we use alternation to split ranges with interdependent positions into multiple ranges with independent positions, we can use alternation to split ranges with a variable number of positions into multiple ranges with a fixed number of positions. The range 1 to 12 has numbers with one or two positions. We split this into the range 1 to 9 with one position, and the range 10 to 12 with two positions. The positions in each of these two ranges are independent, so we don’t need to split them up further:
1[0-2]|[1-9]
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
We listed the range with two digits before the one with a single
digit. This is intentional because the regular expression engine is
eager. It scans the alternatives
from left to right, and stops as soon as one matches. If your subject
text is 12
,
then ‹1[0-2|[1-9]
› matches
12
, whereas
‹[1-9]|1[0-2]
› matches
just ‹1
›. The first
alternative, ‹[1-9]
›, is
tried first. Since that alternative is happy to match just 1
, the regex engine never
tries to check whether ‹1[0-2]
› might offer a “better” solution.
The range 85 to 117 includes numbers of two different lengths. The range 85 to 99 has two positions, and the range 100 to 117 has three positions. The positions in these ranges are interdependent, and so we have to split them up further. For the two-digit range, if the first digit is 8, the second must be between 5 and 9. If the first digit is 9, the second digit can be any digit. For the three-digit range, the first position allows only the digit 1. If the second position has the digit 0, the third position allows any digit. But if the second digit is 1, then the third digit must be between 0 and 7. This gives us four ranges total: 85 to 89, 90 to 99, 100 to 109, and 110 to 117. Though things are getting long-winded, the regular expression remains as straightforward as the previous ones:
8[5-9]|9[0-9]|10[0-9]|11[0-7]
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
That’s all there really is to matching numeric ranges with regular expressions: simply split up the range until you have ranges with a fixed number of positions with independent digits. This way, you’ll always get a correct regular expression that is easy to read and maintain, even if it may get a bit long-winded.
There are some extra techniques that allow for shorter regular expressions. For example, using the previous system, the range 0 to 65535 would require this regex:
6553[0-5]|655[0-2][0-9]|65[0-4][0-9][0-9]|6[0-4][0-9][0-9][0-9]|↵ [1-5][0-9][0-9][0-9][0-9]|[1-9][0-9][0-9][0-9]|[1-9][0-9][0-9]|↵ [1-9][0-9]|[0-9]
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
This regular expression works perfectly, and you won’t be able to come up with a regex that runs measurably faster. Any optimizations that could be made (e.g., there are various alternatives starting with a 6) are already made by the regular expression engine when it compiles your regular expression. There’s no need to waste your time to make your regex more complicated in the hopes of getting it faster. But you can make your regex shorter, to reduce the amount of typing you need to do, while still keeping it readable.
Several of the alternatives have identical character classes next to each other. You can eliminate the duplication by using quantifiers. Recipe 2.12 tells you all about those.
6553[0-5]|655[0-2][0-9]|65[0-4][0-9]{2}|6[0-4][0-9]{3}|[1-5][0-9]{4}|↵ [1-9][0-9]{3}|[1-9][0-9]{2}|[1-9][0-9]|[0-9]
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
The ‹[1-9][0-9]{3}|[1-9][0-9]{2}|[1-9][0-9]
› part of
the regex has three very similar alternatives, and they all have the
same pair of character classes. The only difference is the number of
times the second class is repeated. We can easily combine that into
‹[1-9][0-9]{1,3}
›.
6553[0-5]|655[0-2][0-9]|65[0-4][0-9]{2}|6[0-4][0-9]{3}|[1-5][0-9]{4}|↵ [1-9][0-9]{1,3}|[0-9]
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Any further tricks will hurt readability. For example, you could isolate the leading 6 from the first four alternatives:
6(?:553[0-5]|55[0-2][0-9]|5[0-4][0-9]{2}|[0-4][0-9]{3})|[1-5][0-9]{4}|↵ [1-9][0-9]{1,3}|[0-9]
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
But this regex is actually one character longer because we had to add a noncapturing group to isolate the alternatives with the leading 6 from the other alternatives. You won’t get a performance benefit with any of the regex flavors discussed in this book. They all make this optimization internally.
All the other recipes in this chapter show more ways of matching different kinds of numbers with a regular expression. Recipe 6.8 shows how to match ranges of hexadecimal numbers.
Recipe 4.12 shows how to remove specific numbers from a valid range, using negative lookahead.
Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition.