You want to check whether a string represents a valid IPv6 address using the standard, compact, and/or mixed notations.
Match an IPv6 address in standard notation, which consists of
eight 16-bit words using hexadecimal notation, delimited by colons
(e.g.: 1762:0:0:0:0:B03:1:AF18
). Leading zeros
are optional.
Check whether the whole subject text is an IPv6 address using standard notation:
^(?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4}$
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python |
A(?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4}
Regex options: Case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
Find an IPv6 address using standard notation within a larger collection of text:
(?<![:.w])(?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4}(?![:.w])
Regex options: Case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9 |
JavaScript and Ruby 1.8 don’t support lookbehind. We have to remove the check at the start of the regex that keeps it from finding IPv6 addresses within longer sequences of hexadecimal digits and colons. A word boundary performs part of the test:
(?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4}
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Match an IPv6 address in mixed notation, which consists
of six 16-bit words using hexadecimal notation, followed by four bytes
using decimal notation. The words are delimited with colons, and the
bytes with dots. A colon separates the words from the bytes. Leading
zeros are optional for both the hexadecimal words and the decimal
bytes. This notation is used in situations where IPv4 and IPv6 are
mixed, and the IPv6 addresses are extensions of the IPv4 addresses.
1762:0:0:0:0:B03:127.32.67.15
is an
example of an IPv6 address in mixed notation.
Check whether the whole subject text is an IPv6 address using mixed notation:
^(?:[A-F0-9]{1,4}:){6}(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])↵ .){3}(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])$
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Find IPv6 address using mixed notation within a larger collection of text:
(?<![:.w])(?:[A-F0-9]{1,4}:){6}↵ (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]).){3}↵ (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])(?![:.w])
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
JavaScript and Ruby 1.8 don’t support lookbehind. We have to remove the check at the start of the regex that keeps it from finding IPv6 addresses within longer sequences of hexadecimal digits and colons. A word boundary performs part of the test:
(?:[A-F0-9]{1,4}:){6}(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])↵ .){3}(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Match an IPv6 address using standard or mixed notation.
Check whether the whole subject text is an IPv6 address using standard or mixed notation:
A # Start of string (?:[A-F0-9]{1,4}:){6} # 6 words (?:[A-F0-9]{1,4}:[A-F0-9]{1,4} # 2 words | (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]).){3} # or 4 bytes (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]) ) # End of string
Regex options: Free-spacing, case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
^(?:[A-F0-9]{1,4}:){6}(?:[A-F0-9]{1,4}:[A-F0-9]{1,4}|↵ (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]).){3}↵ (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]))$
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python |
Find IPv6 address using standard or mixed notation within a larger collection of text:
(?<![:.w]) # Anchor address (?:[A-F0-9]{1,4}:){6} # 6 words (?:[A-F0-9]{1,4}:[A-F0-9]{1,4} # 2 words | (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]).){3} # or 4 bytes (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]) )(?![:.w]) # Anchor address
Regex options: Free-spacing, case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9 |
JavaScript and Ruby 1.8 don’t support lookbehind. We have to remove the check at the start of the regex that keeps it from finding IPv6 addresses within longer sequences of hexadecimal digits and colons. A word boundary performs part of the test:
# Word boundary (?:[A-F0-9]{1,4}:){6} # 6 words (?:[A-F0-9]{1,4}:[A-F0-9]{1,4} # 2 words | (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]).){3} # or 4 bytes (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]) ) # Word boundary
Regex options: Free-spacing, case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
(?:[A-F0-9]{1,4}:){6}(?:[A-F0-9]{1,4}:[A-F0-9]{1,4}|↵ (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]).){3}↵ (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]))
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Match an IPv6 address using compressed notation. Compressed notation is the same as standard notation, except that one sequence of one or more words that are zero may be omitted, leaving only the colons before and after the omitted zeros. Addresses using compressed notation can be recognized by the occurrence of two adjacent colons in the address. Only one sequence of zeros may be omitted; otherwise, it would be impossible to determine how many words have been omitted in each sequence. If the omitted sequence of zeros is at the start or the end of the IP address, it will begin or end with two colons. If all numbers are zero, the compressed IPv6 address consists of just two colons, without any digits.
For example, 1762::B03:1:AF18
is the compressed form
of 1762:0:0:0:0:B03:1:AF18
. The regular
expressions in this section will match both the compressed and the
standard form of the IPv6 address. Check whether the whole subject
text is an IPv6 address using standard or compressed notation:
A(?: # Standard (?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4} # Compressed with at most 7 colons |(?=(?:[A-F0-9]{0,4}:){0,7}[A-F0-9]{0,4} ) # and anchored # and at most 1 double colon (([0-9A-F]{1,4}:){1,7}|:)((:[0-9A-F]{1,4}){1,7}|:) # Compressed with 8 colons |(?:[A-F0-9]{1,4}:){7}:|:(:[A-F0-9]{1,4}){7} )
Regex options: Free-spacing, case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
^(?:(?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4}|(?=(?:[A-F0-9]{0,4}:){0,7}↵ [A-F0-9]{0,4}$)(([0-9A-F]{1,4}:){1,7}|:)((:[0-9A-F]{1,4}){1,7}|:)↵ |(?:[A-F0-9]{1,4}:){7}:|:(:[A-F0-9]{1,4}){7})$
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python |
Find IPv6 address using standard or compressed notation within a larger collection of text:
(?<![:.w])(?: # Standard (?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4} # Compressed with at most 7 colons |(?=(?:[A-F0-9]{0,4}:){0,7}[A-F0-9]{0,4} (?![:.w])) # and anchored # and at most 1 double colon (([0-9A-F]{1,4}:){1,7}|:)((:[0-9A-F]{1,4}){1,7}|:) # Compressed with 8 colons |(?:[A-F0-9]{1,4}:){7}:|:(:[A-F0-9]{1,4}){7} )(?![:.w])
Regex options: Free-spacing, case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9 |
JavaScript and Ruby 1.8 don’t support lookbehind, so we have to remove the check at the start of the regex that keeps it from finding IPv6 addresses within longer sequences of hexadecimal digits and colons. We cannot use a word boundary, because the address may start with a colon, which is not a word character:
(?: # Standard (?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4} # Compressed with at most 7 colons |(?=(?:[A-F0-9]{0,4}:){0,7}[A-F0-9]{0,4} (?![:.w])) # and anchored # and at most 1 double colon (([0-9A-F]{1,4}:){1,7}|:)((:[0-9A-F]{1,4}){1,7}|:) # Compressed with 8 colons |(?:[A-F0-9]{1,4}:){7}:|:(:[A-F0-9]{1,4}){7} )(?![:.w])
Regex options: Free-spacing, case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
(?:(?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4}|(?=(?:[A-F0-9]{0,4}:){0,7}↵ [A-F0-9]{0,4}(?![:.w]))(([0-9A-F]{1,4}:){1,7}|:)((:[0-9A-F]{1,4}){1,7}|:)↵ |(?:[A-F0-9]{1,4}:){7}:|:(:[A-F0-9]{1,4}){7})(?![:.w])
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Match an IPv6 address using compressed mixed notation. Compressed mixed notation is the same as mixed notation, except that one sequence of one or more words that are zero may be omitted, leaving only the colons before and after the omitted zeros. The four decimal bytes must all be specified, even if they are zero. Addresses using compressed mixed notation can be recognized by the occurrence of two adjacent colons in the first part of the address and the three dots in the second part. Only one sequence of zeros may be omitted; otherwise, it would be impossible to determine how many words have been omitted in each sequence. If the omitted sequence of zeros is at the start of the IP address, it will begin with two colons rather than with a digit.
For example, the IPv6 address 1762::B03:127.32.67.15
is the
compressed form of 1762:0:0:0:0:B03:127.32.67.15
. The
regular expressions in this section will match both compressed and
noncompressed IPv6 address using mixed notation.
Check whether the whole subject text is an IPv6 address using compressed or noncompressed mixed notation:
A (?: # Non-compressed (?:[A-F0-9]{1,4}:){6} # Compressed with at most 6 colons |(?=(?:[A-F0-9]{0,4}:){0,6} (?:[0-9]{1,3}.){3}[0-9]{1,3} # and 4 bytes ) # and anchored # and at most 1 double colon (([0-9A-F]{1,4}:){0,5}|:)((:[0-9A-F]{1,4}){1,5}:|:) # Compressed with 7 colons and 5 numbers |::(?:[A-F0-9]{1,4}:){5} ) # 255.255.255. (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]).){3} # 255 (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])
Regex options: Free-spacing, case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
^(?:(?:[A-F0-9]{1,4}:){6}|(?=(?:[A-F0-9]{0,4}:){0,6}(?:[0-9]{1,3}.)↵ {3}[0-9]{1,3}$)(([0-9A-F]{1,4}:){0,5}|:)((:[0-9A-F]{1,4}){1,5}:|:)↵ |::(?:[A-F0-9]{1,4}:){5})(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|↵ [1-9]?[0-9]).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python |
Find IPv6 address using compressed or noncompressed mixed notation within a larger collection of text:
(?<![:.w]) (?: # Non-compressed (?:[A-F0-9]{1,4}:){6} # Compressed with at most 6 colons |(?=(?:[A-F0-9]{0,4}:){0,6} (?:[0-9]{1,3}.){3}[0-9]{1,3} # and 4 bytes (?![:.w])) # and anchored # and at most 1 double colon (([0-9A-F]{1,4}:){0,5}|:)((:[0-9A-F]{1,4}){1,5}:|:) # Compressed with 7 colons and 5 numbers |::(?:[A-F0-9]{1,4}:){5} ) # 255.255.255. (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]).){3} # 255 (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]) (?![:.w])
Regex options: Free-spacing, case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9 |
JavaScript and Ruby 1.8 don’t support lookbehind, so we have to remove the check at the start of the regex that keeps it from finding IPv6 addresses within longer sequences of hexadecimal digits and colons. We cannot use a word boundary, because the address may start with a colon, which is not a word character.
(?: # Non-compressed (?:[A-F0-9]{1,4}:){6} # Compressed with at most 6 colons |(?=(?:[A-F0-9]{0,4}:){0,6} (?:[0-9]{1,3}.){3}[0-9]{1,3} # and 4 bytes (?![:.w])) # and anchored # and at most 1 double colon (([0-9A-F]{1,4}:){0,5}|:)((:[0-9A-F]{1,4}){1,5}:|:) # Compressed with 7 colons and 5 numbers |::(?:[A-F0-9]{1,4}:){5} ) # 255.255.255. (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]).){3} # 255 (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]) (?![:.w])
Regex options: Free-spacing, case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
(?:(?:[A-F0-9]{1,4}:){6}|(?=(?:[A-F0-9]{0,4}:){0,6}(?:[0-9]{1,3}.){3}↵ [0-9]{1,3}(?![:.w]))(([0-9A-F]{1,4}:){0,5}|:)((:[0-9A-F]{1,4}){1,5}:|:)↵ |::(?:[A-F0-9]{1,4}:){5})(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?↵ [0-9]).){3}(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])(?![:.w])
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Match an IPv6 address using any of the notations explained earlier: standard, mixed, compressed, and compressed mixed.
Check whether the whole subject text is an IPv6 address:
A(?: # Mixed (?: # Non-compressed (?:[A-F0-9]{1,4}:){6} # Compressed with at most 6 colons |(?=(?:[A-F0-9]{0,4}:){0,6} (?:[0-9]{1,3}.){3}[0-9]{1,3} # and 4 bytes ) # and anchored # and at most 1 double colon (([0-9A-F]{1,4}:){0,5}|:)((:[0-9A-F]{1,4}){1,5}:|:) # Compressed with 7 colons and 5 numbers |::(?:[A-F0-9]{1,4}:){5} ) # 255.255.255. (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]).){3} # 255 (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]) |# Standard (?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4} |# Compressed with at most 7 colons (?=(?:[A-F0-9]{0,4}:){0,7}[A-F0-9]{0,4} ) # and anchored # and at most 1 double colon (([0-9A-F]{1,4}:){1,7}|:)((:[0-9A-F]{1,4}){1,7}|:) # Compressed with 8 colons |(?:[A-F0-9]{1,4}:){7}:|:(:[A-F0-9]{1,4}){7} )
Regex options: Free-spacing, case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
^(?:(?:(?:[A-F0-9]{1,4}:){6}|(?=(?:[A-F0-9]{0,4}:){0,6}(?:[0-9]{1,3}↵ .){3}[0-9]{1,3}$)(([0-9A-F]{1,4}:){0,5}|:)((:[0-9A-F]{1,4}){1,5}:|:)↵ |::(?:[A-F0-9]{1,4}:){5})(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|↵ [1-9]?[0-9]).){3}(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])|↵ (?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4}|(?=(?:[A-F0-9]{0,4}:){0,7}↵ [A-F0-9]{0,4}$)(([0-9A-F]{1,4}:){1,7}|:)((:[0-9A-F]{1,4}){1,7}|:)|↵ (?:[A-F0-9]{1,4}:){7}:|:(:[A-F0-9]{1,4}){7})$
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python |
Find an IPv6 address using standard or mixed notation within a larger collection of text:
(?<![:.w])(?: # Mixed (?: # Non-compressed (?:[A-F0-9]{1,4}:){6} # Compressed with at most 6 colons |(?=(?:[A-F0-9]{0,4}:){0,6} (?:[0-9]{1,3}.){3}[0-9]{1,3} # and 4 bytes (?![:.w])) # and anchored # and at most 1 double colon (([0-9A-F]{1,4}:){0,5}|:)((:[0-9A-F]{1,4}){1,5}:|:) # Compressed with 7 colons and 5 numbers |::(?:[A-F0-9]{1,4}:){5} ) # 255.255.255. (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]).){3} # 255 (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]) |# Standard (?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4} |# Compressed with at most 7 colons (?=(?:[A-F0-9]{0,4}:){0,7}[A-F0-9]{0,4} (?![:.w])) # and anchored # and at most 1 double colon (([0-9A-F]{1,4}:){1,7}|:)((:[0-9A-F]{1,4}){1,7}|:) # Compressed with 8 colons |(?:[A-F0-9]{1,4}:){7}:|:(:[A-F0-9]{1,4}){7} )(?![:.w])
Regex options: Free-spacing, case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby 1.9 |
JavaScript and Ruby 1.8 don’t support lookbehind, so we have to remove the check at the start of the regex that keeps it from finding IPv6 addresses within longer sequences of hexadecimal digits and colons. We cannot use a word boundary, because the address may start with a colon, which is not a word character.
(?: # Mixed (?: # Non-compressed (?:[A-F0-9]{1,4}:){6} # Compressed with at most 6 colons |(?=(?:[A-F0-9]{0,4}:){0,6} (?:[0-9]{1,3}.){3}[0-9]{1,3} # and 4 bytes (?![:.w])) # and anchored # and at most 1 double colon (([0-9A-F]{1,4}:){0,5}|:)((:[0-9A-F]{1,4}){1,5}:|:) # Compressed with 7 colons and 5 numbers |::(?:[A-F0-9]{1,4}:){5} ) # 255.255.255. (?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]).){3} # 255 (?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9]) |# Standard (?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4} |# Compressed with at most 7 colons (?=(?:[A-F0-9]{0,4}:){0,7}[A-F0-9]{0,4} (?![:.w])) # and anchored # and at most 1 double colon (([0-9A-F]{1,4}:){1,7}|:)((:[0-9A-F]{1,4}){1,7}|:) # Compressed with 8 colons |(?:[A-F0-9]{1,4}:){7}:|:(:[A-F0-9]{1,4}){7} )(?![:.w])
Regex options: Free-spacing, case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
(?:(?:(?:[A-F0-9]{1,4}:){6}|(?=(?:[A-F0-9]{0,4}:){0,6}(?:[0-9]{1,3}.){3}↵ [0-9]{1,3}(?![:.w]))(([0-9A-F]{1,4}:){0,5}|:)((:[0-9A-F]{1,4}){1,5}:|:)↵ |::(?:[A-F0-9]{1,4}:){5})(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|↵ [1-9]?[0-9]).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)|↵ (?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4}|(?=(?:[A-F0-9]{0,4}:){0,7}↵ [A-F0-9]{0,4}(?![:.w]))(([0-9A-F]{1,4}:){1,7}|:)((:[0-9A-F]{1,4}){1,7}↵ |:)|(?:[A-F0-9]{1,4}:){7}:|:(:[A-F0-9]{1,4}){7})(?![:.w])
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Because of the different notations, matching an IPv6 address isn’t nearly as simple as matching an IPv4 address. Which notations you want to accept will greatly impact the complexity of your regular expression. Basically, there are two notations: standard and mixed. You can decide to allow only one of the two notations, or both. That gives us three sets of regular expressions.
Both the standard and mixed notations have a compressed form that omits zeros. Allowing compressed notation gives us another three sets of regular expressions.
You’ll need slightly different regexes depending on whether you
want to check if a given string is a valid IPv6 address, or whether you
want to find IP addresses in a larger body of text. To validate the IP
address, we use anchors, as Recipe 2.5
explains. JavaScript uses the ‹^
› and ‹$
› anchors, whereas Ruby uses ‹A
› and
‹›. All
other flavors support both. Ruby also supports ‹
^
› and ‹$
›, but allows them to match at embedded line
breaks in the string as well. You should use the caret and dollar in
Ruby only if you know your string doesn’t have any embedded line
breaks.
To find IPv6 addresses within larger text, we use negative
lookbehind ‹(?<![:.w])
› and negative lookahead ‹(?![:.w])
›
to make sure the address isn’t preceded or followed by a word character
(letter, digit, or underscore) or by a dot or colon. This makes sure we
don’t match parts of longer sequences of digits and colons. Recipe 2.16 explains how lookbehind and lookahead
work. If lookaround isn’t available, word boundaries can check that the
address isn’t preceded or followed by a word character, but only if the
first and last character in the address are sure to be (hexadecimal)
digits. Compressed notation allows addresses that start and end with a
colon. If we were to put a word boundary before or after a colon, it
would require an adjacent letter or digit, which isn’t what we want.
Recipe 2.6 explains everything about
word boundaries.
Standard IPv6 notation is very straightforward to handle
with a regular expression. We need to match eight words in hexadecimal
notation, delimited by seven colons. ‹[A-F0-9]{1,4}
› matches 1 to 4
hexadecimal characters, which is what we need for a 16-bit word with
optional leading zeros. The character class (Recipe 2.3) lists only the uppercase letters. The
case-insensitive matching mode takes care of the lowercase letters.
See Recipe 3.4 to learn how to set
matching modes in your programming language.
The noncapturing group ‹(?:[A-F0-9]{1,4}:){7}
› matches a hexadecimal
word followed by a literal colon. The quantifier repeats the group
seven times. The first colon in this regex is part of the regex syntax
for noncapturing groups, as Recipe 2.9
explains, and the second is a literal colon. The colon is not a
metacharacter in regular expressions, except in a few very specific
situations as part of a larger regex token. Therefore, we don’t need
to use backslashes to escape literal colons in our regular
expressions. We could escape them, but it would only make the regex
harder to read.
The regex for the mixed IPv6 notation consists of two
parts. ‹(?:[A-F0-9]{1,4}:){6}
› matches six hexadecimal
words, each followed by a literal colon, just like we have a sequence
of seven such words in the regex for the standard IPv6
notation.
Instead of having two more hexadecimal words at the end, we now have a full IPv4 address at the end. We match this using the “accurate” regex that disallows leading zeros shown in Recipe 8.16.
Allowing both standard and mixed notation requires a slightly longer regular expression. The two notations differ only in their representation of the last 32 bits of the IPv6 address. Standard notation uses two 16-bit words, whereas mixed notation uses 4 decimal bytes, as with IPv4.
The first part of the regex matches six hexadecimal words, as in the regex that supports mixed notation only. The second part of the regex is now a noncapturing group with the two alternatives for the last 32 bits. As Recipe 2.8 explains, the alternation operator (vertical bar) has the lowest precedence of all regex operators. Thus, we need the noncapturing group to exclude the six words from the alternation.
The first alternative, located to the left of the vertical bar, matches two hexadecimal words with a literal colon in between. The second alternative matches an IPv4 address.
Things get quite a bit more complicated when we allow
compressed notation. The reason is that compressed notation allows a
variable number of zeros to be omitted. 1:0:0:0:0:6:0:0
, 1::6:0:0
, and
1:0:0:0:0:6::
are three ways of writing
the same IPv6 address. The address may have at most eight words, but
it needn’t have any. If it has less than eight, it must have one
double-colon sequence that represents the omitted zeros.
Variable repetition is easy with regular expressions. If an IPv6 address has a double colon, there can be at most seven words before and after the double colon. We could easily write this as:
( ([0-9A-F]{1,4}:){1,7} # 1 to 7 words to the left | : # or a double colon at the start ) ( (:[0-9A-F]{1,4}){1,7} # 1 to 7 words to the right | : # or a double colon at the end )
Regex options: Free-spacing, case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
This regular expression and the ones that follow in this discussion also work with JavaScript if you eliminate the comments and extra whitespace. JavaScript supports all the features used in these regexes, except free-spacing, which we use here to make these regexes easier to understand. Or, you can use the XRegExp library which enables free-spacing regular expressions in JavaScript, among other regex syntax enhancements.
This regular expression matches all compressed IPv6 addresses, but it doesn’t match any addresses that use noncompressed standard notation.
This regex is quite simple. The first part matches 1 to 7 words followed by a colon, or just the colon for addresses that don’t have any words to the left of the double colon. The second part matches 1 to 7 words preceded by a colon, or just the colon for addresses that don’t have any words to the right of the double colon. Put together, valid matches are a double colon by itself, a double colon with 1 to 7 words at the left only, a double colon with 1 to 7 words at the right only, and a double colon with 1 to 7 words at both the left and the right.
It’s the last part that is troublesome. The regex allows 1 to 7 words at both the left and the right, as it should, but it doesn’t specify that the total number of words at the left and right must be 7 or less. An IPv6 address has 8 words. The double colon indicates we’re omitting at least one word, so at most 7 remain.
Regular expressions don’t do math. They can count if something occurs between 1 and 7 times. But they cannot count if two things occur for a total of 7 times, splitting those 7 times between the two things in any combination.
To understand this problem better, let’s examine a simple
analog. Say we want to match something in the form of aaaaxbbb
. The string
must be between 1 and 8 characters long and consist of 0 to 7 times
a
, exactly
one x
, and 0
to 7 times b
.
There are two ways to solve this problem with a regular expression. One way is to spell out all the alternatives. The next section discussing compressed mixed notation uses this. It can result in a long-winded regex, but it will be easy to understand.
A(?:a{7}x | a{6}xb? | a{5}xb{0,2} | a{4}xb{0,3} | a{3}xb{0,4} | a{2}xb{0,5} | axb{0,6} | xb{0,7} )
Regex options: Free-spacing |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
This regular expression has one alternative for each of the
possible number of letters a
. Each alternative spells out how many
letters b
are allowed after the given number of letters a
and the x
have been
matched.
The other solution is to use lookahead. This is the method used for the regex within the section that matches an IPv6 address using compressed notation. If you’re not familiar with lookahead, see Recipe 2.16 first. Using lookahead, we can essentially match the same text twice, checking it for two conditions.
A (?=[abx]{1,8}) a{0,7}xb{0,7}
Regex options: Free-spacing |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
The ‹A
› at the
start of the regex anchors it to the start of the subject text. Then
the positive lookahead kicks in. It checks whether a series of 1 to 8
letters ‹a
›, ‹b
›, and/or ‹x
› can be matched, and that the
end of the string is reached when those 1 to 8 letters have been
matched. The ‹› inside
the lookahead is crucial. In order to limit the regex to strings of
eight characters or less, the lookahead must test that there aren’t
any further characters after those that it matched.
In a different scenario, you might use another kind of delimiter
instead of ‹A
› and
‹›. If you wanted to
do a “whole words only” search for
aaaaxbbb
and friends, you would use
word boundaries. But to restrict the regex match to the right length,
you have to use some kind of delimiter, and you have to put the
delimiter that matches the end of the string both inside the lookahead
and at the end of the regular expression. If you don’t, the regular
expression will partly match a string that has too many characters.
When the lookahead has satisfied its requirement, it gives up
the characters that it has matched. Thus, when the regex engine
attempts ‹a{0,7}
›, it is
back at the start of the string. The fact that the lookahead doesn’t
consume the text that it matched is the key difference between a
lookahead and a noncapturing group, and is what allows us to apply two
patterns to a single piece of text.
Although ‹a{0,7}xb{0,7}
› on its own could match up to 15
letters, in this case it can match only 8, because the lookahead
already made sure there are only 8 letters. All ‹a{0,7}xb{0,7}
› has to do is to
check that they appear in the right order. In fact, ‹a*xb*
› would have the exact same
effect as ‹a{0,7}xb{0,7}
› in this regular
expression.
The second ‹› at
the end of the regex is also essential. Just like the lookahead needs
to make sure there aren’t too many letters, the second test after the
lookahead needs to make sure that all the letters are in the right
order. This makes sure we don’t match something like
axba
, even though it
satisfies the lookahead by being between 1 and 8 characters
long.
Mixed notation can be compressed just like standard notation. Although the four bytes at the end must always be specified, even when they are zero, the number of hexadecimal words before them again becomes variable. If all the hexadecimal words are zero, the IPv6 address could end up looking like an IPv4 address with two colons before it.
Creating a regex for compressed mixed notation involves solving the same issues as for compressed standard notation. The previous section explains all this.
The main difference between the regex for compressed mixed notation and the regex for compressed (standard) notation is that the one for compressed mixed notation needs to check for the IPv4 address after the six hexadecimal words. We do this check at the end of the regex, using the same regex for accurate IPv4 addresses from Recipe 8.16 that we used in this recipe for noncompressed mixed notation.
We have to match the IPv4 part of the address at the end of the regex, but we also have to check for it inside the lookahead that makes sure we have no more than six colons or six hexadecimal words in the IPv6 address. Since we’re already doing an accurate test at the end of the regex, the lookahead can suffice with a simple IPv4 check. The lookahead doesn’t need to validate the IPv4 part, as the main regex already does that. But it does have to match the IPv4 part, so that the end-of-string anchor at the end of the lookahead can do its job.
The final set of regular expressions puts it all together. These match an IPv6 address in any notation: standard or mixed, compressed or not.
These regular expressions are formed by alternating the ones for compressed mixed notation and compressed (standard) notation. These regexes already use alternation to match both the compressed and noncompressed variety of the IPv6 notation they support.
The result is a regular expression with three top-level alternatives, with the first alternative consisting of two alternatives of its own. The first alternative matches an IPv6 address using mixed notation, either noncompressed or compressed. The second alternative matches an IPv6 address using standard notation. The third alternative covers the compressed (standard) notation.
We have three top-level alternatives instead of two alternatives that each contain their own two alternatives because there’s no particular reason to group the alternatives for standard and compressed notation. For mixed notation, we do keep the compressed and noncompressed alternatives together, because it saves us having to spell out the IPv4 part twice.
Essentially, we combined this regex:
^(6words|compressed6words)ip4$
and this regex:
^(8words|compressed8words)$
into:
^((6words|compressed6words)ip4|8words|compressed8words)$
^((6words|compressed6words)ip4|(8words|compressed8words))$
Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.1 explains which special characters need to be escaped. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.6 explains word boundaries. Recipe 2.9 explains grouping. Recipe 2.8 explains alternation. Recipe 2.12 explains repetition. Recipe 2.16 explains lookaround. Recipe 2.18 explains how to add comments.