You want to check whether a string looks like it may be a valid, fully qualified domain name, or find such domain names in longer text.
Check whether a string looks like a valid domain name:
^([a-z0-9]+(-[a-z0-9]+)*.)+[a-z]{2,}$
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python |
A([a-z0-9]+(-[a-z0-9]+)*.)+[a-z]{2,}
Regex options: Case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
Find valid domain names in longer text:
([a-z0-9]+(-[a-z0-9]+)*.)+[a-z]{2,}
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Check whether each part of the domain is not longer than 63 characters:
((?=[a-z0-9-]{1,63}.)[a-z0-9]+(-[a-z0-9]+)*.)+[a-z]{2,63}
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Allow internationalized domain names using the punycode notation:
((xn--)?[a-z0-9]+(-[a-z0-9]+)*.)+[a-z]{2,}
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Check whether each part of the domain is not longer than 63 characters, and allow internationalized domain names using the punycode notation:
((?=[a-z0-9-]{1,63}.)(xn--)?[a-z0-9]+(-[a-z0-9]+)*.)+[a-z]{2,63}
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
A domain name has the form of domain.tld
, or subdomain.domain.tld
, or any number of
additional subdomains. The top-level domain (tld
) consists of two or more letters. That’s
the easiest part of the regex: ‹[a-z]{2,}
›.
The domain, and any subdomains, consist of letters,
digits, and hyphens. Hyphens cannot appear in pairs, and cannot appear
as the first or last character in the domain. We handle this with the
regular expression ‹[a-z0-9]+(-[a-z0-9]+)*
›. This regex allows any
number of letters and digits, optionally followed by any number of
groups that consist of a hyphen followed by another sequence of letters
and digits. Remember that the hyphen is a metacharacter inside character
classes (Recipe 2.3) but an ordinary
character outside of character classes, so we don’t need to escape any
hyphens in this regex.
The domain and the subdomains are delimited with a literal dot,
which we match with ‹.
›
in a regular expression. Since we can have any number of subdomains in
addition to the domain, we place the domain name part of the regex and
the literal dot in a group that we repeat: ‹([a-z0-9]+(-[a-z0-9]+)*.)+
›. Since the subdomains
follow the same syntax as the domain, this one group handles
both.
If you want to check whether a string represents a valid domain
name, all that remains is to add anchors to the start and the end of the
regex that match at the start and the end of the string. We can do this
with ‹^
› and
‹$
› in all flavors except
Ruby, and with ‹A
› and
‹› in all flavors except
JavaScript. Recipe 2.5 has all the
details.
If you want to find domain names in a larger body of text,
you can add word boundaries (‹›; see
Recipe 2.6).
Our first set of regular expressions doesn’t check whether each
part of the domain is no longer than 63 characters. We can’t easily do
this, because our regex for each domain part, ‹[a-z0-9]+(-[a-z0-9]+)*
›, has three quantifiers in
it. There’s no way to tell the regex engine to make these add up to
63.
We could use ‹[-a-z0-9]{1,63}
› to match a domain part that is 1
to 63 characters long, or ‹([-a-z0-9]{1,63}.)+[a-z]{2,63}
› for the whole
domain name. But then we’re no longer excluding domains with hyphens in
the wrong places.
What we can do is to use lookahead to match the same text
twice. Review Recipe 2.16 first if you’re not
familiar with lookahead. We use the same regex ‹[a-z0-9]+(-[a-z0-9]+)*.
› to match a domain name
with valid hyphens, and add ‹[-a-z0-9]{1,63}.
› inside a lookahead to check
that its length is also 63 characters or less. The result is ‹(?=[-a-z0-9]{1,63}.)[a-z0-9]+(-[a-z0-9]+)*.
›.
The lookahead ‹(?=[-a-z0-9]{1,63}.)
› first checks that there are
1 to 63 letters, digits, and hyphens until the next dot. It’s important
to include the dot in the lookahead. Without it, domains longer than 63
characters would still satisfy the lookahead’s requirement for 63
characters. Only by putting the literal dot inside the lookahead do we
enforce the requirement that we want at most 63 characters.
The lookahead does not consume the text that it matched.
Thus, if the lookahead succeeds, ‹[a-z0-9]+(-[a-z0-9]+)*.
› is applied to the same
text already matched by the lookahead. We’ve confirmed there are no more
than 63 characters, and now we test that they’re the right combination
of hyphens and nonhyphens.
Internationalized domain names (IDNs) theoretically can contain
pretty much any character. The actual list of characters depends on the
registry that manages the top-level domain. For example, .es
allows domain names
with Spanish characters.
In practice, internationalized domain names are often encoded
using a scheme called punycode. Although the
punycode algorithm is quite complicated, what matters here is that it
results in domain names that are a combination of letters, digits, and
hyphens, following the rules we’re already handling with our regular
expression for domain names. The only difference is that the domain name
produced by punycode is prefixed with xn--
. To add support for such domains to our
regular expression, we only need to add ‹(xn--)?
› to the group in our regular expression
that matches the domain name parts.
Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.1 explains which special characters need to be escaped. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.6 explains word boundaries. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition. Recipe 2.16 explains lookaround.