8.15. Validating Domain Names

Problem

You want to check whether a string looks like it may be a valid, fully qualified domain name, or find such domain names in longer text.

Solution

Check whether a string looks like a valid domain name:

^([a-z0-9]+(-[a-z0-9]+)*.)+[a-z]{2,}$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python
A([a-z0-9]+(-[a-z0-9]+)*.)+[a-z]{2,}
Regex options: Case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

Find valid domain names in longer text:

([a-z0-9]+(-[a-z0-9]+)*.)+[a-z]{2,}
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Check whether each part of the domain is not longer than 63 characters:

((?=[a-z0-9-]{1,63}.)[a-z0-9]+(-[a-z0-9]+)*.)+[a-z]{2,63}
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Allow internationalized domain names using the punycode notation:

((xn--)?[a-z0-9]+(-[a-z0-9]+)*.)+[a-z]{2,}
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Check whether each part of the domain is not longer than 63 characters, and allow internationalized domain names using the punycode notation:

((?=[a-z0-9-]{1,63}.)(xn--)?[a-z0-9]+(-[a-z0-9]+)*.)+[a-z]{2,63}
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

A domain name has the form of domain.tld, or subdomain.domain.tld, or any number of additional subdomains. The top-level domain (tld) consists of two or more letters. That’s the easiest part of the regex: [a-z]{2,}.

The domain, and any subdomains, consist of letters, digits, and hyphens. Hyphens cannot appear in pairs, and cannot appear as the first or last character in the domain. We handle this with the regular expression [a-z0-9]+(-[a-z0-9]+)*. This regex allows any number of letters and digits, optionally followed by any number of groups that consist of a hyphen followed by another sequence of letters and digits. Remember that the hyphen is a metacharacter inside character classes (Recipe 2.3) but an ordinary character outside of character classes, so we don’t need to escape any hyphens in this regex.

The domain and the subdomains are delimited with a literal dot, which we match with . in a regular expression. Since we can have any number of subdomains in addition to the domain, we place the domain name part of the regex and the literal dot in a group that we repeat: ([a-z0-9]+(-[a-z0-9]+)*.)+. Since the subdomains follow the same syntax as the domain, this one group handles both.

If you want to check whether a string represents a valid domain name, all that remains is to add anchors to the start and the end of the regex that match at the start and the end of the string. We can do this with ^ and $ in all flavors except Ruby, and with A and  in all flavors except JavaScript. Recipe 2.5 has all the details.

If you want to find domain names in a larger body of text, you can add word boundaries (; see Recipe 2.6).

Our first set of regular expressions doesn’t check whether each part of the domain is no longer than 63 characters. We can’t easily do this, because our regex for each domain part, [a-z0-9]+(-[a-z0-9]+)*, has three quantifiers in it. There’s no way to tell the regex engine to make these add up to 63.

We could use [-a-z0-9]{1,63} to match a domain part that is 1 to 63 characters long, or ([-a-z0-9]{1,63}.)+[a-z]{2,63} for the whole domain name. But then we’re no longer excluding domains with hyphens in the wrong places.

What we can do is to use lookahead to match the same text twice. Review Recipe 2.16 first if you’re not familiar with lookahead. We use the same regex [a-z0-9]+(-[a-z0-9]+)*. to match a domain name with valid hyphens, and add [-a-z0-9]{1,63}. inside a lookahead to check that its length is also 63 characters or less. The result is (?=[-a-z0-9]{1,63}.)[a-z0-9]+(-[a-z0-9]+)*..

The lookahead (?=[-a-z0-9]{1,63}.) first checks that there are 1 to 63 letters, digits, and hyphens until the next dot. It’s important to include the dot in the lookahead. Without it, domains longer than 63 characters would still satisfy the lookahead’s requirement for 63 characters. Only by putting the literal dot inside the lookahead do we enforce the requirement that we want at most 63 characters.

The lookahead does not consume the text that it matched. Thus, if the lookahead succeeds, [a-z0-9]+(-[a-z0-9]+)*. is applied to the same text already matched by the lookahead. We’ve confirmed there are no more than 63 characters, and now we test that they’re the right combination of hyphens and nonhyphens.

Internationalized domain names (IDNs) theoretically can contain pretty much any character. The actual list of characters depends on the registry that manages the top-level domain. For example, .es allows domain names with Spanish characters.

In practice, internationalized domain names are often encoded using a scheme called punycode. Although the punycode algorithm is quite complicated, what matters here is that it results in domain names that are a combination of letters, digits, and hyphens, following the rules we’re already handling with our regular expression for domain names. The only difference is that the domain name produced by punycode is prefixed with xn--. To add support for such domains to our regular expression, we only need to add (xn--)? to the group in our regular expression that matches the domain name parts.

See Also

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.1 explains which special characters need to be escaped. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.6 explains word boundaries. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition. Recipe 2.16 explains lookaround.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset