8.10. Extracting the Host from a URL

Problem

You want to extract the host from a string that holds a URL. For example, you want to extract www.regexcookbook.com from http://www.regexcookbook.com/.

Solution

Extract the host from a URL known to be valid

A
[a-z][a-z0-9+-.]*://               # Scheme
([a-z0-9-._~%!$&'()*+,;=]+@)?      # User
([a-z0-9-._~%]+                    # Named or IPv4 host
|[[a-z0-9-._~%!$&'()*+,;=:]+])   # IPv6+ host
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
^[a-z][a-z0-9+-.]*://([a-z0-9-._~%!$&'()*+,;=]+@)?([a-z0-9-._~%]+|↵
[[a-z0-9-._~%!$&'()*+,;=:]+])
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Extract the host while validating the URL

A
[a-z][a-z0-9+-.]*://                       # Scheme
([a-z0-9-._~%!$&'()*+,;=]+@)?              # User
([a-z0-9-._~%]+                            # Named host
|[[a-f0-9:.]+]                            # IPv6 host
|[v[a-f0-9][a-z0-9-._~%!$&'()*+,;=:]+])  # IPvFuture host
(:[0-9]+)?                                  # Port
(/[a-z0-9-._~%!$&'()*+,;=:@]+)*/?          # Path
(?[a-z0-9-._~%!$&'()*+,;=:@/?]*)?         # Query
(#[a-z0-9-._~%!$&'()*+,;=:@/?]*)?         # Fragment

Regex options: Case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^[a-z][a-z0-9+-.]*://([a-z0-9-._~%!$&'()*+,;=]+@)?([a-z0-9-._~%]+|↵
[[a-f0-9:.]+]|[v[a-f0-9][a-z0-9-._~%!$&'()*+,;=:]+])(:[0-9]+)?↵
(/[a-z0-9-._~%!$&'()*+,;=:@]+)*/?(?[a-z0-9-._~%!$&'()*+,;=:@/?]*)?↵
(#[a-z0-9-._~%!$&'()*+,;=:@/?]*)?$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python

Discussion

Extracting the host from a URL is easy if you already know that your subject text is a valid URL. We use A or ^ to anchor the match to the start of the string. [a-z][a-z0-9+-.]*:// skips over the scheme, and ([a-z0-9-._~%!$&'()*+,;=]+@)? skips over the optional user. The hostname follows right after that.

RFC 3986 allows two different notations for the host. Domain names and IPv4 addresses are specified without square brackets, whereas IPv6 and future IP addresses are specified with square brackets. We need to handle those separately because the notation with square brackets allows more punctuation than the notation without. In particular, the colon is allowed between square brackets, but not in domain names or IPv4 addresses. The colon is also used to delimit the hostname (with or without square brackets) from the port number.

[a-z0-9-._~%]+ matches domain names and IPv4 addresses. [[a-z0-9-._~%!$&'()*+,;=:]+] handles IP version 6 and later. We combine these two using alternation (Recipe 2.8) in a group. The capturing group also allows us to extract the hostname.

This regex will find a match only if the URL actually specifies a host. When it does, the regex will match the scheme, user, and host parts of the URL. When the regex finds a match, you can retrieve the text matched by the second capturing group to get the hostname without any delimiters or other URL parts. The capturing group will include the square brackets for IPv6 addresses. Recipe 2.9 tells you all about capturing groups. See Recipe 3.9 to learn how to retrieve text matched by capturing groups in your favorite programming language.

If you don’t already know that your subject text is a valid URL, you can use a simplified version of the regex from Recipe 8.7. Since we want to extract the host, we can exclude URLs that don’t specify an authority. This makes the regular expression quite a bit simpler. It’s very similar to the one we used in Recipe 8.9. The only difference is that now the user part of the authority is optional again, as it was in Recipe 8.7.

This regex also uses alternation for the various notations for the host, which is kept together by a capturing group. Retrieve the text matched by capturing group number 2 to get the URL’s host.

If you want a regex that matches any valid URL, including those that don’t specify the host, you can use one of the regexes from Recipe 8.7. The first regex in that recipe captures the host, if present, in the fourth capturing group.

See Also

Recipe 3.9 shows code to get the text matched by a particular part (capturing group) of a regex. Use this to get the host address.

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.11 explains named capturing groups. Recipe 2.12 explains repetition. Recipe 2.18 explains how to add comments.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset