Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

8.10. Extracting the Host from a URL

Problem

You want to extract the host from a string that holds a URL. For example, you want to extract www.regexcookbook.com from http://www.regexcookbook.com/.

Solution

Extract the host from a URL known to be valid

A
[a-z][a-z0-9+-.]*://               # Scheme
([a-z0-9-._~%!$&'()*+,;=]+@)?      # User
([a-z0-9-._~%]+                    # Named or IPv4 host
|[[a-z0-9-._~%!$&'()*+,;=:]+])   # IPv6+ host

Regex options: Free-spacing, case insensitive

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

^[a-z][a-z0-9+-.]*://([a-z0-9-._~%!$&'()*+,;=]+@)?([a-z0-9-._~%]+|↵
[[a-z0-9-._~%!$&'()*+,;=:]+])

Regex options: Case insensitive

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Extract the host while validating the URL

A
[a-z][a-z0-9+-.]*://                       # Scheme
([a-z0-9-._~%!$&'()*+,;=]+@)?              # User
([a-z0-9-._~%]+                            # Named host
|[[a-f0-9:.]+]                            # IPv6 host
|[v[a-f0-9][a-z0-9-._~%!$&'()*+,;=:]+])  # IPvFuture host
(:[0-9]+)?                                  # Port
(/[a-z0-9-._~%!$&'()*+,;=:@]+)*/?          # Path
(?[a-z0-9-._~%!$&'()*+,;=:@/?]*)?         # Query
(#[a-z0-9-._~%!$&'()*+,;=:@/?]*)?         # Fragment

Regex options: Case insensitive

Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

^[a-z][a-z0-9+-.]*://([a-z0-9-._~%!$&'()*+,;=]+@)?([a-z0-9-._~%]+|↵
[[a-f0-9:.]+]|[v[a-f0-9][a-z0-9-._~%!$&'()*+,;=:]+])(:[0-9]+)?↵
(/[a-z0-9-._~%!$&'()*+,;=:@]+)*/?(?[a-z0-9-._~%!$&'()*+,;=:@/?]*)?↵
(#[a-z0-9-._~%!$&'()*+,;=:@/?]*)?$

Regex options: Case insensitive

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python

Discussion

Extracting the host from a URL is easy if you already know that your subject text is a valid URL. We use ‹A› or ‹^› to anchor the match to the start of the string. ‹[a-z][a-z0-9+-.]*://› skips over the scheme, and ‹([a-z0-9-._~%!$&'()*+,;=]+@)?› skips over the optional user. The hostname follows right after that.

RFC 3986 allows two different notations for the host. Domain names and IPv4 addresses are specified without square brackets, whereas IPv6 and future IP addresses are specified with square brackets. We need to handle those separately because the notation with square brackets allows more punctuation than the notation without. In particular, the colon is allowed between square brackets, but not in domain names or IPv4 addresses. The colon is also used to delimit the hostname (with or without square brackets) from the port number.

‹[a-z0-9-._~%]+› matches domain names and IPv4 addresses. ‹[[a-z0-9-._~%!$&'()*+,;=:]+]› handles IP version 6 and later. We combine these two using alternation (Recipe 2.8) in a group. The capturing group also allows us to extract the hostname.

This regex will find a match only if the URL actually specifies a host. When it does, the regex will match the scheme, user, and host parts of the URL. When the regex finds a match, you can retrieve the text matched by the second capturing group to get the hostname without any delimiters or other URL parts. The capturing group will include the square brackets for IPv6 addresses. Recipe 2.9 tells you all about capturing groups. See Recipe 3.9 to learn how to retrieve text matched by capturing groups in your favorite programming language.

If you don’t already know that your subject text is a valid URL, you can use a simplified version of the regex from Recipe 8.7. Since we want to extract the host, we can exclude URLs that don’t specify an authority. This makes the regular expression quite a bit simpler. It’s very similar to the one we used in Recipe 8.9. The only difference is that now the user part of the authority is optional again, as it was in Recipe 8.7.

This regex also uses alternation for the various notations for the host, which is kept together by a capturing group. Retrieve the text matched by capturing group number 2 to get the URL’s host.

If you want a regex that matches any valid URL, including those that don’t specify the host, you can use one of the regexes from Recipe 8.7. The first regex in that recipe captures the host, if present, in the fourth capturing group.

Table of Contents for
8.10. Extracting the Host from a URL

8.10. Extracting the Host from a URL

Problem

Solution

Extract the host from a URL known to be valid

Extract the host while validating the URL

Discussion

See Also

Table of Contents for 8.10. Extracting the Host from a URL

Create new playlist

Sign In

Sign Up

8.10. Extracting the Host from a URL

Problem

Solution

Extract the host from a URL known to be valid

Extract the host while validating the URL

Discussion

See Also

Table of Contents for
8.10. Extracting the Host from a URL