8.8. Extracting the Scheme from a URL

Problem

You want to extract the URL scheme from a string that holds a URL. For example, you want to extract http from http://www.regexcookbook.com.

Solution

Extract the scheme from a URL known to be valid

^([a-z][a-z0-9+-.]*):
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Extract the scheme while validating the URL

A
([a-z][a-z0-9+-.]*):
(# Authority & path
 //
 ([a-z0-9-._~%!$&'()*+,;=]+@)?              # User
 ([a-z0-9-._~%]+                            # Named host
 |[[a-f0-9:.]+]                            # IPv6 host
 |[v[a-f0-9][a-z0-9-._~%!$&'()*+,;=:]+])  # IPvFuture host
 (:[0-9]+)?                                  # Port
 (/[a-z0-9-._~%!$&'()*+,;=:@]+)*/?          # Path
|# Path without authority
 (/?[a-z0-9-._~%!$&'()*+,;=:@]+(/[a-z0-9-._~%!$&'()*+,;=:@]+)*/?)?
)
# Query
(?[a-z0-9-._~%!$&'()*+,;=:@/?]*)?
# Fragment
(#[a-z0-9-._~%!$&'()*+,;=:@/?]*)?

Regex options: Case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^([a-z][a-z0-9+-.]*):(//([a-z0-9-._~%!$&'()*+,;=]+@)?([a-z0-9-._~%]+|↵
[[a-f0-9:.]+]|[v[a-f0-9][a-z0-9-._~%!$&'()*+,;=:]+])(:[0-9]+)?↵
(/[a-z0-9-._~%!$&'()*+,;=:@]+)*/?|(/?[a-z0-9-._~%!$&'()*+,;=:@]+↵
(/[a-z0-9-._~%!$&'()*+,;=:@]+)*/?)?)(?[a-z0-9-._~%!$&'()*+,;=:@/?]*)?↵
(#[a-z0-9-._~%!$&'()*+,;=:@/?]*)?$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python

Discussion

Extracting the scheme from a URL is easy if you already know that your subject text is a valid URL. A URL’s scheme always occurs at the very start of the URL. The caret (Recipe 2.5) specifies that requirement in the regex. The scheme begins with a letter, which can be followed by additional letters, digits, plus signs, hyphens, and dots. We match this with the two character classes [a-z][a-z0-9+-.]* (Recipe 2.3).

The scheme is delimited from the rest of the URL with a colon. We add this colon to the regex to make sure we match the scheme only if the URL actually starts with a scheme. Relative URLs do not start with a scheme. The URL syntax specified in RFC 3986 makes sure that relative URLs don’t contain any colons, unless those colons are preceded by characters that aren’t allowed in schemes. That’s why we had to exclude the colon from one of the character classes for matching the path in Recipe 8.7. If you use the regexes in this recipe on a valid but relative URL, they won’t find a match at all.

Since the regex matches more than just the scheme itself (it includes the colon), we’ve added a capturing group to the regular expression. When the regex finds a match, you can retrieve the text matched by the first (and only) capturing group to get the scheme without the colon. Recipe 2.9 tells you all about capturing groups. See Recipe 3.9 to learn how to retrieve text matched by capturing groups in your favorite programming language.

If you don’t already know that your subject text is a valid URL, you can use a simplified version of the regex from Recipe 8.7. Since we want to extract the scheme, we can exclude relative URLs, which don’t specify a scheme. That makes the regular expression slightly simpler.

Since this regex matches the whole URL, we added an extra capturing group around the part of the regex that matches the scheme. Retrieve the text matched by capturing group number 1 to get the URL’s scheme.

See Also

Recipe 3.9 shows code to get the text matched by a particular part (capturing group) of a regex. Use this to get the URL scheme.

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.11 explains named capturing groups. Recipe 2.12 explains repetition. Recipe 2.18 explains how to add comments.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset