Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

8.8. Extracting the Scheme from a URL

Problem

You want to extract the URL scheme from a string that holds a URL. For example, you want to extract http from http://www.regexcookbook.com.

Solution

Extract the scheme from a URL known to be valid

^([a-z][a-z0-9+-.]*):

Regex options: Case insensitive

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Extract the scheme while validating the URL

A
([a-z][a-z0-9+-.]*):
(# Authority & path
 //
 ([a-z0-9-._~%!$&'()*+,;=]+@)?              # User
 ([a-z0-9-._~%]+                            # Named host
 |[[a-f0-9:.]+]                            # IPv6 host
 |[v[a-f0-9][a-z0-9-._~%!$&'()*+,;=:]+])  # IPvFuture host
 (:[0-9]+)?                                  # Port
 (/[a-z0-9-._~%!$&'()*+,;=:@]+)*/?          # Path
|# Path without authority
 (/?[a-z0-9-._~%!$&'()*+,;=:@]+(/[a-z0-9-._~%!$&'()*+,;=:@]+)*/?)?
)
# Query
(?[a-z0-9-._~%!$&'()*+,;=:@/?]*)?
# Fragment
(#[a-z0-9-._~%!$&'()*+,;=:@/?]*)?

Regex options: Case insensitive

Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

^([a-z][a-z0-9+-.]*):(//([a-z0-9-._~%!$&'()*+,;=]+@)?([a-z0-9-._~%]+|↵
[[a-f0-9:.]+]|[v[a-f0-9][a-z0-9-._~%!$&'()*+,;=:]+])(:[0-9]+)?↵
(/[a-z0-9-._~%!$&'()*+,;=:@]+)*/?|(/?[a-z0-9-._~%!$&'()*+,;=:@]+↵
(/[a-z0-9-._~%!$&'()*+,;=:@]+)*/?)?)(?[a-z0-9-._~%!$&'()*+,;=:@/?]*)?↵
(#[a-z0-9-._~%!$&'()*+,;=:@/?]*)?$

Regex options: Case insensitive

Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python

Discussion

Extracting the scheme from a URL is easy if you already know that your subject text is a valid URL. A URL’s scheme always occurs at the very start of the URL. The caret (Recipe 2.5) specifies that requirement in the regex. The scheme begins with a letter, which can be followed by additional letters, digits, plus signs, hyphens, and dots. We match this with the two character classes ‹[a-z][a-z0-9+-.]*› (Recipe 2.3).

The scheme is delimited from the rest of the URL with a colon. We add this colon to the regex to make sure we match the scheme only if the URL actually starts with a scheme. Relative URLs do not start with a scheme. The URL syntax specified in RFC 3986 makes sure that relative URLs don’t contain any colons, unless those colons are preceded by characters that aren’t allowed in schemes. That’s why we had to exclude the colon from one of the character classes for matching the path in Recipe 8.7. If you use the regexes in this recipe on a valid but relative URL, they won’t find a match at all.

Since the regex matches more than just the scheme itself (it includes the colon), we’ve added a capturing group to the regular expression. When the regex finds a match, you can retrieve the text matched by the first (and only) capturing group to get the scheme without the colon. Recipe 2.9 tells you all about capturing groups. See Recipe 3.9 to learn how to retrieve text matched by capturing groups in your favorite programming language.

If you don’t already know that your subject text is a valid URL, you can use a simplified version of the regex from Recipe 8.7. Since we want to extract the scheme, we can exclude relative URLs, which don’t specify a scheme. That makes the regular expression slightly simpler.

Since this regex matches the whole URL, we added an extra capturing group around the part of the regex that matches the scheme. Retrieve the text matched by capturing group number 1 to get the URL’s scheme.

Table of Contents for
8.8. Extracting the Scheme from a URL

8.8. Extracting the Scheme from a URL

Problem

Solution

Extract the scheme from a URL known to be valid

Extract the scheme while validating the URL

Discussion

See Also

Table of Contents for 8.8. Extracting the Scheme from a URL

Create new playlist

Sign In

Sign Up

8.8. Extracting the Scheme from a URL

Problem

Solution

Extract the scheme from a URL known to be valid

Extract the scheme while validating the URL

Discussion

See Also

Table of Contents for
8.8. Extracting the Scheme from a URL