8.1. Validating URLs

Problem

You want to check whether a given piece of text is a URL that is valid for your purposes.

Solution

Allow almost any URL:

^(https?|ftp|file)://.+$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python
A(https?|ftp|file)://.+
Regex options: Case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

Require a domain name, and don’t allow a username or password:

A                         # Anchor
(https?|ftp)://            # Scheme
[a-z0-9-]+(.[a-z0-9-]+)+  # Domain
([/?].*)?                  # Path and/or parameters
                         # Anchor
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^(https?|ftp)://[a-z0-9-]+(.[a-z0-9-]+)+↵
([/?].+)?$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Require a domain name, and don’t allow a username or password. Allow the scheme (http or ftp) to be omitted if it can be inferred from the subdomain (www or ftp):

A                             # Anchor
((https?|ftp)://|(www|ftp).)  # Scheme or subdomain
[a-z0-9-]+(.[a-z0-9-]+)+      # Domain
([/?].*)?                      # Path and/or parameters
                             # Anchor
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^((https?|ftp)://|(www|ftp).)[a-z0-9-]+(.[a-z0-9-]+)+([/?].*)?$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python

Require a domain name and a path that points to an image file. Don’t allow a username, password, or parameters:

A                         # Anchor
(https?|ftp)://            # Scheme
[a-z0-9-]+(.[a-z0-9-]+)+  # Domain
(/[w-]+)*                 # Path
/[w-]+.(gif|png|jpg)     # File
                         # Anchor
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^(https?|ftp)://[a-z0-9-]+(.[a-z0-9-]+)+(/[w-]+)*/[w-]+.(gif|png|jpg)$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python

Discussion

You cannot create a regular expression that matches every valid URL without matching any invalid URLs. The reason is that pretty much anything could be a valid URL in some as of yet uninvented scheme.

Validating URLs becomes useful only when we know the context in which those URLs have to be valid. We then can limit the URLs we accept to schemes supported by the software we’re using. All the regular expressions for this recipe are for URLs used by web browsers. Such URLs use the form:

scheme://user:[email protected]:80/path/file.ext?param=value&param2↵
=value2#fragment

All these parts are in fact optional. A file: URL has only a path. http: URLs only need a domain name.

The solutions presented in this recipe work with the generally accepted rules for valid URLs that are used by most web browsers and other applications. They do not attempt to implement RFC 3986, which is the official standard for URLs. Follow Recipe 8.7 instead of this recipe if you want a solution compliant with RFC 3986.

The first regular expression in the solution checks whether the URL begins with one of the common schemes used by web browsers: http, https, ftp, and file. The caret anchors the regex to the start of the string (Recipe 2.5). Alternation (Recipe 2.8) is used to spell out the list of schemes. https? is a clever way of saying http|https.

Because the first regex allows for rather different schemes, such as http and file, it doesn’t try to validate the text after the scheme. .+$ simply grabs everything until the end of the string, as long as the string doesn’t contain any line break characters.

By default, the dot (Recipe 2.4) matches all characters except line break characters, and the dollar (Recipe 2.5) does not match at embedded line breaks. Ruby is the exception here. In Ruby, caret and dollar always match at embedded line breaks, and so we have to use A and  instead (Recipe 2.5). Strictly speaking, you’d have to make the same change for Ruby for all the other regular expressions shown in this recipe. You should…if your input could consist of multiple lines and you want to avoid matching a URL that takes up one line in several lines of text.

The next two regular expressions are the free-spacing (Recipe 2.18) and regular versions of the same regex. The free-spacing regex is easier to read, whereas the regular version is faster to type. JavaScript does not support free-spacing regular expressions.

These two regexes accept only web and FTP URLs, and require the HTTP or FTP scheme to be followed by something that looks like a valid domain name. The domain name must be in ASCII. Internationalized domains (IDNs) are not accepted. The domain can be followed by a path or a list of parameters, separated from the domain with a forward slash or a question mark. Since the question mark is inside a character class (Recipe 2.3), we don’t need to escape it. The question mark is an ordinary character in character classes, and the forward slash is an ordinary character anywhere in a regular expression. (If you see it escaped in source code, that’s because Perl and several other programming languages use forward slashes to delimit literal regular expressions.)

No attempt is made to validate the path or the parameters. .* simply matches anything that doesn’t include line breaks. Since the path and parameters are both optional, [/?].* is placed inside a group that is made optional with a question mark (Recipe 2.12).

These regular expressions, and the ones that follow, don’t allow a username or password to be specified as part of the URL. Putting user information in a URL is considered bad practice for security reasons.

Most web browsers accept URLs that don’t specify the scheme, and correctly infer the scheme from the domain name. For example, www.regexbuddy.com is short for http://www.regexbuddy.com. To allow such URLs, we simply expand the list of schemes allowed by the regular expression to include the subdomains www. and ftp..

(https?|ftp)://|(www|ftp). does this nicely. This list has two alternatives, each of which starts with two alternatives. The first alternative allows https? and ftp, which must be followed by ://. The second alternative allows www and ftp, which must be followed by a dot. You can easily edit both lists to change the schemes and subdomains the regex should accept.

The last two regular expressions require a scheme, an ASCII domain name, a path, and a filename to a GIF, PNG, or JPEG image file. The path and filename allow all letters and digits in any script, as well as underscores and hyphens. The shorthand character class w includes all that, except the hyphens (Recipe 2.3).

Which of these regular expressions should you use? That really depends on what you’re trying to do. In many situations, the answer may be to not use any regular expression at all. Simply try to resolve the URL. If it returns valid content, accept it. If you get a 404 or other error, reject it. Ultimately, that’s the only real test to see whether a URL is valid.

See Also

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition.

Recipe 8.7 provides a solution that follows RFC 3986.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset