Allow almost any URL:
^(https?|ftp|file)://.+$
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python |
A(https?|ftp|file)://.+
Regex options: Case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
Require a domain name, and don’t allow a username or password:
A # Anchor (https?|ftp):// # Scheme [a-z0-9-]+(.[a-z0-9-]+)+ # Domain ([/?].*)? # Path and/or parameters # Anchor
Regex options: Free-spacing, case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
^(https?|ftp)://[a-z0-9-]+(.[a-z0-9-]+)+↵ ([/?].+)?$
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Require a domain name, and don’t allow a username or password. Allow the scheme (http or ftp) to be omitted if it can be inferred from the subdomain (www or ftp):
A # Anchor ((https?|ftp)://|(www|ftp).) # Scheme or subdomain [a-z0-9-]+(.[a-z0-9-]+)+ # Domain ([/?].*)? # Path and/or parameters # Anchor
Regex options: Free-spacing, case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
^((https?|ftp)://|(www|ftp).)[a-z0-9-]+(.[a-z0-9-]+)+([/?].*)?$
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python |
Require a domain name and a path that points to an image file. Don’t allow a username, password, or parameters:
A # Anchor (https?|ftp):// # Scheme [a-z0-9-]+(.[a-z0-9-]+)+ # Domain (/[w-]+)* # Path /[w-]+.(gif|png|jpg) # File # Anchor
Regex options: Free-spacing, case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
^(https?|ftp)://[a-z0-9-]+(.[a-z0-9-]+)+(/[w-]+)*/[w-]+.(gif|png|jpg)$
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python |
You cannot create a regular expression that matches every valid URL without matching any invalid URLs. The reason is that pretty much anything could be a valid URL in some as of yet uninvented scheme.
Validating URLs becomes useful only when we know the context in which those URLs have to be valid. We then can limit the URLs we accept to schemes supported by the software we’re using. All the regular expressions for this recipe are for URLs used by web browsers. Such URLs use the form:
scheme://user:[email protected]:80/path/file.ext?param=value¶m2↵ =value2#fragment
All these parts are in fact optional. A file:
URL has only a path. http:
URLs only need a domain name.
The solutions presented in this recipe work with the generally accepted rules for valid URLs that are used by most web browsers and other applications. They do not attempt to implement RFC 3986, which is the official standard for URLs. Follow Recipe 8.7 instead of this recipe if you want a solution compliant with RFC 3986.
The first regular expression in the solution checks whether the
URL begins with one of the common schemes used by web browsers: http
, https
, ftp
,
and file
. The caret anchors the regex
to the start of the string (Recipe 2.5).
Alternation (Recipe 2.8) is used to spell out
the list of schemes. ‹https?
› is a clever way of saying ‹http|https
›.
Because the first regex allows for rather different schemes, such
as http
and file
, it doesn’t try to validate the text
after the scheme. ‹.+$
›
simply grabs everything until the end of the string, as long as the
string doesn’t contain any line break characters.
By default, the dot (Recipe 2.4) matches all
characters except line break characters, and the dollar (Recipe 2.5) does not match at embedded line breaks.
Ruby is the exception here. In Ruby, caret and dollar always match at
embedded line breaks, and so we have to use ‹A
› and
‹› instead
(Recipe 2.5). Strictly speaking, you’d have to
make the same change for Ruby for all the other regular expressions
shown in this recipe. You should…if your input could consist of multiple
lines and you want to avoid matching a URL that takes up one line in
several lines of text.
The next two regular expressions are the free-spacing (Recipe 2.18) and regular versions of the same regex. The free-spacing regex is easier to read, whereas the regular version is faster to type. JavaScript does not support free-spacing regular expressions.
These two regexes accept only web and FTP URLs, and require the HTTP or FTP scheme to be followed by something that looks like a valid domain name. The domain name must be in ASCII. Internationalized domains (IDNs) are not accepted. The domain can be followed by a path or a list of parameters, separated from the domain with a forward slash or a question mark. Since the question mark is inside a character class (Recipe 2.3), we don’t need to escape it. The question mark is an ordinary character in character classes, and the forward slash is an ordinary character anywhere in a regular expression. (If you see it escaped in source code, that’s because Perl and several other programming languages use forward slashes to delimit literal regular expressions.)
No attempt is made to validate the path or the parameters.
‹.*
› simply matches
anything that doesn’t include line breaks. Since the path and parameters
are both optional, ‹[/?].*
› is placed inside a group that is made
optional with a question mark (Recipe 2.12).
These regular expressions, and the ones that follow, don’t allow a username or password to be specified as part of the URL. Putting user information in a URL is considered bad practice for security reasons.
Most web browsers accept URLs that don’t specify the scheme, and
correctly infer the scheme from the domain name. For example, www.regexbuddy.com
is short for http://www.regexbuddy.com
. To allow such URLs,
we simply expand the list of schemes allowed by the regular expression
to include the subdomains www.
and
ftp.
.
‹(https?|ftp)://|(www|ftp).
› does this nicely.
This list has two alternatives, each of which starts with two
alternatives. The first alternative allows ‹https?
› and ‹ftp
›, which must be followed by ‹://
›. The second alternative
allows ‹www
› and ‹ftp
›, which must be followed by a
dot. You can easily edit both lists to change the schemes and subdomains
the regex should accept.
The last two regular expressions require a scheme, an ASCII domain
name, a path, and a filename to a GIF, PNG, or JPEG image file. The path
and filename allow all letters and digits in any script, as well as
underscores and hyphens. The shorthand character class ‹w
›
includes all that, except the hyphens (Recipe 2.3).
Which of these regular expressions should you use? That really depends on what you’re trying to do. In many situations, the answer may be to not use any regular expression at all. Simply try to resolve the URL. If it returns valid content, accept it. If you get a 404 or other error, reject it. Ultimately, that’s the only real test to see whether a URL is valid.
Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition.
Recipe 8.7 provides a solution that follows RFC 3986.