You want to find URLs in a larger body of text. URLs may or may not be enclosed in punctuation, such as parentheses, that are not part of the URL.
URL without spaces:
(https?|ftp|file)://S+
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
URL without spaces or final punctuation:
(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|$!:,.;]*↵ [A-Z0-9+&@#/%=~_|$]
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
URL without spaces or final punctuation. URLs that start with the
www
or ftp
subdomain can omit the scheme:
((https?|ftp|file)://|(www|ftp).)[-A-Z0-9+&@#/%?=~_|$!:,.;]*↵ [A-Z0-9+&@#/%=~_|$]
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Given the text:
Visit http://www.somesite.com/page, where you will find more information.
what is the URL?
Before you say http://www.somesite.com/page
, think about
this: punctuation and spaces are valid characters in URLs. Though RFC
3986 (see Recipe 8.7) does not allow literal
spaces in URLs, all major browsers accept URLs with literal spaces just
fine. Some WYSIWYG web authoring tools even make it easy for the user to
put spaces in file and folder names, and include those spaces literally
in links to those files.
That means that if we use a regular expression that allows all valid URLs, it will find this URL in the preceding text:
http://www.somesite.com/page, where you will find more information.
The odds are small that the person who typed in this sentence
intended the spaces to be part of the URL. The first regular expression
in the solution excludes them using the shorthand character class
‹S
›, which
includes all characters that are not whitespace. Though the regex
specifies the “case insensitive” option, the S must be uppercase,
because ‹S
› is not the
same as ‹s
›. In fact,
they’re exactly the opposite. Recipe 2.3 has
all the details.
The first regular expression is still quite crude. It will include the comma in the example text into the URL. Though it’s not uncommon for URLs to include commas and other punctuation, punctuation rarely occurs at the end of the URL.
The next regular expression uses two character classes instead of
the single shorthand ‹S
›.
The first character class includes more punctuation than the second. The
second class excludes those characters that are likely to appear as
English language punctuation right after a URL when the URL is placed
into an English sentence. The first character class has the asterisk
quantifier (Recipe 2.12), to allow URLs of any
length. The second character class has no quantifier, requiring the URL
to end with one character from that class. The character classes don’t
include the lowercase letters; the “case insensitive” option takes care of those.
See Recipe 3.4 to learn how to set such
options in your programming language.
The second regex will work incorrectly with certain URLs that use odd punctuation, matching those URLs only partially. But this regex does solve the very common problem of a comma or full stop right after a URL, while still allowing commas and dots within the URL.
Most web browsers accept URLs that don’t specify the scheme, and
correctly infer the scheme from the domain name. For example, www.regexbuddy.com
is short for http://www.regexbuddy.com
. To allow such URLs,
the final regex expands the list of allowed schemes to include the
subdomains www.
and ftp.
.
‹(https?|ftp)://|(www|ftp).
› does this nicely.
This list has two alternatives, each of which starts with two
alternatives. The first alternative allows ‹https?
› and ‹ftp
›, which must be followed by ‹://
›. The second alternative
allows ‹www
› and ‹ftp
›, which must be followed by a
dot. You can easily edit both lists to change the schemes and subdomains the regex should accept.
Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.6 explains word boundaries. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition.
Recipe 8.5 gives a replacement text that you can use in combination with this regular expression to create a search-and-replace that converts URLs into HTML anchors.