8.2. Finding URLs Within Full Text

Problem

You want to find URLs in a larger body of text. URLs may or may not be enclosed in punctuation, such as parentheses, that are not part of the URL.

Solution

URL without spaces:

(https?|ftp|file)://S+
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

URL without spaces or final punctuation:

(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|$!:,.;]*↵
[A-Z0-9+&@#/%=~_|$]
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

URL without spaces or final punctuation. URLs that start with the www or ftp subdomain can omit the scheme:

((https?|ftp|file)://|(www|ftp).)[-A-Z0-9+&@#/%?=~_|$!:,.;]*↵
[A-Z0-9+&@#/%=~_|$]
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

Given the text:

Visit http://www.somesite.com/page, where you will find more information.

what is the URL?

Before you say http://www.somesite.com/page, think about this: punctuation and spaces are valid characters in URLs. Though RFC 3986 (see Recipe 8.7) does not allow literal spaces in URLs, all major browsers accept URLs with literal spaces just fine. Some WYSIWYG web authoring tools even make it easy for the user to put spaces in file and folder names, and include those spaces literally in links to those files.

That means that if we use a regular expression that allows all valid URLs, it will find this URL in the preceding text:

http://www.somesite.com/page, where you will find more information.

The odds are small that the person who typed in this sentence intended the spaces to be part of the URL. The first regular expression in the solution excludes them using the shorthand character class S, which includes all characters that are not whitespace. Though the regex specifies the “case insensitive” option, the S must be uppercase, because S is not the same as s. In fact, they’re exactly the opposite. Recipe 2.3 has all the details.

The first regular expression is still quite crude. It will include the comma in the example text into the URL. Though it’s not uncommon for URLs to include commas and other punctuation, punctuation rarely occurs at the end of the URL.

The next regular expression uses two character classes instead of the single shorthand S. The first character class includes more punctuation than the second. The second class excludes those characters that are likely to appear as English language punctuation right after a URL when the URL is placed into an English sentence. The first character class has the asterisk quantifier (Recipe 2.12), to allow URLs of any length. The second character class has no quantifier, requiring the URL to end with one character from that class. The character classes don’t include the lowercase letters; the “case insensitive” option takes care of those. See Recipe 3.4 to learn how to set such options in your programming language.

The second regex will work incorrectly with certain URLs that use odd punctuation, matching those URLs only partially. But this regex does solve the very common problem of a comma or full stop right after a URL, while still allowing commas and dots within the URL.

Most web browsers accept URLs that don’t specify the scheme, and correctly infer the scheme from the domain name. For example, www.regexbuddy.com is short for http://www.regexbuddy.com. To allow such URLs, the final regex expands the list of allowed schemes to include the subdomains www. and ftp..

(https?|ftp)://|(www|ftp). does this nicely. This list has two alternatives, each of which starts with two alternatives. The first alternative allows https? and ftp, which must be followed by ://. The second alternative allows www and ftp, which must be followed by a dot. You can easily edit both lists to change the schemes and subdomains the regex should accept.

See Also

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.6 explains word boundaries. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition.

Recipe 8.5 gives a replacement text that you can use in combination with this regular expression to create a search-and-replace that converts URLs into HTML anchors.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset