8.12. Extracting the Path from a URL

Problem

You want to extract the path from a string that holds a URL. For example, you want to extract /index.html from http://www.regexcookbook.com/index.html or from /index.html#fragment.

Solution

Extract the path from a string known to hold a valid URL. The following finds a match for all URLs, even for URLs that have no path:

A
# Skip over scheme and authority, if any
([a-z][a-z0-9+-.]*:(//[^/?#]+)?)?
# Path
([a-z0-9-._~%!$&'()*+,;=:@/]*)
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^([a-z][a-z0-9+-.]*:(//[^/?#]+)?)?([a-z0-9-._~%!$&'()*+,;=:@/]*)
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Extract the path from a string known to hold a valid URL. Only match URLs that actually have a path:

A
# Skip over scheme and authority, if any
([a-z][a-z0-9+-.]*:(//[^/?#]+)?)?
# Path
(/?[a-z0-9-._~%!$&'()*+,;=@]+(/[a-z0-9-._~%!$&'()*+,;=:@]+)*/?|/)
# Query, fragment, or end of URL
([#?]|)
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^([a-z][a-z0-9+-.]*:(//[^/?#]+)?)?(/?[a-z0-9-._~%!$&'()*+,;=@]+↵
(/[a-z0-9-._~%!$&'()*+,;=:@]+)*/?|/)([#?]|$)
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Extract the path from a string known to hold a valid URL. Use atomic grouping to match only those URLs that actually have a path:

A
# Skip over scheme and authority, if any
(?>([a-z][a-z0-9+-.]*:(//[^/?#]+)?)?)
# Path
([a-z0-9-._~%!$&'()*+,;=:@/]+)
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Ruby

Discussion

You can use a much simpler regular expression to extract the path if you already know that your subject text is a valid URL. While the generic regex in Recipe 8.7 has three different ways to match the path, depending on whether the URL specifies a scheme and/or authority, the specific regex for extracting the path from a URL known to be valid needs to match the path only once.

We start with A or ^ to anchor the match to the start of the string. [a-z][a-z0-9+-.]*: skips over the scheme, and //[^/?#]+ skips over the authority. We can use this very simple regex for the authority because we already know it to be valid, and we’re not interested in extracting the user, host, or port from the authority. The authority starts with two forward slashes, and runs until the start of the path (forward slash), query (question mark), or fragment (hash). The negated character class matches everything up to the first forward slash, question mark, or hash (Recipe 2.3).

Because the authority is optional, we put it into a group followed by the question mark quantifier: (//[^/?#]+)?. The scheme is also optional. If the scheme is omitted, the authority must be omitted, too. To match this, we place the parts of the regex for the scheme and the optional authority in another group, also made optional with a question mark.

Since we know the URL to be valid, we can easily match the path with a single character class [a-z0-9-._~%!$&'()*+,;=:@/]* that includes the forward slash. We don’t need to check for consecutive forward slashes, which aren’t allowed in paths in URLs.

We indeed use an asterisk rather than a plus as the quantifier on the character class for the path. It may seem strange to make the path optional in a regex that only exists to extract the path from a URL. Actually, making the path optional is essential because of the shortcuts we took in skipping over the scheme and the authority.

In the generic regex for URLs in Recipe 8.7, we have three different ways of matching the path, depending on whether the scheme and/or authority are present in the URL. This makes sure the scheme isn’t accidentally matched as the path.

Now we’re trying to keep things simple by using only one character class for the path. Consider the URL http://www.regexcookbook.com, which has a scheme and an authority but no path. The first part of our regex will happily match the scheme and the authority. The regex engine then tries to match the character class for the path, but there are no characters left. If the path is optional (using the asterisk quantifier), the regex engine is perfectly happy not to match any characters for the path. It reaches the end of the regex and declares that an overall match has been found.

But if the character class for the path is not optional, the regex engine backtracks. (See Recipe 2.13 if you’re not familiar with backtracking.) It remembered that the authority and scheme parts of our regex are optional, so the engine says: let’s try this again, without allowing (//[^/?#]+)? to match anything. [a-z0-9-._~%!$&'()*+,;=:@/]+ would then match //www.regexcookbook.com for the path, clearly not what we want. If we used a more accurate regex for the path to disallow the double forward slashes, the regex engine would simply backtrack again, and pretend the URL has no scheme. With an accurate regex for the path, it would match http as the path. To prevent that as well, we would have to add an extra check to make sure the path is followed by the query, fragment, or nothing at all. If we do all that, we end up with the regular expressions indicated as “only match URLs that actually have a path” in this recipe’s section. These are quite a bit more complicated than the first two, all just to make the regex not match URLs without a path.

If your regex flavor supports atomic grouping, there’s an easier way. All flavors discussed in this book, except JavaScript and Python, support atomic grouping (see Recipe 2.14). Essentially, an atomic group tells the regex engine not to backtrack. If we place the scheme and authority parts of our regex inside an atomic group, the regex engine will be forced to keep the matches of the scheme and authority parts once they’ve been matched, even if that allows no room for the character class for the path to match. This solution is just as efficient as making the path optional.

Regardless of which regular expression you choose from this recipe, the third capturing group will hold the path. The third capturing group may return the empty string, or null in JavaScript, if you use one of the first two regexes that allow the path to be optional.

If you don’t already know that your subject text is a valid URL, you can use the regex from Recipe 8.7. If you’re using .NET, you can use the .NET-specific regex that includes three groups named “path” to capture the three parts of the regex that could match the URL’s path. If you use another flavor that supports named capture, one of three groups will have captured it: “hostpath,” “schemepath,” or “relpath.” Since only one of the three groups will actually capture anything, a simple trick to get the path is to concatenate the strings returned by the three groups. Two of them will return the empty string, so no actual concatenation is done.

If your flavor does not support named capture, you can use the first regex in Recipe 8.7. It captures the path in group 6, 7, or 8. You can use the same trick to concatenate the text captured by these three groups, as two of them will return the empty string. In JavaScript, however, this won’t work. JavaScript returns undefined for groups that don’t participate.

Recipe 3.9 has more information on retrieving the text matched by named and numbered capturing groups in your favorite programming language.

See Also

Recipe 3.9 shows code to get the text matched by a particular part (capturing group) of a regex. Use this to get the path.

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.11 explains named capturing groups. Recipe 2.12 explains repetition. Recipe 2.18 explains how to add comments.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset