You want to extract the path from a string that holds a
URL. For example, you want to
extract /index.html
from http://www.regexcookbook.com/index.html
or from /index.html#fragment
.
Extract the path from a string known to hold a valid URL. The following finds a match for all URLs, even for URLs that have no path:
A # Skip over scheme and authority, if any ([a-z][a-z0-9+-.]*:(//[^/?#]+)?)? # Path ([a-z0-9-._~%!$&'()*+,;=:@/]*)
Regex options: Free-spacing, case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
^([a-z][a-z0-9+-.]*:(//[^/?#]+)?)?([a-z0-9-._~%!$&'()*+,;=:@/]*)
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Extract the path from a string known to hold a valid URL. Only match URLs that actually have a path:
A # Skip over scheme and authority, if any ([a-z][a-z0-9+-.]*:(//[^/?#]+)?)? # Path (/?[a-z0-9-._~%!$&'()*+,;=@]+(/[a-z0-9-._~%!$&'()*+,;=:@]+)*/?|/) # Query, fragment, or end of URL ([#?]|)
Regex options: Free-spacing, case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
^([a-z][a-z0-9+-.]*:(//[^/?#]+)?)?(/?[a-z0-9-._~%!$&'()*+,;=@]+↵ (/[a-z0-9-._~%!$&'()*+,;=:@]+)*/?|/)([#?]|$)
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Extract the path from a string known to hold a valid URL. Use atomic grouping to match only those URLs that actually have a path:
A # Skip over scheme and authority, if any (?>([a-z][a-z0-9+-.]*:(//[^/?#]+)?)?) # Path ([a-z0-9-._~%!$&'()*+,;=:@/]+)
Regex options: Free-spacing, case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Ruby |
You can use a much simpler regular expression to extract the path if you already know that your subject text is a valid URL. While the generic regex in Recipe 8.7 has three different ways to match the path, depending on whether the URL specifies a scheme and/or authority, the specific regex for extracting the path from a URL known to be valid needs to match the path only once.
We start with ‹A
›
or ‹^
› to anchor the match
to the start of the string. ‹[a-z][a-z0-9+-.]*:
› skips over the scheme, and
‹//[^/?#]+
› skips over the
authority. We can use this very simple regex for the authority because
we already know it to be valid, and
we’re not interested in extracting the user, host, or port from
the authority. The authority
starts with two forward slashes, and runs until the start of the path
(forward slash), query (question mark), or fragment (hash). The negated
character class matches everything up to the first forward slash,
question mark, or hash (Recipe 2.3).
Because the authority is optional, we put it into a group followed
by the question mark quantifier: ‹(//[^/?#]+)?
›. The scheme is also optional. If the
scheme is omitted, the authority must be omitted, too. To match this, we
place the parts of the regex for the scheme and the optional authority
in another group, also made optional with a question mark.
Since we know the URL to be valid, we can easily match the path
with a single character class ‹[a-z0-9-._~%!$&'()*+,;=:@/]*
› that includes
the forward slash. We don’t need to check for consecutive forward
slashes, which aren’t allowed in paths in URLs.
We indeed use an asterisk rather than a plus as the quantifier on the character class for the path. It may seem strange to make the path optional in a regex that only exists to extract the path from a URL. Actually, making the path optional is essential because of the shortcuts we took in skipping over the scheme and the authority.
In the generic regex for URLs in Recipe 8.7, we have three different ways of matching the path, depending on whether the scheme and/or authority are present in the URL. This makes sure the scheme isn’t accidentally matched as the path.
Now we’re trying to keep things simple by using only one character
class for the path. Consider the URL http://www.regexcookbook.com
, which has a
scheme and an authority but no path. The first part of our regex will
happily match the scheme and the authority. The regex engine then tries
to match the character class for the path, but there are no characters
left. If the path is optional (using the asterisk quantifier), the regex
engine is perfectly happy not to match any characters for the path. It
reaches the end of the regex and declares that an overall match has been
found.
But if the character class for the path is not optional,
the regex engine backtracks. (See Recipe 2.13
if you’re not familiar with backtracking.) It remembered that the
authority and scheme parts of our regex are optional, so the engine
says: let’s try this again, without allowing ‹(//[^/?#]+)?
› to match anything. ‹[a-z0-9-._~%!$&'()*+,;=:@/]+
›
would then match //www.regexcookbook.com
for the path,
clearly not what we want. If we used a more accurate regex for the path
to disallow the double forward slashes, the regex engine would simply
backtrack again, and pretend the URL has no scheme. With an accurate
regex for the path, it would match http
as the path. To prevent that as well,
we would have to add an extra check to make sure the path is followed by
the query, fragment, or nothing at all. If we do all that, we end up
with the regular expressions indicated as “only match URLs that actually
have a path” in this recipe’s section. These are quite a bit more
complicated than the first two, all just to make the regex not match
URLs without a path.
If your regex flavor supports atomic grouping, there’s an easier way. All flavors discussed in this book, except JavaScript and Python, support atomic grouping (see Recipe 2.14). Essentially, an atomic group tells the regex engine not to backtrack. If we place the scheme and authority parts of our regex inside an atomic group, the regex engine will be forced to keep the matches of the scheme and authority parts once they’ve been matched, even if that allows no room for the character class for the path to match. This solution is just as efficient as making the path optional.
Regardless of which regular expression you choose from this
recipe, the third capturing group will hold the path. The third
capturing group may return the empty string, or null
in JavaScript, if you use one of the first
two regexes that allow the path to be optional.
If you don’t already know that your subject text is a valid URL, you can use the regex from Recipe 8.7. If you’re using .NET, you can use the .NET-specific regex that includes three groups named “path” to capture the three parts of the regex that could match the URL’s path. If you use another flavor that supports named capture, one of three groups will have captured it: “hostpath,” “schemepath,” or “relpath.” Since only one of the three groups will actually capture anything, a simple trick to get the path is to concatenate the strings returned by the three groups. Two of them will return the empty string, so no actual concatenation is done.
If your flavor does not support named capture, you can use the
first regex in Recipe 8.7. It captures the
path in group 6, 7, or 8. You can use the same trick to concatenate the
text captured by these three groups, as two of them will return the
empty string. In JavaScript, however, this won’t work. JavaScript
returns undefined
for groups that
don’t participate.
Recipe 3.9 has more information on retrieving the text matched by named and numbered capturing groups in your favorite programming language.
Recipe 3.9 shows code to get the text matched by a particular part (capturing group) of a regex. Use this to get the path.
Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.11 explains named capturing groups. Recipe 2.12 explains repetition. Recipe 2.18 explains how to add comments.