8.9. Extracting the User from a URL

Problem

You want to extract the user from a string that holds a URL. For example, you want to extract jan from ftp://[email protected].

Solution

Extract the user from a URL known to be valid

^[a-z0-9+-.]+://([a-z0-9-._~%!$&'()*+,;=]+)@
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Extract the user while validating the URL

A
[a-z][a-z0-9+-.]*://                       # Scheme
([a-z0-9-._~%!$&'()*+,;=]+)@               # User
([a-z0-9-._~%]+                            # Named host
|[[a-f0-9:.]+]                            # IPv6 host
|[v[a-f0-9][a-z0-9-._~%!$&'()*+,;=:]+])  # IPvFuture host
(:[0-9]+)?                                  # Port
(/[a-z0-9-._~%!$&'()*+,;=:@]+)*/?          # Path
(?[a-z0-9-._~%!$&'()*+,;=:@/?]*)?         # Query
(#[a-z0-9-._~%!$&'()*+,;=:@/?]*)?         # Fragment

Regex options: Case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^[a-z][a-z0-9+-.]*://([a-z0-9-._~%!$&'()*+,;=]+)@([a-z0-9-._~%]+|↵
[[a-f0-9:.]+]|[v[a-f0-9][a-z0-9-._~%!$&'()*+,;=:]+])(:[0-9]+)?↵
(/[a-z0-9-._~%!$&'()*+,;=:@]+)*/?(?[a-z0-9-._~%!$&'()*+,;=:@/?]*)?↵
(#[a-z0-9-._~%!$&'()*+,;=:@/?]*)?$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python

Discussion

Extracting the user from a URL is easy if you already know that your subject text is a valid URL. The username, if present in the URL, occurs right after the scheme and the two forward slashes that begin the “authority” part of the URL. The username is separated from the hostname that follows it with an @ sign. Since @ signs are not valid in hostnames, we can be sure that we’re extracting the username portion of a URL if we find an @ sign after the two forward slashes and before the next forward slash in the URL. Forward slashes are not valid in usernames, so we don’t need to do any special checking for them.

All these rules mean we can very easily extract the username if we know the URL to be valid. We just skip over the scheme with [a-z0-9+-.]+ and the ://. Then, we grab the username that follows. If we can match the @ sign, we know that the characters before it are the username. The character class [a-z0-9-._~%!$&'()*+,;=] lists all the characters that are valid in usernames.

This regex will find a match only if the URL actually specifies a user. When it does, the regex will match both the scheme and the user parts of the URL. Therefore, we’ve added a capturing group to the regular expression. When the regex finds a match, you can retrieve the text matched by the first (and only) capturing group to get the username without any delimiters or other URL parts. Recipe 2.9 tells you all about capturing groups. See Recipe 3.9 to learn how to retrieve text matched by capturing groups in your favorite programming language.

If you don’t already know that your subject text is a valid URL, you can use a simplified version of the regex from Recipe 8.7. Since we want to extract the user, we can exclude URLs that don’t specify an authority. The regex in the solution actually matches only URLs that specify an authority that includes a username. Requiring the authority part of the URL makes the regular expression quite a bit simpler. It’s even simpler than the one we used in Recipe 8.8.

Since this regex matches the whole URL, we added an extra capturing group around the part of the regex that matches the user. Retrieve the text matched by capturing group number 1 to get the URL’s user.

If you want a regex that matches any valid URL, including those that don’t specify the user, you can use one of the regexes from Recipe 8.7. The first regex in that recipe captures the user, if present, in the third capturing group. The capturing group will include the @ symbol. You can add an extra capturing group to the regex if you want to capture the username without the @ symbol.

See Also

Recipe 3.9 shows code to get the text matched by a particular part (capturing group) of a regex. Use this to get the user name.

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.11 explains named capturing groups. Recipe 2.12 explains repetition. Recipe 2.18 explains how to add comments.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset