8.6. Validating URNs

Problem

You want to check whether a string represents a valid Uniform Resource Name (URN), as specified in RFC 2141, or find URNs in a larger body of text.

Solution

Check whether a string consists entirely of a valid URN:

Aurn:
# Namespace Identifier
[a-z0-9][a-z0-9-]{0,31}:
# Namespace Specific String
[a-z0-9()+,-.:=@;$_!*'%/?#]+

Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^urn:[a-z0-9][a-z0-9-]{0,31}:[a-z0-9()+,-.:=@;$_!*'%/?#]+$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python

Find a URN in a larger body of text:

urn:
# Namespace Identifier
[a-z0-9][a-z0-9-]{0,31}:
# Namespace Specific String
[a-z0-9()+,-.:=@;$_!*'%/?#]+
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
urn:[a-z0-9][a-z0-9-]{0,31}:[a-z0-9()+,-.:=@;$_!*'%/?#]+
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Find a URN in a larger body of text, assuming that punctuation at the end of the URN is part of the (English) text in which the URN is quoted rather than part of the URN itself:

urn:
# Namespace Identifier
[a-z0-9][a-z0-9-]{0,31}:
# Namespace Specific String
[a-z0-9()+,-.:=@;$_!*'%/?#]*[a-z0-9+=@$/]
Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
urn:[a-z0-9][a-z0-9-]{0,31}:[a-z0-9()+,-.:=@;$_!*'%/?#]*[a-z0-9+=@$/]
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

A URN consists of three parts. The first part is the four characters urn:, which we can add literally to the regular expression.

The second part is the Namespace Identifier (NID). It is between 1 and 32 characters long. The first character must be a letter or a digit. The remaining characters can be letters, digits, and hyphens. We match this using two character classes (Recipe 2.3): the first one matches a letter or a digit, and the second one matches between 0 and 31 letters, digits, and hyphens. The NID must be delimited with a colon, which we again add literally to the regex.

The third part of the URN is the Namespace Specific String (NSS). It can be of any length, and can include a bunch of punctuation characters in addition to letters and digits. We easily match this with another character class. The plus after the character class repeats it one or more times (Recipe 2.12).

If you want to check whether a string represents a valid URN, all that remains is to add anchors to the start and the end of the regex that match at the start and the end of the string. We can do this with ^ and $ in all flavors except Ruby, and with A and  in all flavors except JavaScript. Recipe 2.5 has all the details on these anchors.

Things are a little trickier if you want to find URNs in a larger body of text. The punctuation issue with URLs discussed in Recipe 8.2 also exists for URNs. Suppose you have the text:

The URN is urn:nid:nss, isn't it?

The issue is whether the comma is part of the URN. URNs that end with commas are syntactically valid, but any human reading this English-language sentence would see the comma as English punctuation, not as part of the URN. The last regular expression in the section solves this issue by being a little more strict than RFC 2141. It restricts the last character of the URN to be a character that is valid for the NSS part, and is not likely to appear as English punctuation in a sentence mentioning a URN.

This is easily done by replacing the plus quantifier (one or more) with an asterisk (zero or more), and adding a second character class for the final character. If we added the character class without changing the quantifier, we’d require the NSS to be at least two characters long, which isn’t what we want.

See Also

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.12 explains repetition. Recipe 2.18 explains how to add comments.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset