8.18. Validate Windows Paths

Problem

You want to check whether a string looks like a valid path to a folder or file on the Microsoft Windows operating system.

Solution

Drive letter paths

A
[a-z]:\                    # Drive
(?:[^\/:*?"<>|
]+\)*   # Folder
[^\/:*?"<>|
]*          # File

Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^[a-z]:\(?:[^\/:*?"<>|
]+\)*[^\/:*?"<>|
]*$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python

Drive letter and UNC paths

A
(?:[a-z]:|\\[a-z0-9_.$-]+\[a-z0-9_.$-]+)\  # Drive
(?:[^\/:*?"<>|
]+\)*                           # Folder
[^\/:*?"<>|
]*                                  # File

Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^(?:[a-z]:|\\[a-z0-9_.$-]+\[a-z0-9_.$-]+)\(?:[^\/:*?"<>|
]+\)*↵
[^\/:*?"<>|
]*$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python

Drive letter, UNC, and relative paths

A
(?:(?:[a-z]:|\\[a-z0-9_.$-]+\[a-z0-9_.$-]+)\|  # Drive
   \?[^\/:*?"<>|
]+\?)                           # Relative path
(?:[^\/:*?"<>|
]+\)*                              # Folder
[^\/:*?"<>|
]*                                     # File

Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^(?:(?:[a-z]:|\\[a-z0-9_.$-]+\[a-z0-9_.$-]+)\|\?[^\/:*?"<>|↵

]+\?)(?:[^\/:*?"<>|
]+\)*[^\/:*?"<>|
]*$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python

Discussion

Drive letter paths

Matching a full path to a file or folder on a drive that has a drive letter is very straightforward. The drive is indicated with a single letter, followed by a colon and a backslash. We easily match this with [a-z]:\. The backslash is a metacharacter in regular expressions, and so we need to escape it with another backslash to match it literally.

Folder and filenames on Windows can contain all characters, except these: /:*?"<>|. Line breaks aren’t allowed either. We can easily match a sequence of all characters except these with the negated character class [^\/:*?"<>| ]+. The backslash is a metacharacter in character classes too, so we escape it. and are the two line break characters. See Recipe 2.3 to learn more about (negated) character classes. The plus quantifier (Recipe 2.12) specifies we want one or more such characters.

Folders are delimited with backslashes. We can match a sequence of zero or more folders with (?:[^\/:*?"<>| ]+\)*, which puts the regex for the folder name and a literal backslash inside a noncapturing group (Recipe 2.9) that is repeated zero or more times with the asterisk (Recipe 2.12).

To match the filename, we use [^\/:*?"<>| ]*. The asterisk makes the filename optional, to allow paths that end with a backslash. If you don’t want to allow paths that end with a backslash, change the last * in the regex into a +.

Drive letter and UNC paths

Paths to files on network drives that aren’t mapped to drive letters can be accessed using Universal Naming Convention (UNC) paths. UNC paths have the form \serversharefolderfile.

We can easily adapt the regex for drive letter paths to support UNC paths as well. All we have to do is to replace the [a-z]: part that matches the drive letter with something that matches a drive letter or server name.

(?:[a-z]:|\\[a-z0-9_.$-]+\[a-z0-9_.$-]+) does that. The vertical bar is the alternation operator (Recipe 2.8). It gives the choice between a drive letter matched with [a-z]: or a server and share name matched with \\[a-z0-9_.$-]+\[a-z0-9_.$-]+. The alternation operator has the lowest precedence of all regex operators. To group the two alternatives together, we use a noncapturing group. As Recipe 2.9 explains, the characters (?: form the somewhat complicated opening bracket of a noncapturing group. The question mark does not have its usual meaning after a parenthesis.

The rest of the regular expression can remain the same. The name of the share in UNC paths will be matched by the part of the regex that matches folder names.

Drive letter, UNC, and relative paths

A relative path is one that begins with a folder name (perhaps the special folder .. to select the parent folder) or consists of just a filename. To support relative paths, we add a third alternative to the “drive” portion of our regex. This alternative matches the start of a relative path rather than a drive letter or server name.

\?[^\/:*?"<>| ]+\? matches the start of the relative path. The path can begin with a backslash, but it doesn’t have to. \? matches the backslash if present, or nothing otherwise. [^\/:*?"<>| ]+ matches a folder or filename. If the relative path consists of just a filename, the final \? won’t match anything, and neither will the “folder” and “file” parts of the regex, which are both optional. If the relative path specifies a folder, the final \? will match the backslash that delimits the first folder in the relative path from the rest of the path. The “folder” part then matches the remaining folders in the path, if any, and the “file” part matches the filename.

The regular expression for matching relative paths no longer neatly uses distinct parts of the regex to match distinct parts of the subject text. The regex part labeled “relative path” will actually match a folder or filename if the path is relative. If the relative path specifies one or more folders, the “relative path” part matches the first folder, and the “folder” and “file” paths match what’s left. If the relative path is just a filename, it will be matched by the “relative path” part, leaving nothing for the “folder” and “file” parts. Since we’re only interested in validating the path, this doesn’t matter. The comments in the regex are just labels to help us understand it.

If we wanted to extract parts of the path into capturing groups, we’d have to be more careful to match the drive, folder, and filename separately. The next recipe handles that problem.

See Also

Recipe 8.19 also validates a Windows path but adds capturing groups for the drive, folder, and file, allowing you to extract those separately.

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.1 explains which special characters need to be escaped. Recipe 2.2 explains how to match nonprinting characters. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.12 explains repetition. Recipe 2.18 explains how to add comments.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset