8.19. Split Windows Paths into Their Parts

Problem

You want to check whether a string looks like a valid path to a folder or file on the Microsoft Windows operating system. If the string turns out to hold a valid Windows path, then you also want to extract the drive, folder, and filename parts of the path separately.

Solution

Drive letter paths

A
(?<drive>[a-z]:)\
(?<folder>(?:[^\/:*?"<>|
]+\)*)
(?<file>[^\/:*?"<>|
]*)

Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java 7, PCRE 7, Perl 5.10, Ruby 1.9
A
(?P<drive>[a-z]:)\
(?P<folder>(?:[^\/:*?"<>|
]+\)*)
(?P<file>[^\/:*?"<>|
]*)

Regex options: Free-spacing, case insensitive
Regex flavors: PCRE 4 and later, Perl 5.10, Python
A
([a-z]:)\
((?:[^\/:*?"<>|
]+\)*)
([^\/:*?"<>|
]*)

Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^([a-z]:)\((?:[^\/:*?"<>|
]+\)*)([^\/:*?"<>|
]*)$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python

Drive letter and UNC paths

A
(?<drive>[a-z]:|\\[a-z0-9_.$-]+\[a-z0-9_.$-]+)\
(?<folder>(?:[^\/:*?"<>|
]+\)*)
(?<file>[^\/:*?"<>|
]*)

Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java 7, PCRE 7, Perl 5.10, Ruby 1.9
A
(?P<drive>[a-z]:|\\[a-z0-9_.$-]+\[a-z0-9_.$-]+)\
(?P<folder>(?:[^\/:*?"<>|
]+\)*)
(?P<file>[^\/:*?"<>|
]*)

Regex options: Free-spacing, case insensitive
Regex flavors: PCRE 4 and later, Perl 5.10, Python
A
([a-z]:|\\[a-z0-9_.$-]+\[a-z0-9_.$-]+)\
((?:[^\/:*?"<>|
]+\)*)
([^\/:*?"<>|
]*)

Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^([a-z]:|\\[a-z0-9_.$-]+\[a-z0-9_.$-]+)\((?:[^\/:*?"<>|
]+\)*)↵
([^\/:*?"<>|
]*)$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python

Drive letter, UNC, and relative paths

Warning

These regular expressions can match the empty string. See the section for more details and an alternative solution.

A
(?<drive>[a-z]:\|\\[a-z0-9_.$-]+\[a-z0-9_.$-]+\|\?)
(?<folder>(?:[^\/:*?"<>|
]+\)*)
(?<file>[^\/:*?"<>|
]*)

Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java 7, PCRE 7, Perl 5.10, Ruby 1.9
A
(?P<drive>[a-z]:\|\\[a-z0-9_.$-]+\[a-z0-9_.$-]+\|\?)
(?P<folder>(?:[^\/:*?"<>|
]+\)*)
(?P<file>[^\/:*?"<>|
]*)

Regex options: Free-spacing, case insensitive
Regex flavors: PCRE 4 and later, Perl 5.10, Python
A
([a-z]:\|\\[a-z0-9_.$-]+\[a-z0-9_.$-]+\|\?)
((?:[^\/:*?"<>|
]+\)*)
([^\/:*?"<>|
]*)

Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^([a-z]:\|\\[a-z0-9_.$-]+\[a-z0-9_.$-]+\|\?)↵
((?:[^\/:*?"<>|
]+\)*)([^\/:*?"<>|
]*)$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python

Discussion

The regular expressions in this recipe are very similar to the ones in the previous recipe. This discussion assumes you’ve already read and understood the discussion of the previous recipe.

Drive letter paths

We’ve made only one change to the regular expressions for drive letter paths, compared to the ones in the previous recipe. We’ve added three capturing groups that you can use to retrieve the various parts of the path: drive, folder, and file. You can use these names if your regex flavor supports named capture (Recipe 2.11). If not, you’ll have to reference the capturing groups by their numbers: 1, 2, and 3. See Recipe 3.9 to learn how to get the text matched by named and/or numbered groups in your favorite programming language.

Drive letter and UNC paths

We’ve added the same three capturing groups to the regexes for UNC paths.

Drive letter, UNC, and relative paths

Things get a bit more complicated if we also want to allow relative paths. In the previous recipe, we could just add a third alternative to the drive part of the regex to match the start of the relative path. We can’t do that here. In case of a relative path, the capturing group for the drive should remain empty.

Instead, the literal backslash that was after the capturing group for the drives in the regex in the “drive letter and UNC paths” section is now moved into that capturing group. We add it to the end of the alternatives for the drive letter and the network share. We add a third alternative with an optional backslash for relative paths that may or may not begin with a backslash. Because the third alternative is optional, the whole group for the drive is essentially optional.

The resulting regular expression correctly matches all Windows paths. The problem is that by making the drive part optional, we now have a regex in which everything is optional. The folder and file parts were already optional in the regexes that support absolute paths only. In other words: our regular expression will match the empty string.

If we want to make sure the regex doesn’t match empty strings, we’d have to add additional alternatives to deal with relative paths that specify a folder (in which case the filename is optional), and relative paths that don’t specify a folder (in which case the filename is mandatory):

A
(?:
   (?<drive>[a-z]:|\\[a-z0-9_.$-]+\[a-z0-9_.$-]+)\
   (?<folder>(?:[^\/:*?"<>|
]+\)*)
   (?<file>[^\/:*?"<>|
]*)
|  (?<relativefolder>\?(?:[^\/:*?"<>|
]+\)+)
   (?<file2>[^\/:*?"<>|
]*)
|  (?<relativefile>[^\/:*?"<>|
]+)
)

Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java 7, PCRE 7, Perl 5.10, Ruby 1.9
A
(?:
   (?P<drive>[a-z]:|\\[a-z0-9_.$-]+\[a-z0-9_.$-]+)\
   (?P<folder>(?:[^\/:*?"<>|
]+\)*)
   (?P<file>[^\/:*?"<>|
]*)
|  (?P<relativefolder>\?(?:[^\/:*?"<>|
]+\)+)
   (?P<file2>[^\/:*?"<>|
]*)
|  (?P<relativefile>[^\/:*?"<>|
]+)
)

Regex options: Free-spacing, case insensitive
Regex flavors: PCRE 4 and later, Perl 5.10, Python
A
(?:
   ([a-z]:|\\[a-z0-9_.$-]+\[a-z0-9_.$-]+)\
   ((?:[^\/:*?"<>|
]+\)*)
   ([^\/:*?"<>|
]*)
|  (\?(?:[^\/:*?"<>|
]+\)+)
   ([^\/:*?"<>|
]*)
|  ([^\/:*?"<>|
]+)
)

Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby
^(?:([a-z]:|\\[a-z0-9_.$-]+\[a-z0-9_.$-]+)\↵
((?:[^\/:*?"<>|
]+\)*)([^\/:*?"<>|
]*)|(\?(?:[^\/:*?"<>|↵

]+\)+)([^\/:*?"<>|
]*)|([^\/:*?"<>|
]+))$
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python

The price we pay for excluding zero-length strings is that we now have six capturing groups to capture the three different parts of the path. You’ll have to look at the scenario in which you want to use these regular expressions to determine whether it’s easier to do an extra check for empty strings before using the regex or to spend more effort in dealing with multiple capturing groups after a match has been found.

When using Perl 5.10, Ruby 1.9, or .NET, we can give multiple named groups the same name. See the section Groups with the same name in Recipe 2.11 for details. This way we can simply get the match of the folder or file group, without worrying about which of the two folder groups or three file groups actually participated in the regex match:

A
(?:
   (?<drive>[a-z]:|\\[a-z0-9_.$-]+\[a-z0-9_.$-]+)\
   (?<folder>(?:[^\/:*?"<>|
]+\)*)
   (?<file>[^\/:*?"<>|
]*)
|  (?<folder>\?(?:[^\/:*?"<>|
]+\)+)
   (?<file>[^\/:*?"<>|
]*)
|  (?<file>[^\/:*?"<>|
]+)
)

Regex options: Free-spacing, case insensitive
Regex flavors: .NET, Perl 5.10, Ruby 1.9

See Also

Recipe 8.18 validates a Windows path using simpler regular expressions without separate capturing groups for the drive, folder, and file.

Recipe 3.9 shows code to get the text matched by a particular part (capturing group) of a regex. Use this to get the parts of the path you’re interested in.

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.1 explains which special characters need to be escaped. Recipe 2.2 explains how to match nonprinting characters. Recipe 2.3 explains character classes. Recipe 2.5 explains anchors. Recipe 2.8 explains alternation. Recipe 2.9 explains grouping. Recipe 2.11 explains named capturing groups. Recipe 2.12 explains repetition.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset