Regex Literals

Problem

You need a regular expression that matches regular expression literals in your source code files so you can easily find them in your text editor or with a grep tool. Your programming language uses forward slashes to delimit regular expressions. Forward slashes in the regex must be escaped with a backslash.

Your regex only needs to match whatever looks like a regular expression literal. It doesn’t need to verify that the text between a pair of forward slashes is actually a valid regular expression.

Because you will be using just one regex rather than writing a full compiler, your regular expression does need to be smart enough to know the difference between a forward slash used as a division operator and one used to start a regex. In your source code, literal regular expressions appear as part of assignments (after an equals sign), in equality or inequality tests (after an equals sign), possibly with a negation operator (exclamation point) before the regex, in literal object definitions (after a colon), and as a parameter to a function (after an opening parenthesis or a comma). Whitespace between the regex and the character that precedes it needs to be ignored.

Solution

(?<=[=:(,](?:s*!)?s*)/[^/\
]*(?:\.[^/\
]*)*/
Regex options: None
Regex flavors: .NET
[=:(,](?:s*!)?s*K/[^/\
]*(?:\.[^/\
]*)*/
Regex options: None
Regex flavors: PCRE 7.2, Perl 5.10
(?<=[=:(,](?:s{0,10}+!)?s{0,10})/[^/\
]*(?:\.[^/\
]*)*/
Regex options: None
Regex flavors: .NET, Java
[=:(,](?:s*!)?+s*(/[^/\
]*(?:\.[^/\
]*)*/)
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

All four solutions use /[^/\ ]*(?:\.[^/\ ]*)*/ to match the regular expression. This is the same regular expression that was the Solution to Strings with Escapes, except that it has forward slashes instead of quotes. A literal regular expression really is just a string quoted with forward slashes that can contain forward slashes if escaped with a backslash.

The difference between the four solutions is how they check whether the regex is preceded by an equals sign, a colon, an opening parenthesis, or a comma, possibly with an exclamation point between that character and the regular expression. We could easily do that with lookbehind if we didn’t also want to allow any amount of whitespace between the regex and the preceding character. That complicates matters because the regex flavors in this book vary widely in their support for lookbehind.

The .NET regex flavor is the only one in this book that allows infinite repetition inside lookbehind. So for .NET we have a perfect solution: (?<=[=:(,](?:s*!)?s*). The character class [=:(,] checks for the presence of any of the four characters. (?:s*!)? allows the character to be followed by an exclamation point, with any amount of whitespace between the character and the exclamation point. The second s* allows any amount of whitespace before the forward slash that opens the regex.

Perl and PCRE do not allow repetition inside lookbehind. A solution using lookbehind wouldn’t be flexible enough in Perl or PCRE. But Perl 5.10 and PCRE 7.2 added a new regex token K that we can use instead. We use [=:(,](?:s*!)?s* to match any of the four characters, optionally followed by any amount of whitespace and an exclamation point, and also optionally followed by any amount of whitespace. After the regex has matched this, the K tells the regex engine to keep what it has just matched. The punctuation characters just matched by our regex will not be included in the overall match result. The matching process will continue normally with /[^/\ ]*(?:\.[^/\ ]*)*/ to match the regular expression.

Java does not allow infinite repetition in lookbehind, but does allow finite repetition. So instead of using s* to check for absolutely any amount of whitespace, we use s{0,10} to check for up to 10 whitespace characters. The number 10 is arbitrary; we just need something sufficiently large to make sure we don’t miss any regexes that are deeply indented. We also need to keep the number reasonably small to make sure we don’t needlessly slow down the regular expression. The greater the number of repetitions we allow, the more characters Java will scan while looking for a match to what’s inside the lookbehind.

The other regex flavors either don’t support repetition inside lookbehind or don’t support lookbehind or K at all. For these flavors, we simply use [=:(,](?:s*!)?+s* to match the punctuation we want before the regex, and (/[^/\ ]*(?:\.[^/\ ]*)*/) to match the regex itself and store it in a capturing group. The overall regex match will include both the punctuation and the regex. The capturing group makes it easier to retrieve just the regex. This solution will work only if the application with which you’ll use this regex can work on the text matched by a capturing group rather than the whole regex match.

See Also

Recipe 2.16 has all the details on lookbehind and K.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset