You need a regular expression that matches regular expression literals in your source code files so you can easily find them in your text editor or with a grep tool. Your programming language uses forward slashes to delimit regular expressions. Forward slashes in the regex must be escaped with a backslash.
Your regex only needs to match whatever looks like a regular expression literal. It doesn’t need to verify that the text between a pair of forward slashes is actually a valid regular expression.
Because you will be using just one regex rather than writing a full compiler, your regular expression does need to be smart enough to know the difference between a forward slash used as a division operator and one used to start a regex. In your source code, literal regular expressions appear as part of assignments (after an equals sign), in equality or inequality tests (after an equals sign), possibly with a negation operator (exclamation point) before the regex, in literal object definitions (after a colon), and as a parameter to a function (after an opening parenthesis or a comma). Whitespace between the regex and the character that precedes it needs to be ignored.
(?<=[=:(,](?:s*!)?s*)/[^/\ ]*(?:\.[^/\ ]*)*/
Regex options: None |
Regex flavors: .NET |
[=:(,](?:s*!)?s*K/[^/\ ]*(?:\.[^/\ ]*)*/
Regex options: None |
Regex flavors: PCRE 7.2, Perl 5.10 |
(?<=[=:(,](?:s{0,10}+!)?s{0,10})/[^/\ ]*(?:\.[^/\ ]*)*/
Regex options: None |
Regex flavors: .NET, Java |
[=:(,](?:s*!)?+s*(/[^/\ ]*(?:\.[^/\ ]*)*/)
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
All four solutions use ‹/[^/\
]*(?:\.[^/\
]*)*/
› to match the
regular expression. This is the same regular expression that was the
Solution to Strings with Escapes, except that it
has forward slashes instead of quotes. A literal regular expression
really is just a string quoted with forward slashes that can contain
forward slashes if escaped with a backslash.
The difference between the four solutions is how they check whether the regex is preceded by an equals sign, a colon, an opening parenthesis, or a comma, possibly with an exclamation point between that character and the regular expression. We could easily do that with lookbehind if we didn’t also want to allow any amount of whitespace between the regex and the preceding character. That complicates matters because the regex flavors in this book vary widely in their support for lookbehind.
The .NET regex flavor is the only one in this book that
allows infinite repetition inside lookbehind. So for .NET we have a
perfect solution: ‹(?<=[=:(,](?:s*!)?s*)
›. The character class
‹[=:(,]
› checks for the
presence of any of the four characters. ‹(?:s*!)?
› allows the character to be followed by
an exclamation point, with any amount of whitespace between the
character and the exclamation point. The second ‹s*
› allows any amount of whitespace before the
forward slash that opens the regex.
Perl and PCRE do not allow repetition inside lookbehind. A
solution using lookbehind wouldn’t be flexible enough in Perl or PCRE.
But Perl 5.10 and PCRE 7.2 added a new regex token ‹K
› that we
can use instead. We use ‹[=:(,](?:s*!)?s*
› to match any of the four
characters, optionally followed by any amount of whitespace and an
exclamation point, and also optionally followed by any amount of
whitespace. After the regex has matched this, the ‹K
› tells
the regex engine to keep what it has just
matched. The punctuation characters just matched by our regex will not
be included in the overall match result. The matching process will
continue normally with ‹/[^/\
]*(?:\.[^/\
]*)*/
› to match the
regular expression.
Java
does not allow infinite repetition in lookbehind, but does allow finite
repetition. So instead of using ‹s*
› to check for absolutely any amount of
whitespace, we use ‹s{0,10}
› to check for up to 10 whitespace
characters. The number 10 is arbitrary; we just need something
sufficiently large to make sure we don’t miss any regexes that are
deeply indented. We also need to keep the number reasonably small to
make sure we don’t needlessly slow down the regular expression. The
greater the number of repetitions we allow, the more characters Java
will scan while looking for a match to what’s inside the
lookbehind.
The other regex flavors either don’t support repetition inside
lookbehind or don’t support lookbehind or ‹K
› at all.
For these flavors, we simply use ‹[=:(,](?:s*!)?+s*
› to match the punctuation we
want before the regex, and ‹(/[^/\
]*(?:\.[^/\
]*)*/)
› to match the
regex itself and store it in a capturing group. The overall regex match
will include both the punctuation and the regex. The capturing group
makes it easier to retrieve just the regex. This solution will work only
if the application with which you’ll use this regex can work on the text
matched by a capturing group rather than the whole regex match.
Recipe 2.16 has all the details on
lookbehind and ‹K
›.