3.1. Literal Regular Expressions in Source Code

Problem

You have been given the regular expression [$"' d/\] as the solution to a problem. This regular expression consists of a single character class that matches a dollar sign, a double quote, a single quote, a line feed, any digit between 0 and 9, a forward slash, or a backslash. You want to hardcode this regular expression into your source code as a string constant or regular expression operator.

Solution

C#

As a normal string:

"[$"'
\d/\\]"

As a verbatim string:

@"[$""'
d/\]"

VB.NET

"[$""'
d/\]"

Java

"[$"'
\d/\\]"

JavaScript

/[$"'
d/\]/

XRegExp

"[$"'
\d/\\]"

PHP

'%[$"'
d/\\]%'

Perl

Pattern-matching operator:

/[$"'
d/\]/
m![$"'
d/\]!

Substitution operator:

s![$"'
d/\]!!

Python

Raw triple-quoted string:

r"""[$"'
d/\]"""

Normal string:

"[$"'
\d/\\]"

Ruby

Literal regex delimited with forward slashes:

/[$"'
d/\]/

Literal regex delimited with punctuation of your choice:

%r![$"'
d/\]!

Discussion

When this book shows you a regular expression by itself (as opposed to as part of a larger source code snippet), it always shows regular expressions unadorned. This recipe is the only exception. If you’re using a regular expression tester such as RegexBuddy or RegexPal, you would type in the regex this way. If your application accepts a regular expression as user input, the user would type it in this way.

But if you want to hardcode the regular expression into your source code, you have extra work. Carelessly copying and pasting regular expressions from a regular expression tester into your source code—or vice versa—will often leave you scratching your head as to why the regular expression works in your tool but not in your source code, or why the tester fails on a regex you’ve copied from somebody else’s code. All programming languages discussed in this book require literal regular expressions to be delimited in a certain way, with some languages requiring strings and some requiring a special regex constant. If your regex includes the language’s delimiters or certain other characters with special meanings in the language, you have to escape them.

The backslash is the most commonly used escape character. That’s why most of the solutions to this problem have far more backslashes in them than the four in the original regular expression.

C#

In C#, you can pass literal regular expressions to the Regex() constructor, and to various member functions in the Regex class. The parameter that takes the regular expression is always declared as a string.

C# supports two kinds of string literals. The most common kind is the double-quoted string, well-known from languages such as C++ and Java. Within double-quoted strings, double quotes and backslashes must be escaped with a backslash. Escapes for nonprintable characters, such as , are also supported in strings. There is a difference between " " and "\n" when using RegexOptions.IgnorePatternWhitespace (see Recipe 3.4) to turn on free-spacing mode, as explained in Recipe 2.18. " " is a string with a literal line break, which is ignored as whitespace. "\n" is a string with the regex token , which matches a newline.

Verbatim strings start with an at sign and a double quote, and end with a double quote on its own. To include a double quote in a verbatim string, double it up. Backslashes do not need to be escaped, resulting in a significantly more readable regular expression. @" " is always the regex token , which matches a newline, even in free-spacing mode. Verbatim strings do not support at the string level, but can span multiple lines instead. That makes verbatim strings ideal for free-spacing regular expressions.

The choice is clear: use verbatim strings to put regular expressions into your C# source code.

VB.NET

In VB.NET, you can pass literal regular expressions to the Regex() constructor, and to various member functions in the Regex class. The parameter that takes the regular expression is always declared as a string.

Visual Basic uses double-quoted strings. Double quotes within the string must be doubled. No other characters need to be escaped.

Java

In Java, you can pass literal regular expressions to the Pattern.compile() class factory, and to various functions of the String class. The parameter that takes the regular expression is always declared as a string.

Java uses double-quoted strings. Within double-quoted strings, double quotes and backslashes must be escaped with a backslash. Escapes for nonprintable characters, such as , and Unicode escapes such as uFFFF are also supported in strings.

There is a difference between " " and "\n" when using Pattern.COMMENTS (see Recipe 3.4) to turn on free-spacing mode, as explained in Recipe 2.18. " " is a string with a literal line break, which is ignored as whitespace. "\n" is a string with the regex token , which matches a newline.

JavaScript

In JavaScript, regular expressions are best created by using the special syntax for declaring literal regular expressions. Simply place your regular expression between two forward slashes. If any forward slashes occur within the regular expression itself, escape those with a backslash.

Although it is possible to create a RegExp object from a string, it makes little sense to use the string notation for literal regular expressions in your code. You would have to escape quotes and backslashes, which generally leads to a forest of backslashes.

XRegExp

If you use XRegExp to extend JavaScript’s regular expression syntax, then you will be creating XRegExp objects from strings, and you’ll need to escape quotes and backslashes.

PHP

Literal regular expressions for use with PHP’s preg functions are a curious contraption. Unlike JavaScript or Perl, PHP does not have a native regular expression type. Regular expressions must always be quoted as strings. This is true for the ereg and mb_ereg functions as well. But in their quest to mimic Perl, the developers of PHP’s wrapper functions for PCRE added an additional requirement.

Within the string, the regular expression must be quoted as a Perl-style literal regular expression. That means that where you would write /regex/ in Perl, the string for PHP’s preg functions becomes '/regex/'. As in Perl, you can use any pair of punctuation characters as the delimiters. If the regex delimiter occurs within the regex, it must be escaped with a backslash. To avoid this, choose a delimiter that does not occur in the regex. For this recipe, we used the percentage sign, because the forward slash occurs in the regex but the percentage sign does not. If the forward slash does not occur in the regex, use that, as it’s the most commonly used delimiter in Perl and the required delimiter in JavaScript and Ruby.

PHP supports both single-quoted and double-quoted strings. Both require the quote (single or double) and the backslash within a regex to be escaped with a backslash. In double-quoted strings, the dollar sign also needs to be escaped. For regular expressions, you should use single-quoted strings, unless you really want to interpolate variables in your regex.

Perl

In Perl, literal regular expressions are used with the pattern-matching operator and the substitution operator. The pattern-matching operator consists of two forward slashes, with the regex between it. Forward slashes within the regular expression must be escaped with a backslash. There’s no need to escape any other characters, except perhaps $ and @, as explained at the end of this subsection.

An alternative notation for the pattern-matching operator puts the regular expression between any pair of punctuation characters, preceded by the letter m. If you use any kind of opening and closing punctuation (parentheses, braces, or brackets) as the delimiter, they need to match up: for example, m{regex}. If you use other punctuation, simply use the same character twice. The solution for this recipe uses the exclamation point. That saves us having to escape the literal forward slash in the regular expression. Only the closing delimiter needs to be escaped with a backslash.

The substitution operator is similar to the pattern-matching operator. It starts with s instead of m, and tacks on the replacement text. When using brackets or similar punctuation as the delimiters, you need two pairs: s[regex][replace]. If you mix different delimiters, you also need two pairs: s[regex]/replace/. For all other punctuation, use it three times: s/regex/replace/.

Perl parses the pattern-matching and substitution operators as double-quoted strings. If you write m/I am $name/ and $name holds "Jan", you end up with the regular expression IamJan. $" is also a variable in Perl, so we have to escape the literal dollar sign in the character class in our regular expression in this recipe.

Never escape a dollar sign that you want to use as an anchor (see Recipe 2.5). An escaped dollar sign is always a literal. Perl is smart enough to differentiate between dollars used as anchors, and dollars used for variable interpolation, due to the fact that anchors can be used sensibly only at the end of a group or the whole regex, or before a newline. You shouldn’t escape the dollar in m/^regex$/ if you want to check whether “regex” matches the subject string entirely.

The at sign does not have a special meaning in regular expressions, but it is used for variable interpolation in Perl. You need to escape it in literal regular expressions in Perl code, as you do for double-quoted strings.

Python

The functions in Python’s re module expect literal regular expressions to be passed as strings. You can use any of the various ways that Python provides to quote strings. Depending on the characters that occur in your regular expression, different ways of quoting it may reduce the number of characters you need to escape with backslashes.

Generally, raw strings are the best option. Python raw strings don’t require any characters to be escaped. If you use a raw string, you don’t need to double up the backslashes in your regular expression. r"d+" is easier to read than "\d+", particularly as your regex gets long.

The only situation where raw strings aren’t ideal is when your regular expression includes both the single quote and double quote characters. Then you can’t use a raw string delimited with one pair of single or double quotes, because there’s no way to escape the quotes inside the regular expression. In that case, you can triple-quote the raw string, as we did in the Python solution for this recipe. The normal string is shown for comparison.

If you want to use the Unicode features explained in Recipe 2.7 in your regular expression in Python 2.x, you need to use Unicode strings. You can turn a string into a Unicode string by preceding it with a u. In Python 3.0 and later, all text is Unicode.

Raw strings don’t support nonprintable character escapes such as . Raw strings treat escape sequences as literal text. This is not a problem for the re module. It supports these escapes as part of the regular expression syntax, and as part of the replacement text syntax. A literal in a raw string will still be interpreted as a newline in your regular expressions and replacement texts.

There is a difference between the string " " on one side, and the string "\n" and the raw string r" " on the other side when using re.VERBOSE (see Recipe 3.4) to turn on free-spacing mode, as explained in Recipe 2.18. " " is a string with a literal line break, which is ignored as whitespace. "\n" and r" " are both strings with the regex token , which matches a newline.

When using free-spacing mode, triple-quoted raw strings such as r""" """ are the best solution, because they can span multiple lines. Also, is not interpreted at the string level, so it can be interpreted at the regex level to match a line break.

Ruby

In Ruby, regular expressions are best created by using the special syntax for declaring literal regular expressions. Simply place your regular expression between two forward slashes. If any forward slashes occur within the regular expression itself, escape those with a backslash.

If you don’t want to escape forward slashes in your regex, you can prefix your regular expression with %r and then use any punctuation character of your choice as the delimiter.

Although it is possible to create a Regexp object from a string, it makes little sense to use the string notation for literal regular expressions in your code. You then would have to escape quotes and backslashes, which generally leads to a forest of backslashes.

Tip

Ruby is very similar to JavaScript in this respect, except that the name of the class is Regexp as one word in Ruby, whereas it is RegExp with camel caps in JavaScript.

See Also

Recipe 2.3 explains how character classes work, and why two backslashes are needed in the regular expression to include just one in the character class.

Recipe 3.4 explains how to set regular expression options, which is done as part of literal regular expressions in some programming languages.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset