You have been given the regular expression ‹[$"'
d/\]
› as the solution to a
problem. This regular expression consists of a single character class
that matches a dollar sign, a double quote, a single quote, a line feed,
any digit between 0 and 9, a forward slash, or a backslash. You want to
hardcode this regular expression into your source code as a string
constant or regular expression operator.
When this book shows you a regular expression by itself (as opposed to as part of a larger source code snippet), it always shows regular expressions unadorned. This recipe is the only exception. If you’re using a regular expression tester such as RegexBuddy or RegexPal, you would type in the regex this way. If your application accepts a regular expression as user input, the user would type it in this way.
But if you want to hardcode the regular expression into your source code, you have extra work. Carelessly copying and pasting regular expressions from a regular expression tester into your source code—or vice versa—will often leave you scratching your head as to why the regular expression works in your tool but not in your source code, or why the tester fails on a regex you’ve copied from somebody else’s code. All programming languages discussed in this book require literal regular expressions to be delimited in a certain way, with some languages requiring strings and some requiring a special regex constant. If your regex includes the language’s delimiters or certain other characters with special meanings in the language, you have to escape them.
The backslash is the most commonly used escape character. That’s why most of the solutions to this problem have far more backslashes in them than the four in the original regular expression.
In C#, you can pass literal regular expressions to the
Regex()
constructor, and to various
member functions in the Regex
class. The parameter that takes the regular expression is always
declared as a string.
C# supports two kinds of string literals. The most
common kind is the double-quoted string, well-known from languages
such as C++ and Java. Within double-quoted strings, double quotes and
backslashes must be escaped with a backslash. Escapes for nonprintable
characters, such as ‹
›, are
also supported in strings. There is a difference between "
"
and "\n"
when using RegexOptions.IgnorePatternWhitespace
(see Recipe 3.4) to turn on free-spacing mode, as
explained in Recipe 2.18. "
"
is a string with a literal
line break, which is ignored as whitespace. "\n"
is a string with the regex token
‹
›,
which matches a newline.
Verbatim strings start with an at sign and a double
quote, and end with a double quote on its own. To include a double
quote in a verbatim string, double it up. Backslashes do not need to
be escaped, resulting in a significantly more readable regular
expression. @"
"
is
always the regex token ‹
›,
which matches a newline, even in free-spacing mode. Verbatim strings
do not support ‹
› at
the string level, but can span multiple lines instead. That makes
verbatim strings ideal for free-spacing regular expressions.
The choice is clear: use verbatim strings to put regular expressions into your C# source code.
In VB.NET, you can pass literal regular expressions to
the Regex()
constructor, and to
various member functions in the Regex
class. The parameter that takes the
regular expression is always declared as a string.
Visual Basic uses double-quoted strings. Double quotes within the string must be doubled. No other characters need to be escaped.
In Java, you can pass literal regular expressions to the
Pattern.compile()
class factory, and to various
functions of the String
class. The parameter that takes the regular expression is always
declared as a string.
Java uses double-quoted strings. Within double-quoted strings,
double quotes and backslashes must be escaped with a backslash.
Escapes for nonprintable characters, such as ‹
›, and
Unicode escapes such as ‹uFFFF
› are also supported in strings.
There is a difference between "
"
and "\n"
when using Pattern.COMMENTS
(see Recipe 3.4) to turn on free-spacing mode, as
explained in Recipe 2.18. "
"
is a string with a literal
line break, which is ignored as whitespace. "\n"
is a string with the regex token
‹
›,
which matches a newline.
In JavaScript, regular expressions are best created by using the special syntax for declaring literal regular expressions. Simply place your regular expression between two forward slashes. If any forward slashes occur within the regular expression itself, escape those with a backslash.
Although it is possible to create a RegExp
object from a string, it makes little
sense to use the string notation for literal regular expressions in
your code. You would have to escape quotes and backslashes, which
generally leads to a forest of backslashes.
If you use XRegExp to extend JavaScript’s regular
expression syntax, then you will be creating XRegExp
objects from strings, and you’ll need to
escape quotes and backslashes.
Literal regular expressions for use with PHP’s
preg
functions are a curious contraption. Unlike JavaScript or Perl, PHP
does not have a native regular expression type. Regular expressions
must always be quoted as strings. This is true for the ereg
and
mb_ereg
functions as well. But in their quest to mimic Perl, the developers of
PHP’s wrapper functions for PCRE added an additional
requirement.
Within the string, the regular expression must be quoted as a
Perl-style literal regular expression. That means that where you would
write /regex/
in Perl,
the string for PHP’s preg
functions becomes '/regex/'
. As in Perl, you can use any pair of
punctuation characters as the delimiters. If the regex delimiter
occurs within the regex, it must be escaped with a backslash. To avoid
this, choose a delimiter that does not occur in the regex. For this
recipe, we used the percentage sign, because the forward slash occurs
in the regex but the percentage sign does not. If the forward slash
does not occur in the regex, use that, as it’s the most commonly used
delimiter in Perl and the required delimiter in JavaScript and
Ruby.
PHP supports both single-quoted and double-quoted strings. Both require the quote (single or double) and the backslash within a regex to be escaped with a backslash. In double-quoted strings, the dollar sign also needs to be escaped. For regular expressions, you should use single-quoted strings, unless you really want to interpolate variables in your regex.
In Perl, literal regular expressions are used with the
pattern-matching operator and the substitution operator. The
pattern-matching operator consists of two forward slashes, with the
regex between it. Forward slashes within the regular expression must
be escaped with a backslash. There’s no need to escape any other
characters, except perhaps $
and
@
, as explained at
the end of this subsection.
An alternative notation for the pattern-matching operator puts
the regular expression between any pair of punctuation characters,
preceded by the letter m
. If you use any kind of opening and closing
punctuation (parentheses, braces, or brackets) as the delimiter, they
need to match up: for example, m{
. If you use
other punctuation, simply use the same character twice. The solution
for this recipe uses the exclamation point. That saves us having to
escape the literal forward slash in the regular expression. Only the
closing delimiter needs to be escaped with a backslash.regex
}
The substitution operator is similar to the pattern-matching
operator. It starts with s
instead of m
, and tacks on the replacement text. When using
brackets or similar punctuation as the delimiters, you need two pairs:
s[
.
If you mix different delimiters, you also need two pairs: regex
][replace
]s[
.
For all other punctuation, use it three times: regex
]/replace
/s/
.regex
/replace
/
Perl parses the pattern-matching and substitution operators as
double-quoted strings. If you write m/I am $name/
and $name
holds "Jan"
, you end up with the regular expression
‹I●am●Jan
›. $"
is also a variable in Perl, so we have to
escape the literal dollar sign in the character class in our regular
expression in this recipe.
Never escape a dollar sign that you
want to use as an anchor (see Recipe 2.5). An
escaped dollar sign is always a literal. Perl is smart enough to
differentiate between dollars used as anchors, and dollars used for
variable interpolation, due to the fact that anchors can be used
sensibly only at the end of a group or the whole regex, or before a
newline. You shouldn’t escape the dollar in ‹m/^
› if you
want to check whether “regex” matches the subject string
entirely.regex
$/
The at sign does not have a special meaning in regular expressions, but it is used for variable interpolation in Perl. You need to escape it in literal regular expressions in Perl code, as you do for double-quoted strings.
The functions in Python’s re
module expect literal regular expressions to
be passed as strings. You can use any of the various ways that Python
provides to quote strings. Depending on the characters that occur in
your regular expression, different ways of quoting it may reduce the
number of characters you need to escape with backslashes.
Generally, raw strings are the best option. Python raw
strings don’t require any characters to be escaped. If you use a raw
string, you don’t need to double up the backslashes in your regular
expression. r"d+"
is
easier to read than "\d+"
, particularly as your regex gets
long.
The only situation where raw strings aren’t ideal is when your regular expression includes both the single quote and double quote characters. Then you can’t use a raw string delimited with one pair of single or double quotes, because there’s no way to escape the quotes inside the regular expression. In that case, you can triple-quote the raw string, as we did in the Python solution for this recipe. The normal string is shown for comparison.
If you want to use the Unicode features explained in Recipe 2.7 in your regular expression in Python
2.x, you need to use Unicode strings. You can turn a string into a
Unicode string by preceding it with a u
. In Python 3.0 and later, all text is
Unicode.
Raw strings don’t support nonprintable character escapes such as
. Raw
strings treat escape sequences as literal text. This is not a problem
for the re
module. It
supports these escapes as part of the regular expression syntax, and
as part of the replacement text syntax. A literal
in a
raw string will still be interpreted as a newline in your regular
expressions and replacement texts.
There is a difference between the string "
"
on one side, and the string
"\n"
and the raw
string r"
"
on the
other side when using re.VERBOSE
(see Recipe 3.4) to turn on free-spacing mode, as
explained in Recipe 2.18. "
"
is a string with a literal
line break, which is ignored as whitespace. "\n"
and r"
"
are both strings with the regex token
‹
›,
which matches a newline.
When using free-spacing mode,
triple-quoted raw strings such as r"""
"""
are the best solution, because they
can span multiple lines. Also, ‹
› is
not interpreted at the string level, so it can be interpreted at the
regex level to match a line break.
In Ruby, regular expressions are best created by using the special syntax for declaring literal regular expressions. Simply place your regular expression between two forward slashes. If any forward slashes occur within the regular expression itself, escape those with a backslash.
If you don’t want to escape forward slashes in your regex, you
can prefix your regular expression with %r
and
then use any punctuation character of your choice as the delimiter.
Although it is possible to create a Regexp
object from a string, it makes little
sense to use the string notation for literal regular expressions in
your code. You then would have to escape quotes and backslashes, which
generally leads to a forest of backslashes.
Recipe 2.3 explains how character classes work, and why two backslashes are needed in the regular expression to include just one in the character class.
Recipe 3.4 explains how to set regular expression options, which is done as part of literal regular expressions in some programming languages.