You want to compile a regular expression with all of the available matching modes: free-spacing, case insensitive, dot matches line breaks, and “^ and $ match at line breaks.”
Regex regexObj = new Regex("regex pattern
",
RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase |
RegexOptions.Singleline | RegexOptions.Multiline);
Dim RegexObj As New Regex("regex pattern
",
RegexOptions.IgnorePatternWhitespace Or RegexOptions.IgnoreCase Or
RegexOptions.Singleline Or RegexOptions.Multiline)
Pattern regex = Pattern.compile("regex pattern
",
Pattern.COMMENTS | Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE |
Pattern.DOTALL | Pattern.MULTILINE);
Many of the regular expressions in this book, and those that you find elsewhere, are written to be used with certain regex matching modes. There are four basic modes that nearly all modern regex flavors support. Unfortunately, some flavors use inconsistent and confusing names for the options that implement the modes. Using the wrong modes usually breaks the regular expression.
All the solutions in this recipe use flags or options provided by the programming language or regular expression class to set the modes. Another way to set modes is to use mode modifiers within the regular expression. Mode modifiers within the regex always override options or flags set outside the regular expression.
The Regex()
constructor takes an optional second
parameter with regular expressions options. You can find the available
options in the RegexOptions
enumeration.
Free-spacing:
RegexOptions.IgnorePatternWhitespace |
Case insensitive:
RegexOptions.IgnoreCase |
Dot matches line
breaks: RegexOptions.Singleline |
^ and $ match at line
breaks: RegexOptions.Multiline |
The Pattern.compile()
class factory takes an
optional second parameter with regular expression options. The
Pattern
class defines
several constants that set the various options. You can set multiple
options by combining them with the bitwise inclusive or operator
|
.
Free-spacing:
Pattern.COMMENTS |
Case insensitive:
Pattern.CASE_INSENSITIVE |
Pattern.UNICODE_CASE |
Dot matches line
breaks: Pattern.DOTALL |
^ and $ match at line
breaks: Pattern.MULTILINE |
There are indeed two options for case insensitivity, and
you have to set both for full case insensitivity. If you set only
Pattern.CASE_INSENSITIVE
, only the English
letters A to Z are matched case insensitively. If you set both
options, all characters from all scripts are matched case
insensitively. The only reason not to use Pattern.UNICODE_CASE
is performance, in case you
know in advance you’ll be dealing with ASCII text only. When using
mode modifiers inside your regular expression, use ‹(?i)
› for
ASCII-only case insensitivity and ‹(?iu)
› for full case insensitivity.
In JavaScript, you can specify options by appending one
or more single-letter flags to the RegExp
literal, after the forward slash that
terminates the regular expression. When talking about these flags in
documentation, they are usually written as /i
and
/m
, even
though the flag itself is only one letter. No additional slashes are
added to specify regex mode flags.
When using the RegExp()
constructor to compile a string into a
regular expression, you can pass an optional second parameter with
flags to the constructor. The second parameter should be a string with
the letters of the options you want to set. Do not put any slashes
into the string.
Free-spacing: Not supported by JavaScript. |
Case insensitive:
/i |
Dot matches line breaks: Not supported by JavaScript. |
^ and $ match at line
breaks: /m |
XRegExp extends JavaScript’s regular expression syntax,
adding support for the “free-spacing” and “dot matches line breaks”
modes with the letters “x” and “s” commonly used by other regular
expression flavors. Pass these letters in the string with the flags in
the second parameter to the XRegExp()
constructor.
Free-spacing:
"x" |
Case insensitive:
"i" |
Dot matches line
breaks: "s" |
^ and $ match at line
breaks: "m" |
Recipe 3.1 explains that the
PHP preg
functions require literal regular expressions to be delimited with two
punctuation characters, usually forward slashes, and the whole lot
formatted as a string literal. You can specify regular expression
options by appending one or more single-letter modifiers to the end of
the string. That is, the modifier letters come after the closing regex
delimiter, but still inside the string’s single or double quotes. When
talking about these modifiers in documentation, they are usually
written as /x
, even
though the flag itself is only one letter, and even though the
delimiter between the regex and the modifiers doesn’t have to be a
forward slash.
Free-spacing:
/x |
Case insensitive:
/i |
Dot matches line
breaks: /s |
^ and $ match at line
breaks: /m |
You can specify regular expression options by appending
one or more single-letter modifiers to the end of the pattern-matching
or substitution operator. When talking about these modifiers in
documentation, they are usually written as /x
, even
though the flag itself is only one letter, and even though the
delimiter between the regex and the modifiers doesn’t have to be a
forward slash.
Free-spacing:
/x |
Case insensitive:
/i |
Dot matches line
breaks: /s |
^ and $ match at line
breaks: /m |
The compile()
function (explained in the previous
recipe) takes an optional second parameter with regular expression
options. You can build up this parameter by using the |
operator to combine the
constants defined in the re
module. Many of the other functions in the re
module that take a literal regular expression
as a parameter also accept regular expression options as a final and
optional parameter.
The constants for the regular expression options come in pairs. Each option can be represented either as a constant with a full name or as just a single letter. Their functionality is equivalent. The only difference is that the full name makes your code easier to read by developers who aren’t familiar with the alphabet soup of regular expression options. The basic single-letter options listed in this section are the same as in Perl.
Free-spacing:
re.VERBOSE or re.X |
Case insensitive:
re.IGNORECASE or re.I |
Dot matches line
breaks: re.DOTALL or re.S |
^ and $ match at line
breaks: re.MULTILINE or re.M |
In Ruby, you can specify options by appending one or
more single-letter flags to the Regexp
literal, after the forward slash that
terminates the regular expression. When talking about these flags in
documentation, they are usually written as /i
and
/m
, even
though the flag itself is only one letter. No additional slashes are
added to specify regex mode flags.
When using the Regexp.new()
factory to compile a string into a
regular expression, you can pass an optional second parameter with
flags to the constructor. The second parameter should be either
nil
to turn off all
options, or a combination of constants from the Regexp
class combined with the
or
operator.
RegexOptions.ExplicitCapture
makes all groups,
except named groups, noncapturing. With this option, ‹(⋯)
› is the same as ‹(?:⋯)
›. If you always name your capturing
groups, turn on this option to make your regular expression more
efficient without the need to use the ‹(?:⋯)
› syntax. Instead of using RegexOptions.ExplicitCapture
, you can turn on
this option by putting ‹(?n)
› at
the start of your regular expression. See Recipe 2.9 to learn about grouping. Recipe 2.11 explains named groups.
Specify RegexOptions.ECMAScript
if you’re using the same
regular expression in your .NET code and in JavaScript code, and you
want to make sure it behaves in the same way. This is particularly
useful when you’re developing the client side of a web application in
JavaScript and the server side in ASP.NET. The most important effect
is that with this option, ‹w
› and
‹d
› are restricted to
ASCII characters, as they are in JavaScript.
An option unique to Java is Pattern.CANON_EQ
, which enables “canonical
equivalence.” As explained in the discussion in Unicode grapheme, Unicode provides
different ways to represent characters with diacritics. When you turn
on this option, your regex will match a character, even if it is
encoded differently in the subject string. For instance, the regex
‹u00E0
› will match both
"u00E0"
and "u0061u0300"
, because they are
canonically equivalent. They both appear as “à” when displayed on
screen, indistinguishable to the end user. Without canonical
equivalence, the regex ‹u00E0
› does not match the string "u0061u0300"
. This is how all
other regex flavors discussed in this book behave.
In Java 7, you can set Pattern.UNICODE_CHARACTER_CLASS
to make
shorthand character classes match Unicode characters rather than just
ASCII characters. See Shorthands in Recipe 2.3 for details.
Finally, Pattern.UNIX_LINES
tells Java to treat only
‹
› as a
line break character for the dot, caret, and dollar. By default, all
Unicode line breaks are treated as line break characters.
If you want to apply a regular expression repeatedly to
the same string (e.g., to iterate over all matches or to search and
replace all matches instead of just the first) specify the /g
or
“global” flag.
XRegExp needs the “g” flag if you want to apply a
regular expression repeatedly to the same string just as standard
JavaScript does. XRegExp also adds the “n” flag which makes all
groups, except named groups, noncapturing. With this option, ‹(⋯)
› is the same as ‹(?:⋯)
›. If you always name your capturing
groups, turn on this option to make your regular expression more
efficient without the need to use the ‹(?:⋯)
› syntax. See Recipe 2.9 to learn about grouping. Recipe 2.11 explains named groups.
/u
tells
PCRE to interpret both the regular expression and the subject string
as UTF-8 strings. This modifier also enables Unicode regex tokens such
as ‹x{FFFF}
› and
‹p{L}
›. These are
explained in Recipe 2.7. Without this
modifier, PCRE treats each byte as a separate character, and Unicode
regex tokens cause an error.
/U
flips
the “greedy” and “lazy” behavior of adding an extra question mark to a
quantifier. Normally, ‹.*
› is greedy and ‹.*?
› is lazy. With /U
,
‹.*
› is lazy and
‹.*?
› is greedy. We
strongly recommend that you never use this flag, as it will confuse
programmers who read your code later and miss the extra /U
modifier, which is unique to PHP. Also, don’t confuse /U
with
/u
if you encounter it
in somebody else’s code. Regex modifiers are case sensitive.
If you want to apply a regular expression repeatedly to
the same string (e.g., to iterate over all matches or to
search-and-replace all matches instead of just the first one), specify
the /g
(“global”)
flag.
If you interpolate a variable in a regex as in m/I am $name/
then Perl will
recompile the regular expression each time it needs to be used,
because the contents of $name
may have changed. You can suppress this
with the /o
modifier. m/I am
$name/o
is compiled the first time Perl needs to use it, and
then reused the way it is after that. If the contents of $name
change, the regex will not
reflect the change. See Recipe 3.3 if you
want to control when the regex is recompiled.
If your regex uses shorthand character classes or word
boundaries, you can specify one of the /d
,
/u
, /a
, or
/l
flags
to control whether the shorthands and word boundaries will match only
ASCII characters, or whether they use Unicode or the current locale.
The “Variations” sections in Recipes 2.3 and
2.3
have more details on what these flags do in Perl.
Python has two extra options that change the meaning of
word boundaries (see Recipe 2.6) and
the shorthand character classes ‹w
›,
‹d
›, and
‹s
›, as
well as their negated counterparts (see Recipe 2.3). By default, these tokens deal only
with ASCII letters, digits, and whitespace.
The re.LOCALE
or re.L
option makes these tokens dependent on the current locale. The locale
then determines which characters are treated as letters, digits, and
whitespace by these regex tokens. You should specify this option when
the subject string is not a Unicode string and you want characters
such as letters with diacritics to be treated as such.
The re.UNICODE
or re.U
makes these tokens dependent on the Unicode standard. All characters
defined by Unicode as letters, digits, and whitespace are then treated
as such by these regex tokens. You should specify this option when the
subject string you’re applying the regular expression to is a Unicode
string.
The Regexp.new()
factory takes an optional third
parameter to select the string encoding your regular expression
supports. If you do not specify an encoding for your regular
expression, it will use the same encoding as your source file. Most of
the time, using the source file’s encoding is the right thing to
do.
To select a coding explicitly, pass a single character for this parameter. The parameter is case-insensitive. Possible values are:
When using a literal regular expression, you can set the
encoding with the modifiers /n
,
/e
, /s
, and /u
. Only one of these modifiers can be used
for a single regular expression. They can be used in combination with
any or all of the /x
, /i
, and /m
modifiers.
The effects of the matching modes are explained in detail in Chapter 2. Those sections also explain the use of mode modifiers within the regular expression.
Free-spacing: Recipe 2.18 |
Case insensitive: Case-insensitive matching in Recipe 2.1 |
Dot matches line breaks: Recipe 2.4 |
^ and $ match at line breaks: Recipe 2.5 |
Recipes 3.1 and 3.3 explain how to use literal regular expressions in your source code and how to create regular expression objects. You set the regular expression options while creating a regular expression.