3.4. Set Regular Expression Options

Problem

You want to compile a regular expression with all of the available matching modes: free-spacing, case insensitive, dot matches line breaks, and “^ and $ match at line breaks.”

Solution

C#

Regex regexObj = new Regex("regex pattern",
    RegexOptions.IgnorePatternWhitespace | RegexOptions.IgnoreCase |
    RegexOptions.Singleline | RegexOptions.Multiline);

VB.NET

Dim RegexObj As New Regex("regex pattern",
    RegexOptions.IgnorePatternWhitespace Or RegexOptions.IgnoreCase Or
    RegexOptions.Singleline Or RegexOptions.Multiline)

Java

Pattern regex = Pattern.compile("regex pattern",
    Pattern.COMMENTS | Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE |
    Pattern.DOTALL | Pattern.MULTILINE);

JavaScript

Literal regular expression in your code:

var myregexp = /regex pattern/im;

Regular expression retrieved from user input, as a string:

var myregexp = new RegExp(userinput, "im");

XRegExp

var myregexp = XRegExp("regex pattern", "xism");

PHP

regexstring = '/regex pattern/xism';

Perl

m/regex pattern/xism;

Python

reobj = re.compile("regex pattern",
    re.VERBOSE | re.IGNORECASE |
    re.DOTALL | re.MULTILINE)

Ruby

Literal regular expression in your code:

myregexp = /regex pattern/xim;

Regular expression retrieved from user input, as a string:

myregexp = Regexp.new(userinput,
    Regexp::EXTENDED or Regexp::IGNORECASE or
    Regexp::MULTILINE);

Discussion

Many of the regular expressions in this book, and those that you find elsewhere, are written to be used with certain regex matching modes. There are four basic modes that nearly all modern regex flavors support. Unfortunately, some flavors use inconsistent and confusing names for the options that implement the modes. Using the wrong modes usually breaks the regular expression.

All the solutions in this recipe use flags or options provided by the programming language or regular expression class to set the modes. Another way to set modes is to use mode modifiers within the regular expression. Mode modifiers within the regex always override options or flags set outside the regular expression.

.NET

The Regex() constructor takes an optional second parameter with regular expressions options. You can find the available options in the RegexOptions enumeration.

Free-spacing: RegexOptions.IgnorePatternWhitespace
Case insensitive: RegexOptions.IgnoreCase
Dot matches line breaks: RegexOptions.Singleline
^ and $ match at line breaks: RegexOptions.Multiline

Java

The Pattern.compile() class factory takes an optional second parameter with regular expression options. The Pattern class defines several constants that set the various options. You can set multiple options by combining them with the bitwise inclusive or operator |.

Free-spacing: Pattern.COMMENTS
Case insensitive: Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE
Dot matches line breaks: Pattern.DOTALL
^ and $ match at line breaks: Pattern.MULTILINE

There are indeed two options for case insensitivity, and you have to set both for full case insensitivity. If you set only Pattern.CASE_INSENSITIVE, only the English letters A to Z are matched case insensitively. If you set both options, all characters from all scripts are matched case insensitively. The only reason not to use Pattern.UNICODE_CASE is performance, in case you know in advance you’ll be dealing with ASCII text only. When using mode modifiers inside your regular expression, use (?i) for ASCII-only case insensitivity and (?iu) for full case insensitivity.

JavaScript

In JavaScript, you can specify options by appending one or more single-letter flags to the RegExp literal, after the forward slash that terminates the regular expression. When talking about these flags in documentation, they are usually written as /i and /m, even though the flag itself is only one letter. No additional slashes are added to specify regex mode flags.

When using the RegExp() constructor to compile a string into a regular expression, you can pass an optional second parameter with flags to the constructor. The second parameter should be a string with the letters of the options you want to set. Do not put any slashes into the string.

Free-spacing: Not supported by JavaScript.
Case insensitive: /i
Dot matches line breaks: Not supported by JavaScript.
^ and $ match at line breaks: /m

XRegExp

XRegExp extends JavaScript’s regular expression syntax, adding support for the “free-spacing” and “dot matches line breaks” modes with the letters “x” and “s” commonly used by other regular expression flavors. Pass these letters in the string with the flags in the second parameter to the XRegExp() constructor.

Free-spacing: "x"
Case insensitive: "i"
Dot matches line breaks: "s"
^ and $ match at line breaks: "m"

PHP

Recipe 3.1 explains that the PHP preg functions require literal regular expressions to be delimited with two punctuation characters, usually forward slashes, and the whole lot formatted as a string literal. You can specify regular expression options by appending one or more single-letter modifiers to the end of the string. That is, the modifier letters come after the closing regex delimiter, but still inside the string’s single or double quotes. When talking about these modifiers in documentation, they are usually written as /x, even though the flag itself is only one letter, and even though the delimiter between the regex and the modifiers doesn’t have to be a forward slash.

Free-spacing: /x
Case insensitive: /i
Dot matches line breaks: /s
^ and $ match at line breaks: /m

Perl

You can specify regular expression options by appending one or more single-letter modifiers to the end of the pattern-matching or substitution operator. When talking about these modifiers in documentation, they are usually written as /x, even though the flag itself is only one letter, and even though the delimiter between the regex and the modifiers doesn’t have to be a forward slash.

Free-spacing: /x
Case insensitive: /i
Dot matches line breaks: /s
^ and $ match at line breaks: /m

Python

The compile() function (explained in the previous recipe) takes an optional second parameter with regular expression options. You can build up this parameter by using the | operator to combine the constants defined in the re module. Many of the other functions in the re module that take a literal regular expression as a parameter also accept regular expression options as a final and optional parameter.

The constants for the regular expression options come in pairs. Each option can be represented either as a constant with a full name or as just a single letter. Their functionality is equivalent. The only difference is that the full name makes your code easier to read by developers who aren’t familiar with the alphabet soup of regular expression options. The basic single-letter options listed in this section are the same as in Perl.

Free-spacing: re.VERBOSE or re.X
Case insensitive: re.IGNORECASE or re.I
Dot matches line breaks: re.DOTALL or re.S
^ and $ match at line breaks: re.MULTILINE or re.M

Ruby

In Ruby, you can specify options by appending one or more single-letter flags to the Regexp literal, after the forward slash that terminates the regular expression. When talking about these flags in documentation, they are usually written as /i and /m, even though the flag itself is only one letter. No additional slashes are added to specify regex mode flags.

When using the Regexp.new() factory to compile a string into a regular expression, you can pass an optional second parameter with flags to the constructor. The second parameter should be either nil to turn off all options, or a combination of constants from the Regexp class combined with the or operator.

Free-spacing: /r or Regexp::EXTENDED
Case insensitive: /i or Regexp::IGNORECASE
Dot matches line breaks: /m or Regexp::MULTILINE. Ruby indeed uses “m” and “multiline” here, whereas all the other flavors use “s” or “singleline” for “dot matches line breaks.”
^ and $ match at line breaks: The caret and dollar always match at line breaks in Ruby. You cannot turn this off. Use A and  to match at the start or end of the subject string.

Additional Language-Specific Options

.NET

RegexOptions.ExplicitCapture makes all groups, except named groups, noncapturing. With this option, () is the same as (?:). If you always name your capturing groups, turn on this option to make your regular expression more efficient without the need to use the (?:) syntax. Instead of using RegexOptions.ExplicitCapture, you can turn on this option by putting (?n) at the start of your regular expression. See Recipe 2.9 to learn about grouping. Recipe 2.11 explains named groups.

Specify RegexOptions.ECMAScript if you’re using the same regular expression in your .NET code and in JavaScript code, and you want to make sure it behaves in the same way. This is particularly useful when you’re developing the client side of a web application in JavaScript and the server side in ASP.NET. The most important effect is that with this option, w and d are restricted to ASCII characters, as they are in JavaScript.

Java

An option unique to Java is Pattern.CANON_EQ, which enables “canonical equivalence.” As explained in the discussion in Unicode grapheme, Unicode provides different ways to represent characters with diacritics. When you turn on this option, your regex will match a character, even if it is encoded differently in the subject string. For instance, the regex u00E0 will match both "u00E0" and "u0061u0300", because they are canonically equivalent. They both appear as “à” when displayed on screen, indistinguishable to the end user. Without canonical equivalence, the regex u00E0 does not match the string "u0061u0300". This is how all other regex flavors discussed in this book behave.

In Java 7, you can set Pattern.UNICODE_CHARACTER_CLASS to make shorthand character classes match Unicode characters rather than just ASCII characters. See Shorthands in Recipe 2.3 for details.

Finally, Pattern.UNIX_LINES tells Java to treat only as a line break character for the dot, caret, and dollar. By default, all Unicode line breaks are treated as line break characters.

JavaScript

If you want to apply a regular expression repeatedly to the same string (e.g., to iterate over all matches or to search and replace all matches instead of just the first) specify the /g or “global” flag.

XRegExp

XRegExp needs the “g” flag if you want to apply a regular expression repeatedly to the same string just as standard JavaScript does. XRegExp also adds the “n” flag which makes all groups, except named groups, noncapturing. With this option, () is the same as (?:). If you always name your capturing groups, turn on this option to make your regular expression more efficient without the need to use the (?:) syntax. See Recipe 2.9 to learn about grouping. Recipe 2.11 explains named groups.

PHP

/u tells PCRE to interpret both the regular expression and the subject string as UTF-8 strings. This modifier also enables Unicode regex tokens such as x{FFFF} and p{L}. These are explained in Recipe 2.7. Without this modifier, PCRE treats each byte as a separate character, and Unicode regex tokens cause an error.

/U flips the “greedy” and “lazy” behavior of adding an extra question mark to a quantifier. Normally, .* is greedy and .*? is lazy. With /U, .* is lazy and .*? is greedy. We strongly recommend that you never use this flag, as it will confuse programmers who read your code later and miss the extra /U modifier, which is unique to PHP. Also, don’t confuse /U with /u if you encounter it in somebody else’s code. Regex modifiers are case sensitive.

Perl

If you want to apply a regular expression repeatedly to the same string (e.g., to iterate over all matches or to search-and-replace all matches instead of just the first one), specify the /g (“global”) flag.

If you interpolate a variable in a regex as in m/I am $name/ then Perl will recompile the regular expression each time it needs to be used, because the contents of $name may have changed. You can suppress this with the /o modifier. m/I am $name/o is compiled the first time Perl needs to use it, and then reused the way it is after that. If the contents of $name change, the regex will not reflect the change. See Recipe 3.3 if you want to control when the regex is recompiled.

If your regex uses shorthand character classes or word boundaries, you can specify one of the /d, /u, /a, or /l flags to control whether the shorthands and word boundaries will match only ASCII characters, or whether they use Unicode or the current locale. The “Variations” sections in Recipes 2.3 and 2.3 have more details on what these flags do in Perl.

Python

Python has two extra options that change the meaning of word boundaries (see Recipe 2.6) and the shorthand character classes w, d, and s, as well as their negated counterparts (see Recipe 2.3). By default, these tokens deal only with ASCII letters, digits, and whitespace.

The re.LOCALE or re.L option makes these tokens dependent on the current locale. The locale then determines which characters are treated as letters, digits, and whitespace by these regex tokens. You should specify this option when the subject string is not a Unicode string and you want characters such as letters with diacritics to be treated as such.

The re.UNICODE or re.U makes these tokens dependent on the Unicode standard. All characters defined by Unicode as letters, digits, and whitespace are then treated as such by these regex tokens. You should specify this option when the subject string you’re applying the regular expression to is a Unicode string.

Ruby

The Regexp.new() factory takes an optional third parameter to select the string encoding your regular expression supports. If you do not specify an encoding for your regular expression, it will use the same encoding as your source file. Most of the time, using the source file’s encoding is the right thing to do.

To select a coding explicitly, pass a single character for this parameter. The parameter is case-insensitive. Possible values are:

n

This stands for “None.” Each byte in your string is treated as one character. Use this for ASCII text.

e

Enables the “EUC” encoding for Far East languages.

s

Enables the Japanese “Shift-JIS” encoding.

u

Enables UTF-8, which uses one to four bytes per character and supports all languages in the Unicode standard (which includes all living languages of any significance).

When using a literal regular expression, you can set the encoding with the modifiers /n, /e, /s, and /u. Only one of these modifiers can be used for a single regular expression. They can be used in combination with any or all of the /x, /i, and /m modifiers.

Caution

Do not mistake Ruby’s /s for that of Perl, Java, or .NET. In Ruby, /s forces the Shift-JIS encoding. In Perl and most other regex flavors, it turns on “dot matches line breaks” mode. In Ruby, you can do that with /m.

See Also

The effects of the matching modes are explained in detail in Chapter 2. Those sections also explain the use of mode modifiers within the regular expression.

Free-spacing: Recipe 2.18
Case insensitive: Case-insensitive matching in Recipe 2.1
Dot matches line breaks: Recipe 2.4
^ and $ match at line breaks: Recipe 2.5

Recipes 3.1 and 3.3 explain how to use literal regular expressions in your source code and how to create regular expression objects. You set the regular expression options while creating a regular expression.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset