2.1. Match Literal Text

Problem

Create a regular expression to exactly match this gloriously contrived sentence: The punctuation characters in the ASCII table are: !"#$%&'()*+,-./:;<=>?@[]^_`{|}~.

This is intended to show which characters have special meaning in regular expressions, and which characters always match themselves literally.

Solution

This regular expression matches the sentence stated in the problem:

ThepunctuationcharactersintheASCIItableare:↵
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

Any regular expression that does not include any of the dozen characters $()*+.?[^{| simply matches itself. To find whether Mary had a little lamb in the text you’re editing, simply search for Maryhadalittlelamb. It doesn’t matter whether the “regular expression” checkbox is turned on in your text editor.

The 12 punctuation characters that make regular expressions work their magic are called metacharacters. If you want your regex to match them literally, you need to escape them by placing a backslash in front of them. Thus, the regex: $()*+.?[\^{| matches the text $()*+.?[^{|.

Notably absent from the list are the closing square bracket ], the hyphen -, and the closing curly bracket }. The first two become metacharacters only after an unescaped [, and the } only after an unescaped {. There’s no need to ever escape }. Metacharacter rules for the blocks that appear between [ and ] are explained in Recipe 2.3.

Escaping any other nonalphanumeric character does not change how your regular expression works—at least not when working with any of the flavors discussed in this book. Escaping an alphanumeric character may give it a special meaning or throw a syntax error.

People new to regular expressions often escape every punctuation character in sight. Don’t let anyone know you’re a newbie. Escape judiciously. A jungle of needless backslashes makes regular expressions hard to read, particularly when all those backslashes have to be doubled up to quote the regex as a literal string in source code.

Variations

Block escape

We can make our solution easier to read when using a regex flavor that supports a feature called block escape:

ThepunctuationcharactersintheASCIItableare:↵
Q!"#$%&'()*+,-./:;<=>?@[]^_`{|}~E
Regex options: None
Regex flavors: Java 6, PCRE, Perl

Perl, PCRE and Java support the regex tokens Q and E. Q suppresses the meaning of all metacharacters, including the backslash, until E. If you omit E, all characters after the Q until the end of the regex are treated as literals.

The only benefit of Q...E is that it is easier to read than ....

Warning

Though Java 4 and 5 support this feature, you should not use it. Bugs in the implementation cause regular expressions with QE to match different things from what you intended, and from what PCRE, Perl, or Java 6 would match. These bugs were fixed in Java 6, making it behave the same way as PCRE and Perl.

Case-insensitive matching

By default, regular expressions are case sensitive. regex matches regex but not Regex, REGEX, or ReGeX. To make regex match all of those, you need to turn on case insensitivity.

In most applications, that’s a simple matter of marking or clearing a checkbox. All programming languages discussed in the next chapter have a flag or property that you can set to make your regex case insensitive. Recipe 3.4 in the next chapter explains how to apply the regex options listed with each regular expression solution in this book in your source code.

ascii
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

If you cannot turn on case insensitivity outside the regex, you can do so within by using the (?i) mode modifier, such as (?i)regex. This works with the .NET, Java, PCRE, Perl, Python, and Ruby flavors. It works with JavaScript when using the XRegExp library.

(?i)ascii
Regex options: None
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby

.NET, Java, PCRE, Perl, and Ruby support local mode modifiers, which affect only part of the regular expression. sensitive(?i)caseless(?-i)sensitive matches sensitiveCASELESSsensitive but not SENSITIVEcaselessSENSITIVE. (?i) turns on case insensitivity for the remainder of the regex, and (?-i) turns it off for the remainder of the regex. They act as toggle switches.

Recipe 2.9 shows how to use local mode modifiers with groups instead of toggles.

See Also

Recipe 2.3 explains character classes. The metacharacters inside character classes are different from those outside character classes.

Recipe 5.14 demonstrates how to use a regular expression to escape all metacharacters in a string. Doing so converts the string into a regular expression that matches the string literally.

Example JavaScript solution in Recipe 5.2 shows some sample JavaScript code for escaping all regex metacharacters. Some programming languages have a built-in command for this.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset