Create a regular expression to exactly match this
gloriously contrived sentence: The punctuation characters in the ASCII table are:
!"#$%&'()*+,-./:;<=>?@[]^_`{|}~
.
This is intended to show which characters have special meaning in regular expressions, and which characters always match themselves literally.
This regular expression matches the sentence stated in the problem:
The●punctuation●characters●in●the●ASCII●table●are:●↵ !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
Regex options: None |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
Any regular expression that does not include any of the dozen
characters $()*+.?[^{|
simply
matches itself. To find whether Mary had a little lamb
in the text you’re
editing, simply search for ‹Mary●had●a●little●lamb
›. It doesn’t matter whether the
“regular expression” checkbox is turned on in your text editor.
The 12 punctuation characters that make regular expressions work
their magic are called metacharacters. If you want your
regex to match them literally, you need to escape them by placing a backslash
in front of them. Thus, the regex: ‹$()*+.?[\^{|
› matches the text
$()*+.?[^{|
.
Notably absent from the list are the closing square bracket
]
, the hyphen -
, and the closing curly bracket }
. The first two become metacharacters only
after an unescaped [
, and the }
only after an unescaped {
. There’s no need to
ever escape }
. Metacharacter rules
for the blocks that appear between [
and ]
are explained in Recipe 2.3.
Escaping any other nonalphanumeric character does not change how your regular expression works—at least not when working with any of the flavors discussed in this book. Escaping an alphanumeric character may give it a special meaning or throw a syntax error.
People new to regular expressions often escape every punctuation character in sight. Don’t let anyone know you’re a newbie. Escape judiciously. A jungle of needless backslashes makes regular expressions hard to read, particularly when all those backslashes have to be doubled up to quote the regex as a literal string in source code.
We can make our solution easier to read when using a regex flavor that supports a feature called block escape:
The●punctuation●characters●in●the●ASCII●table●are:●↵ Q!"#$%&'()*+,-./:;<=>?@[]^_`{|}~E
Regex options: None |
Regex flavors: Java 6, PCRE, Perl |
Perl, PCRE and Java support the regex tokens ‹Q
› and
‹E
›.
‹Q
›
suppresses the meaning of all metacharacters, including the backslash,
until ‹E
›. If
you omit ‹E
›, all
characters after the ‹Q
› until
the end of the regex are treated as literals.
The only benefit of ‹Q...E
› is that it is easier to read than
‹...
›.
Though Java 4 and 5 support this feature, you should not use
it. Bugs in the implementation cause regular expressions with
‹Q⋯E
› to match different things from
what you intended, and from what PCRE, Perl, or Java 6 would match.
These bugs were fixed in Java 6, making it behave the same way as
PCRE and Perl.
By default, regular expressions are case sensitive.
‹regex
› matches
regex
but not
Regex
, REGEX
, or ReGeX
. To make ‹regex
› match all of those, you
need to turn on case insensitivity.
In most applications, that’s a simple matter of marking or clearing a checkbox. All programming languages discussed in the next chapter have a flag or property that you can set to make your regex case insensitive. Recipe 3.4 in the next chapter explains how to apply the regex options listed with each regular expression solution in this book in your source code.
ascii
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
If you cannot turn on case insensitivity outside the regex, you
can do so within by using the ‹(?i)
›
mode modifier, such as ‹(?i)regex
›. This works with the .NET, Java,
PCRE, Perl, Python, and Ruby flavors. It works with JavaScript when
using the XRegExp library.
(?i)ascii
Regex options: None |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
.NET, Java, PCRE, Perl, and Ruby support local mode modifiers,
which affect only part of the regular expression. ‹sensitive(?i)caseless(?-i)sensitive
› matches
sensitiveCASELESSsensitive
but not
SENSITIVEcaselessSENSITIVE
. ‹(?i)
›
turns on case insensitivity for
the remainder of the regex, and ‹(?-i)
›
turns it off for the remainder of the regex. They act as toggle
switches.
Recipe 2.9 shows how to use local mode modifiers with groups instead of toggles.
Recipe 2.3 explains character classes. The metacharacters inside character classes are different from those outside character classes.
Recipe 5.14 demonstrates how to use a regular expression to escape all metacharacters in a string. Doing so converts the string into a regular expression that matches the string literally.
Example JavaScript solution in Recipe 5.2 shows some sample JavaScript code for escaping all regex metacharacters. Some programming languages have a built-in command for this.