15.2. Special Symbols and Characters for REs

We will now introduce the most popular of the metacharacters, special characters and symbols, which give regular expressions their power and flexibility. You will find the most common of these symbols and characters in Table 15.1.

Table 15.1. Common Regular Expression Symbols and Special Characters
NotationDescriptionExample RE
Symbols
re_stringmatch literal string value re_stringfoo
re1|re2match literal string value re1 or re2foo|bar
.match any character (except NEWLINE)::.+::
^match start of string^Dear
$match end of string/bin/w*sh$
*match 0 or more occurrences of preceding RE[A-Za-z]w*
+match 1 or more occurrences of preceding REd+.|.d+
?match 0 or 1 occurrence(s) of preceding REgoo?
{N}match N occurrences of preceding REd{3}
{M,N}match from M to N occurrences of preceding REd{5,9}
[…]match any single character from character class[aeiou]
[..x–y..]match any single character in the range from x to y[0–9], [A-Za-z]
[^…]do not match any character from character class, including any ranges, if present[^aeiou], [^A-Za-z0–9_]
(*|+|?|{})?apply non-greedy versions of above occurrence/repetition symbols ( *, +, ?, {}).*?w
(…)match enclosed RE and save as subgroup(d{3})?, f(oo|u)bar
Special Characters
dmatch any decimal digit, same as [0–9] (D is inverse of d: do not match any numeric digit)datad+.txt
wmatch any alphanumeric character, same as [A-Za-z0-9_] (W is inverse of w)[A-Za-z_]w+
smatch any whitespace character, same as [ vf] (S is inverse of s)ofsthe
match any word boundary (B is inverse of )The
nnmatch saved subgroup nn (see (…) above)price: 16
cmatch any special character c verbatim (i.e., without its special meaning, literal)., \, *
A ()match start (end) of string (also see ^ and $ above)ADear

Matching more than one RE pattern with alternation ( | )

The pipe symbol ( | ), a vertical bar on your keyboard, indicates an alternation operation, meaning that it is used to choose from one of the different regular expressions which are separated by the pipe symbol. For example, below are some patterns which employ alternation, along with the strings they match:

RE PatternStrings Matched
at|homeat, home
r2d2|c3por2d2, c3po
bat|bet|bitbat, bet, bit

With this one symbol, we have just increased the flexibility of our regular expressions, enabling the matching of more than just one string. Alternation is also sometimes called union or logical OR.

Matching any single character ( . )

The dot or period ( . ) symbol matches any single character except for NEWLINE (Python REs have a compilation flag [S or DOTALL] which can override this to include NEWLINEs.). Whether letter, number, whitespace not including “ ,” printable, non-printable, or a symbol, the dot can match them all.

RE PatternStrings Matched
f.oany character between “f” and “o,” e.g., fao,f9o, f#o, etc.
..any pair of characters
.endany character before the string end

Q1:“What if I want to match the dot or period character?”
A1: In order to specify a dot character explicitly, you must escape its functionality with a backslash, as in “.”

Matching from the beginning or end of strings or word boundaries ( ^/$)

There are also symbols and related special characters to specify searching for patterns at the beginning and ending of strings. To match a pattern starting from the beginning, you must use the carat symbol ( ^ ) or the special character A (backslash-capital “A”). The latter is primarily for keyboards which do not have the carat symbol, i.e., international. Similarly, the dollar sign ( $ ) or  will match a pattern from the end of a string.

Patterns which use these symbols differ from most of the others we describe in this chapter since they dictate location or position. In the Core Note above, we noted that a distinction is made between “matching,” attempting matches of entire strings starting at the beginning, and “searching,” attempting matches from anywhere within a string. Because we are looking specifically at symbols and special characters which deal with position, they make sense only when applied to searching.

That said, here are some examples of “edge-bound” RE search patterns:

RE PatternStrings Matched
^Fromany string which starts with From
/bin/tcsh$any string which ends with /bin/tcsh
^Subject: hi$any string consisting solely of the string Subject: hi

Again, if you want to match either (or both) of these characters verbatim, you must use an escaping backslash. For example, if you wanted to match any string which ended with a dollar sign, one possible RE solution would be the pattern “.*$$”.

The  and B special characters will match the empty string, meaning that they can start performing the match anywhere. The difference is that  will match a pattern to a word boundary, meaning that a pattern must be at the beginning of a word, whether there are any characters in front of it (word in the middle of a string) or not (word at the beginning of a line). And likewise, B will match a pattern only if it appears starting in the middle of a word (i.e., not at a word boundary). Here are some examples:

RE PatternStrings Matched
theany string containing the
theany word which starts with the
thematches only the word the
Btheany string which contains but does not begin with the

Creating character classes ( [ ] )

While the dot is good for allowing matches of any symbols, there may be occasions where there are specific characters you want to match. For this reason, the bracket symbols ( [ ] ) were invented. The regular expression will match from any of the enclosed characters. Here are some examples:

RE PatternStrings Matched
b[aeiu]tbat, bet, bit, but
[cr][23][dp][o2]“r” or “c” then “2” or “3” followed by “d” or “p” and finally, either “o” or “2,” e.g., c2do, r3p2, r2d2, c3po, etc.

One side note regarding the RE “[cr][23][dp][o2]”—a more restrictive version of this RE would be required to allow only “r2d2” or “c3po” as valid strings. Because brackets merely imply “logical OR” functionality, it is not possible to use brackets to enforce such a requirement. The only solution is to use the pipe, as in “r2d2|c3po”.

For single character REs, though, the pipe and brackets are equivalent. For example, let's start with the regular expression “ab” which matches only the string with an “a” followed by a “b.” If we wanted either a one-letter string, i.e., either “a” or a “b,” we could use the RE “[ab]”. Because “a” and “b” are individual strings, we can also choose the RE “a|b”. However, if we wanted to match the string with the pattern “ab” followed by “cd,” we cannot use the brackets because they work only for single characters. In this case, the only solution is “ab|cd,” similar to the “r2d2/c3po” problem just mentioned.

Denoting ranges ( - ) and negation ( ^ )

In addition to single characters, the brackets also support ranges of characters. A hyphen between a pair of symbols enclosed in brackets is used to indicate a range of characters, e.g., A–Z, a–z, or 0–9 for uppercase letters, lowercase letters, and numeric digits, respectively. This is a lexicographic range, so you are not restricted to using just alphanumeric characters. Additionally, if a caret ( ^ ) is the first character immediately inside the open left bracket, this symbolizes a directive to not match any of the characters in the given character set.

RE PatternStrings Matched
z.[0–9]“z” followed by any character then followed by a single digit
[r–u][env-y][us]“r” “s,” “t” or “u” followed by “e,” “n,” “v,” “w,” “x,” or “y” followed by “u” or “s”
[^aeiou]*zero or more (*symbol introduced in next subsection) non-vowels (EXERCISE: Why do we say “non-vowels” rather than “consonants?”)
[^ ]+one or more (+symbol introduced in next subsection) characters up to, but not including, the first TAB or NEWLINE encountered
["-a]in an ASCII system, all characters which fall between “"” and “a,” i.e., between ordinals 34 and 97.

Multiple occurrence/repetition using closure operators ( *, +, ?, { } )

We will now introduce the most common RE notations, namely, the special symbols *, +, and ?, all of which can be used to match single, multiple, or no occurrences of string patterns. The asterisk or star operator ( * ) will match zero or more occurrences of the RE immediately to its left (in language and compiler theory, this operation is known as the Kleene Closure). The plus operator ( + ) will match one or more occurrences of an RE (known as Positive Closure), and the question mark operator ( ? ) will match exactly 0 or 1 occurrences of an RE.

There are also brace operators ( { } ) with either a single value or a comma-separated pair of values. These indicate a match of exactly N occurrences (for { N }) or a range of occurrences, i.e., {M, N} will match from M to N occurrences. These symbols may also be escaped with the backslash, i.e., “*” matches the asterisk, etc.

Finally, the question mark ( ? ) is overloaded so that if it follows any of the following symbols, it will direct the regular expression engine to match as few repetitions as possible.

Here are some examples using the closure operators:

RE PatternStrings Matched
[dn]ot?“d” or “n,” followed by an “o” and, at most, one “t” after that, i.e., do, no, dot, not
0?[1–9]any numeric digit, possibly prepended with a “0,” e.g., the set of numeric representations of the months January to September, whether single- or double-digits
[0–9]{15,16}fifteen or sixteen digits, e.g., credit card numbers
</?[^>]+>strings which match all valid (and invalid) HTML tags
[KQRBNP][a–h][1–8]-[a–h][1–8]Legal chess move in “long algebraic” notation (move only, no capture, check, etc.), i.e., strings which start with any of “K,” “Q,” “R,” “B,” “N,” or “P” followed by a hyphenated-pair of chess board grid locations from “a1” to “h8” (and everything in between), with the first coordinate indicating the former position and the second being the new position.

Special characters representing character sets

We also mentioned that there are special characters which may represent character sets. Rather than using a range of “0–9,” you may simply use “d” to indicate the match of any decimal digit. Another special character “w” can be used to denote the entire alphanumeric character class, serving as a shortcut for “A–Za–z0–9_,” and “s” for whitespace characters. Uppercase versions of these strings symbolizes a non-match, i.e., “D” matches any non-decimal digit (same as “[^0–9]”), etc.

Using these shortcuts, we will present a few more complex examples:

RE PatternStrings Matched
w+-d+alphanumeric string and number separated by a hyphen
[A–Za–z]w*alphabetic first character, additional characters (if present) can be alphanumeric (almost equivalent to the set of valid Python identifiers [see exercises])
d{3}-d{3}-d{4}(American) telephone numbers with an area code prefix, as in 800-555-1212
w+@w+.comsimple e-mail addresses of the form [email protected]

Note that all special characters, including all the ones mentioned before such as “A,” “B,” “d,” etc., may or may not have ASCII equivalents. To be sure you are using the regular expression versions, it would be a safe bet to use raw strings to escape backslash functionality (see the Core Note later in this chapter).

Also, the “w” and “W” alphanumeric character sets are affected by the L or LOCALE compilation flag and in Python 1.6 and newer, by Unicode flags.

Designating groups with parentheses ( ( ) )

Now, perhaps we have achieved the goal of matching a string and discarding non-matches, but in some cases, we may also be more interested in the data that we did match. Not only do we want to know whether the entire string matched our criteria, but whether we can also extract any specific strings or substrings which were part of a successful match. The answer is yes. To accomplish this, surround any RE with a pair of parentheses.

A pair of parentheses ( ( ) ) can accomplish either (or both) of the below when used with regular expressions:

  1. grouping regular expressions

  2. matching subgroups

One good example for wanting to group regular expressions is when you have two different REs with which you want to compare a string. Another reason is to group an RE in order to use a repetition operator on the entire RE (as opposed to an individual characters or character classes).

One side-effect of using parentheses is that the substring which matched the pattern is saved for future use. These subgroups can be recalled for the same match or search, or extracted for post-processing. Why are matches of subgroups important? The main reason is that there are times where you want to extract the patterns you match, in addition to making a match.

For example, what if we decided to match the pattern “w+-d+” but wanted save the alphabetic first part and the numeric second part individually? This may be desired because with any successful match, we may want to see just what those strings were that matched our RE patterns. If we add parentheses to both subpatterns, i.e., “(w+)-(d+),” then we can access each of the matched subgroups individually. Subgrouping is preferred because the alternative is to write code to determine we have a match, then execute another separate routine (which we also had to create) to parse the entire match just to extract both parts. Why not let Python do it, since it is a supported feature of the re module, instead of “reinventing the wheel”?

RE PatternStrings Matched
d+(.d*)?strings representing simple floating point number, that is, any number of digits followed optionally by a single decimal point and zero or more numeric digits, as in “0.004,” “2,” “75.,” etc.
(Mr?s?. )?[A–Z][a–z]* [ A–Za–z-]+first name and last name, with a restricted first name (must start with uppercase; lowercase only for remaining letters, if any), the full name prepended by an optional title of “Mr.,” “Mrs.,” “Ms.,” or “M.,” and a flexible last name, allowing for multiple words, dashes, and uppercase letters

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset