A regular expression is a pattern. Some parts of the pattern match single characters in the string of a particular type. Other parts of the pattern match multiple characters. First, we’ll visit the single-character patterns, and then the multiple-character patterns.
The simplest and most common
pattern-matching character in regular expressions is a single
character that matches itself. In other words, putting a letter
a
in a regular expression requires a corresponding
letter a
in the string.
The next most common
pattern-matching character is the dot
".
“. This character matches any single
character except
newline (
). For
example, the pattern /a./
matches any two-letter
sequence that starts with a
and is not
a
.
A pattern-matching character class is represented by a pair of open and close square brackets and a list of characters between the brackets. One and only one of these characters must be present at the corresponding part of the string for the pattern to match. For example,
/[abcde]/
matches a string containing any one of the first five letters of the lowercase alphabet, while
/[aeiouAEIOU]/
matches any of the five vowels in either lower- or uppercase.
If you want to put a right bracket (]
) in the
list, put a backslash in front of it, or put it as the first
character within the list.
Ranges of characters (like
a
through z
) can be abbreviated
by showing the end points of the range separated by a
dash (-
); to get a
literal dash in the list, precede the dash with a
backslash or place it at the end.
Here are some other examples:
[0123456789] # match any single digit [0-9] # same thing [0-9-] # match 0-9, or minus [a-z0-9] # match any single lowercase letter or digit [a-zA-Z0-9_] # match any single letter, digit, or underscore
There’s also a negated character class, which is the same as a
character class, but has a leading up arrow (or caret:
^
) immediately after the left bracket. This
character class matches any single character that is
not in the list. For example:
[^0-9] # match any single non-digit [^aeiouAEIOU] # match any single non-vowel [^^] # match single character except an up-arrow
For your convenience, some common character classes are predefined, as described in Table 7.1.
Table 7-1. Predefined Character Class Abbreviations
Construct |
Equivalent Class |
Negated Construct |
Equivalent Negated Class |
---|---|---|---|
d (a digit) |
[0-9] |
D (digits, not!) |
[^0-9] |
w (word char) |
[a-zA-Z0-9_] |
W (words, not!) |
[^a-zA-Z0-9_] |
s (space char) |
[ f] |
S (space, not!) |
[^ f] |
The
d
pattern
matches one digit
. The
w
pattern
matches one
word character
,
although the pattern is really matching any character that is legal
in a Perl variable name. The
s
pattern
matches one space
(
whitespace), defined here as
spaces, carriage returns, tabs, line feeds, and form feeds. The
uppercase versions match the complements of these classes. Thus, W
matches one character that can’t be in an identifier, S
matches one character that is not a whitespace (including letters,
punctuation marks, control characters, etc.), and D matches any
single non-digit character.
These abbreviated classes can be used as part of other character classes as well:
[da-fA-F] # match one hex digit
The true power of regular expressions comes into play when you can say “one or more of these” or “up to five of those.” Let’s talk about how these cases are handled.
The first (and probably most obvious) grouping pattern is
sequence. In
using this pattern, Perl matches abc
as an
a
followed by a b
followed by a
c.
This pattern seems simple, but we’re
giving it a name so we can talk about it later.
We’ve already seen the asterisk (*
) as
a grouping pattern. The asterisk indicates zero or more of the
immediately previous character (or character class).
Two other grouping patterns that work in the same manner are the
plus
sign (+
), meaning one or more of the immediately
previous character, and the question mark
(?
), meaning zero or one of the immediately
previous character. For example, the regular expression
/fo+ba?r/
matches an f
followed
by one or more o
’s, followed by a
b
, followed by an optional a
,
followed by an r
.
In all three of these grouping patterns, the patterns are greedy. If such a multiplier has a chance to match between five and ten characters, it’ll pick the ten-character string every time. For example,
$_ = "fred xxxxxxxxxx barney"; s/x+/boom/;
always replaces all consecutive x’s with
boom
(resulting in fred boom barney
), rather than just one or two x’s, even though
a shorter set of x’s would also match the same regular
expression.
If you need to say “five to ten” x’s, you could get
away with putting five x’s followed by five x’s each
immediately followed by a question mark. But this looks ugly.
Instead, an easier way exists: the general
multiplier. The general multiplier consists of a pair of
matching curly braces with one or
two numbers inside, as in /x{5,10}/
. The
immediately preceding character (in this case, the letter
x
) must be found within the indicated number of
repetitions (five through ten here).[48]
If you leave off the second number, as in /x{5,}/
,
you indicate “that many or more” (five or more in this
case), and if you leave off the comma, as in
/x{5}/
, you indicate “exactly this
many” (five x
’s). To get five or fewer
x
’s, you must put the zero in, as in
/x{0,5}/
.
So, the regular expression /a.{5}b/
matches the
letter a
separated from the letter
b
by any five non-newline characters at any point
in the string. (Recall that a period matches any single non-newline
character, and we’re matching five here.) The five characters
do not need to be the same. (We’ll learn how to force them to
be the same in the next section.)
We could dispense with *
, +
,
and ?
entirely, because they are completely
equivalent to {0,}
, {1,}
, and
{0,1}
. But it’s easier to type the
equivalent single punctuation character, and more familiar as well.
If two multipliers occur in a single expression, the greedy rule is augmented with leftmost is greediest. For example:
$_ = "a xxx c xxxxxxxx c xxx d"; /a.*c.*d/;
In this case, the first .*
in the regular
expression matches all characters up to the second
c
, even though matching only the characters up to
the first c
would still allow the entire regular
expression to match. Right now, this distinction is not important
(the pattern would match either way), but later when we can look at
parts of the regular expression that matched, the distinction will
matter quite a bit.
We can force any multiplier to be nongreedy (or lazy) by following it with a question mark:
$_ = "a xxx c xxxxxxxx c xxx d"; /a.*?c.*d/;
Here, the a.*?c
matches the fewest characters
between the a
and c
, not the
most characters. This means the leftmost c
is
matched, not the rightmost. You can put such a question-mark modifier
after any of the multiplers (?,+,*
and
{m,n}
).
What if the string and regular expression were slightly altered, say, to:
$_ = "a xxx ce xxxxxxxx ci xxx d"; /a.*ce.*d/;
In this case, if the .*
matches the most
characters possible before the next c
, the next
regular expression character (e
) doesn’t
match the next character of the string (i
). In
this case, we get automatic
backtracking. The
multiplier is unwound and retried, stopping at someplace earlier (in
this case, at the earlier c
, next to the
e
).[49] A
complex regular expression may involve many such levels of
backtracking, leading to long execution times. In this case, consider
that making that match lazy (with a trailing ?
)
will actually simplify the work that Perl has to perform.
Another grouping operator is a pair of open and close
parentheses around any part
pattern. This operator doesn’t change whether the pattern
matches, but instead causes the part of the string matched by the
pattern to be remembered, so that it may be referenced later. So, for
example, (a)
still matches an
a
, and ([a-z])
still matches
any single lowercase letter.
To recall a memorized part of a string, you must include a backslash followed by an integer. This pattern construct represents the same sequence of characters matched earlier in the same-numbered pair of parentheses (counting from one). For example:
/fred(.)barney1/;
matches a string consisting of fred
, followed by
any single non-newline character, followed by
barney
, followed by that same single character.
So, the string matches fredxbarneyx
, but not
fredxbarneyy
. Compare that string with:
/fred.barney./;
in which the two unspecified characters can be the same, or different.
Where did the 1
come from? The 1 indicates the
first parenthesized part of the regular expression. If there’s
more than one, the second part (counting the left parentheses from
left to right) is referenced as 2
, the third as
3
, and so on. For example:
/a(.)b(.)c2d1/;
matches an a
, a character (call it #1), a
b
, another character (call it #2), a
c
, the character #2, a d
, and
the character #1. So, the string matches axbycydx
,
for example.
The referenced part can be more than a single character. For example,
/a(.*)b1c/;
matches an a
, followed by any number of characters
(even zero), followed by b
, followed by that same
sequence of characters, followed by c
. So, the
string would match aFREDbFREDc
, or even
abc
, but not aXXbXXXc
.
Another grouping construct is
alternation, as in
a|b|c
. This
construct matches exactly one of the alternatives
(a
or b
or
c
, in this case). This construct works even if the
alternatives have multiple characters, as in
/song|blue/
, which matches either
song
or blue
. (For
single-character alternatives, you’re definitely better off
with a character class like /[abc]/
.)
What if we wanted to match songbird
or
bluebird
? We could write
/songbird|bluebird/
, but that
bird
part shouldn’t have to be in there
twice. In fact, there’s a way out, but we have to talk about
the precedence of grouping patterns, which is covered later in the
section Section 7.3.4.
Several special notations anchor a pattern. Normally, when a pattern is matched against the string, the beginning of the pattern is dragged through the string from left to right, matching at the first possible opportunity. Anchors allow you to ensure that parts of the pattern line up with particular parts of the string.
The first pair of anchors requires that a particular part of the
match be located either at a word boundary or not at a word
boundary. The anchor requires a word boundary
at the indicated point for the pattern to match. A word boundary is
the place between characters that match
w
and
W
, or between characters matching
w
and the beginning or ending of the string. Note
that this description has little to do with English words and a lot
more to do with C symbols, but that’s as close as we get. For
example:
/fred/; # matches fred, but not Frederick /mo/; # matches moe and mole, but not Elmo /Fred/; # matches Fred but not Frederick or alFred /+/; # matches "x+y" but not "++" or " + " /abcdef/; # never matches (impossible for a boundary there)
Likewise,
B
requires that there not be a word boundary at the indicated point.
For example:
/FredB/; # matches "Frederick" but not "Fred Flintstone"
Two more anchors require that a particular part of the pattern be
next to an end of the string. The caret (^
)
matches the beginning of the string if it is in a place that makes
sense to match the beginning of the string. For example,
^a
matches an a
if, and only
if, the a
is the first character of the string.
However, a^
matches the two characters
a
and ^
anywhere in the string.
In other words, the caret has lost its special meaning. If you need
the caret to be a literal caret even at the beginning, put a
backslash in front of it.
The $
, like the ^
, anchors the
pattern, but to the end of the string, not the beginning. In other
words, c$
matches a c
only if
it occurs at the end of the string.[50] A dollar sign anywhere else in the
pattern is probably going to be interpreted as a scalar value
interpretation, so you’ll most likely need to
backslash it to
match a literal dollar sign in the string.
Other anchors are supported, including A, , and lookahead anchors created via (?=...) and (?!...). These anchors are described fully in Chapter 2 of Progamming Perl and the perlre documentation.
So what happens when we get
a|b*
together? Is this a
or
b
any number of times, or is it either a single
a
or any number of b
’s?
Well, just as operators have precedence, the grouping and anchoring patterns also have precedence. The precedence of patterns from highest to lowest is given in Table 7.2.
Table 7-2. regex Grouping Precedence [51]
Name |
Representation |
---|---|
Parentheses |
|
Multipliers |
|
Sequence and anchoring |
|
Alternation |
|
[51] Some of these symbols are not described in this book. See Programming Perl or perlre for details. |
According to the table, *
has a higher precedence
than |
. So /a|b*/
is
interpreted as a single a
, or any number of
b
’s.
What if we want the other meaning, as in “any number of
a’s or b’s”? We simply throw in a pair of
parentheses. In this case, we
enclose the part of the expression that the *
operator should apply to inside parentheses, and we are done, as
(a|b)*
. If you want to clarify the first
expression, you can redundantly parenthesize it with
a|(b*)
.
When you use parentheses to
affect precedence they also trigger the memory, as shown earlier in
this chapter. That is, this set of parentheses counts when you are
figuring out whether something is 2
,
3
, or whatever. If you want to use parentheses
without triggering memory, use the form (?:...) instead of (...).
This form still allows for multipliers, but doesn’t cause you
to throw off your counting by using up another $4 or whatever. For
example, /(?:Fred|Wilma) Flintstone/
does not
store anything into $1; it’s just there for grouping.
Here are some other examples of regular expressions, and the effect of parentheses:
abc* # matches ab, abc, abcc, abccc, abcccc, and so on (abc)* # matches "", abc, abcabc, abcabcabc, and so on ^x|y # matches x at the beginning of line, or y anywhere ^(x|y) # matches either x or y at the beginning of a line a|bc|d # a, or bc, or d (a|b)(c|d) # ac, ad, bc, or bd (song|blue)bird # songbird or bluebird
[48] Of course,
/d{3}/
doesn’t only match three-digit
numbers. It would also match any number containing more than three
digits. To match exactly three, you need to use anchors, described in
the next section, titled Section 7.3.3.
[49] Well, technically, there was a
lot of backtracking of the *
operator to find the
c’s in the first place. But that’s a little trickier to
describe, and it works on the same principle.
[50] Or just before the newline at the end of the string, for historical simplicity.