Now that we've admired all the fancy cages, we can go back to looking at the critters in the cages, those funny-looking symbols you put inside the patterns. By now you'll have cottoned to the fact that these symbols aren't regular Perl code like function calls or arithmetic operators. Regular expressions are their own little language nestled inside of Perl. (There's a bit of the jungle in all of us.)
For all their power and expressivity, patterns in Perl recognize the same 12 traditional metacharacters (the Dirty Dozen, as it were) found in many other regular expression packages:
| ( ) [ { ^ $ * + ? .
Some of those bend the rules, making otherwise normal characters that follow them special. We don't like to call the longer sequences "characters", so when they make longer sequences, we call them metasymbols (or sometimes just "symbols"). But at the top level, those twelve metacharacters are all you (and Perl) need to think about. Everything else proceeds from there.
Some simple metacharacters stand by themselves, like .
and ^
and $
. They don't directly
affect anything around them. Some metacharacters work like prefix
operators, governing what follows them, like .
Others work like postfix operators, governing what immediately
precedes them, like
*
, +
, and
?
. One metacharacter, |
, acts
like an infix operator, standing between the operands it governs.
There are even bracketing metacharacters that work like circumfix
operators, governing something contained inside them, like
(…)
and […]
. Parentheses are
particularly important, because they specify the bounds of
|
on the inside, and of *
,
+
, and ?
on the outside.
If you learn only one of the twelve metacharacters,
choose the backslash. (Er . . . and the
parentheses.) That's because backslash disables the others. When a
backslash precedes a nonalphanumeric character in a Perl pattern, it
always makes that next character a literal. If you need to match one
of the twelve metacharacters in a pattern literally, you write them
with a backslash in front. Thus, . matches a real
dot,
$
a real dollar sign, \
a
real backslash, and so on. This is known as "escaping" the
metacharacter, or "quoting it", or sometimes just "backslashing" it.
(Of course, you already know that backslash is used to suppress
variable interpolation in double-quoted strings.)
Although a backslash turns a metacharacter into a literal character, its effect upon a following alphanumeric character goes the other direction. It takes something that was regular and makes it special. That is, together they make a metasymbol. An alphabetical list of these metasymbols can be found below in Table 5.7.
In the following tables, the Atomic column says "yes"
if the given metasymbol is quantifiable (if it can match something
with width, more or less). Also, we've used "…
"
to represent "something else". Please see the later discussion to
find out what "…
" means, if it is not clear from
the one-line gloss in the table.)
Table 5.4
shows the basic traditional metasymbols. The first four of these are
the structural metasymbols we mentioned earlier, while the last
three are simple metacharacters. The . metacharacter is an example
of an atom because it matches something with width (the width of a
character, in this case); ^
and
$
are examples of assertions, because they match
something of zero width, and because they are only evaluated to see
if they're true or not.
Table 5-4. General Regex Metacharacters
The quantifiers, which are further described in their own section, indicate how many times the preceding atom (that is, single character or grouping) should match. These are listed in Table 5.5.
Table 5-5. Regex Quantifiers
Quantifier | Atomic | Meaning |
---|---|---|
| No | |
+ | No | |
? | No | |
{ COUNT } | No | |
{ MIN ,} | No | |
{ MIN ,MAX } | No | |
*? | No | |
+? | No | |
?? | No | |
{ MIN ,}? | No | Match at least |
{ MIN ,MAX }? | No | Match at least |
A minimal quantifier tries to match as
few characters as possible within its allowed
range. A maximal quantifier tries to match as
many characters as possible within its allowed
range. For instance, .+
is guaranteed to match at
least one character of the string, but will match all of them given
the opportunity. The opportunities are discussed later in "The
Little Engine That /Could(n't)?/".
You'll note that quantifiers may never be quantified.
We wanted to provide an extensible syntax for new
kinds of metasymbols. Given that we only had a dozen metacharacters
to work with, we chose a formerly illegal regex sequence to use for
arbitrary syntactic extensions. These metasymbols are all of the
form
(?
KEY
…)
;
that is, a (balanced) parenthesis followed by a question mark,
followed by a KEY
and the rest of the
subpattern. The KEY
character indicates
which particular regex extension it is. See Table 5.6 for a list of these.
Most of them behave structurally since they're based on parentheses,
but they also have additional meanings. Again, only atoms may be
quantified because they represent something that's really there
(potentially).
Table 5-6. Extended Regex Sequences
Extension | Atomic | Meaning |
---|---|---|
(?#…) | No | Comment, discard. |
(?:…) | Yes | Cluster-only parentheses, no capturing. |
(?imsx-imsx) | No | Enable/disable pattern modifiers. |
(?imsx-imsx:…) | Yes | Cluster-only parentheses plus modifiers. |
(?=…) | No | True if lookahead assertion succeeds. |
(?!…) | No | True if lookahead assertion fails. |
(?<=…) | No | True if lookbehind assertion succeeds. |
(?<!…) | No | True if lookbehind assertion fails. |
(?>…) | Yes | Match nonbacktracking subpattern. |
(?{…}) | No | Execute embedded Perl code. |
(??{…}) | Yes | Match regex from embedded Perl code. |
(?(…)…|…) | Yes | Match with if-then-else pattern. |
(?(…)…) | Yes | Match with if-then pattern. |
And finally, Table 5.7 shows all of your favorite alphanumeric metasymbols. (Symbols that are processed by the variable interpolation pass are marked with a dash in the Atomic column, since the Engine never even sees them.)