Patterns allow you to group portions of your pattern together into subpatterns and to remember the strings matched by those subpatterns. We call the first behavior clustering and the second one capturing.
To capture a substring for later use, put parentheses
around the subpattern that matches it. The first pair of parentheses
stores its substring in $1
, the second pair in
$2
, and so on. You may use as many parentheses as
you like; Perl just keeps defining more numbered variables for you
to represent these captured strings.
Some examples:
/(d)(d)/ # Match two digits, capturing them into $1 and $2 /(d+)/ # Match one or more digits, capturing them all into $1 /(d)+/ # Match a digit one or more times, capturing the last into $1
Note the difference between the second and third patterns. The second form is usually what you want. The third form does not create multiple variables for multiple digits. Parentheses are numbered when the pattern is compiled, not when it is matched.
Captured strings are often called
backreferences because they refer back to parts
of the captured text. There are actually two ways to get at these
backreferences. The numbered variables you've seen are how you get
at backreferences outside of a pattern, but inside the pattern, that
doesn't work. You have to use 1
,
2
, etc.[9] So to find doubled words like "the
the
" or "had had
", you might use this
pattern:
/(w+) 1/i
But most often, you'll be using the $1
form, because you'll usually apply a pattern and then do something
with the substrings. Suppose you have some text (a mail header) that
looks like this:
From: [email protected] To: [email protected] Date: Mon, 17 Jul 2000 09:00:00 -1000 Subject: Eye of the needle
and you want to construct a hash that maps the text before each colon to the text afterward. If you were looping through this text line by line (say, because you were reading it from a file) you could do that as follows:
while (<>) { /^(.*?): (.*)$/; # Pre-colon text into $1, post-colon into $2 $fields{$1} = $2; }
Like $`
, $&
, and
$
', these numbered variables are dynamically
scoped through the end of the enclosing block or
eval
string, or to the next successful pattern
match, whichever comes first. You can use them in the righthand side
(the replacement part) of a substitute, too:
s/^(S+) (S+)/$2 $1/; # Swap first two words
Groupings can nest, and when they do, the groupings are counted by the location of the left parenthesis. So given the string "Primula Brandybuck", the pattern:
/^((w+) (w+))$/
would capture “Primula
Brandybuck
” into $1
,
“Primula
” into
$2
, and "Brandybuck
" into
$3
. This is depicted in Figure 5.1.
Patterns with captures are often used in list context to populate a list of values, since the pattern is smart enough to return the captured substrings as a list:
($first, $last) = /^(w+) (w+)$/; ($full, $first, $last) = /^((w+) (w+))$/;
With the /g
modifier, a pattern can return
multiple substrings from multiple matches, all in one list. Suppose
you had the mail header we saw earlier all in one string (in
$_
, say). You could do the same thing as our
line-by-line loop, but with one statement:
%fields = /^(.*?): (.*)$/gm;
The pattern matches four times, and each time it matches, it
finds two substrings. The /gm
match returns all
of these as a flat list of eight strings, which the list assignment
to %fields
will conveniently interpret as four
key/value pairs, thus restoring harmony to the universe.
Several other special variables deal with text
captured in pattern matches. $&
contains the
entire matched string, $`
everything to the left
of the match, $
' everything to the right.
$+
contains the contents of the last
backreference.
$_ = "Speak, <EM>friend</EM>, and enter."; m[ (<.*?>) (.*?) (</.*?>) ]x; # A tag, then chars, then an end tag print "prematch: $` "; # Speak, print "match: $& "; # <EM>friend</EM> print "postmatch: $' "; # , and enter. print "lastmatch: $+ "; # </EM>
For more explanation of these magical Elvish variables (and for a way to write them in English), see Chapter 28.
The @-
(@LAST_MATCH_START
) array holds the offsets of
the beginnings of any submatches, and @+
(@LAST_MATCH_END
) holds the offsets of the
ends:
#!/usr/bin/perl $alphabet = "abcdefghijklmnopqrstuvwxyz"; $alphabet =~ /(hi).*(stu)/; print "The entire match began at $-[0] and ended at $+[0] "; print "The first match began at $-[1] and ended at $+[1] "; print "The second match began at $-[2] and ended at $+[2] ";
If you really want to match a literal parenthesis character instead of having it interpreted as a metacharacter, backslash it:
/(e.g., .*?)/
This matches a parenthesized example (e.g., this statement).
But since dot is a wildcard, this also matches any parenthetical
statement with the first letter e
and third
letter g
(ergo, this statement too).
Bare parentheses both cluster and capture. But
sometimes you don't want that. Sometimes you just want to group
portions of the pattern without creating a backreference. You can
use an extended form of parentheses to suppress capturing: the
(?
:PATTERN
)
notation will cluster without capturing.
There are at least three reasons you might want to cluster without capturing:
To quantify something.
To limit the scope of interior alternation; for
example, /^cat|cow|dog$/
needs to be
/^(?:cat|cow|dog)$/
so that the cat doesn't
run away with the ^
.
To limit the scope of an embedded pattern modifier to a
particular subpattern, such as in
/foo(?-i:Case_Matters)bar/i
. (See Section 5.7.3)
In addition, it's more efficient to suppress the capture of something you're not going to use. On the minus side, the notation is a little noisier, visually speaking.
In a pattern, a left parenthesis immediately followed by a question mark denotes a regex extension. The current regular expression bestiary is relatively fixed--we don't dare create a new metacharacter, for fear of breaking old Perl programs. Instead, the extension syntax is used to add new features to the bestiary.
In the remainder of the chapter, we'll
see many more regex extensions, all of which cluster without
capturing, as well as doing something else. The
(?
:PATTERN
)
extension is just special in that it does nothing else. So if you
say:
@fields = split(/(?:a|b|c)/)
it's like:
@fields = split(/(a|b|c)/)
but doesn't spit out extra fields. (The
split
operator is a bit like
m//g
in that it will emit extra fields for all
the captured substrings within the pattern. Ordinarily,
split
only returns what it
didn't match. For more on
split
see Chapter
29.)
You may cloister the
/i
, /m
, /s
,
and /x
modifiers within a portion of your pattern
by inserting them (without the slash) between the
?
and : of the clustering notation. If you
say:
/Harry (?i:s) Truman/
it matches both "Harry S Truman
" and
"Harry s Truman
", whereas:
/Harry (?x: [A-Z] .? s )?Truman/
matches both "Harry S Truman
" and
"Harry S. Truman
", as well as "Harry
Truman
", and:
/Harry (?ix: [A-Z] .? s )?Truman/
matches all five, by combining the /i
and
/x
modifiers within the cloister.
You can also subtract modifiers from a cloister with a minus sign:
/Harry (?x-i: [A-Z] .? s )?Truman/i
This matches any capitalization of the name--but if the middle
initial is provided, it must be capitalized, since the
/i
applied to the overall pattern is suspended
inside the cloister.
By omitting the colon and PATTERN
,
you can export modifier settings to an outer cluster, turning it
into a cloister. That is, you can selectively turn modifiers on and
off for the cluster one level outside the modifiers' parentheses,
like so:
/(?i)foo/ # Equivalent to /foo/i /foo((?-i)bar)/i # "bar" must be lower case /foo((?x-i) bar)/ # Enables /x and disables /i for "bar"
Note that the second and third examples create backreferences.
If that wasn't what you wanted, then you should have been using
(?-i:bar)
and (?x-i: bar)
,
respectively.
Setting modifiers on a portion of your pattern is particularly
useful when you want "." to match newlines in part of your pattern
but not in the rest of it. Setting /s
on the
whole pattern doesn't help you there.
[9] You can't use $1
for a backreference
within the pattern because that would already have been
interpolated as an ordinary variable back when the regex was
compiled. So we use the traditional 1
backreference notation inside patterns. For two- and three-digit
backreference numbers, there is some ambiguity with octal
character notation, but that is neatly solved by considering how
many captured patterns are available. For instance, if Perl sees
a 11
metasymbol, it's equivalent to
$11
only if there are at least 11 substrings
captured earlier in the pattern. Otherwise, it's equivalent to