Going Deeper

Regular expressions are one of those things that you could write a whole book about and still not cover the extent to which they can be used. In the last two lessons, I've given you the basics of how to build and how to use regular expressions in your own programs. There are still many more features haven't been discussed, however, including many more metacharacters and regular expression forms specific to Perl. In this section I'll give you an overview of some of these other forms.

For more information about any aspect of regular expressions in Perl, the perlre man page can be quite enlightening. If you find yourself enjoying working with regular expressions, consider the book Mastering Regular Expressions, (Friedl, O'Reilly & Associates), which covers regular expressions of all kinds, Perl and otherwise, in an amazing amount of detail.

More Metacharacters, Variables, and Options

The metacharacters I described in previous two lessons are most of the basic set of characters you can find in most regular expression flavors (not just those in Perl). Perl includes a number of extra metacharacters, variables, and options that provide different ways of creating complex patterns (or of processing the patterns that match).

Metacharacters

The first of these are nongreedy versions of the quantifiers *, +, and ?. As you learned throughout the lesson, these quantifiers are greedy—they'll match any characters far beyond what you expect, sometimes to the detriment of figuring out how the pattern actually works. As I've used throughout the chapter, Perl also provides a second set of quantifiers which are nongreedy (sometimes called lazy quantifiers): *?, +?, and ??. These quantifiers match the minimal number of characters needed to match the pattern, rather than the maximum like the regular quantifiers. Don't forget to use negated character classes when necessary. The lazy quantifiers are less efficient than a negated character class.

  • The (?:pattern) construct is a variant on the use of parentheses to group patterns and save the results in the match variables $1, $2, $3, and so on. You can use parentheses to group an expression, but the result will get saved whether you want it to be or not. Using (?:pattern) instead, the expression will be grouped and evaluated as a unit, but the result will not be saved. It provides a slight performance advantage over regular parentheses where you don't care about the result.

  • The (?o) construct enables you to nest pattern-matching options inside the pattern itself, for example, to make only some parts of the expression not case sensitive. The o part of the construct can be any valid pattern-matching option.

Look-ahead is a feature in Perl's regular expressions that enables Perl to peek ahead in a string and see if a pattern will match without changing the position in the string or adding anything to the parenthetical part of the pattern. It's sort of like saying “if the next part of this pattern contains X, then this part matches” without actually going anywhere. Use (?=pattern) to create a positive-lookahead pattern (if pattern matches in future bits of the string, the previous part of the pattern also matches). The reverse is a negative-lookahead pattern, (?!pattern), and works only if the pattern cannot match anything.

Special Variables

In addition to the match variables $1, $2, and so on, Perl also includes the variables $', $& and $`, which provide context for the text matches by the pattern. $' refers to the text leading up to the match, $& is the text that was matched, and $` is the text after the match (note the backquote; that's a different character from quote '). Unlike the transient match variables, these variables will hold their values until the next successful match and regardless of whether or not the original string was changed. Using any of these variables is a significant performance hit, so consider avoiding them when at all possible.

The $+ variable indicates the highest number of match variables that were defined; for example, if both $1 and $2 were filled, but not $3, $+ will be set to 2.

Options

You've learned about most of the options available to Perl regular expressions (both m// and s///) throughout the body of this lesson. Two not touched on are /x, for extended regular expressions, and /o, to avoid compiling the same regular expression over and over again.

The /x option enables you to add whitespace and comments to a regular expression, for better readability. Normally, if you add spaces to a pattern, those spaces are considered part of the pattern itself. The /x option ignores all spaces and newlines, as well as allowing comments on individual lines of the regular expression. So, for example, that regular expression in our <img> extractor script, which looked like this in the script:

while ($raw =~ /(w+)s*=s*(("|')(.*?)3|(w*)s*)/igs) {

Might be rewritten to look like this:

while ($raw =~ /([^ =]+)     # find the attribute name
                s*=s*      # find the equals, with or without whitespace
                ("([^"]+)"|  # find and extract quoted values
                [^s]+s*)   # or find non-quoted values
                /igx) {

The use of extended regular expressions can help quite a bit to improve the readability of a regular expression.

And, finally the /o option is used to optimize how Perl compiles and reads a regular expression interpolated via a scalar variable. Take the following code snippet:

while (<>) {
   if (/$pattern/) {
     ...
   }
}

In this snippet, the pattern stored in $pattern is interpolated and compiled into a real pattern that Perl can understand. The problem is that because this pattern is inside a while loop, that same process will occur each and every time the loop comes around. By including the /o at the end of the pattern, you're telling Perl the pattern won't change, and so it'll compile it once and reuse the same pattern each time:

if (/$pattern/o) {  # compile once

For information on all these metacharacters, variables, and options, see the perlre man page.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset