Simple Patterns

We'll start with some of the most simple and basic patterns you can create: patterns that match specific sequences of characters, patterns that match only at specific places in a string, or combining patterns using what's called alternation.

Character Sequences

One of the simplest patterns is just a sequence of characters you want to match, like this:

/foo/
/this or that/
/   /
/Laura/
/patterns that match specific sequences/

All these patterns will match if the data contains those characters in that order. All the characters must match, including spaces. The word or in the second pattern doesn't have any special significance (it's not a logical or); that pattern will only match if the data contains the string this or that somewhere inside it.

Note that characters in patterns can be matched anywhere in a string. Word boundaries are not relevant for these patterns—the pattern /if/ will match in the string “if wishes were horses” and in the string “there is no difference.” The pattern /if /, however, because it contains a space, will only match in the first string where the characters i, f, and the one space occur in that order.

Upper and lowercase are relevant for characters: /kazoo/ will only match kazoo and not Kazoo or KAZOO. To turn case sensitivety off in a particular search, you can use the i option after the pattern itself (the i indicates ignore case), like this:

/kazoo/i # search for any upper and lowercase versions

Alternately, you can also create patterns that will search for either upper or lowercase letters, as you'll learn about in the next section.

You can include most alphanumeric characters in patterns, including string escapes for binary data (octal and hex escapes). There are a number of characters that you cannot match without escaping them. These characters are called metacharacters and refer to bits of the pattern language and not to the literal character. These are the metacharacters to watch out for in patterns:

^ $
. +
? *
{ (
)
/ |
[  

If you want to actually match a metacharacter in a string—for example, search for an actual question mark—you can escape it using a backslash, just as you would in a regular string:

/?/  # matches question mark

Matching at Word or Line Boundaries

When you create a pattern to match a sequence of characters, those characters can appear anywhere inside the string and the pattern will still match. But sometimes you want a pattern to match those characters only if they occur at a specific place—for example, match /if/ only when it's a whole word, or /kazoo/ only if it occurs at the start of the line (that is, the beginning of the string).

Note

I'm making an assumption here that the data you're searching is a line of input, where the line is a single string with no embedded newline characters. Given that assumption, the terms string, line, and data are effectively interchangeable. Tomorrow, we'll talk about how patterns deal with newlines.


To match a pattern at a specific position, you use pattern anchors. To anchor a pattern at the start of the string, use ^:

/^Kazoo/  # match only if Kazoo occurs at the start of the line

To match at the end of the string, use $:

/end$/  # match only if end occurs at the end of the line

Once again, think of the pattern as a sequence of things in which each part of the pattern must match the data you're applying it to. The pattern matching routines in Perl actually begin searching at a position just before the first character, which will match ^. Then it moves to each character in turn until the end of the line, where $ matches. If there's a newline at the end of the string, the position marked by $ is just before that newline character.

So, for example, let's see what happens when you try to match the pattern /^foo/ to the string “to be or not to be” (which, obviously, won't match, but let's try it anyhow). Perl starts at the beginning of the line, which matches the ^ character. That part of the pattern is true. It then tests the first character. The pattern wants to see an f there, but it got a t instead, so the pattern stops and returns false.

What happens if you try to apply the pattern to the string “fob”? The match will get farther—it'll match the start of the line, the f and the o, but then fail at the b. And keep in mind that /^foo/ will not match in the string “ foo”—the foo is not at the very start of the line where the pattern expects it to be. It will only match when all four parts of the pattern match the string.

Some interesting but potentially tricky uses of ^ and $—can you guess what these patterns will match?

/^/
/^1$/
/^$/

The first pattern matches any strings that have a start of the line. It would be very weird strings indeed that didn't have the start of a line, so this pattern will match any string data whatsoever, even the empty string.

The second one wants to find the start of the line, the numeral 1, and then the end of the line. So it'll only match if the string contains 1 and only 1—it won't match “123” or “foo 1” or even “1”.

The third pattern will only match if the start of the line is immediately followed by the end of the line—that is, if there is no actual data. This pattern will only match an empty line. Keep in mind that because $ occurs just before the newline character, this last pattern will match both “” and “ ”.

Another boundary to match is a word boundary—where a word boundary is considered the position between a word character (a letter, number, or underscore) and some other character such as whitespace or punctuation. A word boundary is indicated using a  escape. So /if/ will match only when the whole word “if” exists in the string—but not when the characters i and f appear in the middle of a word (as in “difference.”). You can use  to refer to both the start and end of a word; /if/, for example, will match in both “if I were king” and “that result is iffy,” and even in “As if!”, but not in “bomb the aquifer” or “the serif is obtuse.”

You can also search for a pattern not in a word boundary using the B escape. With this, /Bif/ will match only when the characters i and f occur inside a word and not at the start of a word. Table 9.1 contains a list of boundaries.

Table 9.1. Boundaries
Boundary Character Matches
^ The beginning of a string or line
$ The end of a string or line
 A word boundary
B Anything other than a word boundary

Matching Alternatives

Sometimes, when you're building a pattern, you might want to search for more than one pattern in the same string, and then test based on whether all the patterns were found, or perhaps any of the sets of patterns were found. You could, of course, do this with the regular Perl logical expressions for boolean AND (&& or and) and OR (|| or or) with multiple pattern-matching expressions, something like this:

if (($in =~ /this/) || ($in =~ /that/)) { ...

Then, if the string contains /this/ or if it contains /that/, the whole test will return true.

In the case of an OR search (match this pattern or that pattern—either one will work), however, there is a regular expression metacharacter you can use: the pipe character (|). So, for example, the long if test in that example could just be written as

if ($in =~ /this|that/) { ...

Using the | character inside a pattern is officially known as alternation because it allows you to match alternate patterns. A true value for the pattern occurs if any of the alternatives match.

Any anchoring characters you use with an alternation character apply only to the pattern on the same side of the pipe. So, for example, the pattern /^this|that/ means “this at the start of the line” or “that anywhere,” and not “either this or that at the start of a line.” If you wanted the latter form you could use /^this|^that/, but a better way is to group your patterns using parentheses:

/^(this|that)/

For this pattern, Perl first matches the start of the line, and then tries to match all the characters in “this.” If it can't match “this,” it'll then back up to the start of the line and try to match “that.” For a pattern line /^this|that/, it'll first try and match everything on the left side of the pipe (start of line, followed by this), and if it can't do that, it'll back up and search the entire string for “that.”

An even better version would be to group only the things that are different between the two patterns, not just the ^ to match the beginning of the line, but also the th characters, like this:

/^th(is|at)/

This last version means that Perl won't even try the alternation unless th has already been matched at the start of the line, and then there will be a minimum of backing up to match the pattern. With regular expressions, the less work Perl has to do to match something, the better.

You can use grouping for any kinds of alternation within a pattern. For example, /(1st|2nd|3rd|4th) time/ will match “1st time,” “2nd time,” and so on—as long as the data contains one of the alternations inside the parentheses and the string “ time” (note the space).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset