Matching Patterns over Multiple Lines

Up to this point we've been assuming that all the pattern matching you've been doing is for individual lines (strings), read from a file, or from the keyboard. The assumption, then, is that the string you'll be searching has no embedded line feeds or carriage returns, and that the anchors for beginning and end of line refer to the beginning and end of the string itself. For the while (<>) code we've been writing up to this point, that's a sensible assumption to make.

Quite often, however, you might want to match a pattern across lines, particularly if the input you're working with is composed of sentences and paragraphs, where the line boundaries are arbitrary based on the current test formatting. If you want to, for example, search for all instances of the term “Exegetic Frobulator 5000” in a Web page, you want to be able to find the phrases that cross line boundaries as well as the ones that exist in total in each logical line.

You have to do two things to do this. First, you have to modify your input routines so they will read all the input into a single string, rather than process it line by line. That'll give you one enormous string with newline or carriage return characters in place. Secondly, depending on the pattern you're working with, you might have to tell Perl to manage newlines in different ways.

Storing Multiple Lines of Input

You can read your entire input into a single string in a number of ways. You could use <> in a list context, like this:

@input = <>;

That particular line could potentially be dangerous, for example, if your input is very, very large, it could suck up all the available memory in your system trying to read all that input into memory. There's also no way to get it to stop in the middle. A less aggressive approach for reading paragraph-based data in particular is to set the special $/ variable. If you set $/ to a null string ($/ = "";), Perl will read in paragraphs of text, including new lines, and stop when it gets to two or more newlines in a row. (The assumption here is that your input data has one or more empty lines between paragraphs):

$/ = "";
while (<>) {   # read a para, not a line
   # $_ will contain the entire paragraph, not just a line
}

A third way to read multiple lines into a single string is to use nested whiles and append lines to an input string until you reach a certain delimiter. For strictly coded HTML files, for example, a paragraph ends with a </P> tag, so you could read all the input up until that point:

while (<>) {
   if (/(.*)</P>/) {
      $in.=$1
   }  else {
      $in.=$_
   }
}

Handling Input with Newlines

After you have multiline input in a string to be searched, be it stored in $_ or in a scalar variable, you can go ahead and search that data for patterns across multiple lines. Be aware of several things regarding pattern matches with embedded newlines:

  • The s character class includes newlines and carriage returns as whitespace, so a pattern such as /Georges+Washington/ will match with no problem regardless of whether the words George Washington are on a single line or on separate lines.

  • The ^ and $ anchoring characters refer to beginning of string or end of string—not to embedded newlines. If you want to treat ^ and $ as beginning and end of line in a string that contains multiple lines, you can use the /m option.

  • The dot (.) metacharacter will NOT match newlines by default. You can change this behavior using the /s option.

That last point is the tricky one. Take this pattern, which uses the .* quantifier to extract a whole line after an initial “From:” heading:

/From: (.*)/

That pattern will search for the characters “From:”, and then fill $1 with the rest of the line. Normally, with a string that ends at the end of the line, this would work fine. If the string goes onto multiple lines, however, this pattern will match only up to the first newline ( ). The dot character, by default, does not match newlines.

You could get around this by changing the pattern to be one or more words or whitespace, avoiding the use of dot altogether, but that's a lot of extra work. What you want is the /s option at the end of your pattern, which tells Perl to allow dot to include as a character. Using /s does not change any other pattern-matching behavior—^ and $ continue to behave as beginning and end of string, respectively.

If your regular expression contains the ^ or $ characters, you might want to treat strings differently if they stretch over multiple lines. By default, ^ and $ refer to the beginning and end of the string, and ignore newlines altogether. If you use the /m option, however, ^ will refer to either the beginning of a string or the beginning of a line (the position just after a ), and $ will refer to the end of the string or the end of the line (the position just before the ). In other words, if your string contains four lines of text, ^ will match four times, and similarly for $. Here's an example:

while (/^(w} /mg) {
   print "$1
";
}

This while loop prints the first word of each line in $_, regardless of whether the input contains a single line or multiple lines.

If you use the /m option, and you really do want to test for the beginning or end of the string, ^ and $ will no longer work for you in that respect. But fear not, Perl provides A and  to refer to the beginning and end of the string, regardless of the state of /m.

You can use both the /s and /m options together, of course, and they will coexist happily. Just keep in mind that /s effects how dot behaves, and /m effects ^ and $, and you should be fine. Beyond that, embedded newlines in strings are no problem for pattern matching.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset