Here Documents

Problem

You need a regex that matches here documents in source files for a scripting language in which a here document can be started with << followed by a word. The word may have single or double quotes around it. The here document ends when that word appears at the very start of a line, without any quotes, using the same case.

Solution

<<(["']?)([A-Za-z]+)1.*?^2
Regex options: Dot matches line breaks, ^ and $ match at line breaks
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby
<<(["']?)([A-Za-z]+)1[sS]*?^2
Regex options: ^ and $ match at line breaks
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

This regex may look a bit cryptic, but it is very straightforward. << simply matches <<. (["']?), then matches an optional single or double quote. The parentheses form a capturing group to store the quote, or the lack thereof. It is important that the quantifier ? is inside the group rather than outside of it, so that the group always participates in the match. If we made the group itself optional, the group would not participate in the match when no quote can be matched, and a backreference to that group would fail to match.

The capturing group with character class ([A-Za-z]+) matches a word and stores it into the second backreference. The word boundary  makes sure we match the entire word after <<. If we were to omit the word boundary, the regex engine would backtrack. It would try to match the word partially if the backreference 2 cannot be matched. We do not need a word boundary before the word, because <<(["']?) already makes sure there is a nonword character before the word.

1 is a backreference to the first capturing group. This group will hold the quote if we matched one; otherwise, the group holds the empty string. Thus 1 matches the same quote matched by the capturing group. 1 has no effect if the capturing group holds the empty string.

.*? matches any amount of text. We turned on the option “dot matches line breaks” to allow it to span multiple lines. JavaScript does not have that option, and so for JavaScript we use [sS]*? to match the text. Either way, the question mark makes the asterisk lazy, telling it to match as few characters as possible. The here document should end at the first occurrence of the terminating word rather than the last occurrence. The file may have multiple here documents using the same terminating word, and the lazy quantifier makes sure we match each here document separately.

^ matches at the start of any line because we turned on the option to make the caret and dollar match at line breaks. Ruby does not have this option. Because the caret and dollar always match at line breaks in Ruby, this does not change our solution. There is just one less option to set.

2 is a backreference to the second capturing group. This group holds the word we matched at the start of the here document. Because the here document syntax of our scripting language is case sensitive, our regex needs to be case sensitive too. That’s why we used [A-Za-z]+ to match the word rather than using [a-z]+ or [A-Z]+ and turning on case insensitivity. Backreferences also become case insensitive when the case insensitivity option is turned on.

Finally, another word boundary  makes sure that the regex stops only if 2 matched the word on its own, rather than as part of a longer word. We do not need a word boundary before , as the caret has already made sure the word is at the start of the line. Whenever 2 or the final  fail to match, the regex engine will backtrack and let .*? match more characters.

See Also

Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.4 explains that the dot matches any character. Recipe 2.5 explains anchors such as the caret. Recipe 2.6 explains word boundaries. Recipe 2.9 explains capturing groups, and Recipe 2.10 explains backreferences. Recipe 2.12 explains repetition, and Recipe 2.13 explains how to make them match as few characters as needed.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset