You need a regex that matches here
documents in source files for a scripting language in which
a here document can be started with <<
followed by a word. The word may have
single or double quotes around it. The here document ends when that word
appears at the very start of a line, without any quotes, using the same
case.
<<(["']?)([A-Za-z]+)1.*?^2
Regex options: Dot matches line breaks, ^ and $ match at line breaks |
Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python, Ruby |
<<(["']?)([A-Za-z]+)1[sS]*?^2
Regex options: ^ and $ match at line breaks |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
This regex may look a bit cryptic, but it is very straightforward.
‹<<
› simply matches
<<
.
‹(["']?)
›, then matches an
optional single or double quote. The parentheses form a capturing group
to store the quote, or the lack thereof. It is important that the
quantifier ‹?
› is inside
the group rather than outside of it, so that the group always
participates in the match. If we made the group itself optional, the
group would not participate in the match when no quote can be matched,
and a backreference to that group would fail to match.
The capturing group with character class ‹([A-Za-z]+)
› matches a word and stores it into the
second backreference. The word boundary ‹› makes sure we match the entire word after
‹
<<
›. If we were to
omit the word boundary, the regex engine would backtrack. It would try
to match the word partially if the backreference ‹2
› cannot be matched. We do not
need a word boundary before the word, because ‹<<(["']?)
› already makes sure there is a
nonword character before the word.
‹1
› is a
backreference to the first capturing group. This group will hold the
quote if we matched one; otherwise, the group holds the empty string.
Thus ‹1
› matches the same
quote matched by the capturing group. ‹1
› has no effect if the capturing group holds the
empty string.
‹.*?
› matches any
amount of text. We turned on the option “dot matches line breaks” to
allow it to span multiple lines. JavaScript does not have that option,
and so for JavaScript we use ‹[sS]*?
› to match the text. Either way, the
question mark makes the asterisk lazy, telling it to match as few
characters as possible. The here document should end at the first
occurrence of the terminating word rather than the last occurrence. The
file may have multiple here documents using the same terminating word,
and the lazy quantifier makes sure we match each here document
separately.
‹^
› matches
at the start of any line because we turned on the option to make the
caret and dollar match at line breaks. Ruby does not have this option.
Because the caret and dollar always match at line breaks in Ruby, this
does not change our solution. There is just one less option to
set.
‹2
› is a
backreference to the second capturing group. This group holds the word
we matched at the start of the here document. Because the here document
syntax of our scripting language is case sensitive, our regex needs to
be case sensitive too. That’s why we used ‹[A-Za-z]+
› to match the word rather than using
‹[a-z]+
› or ‹[A-Z]+
› and turning on case
insensitivity. Backreferences also become case insensitive when the case
insensitivity option is turned on.
Finally, another word boundary ‹› makes
sure that the regex stops only if ‹
2
› matched the word on its own, rather than as
part of a longer word. We do not need a word boundary before ‹›, as the caret has already made
sure the word is at the start of the line. Whenever ‹
2
› or the final ‹› fail to match, the regex
engine will backtrack and let ‹
.*?
› match more characters.
Techniques used in the regular expressions in this recipe are discussed in Chapter 2. Recipe 2.3 explains character classes. Recipe 2.4 explains that the dot matches any character. Recipe 2.5 explains anchors such as the caret. Recipe 2.6 explains word boundaries. Recipe 2.9 explains capturing groups, and Recipe 2.10 explains backreferences. Recipe 2.12 explains repetition, and Recipe 2.13 explains how to make them match as few characters as needed.