You are working with a file format for forms in a software application. The words “end,” “in,” “inline,” “inherited,” “item,” and “object” are reserved keywords in this format.[9] You want a regular expression that matches any of these keywords.
The basic solution is very straightforward and works with all regex flavors in this book:
(?:end|in|inline|inherited|item|object)
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
We can optimize the regular expression for regex flavors that support atomic grouping:
(?>end|in(?:line|herited)?|item|object)
Regex options: Case insensitive |
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby |
Matching a word from a list of words is very easy with a regular
expression. We simply use alternation to match any one of the keywords.
The word boundaries at the start and the end of the regex make sure we
only match entire words. The regex should match inline
rather than
in
when the file
contains inline
, and it should fail to match when
the file contains interesting
. Because alternation has the
lowest precedence of all regex operators, we have to put the list of
keywords inside a group. Here we used a noncapturing group for
efficiency. When using this regex as part of a larger regular
expression, you may want to use a capturing group instead, so you can
determine whether the regex matched a keyword or something else.
We can optimize this regular expression when using regular
expression flavors that support atomic grouping. When the first regex
from the Solution section encounters the word interesting
, the ‹in
› alternative will match. After that, the word
boundary at the end of the regex will fail to match. The regex engine
will then backtrack, fruitlessly attempting the remaining
alternatives.
By putting the alternatives inside an atomic group, we prevent the
regex from backtracking after the second ‹› fails to match. This allows the regex to fail
faster.
Because the regex won’t backtrack, we have to make sure no
backtracking is required to match any of our keywords. When the first
regex encounters inline
, it will first match in
. The second word
boundary then fails. The regex engine backtracks to match inline
, at which point the
word boundary, and thus the whole regex, can find their match. Because
this backtracking won’t work with the atomic group, we changed ‹in|inline|inherited
› from the
first regex into ‹in(?:line|herited)?
› in the second regex. The
first regex attempts to match in
, inline
, and inherited
in that order, because
alternation is eager. The second regex matches inline
or iniherited
if it can
because the quantifier is greedy, and matches in
otherwise. Only after inline
, inherited
, or in
has been matched will
the second regex proceed with the word boundary. If the word boundary
cannot be matched, there is no point in trying any of the other
alternatives, which we expressed with the atomic group.
Matching just the keywords may not be sufficient. The form file format won’t treat these words as reserved keywords when they appear in single-quoted strings. If the form contains a control that has a caption with the text “The end is near,” that will be stored in the file this way:
object Button1: TButton Caption = 'The end is near' end
In this snippet, the second occurrence of end
is a keyword, but the
first occurrence is not. We need a more complex solution if we only want
to treat the second occurrence of end
as a keyword.
There is no easy way to make our regex match keywords only when they appear outside of strings. But we can easily make our regex match both keywords and strings.
(end|in|inline|inherited|item|object)|'[^' ]*(?:''[^' ]*)*'
Regex options: Case insensitive |
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby |
When this regex encounters a single quote, it will match the whole
string up to the next single quote. The next match attempt then begins
after the string. This way, the regex does not separately match keywords
when they appear inside strings. The whole string will be matched
instead. In the previous sample, this regular expression will first
match object
,
then 'The end is
near'
, and finally end
at the end of the sample.
To be able to determine whether the regex matched a keyword or a string, we’re now using a capturing group rather than a noncapturing group for the list of keywords. When the regex matches a keyword, it will be held by the first (and only) capturing group. When the regex matches a string, the first capturing group will be blank, as it didn’t participate in the match.
If you’ll be constructing a parser as explained in Construct a Parser, then you will always combine the keyword regex with the string regex and the regexes for all the other tokens in the file format you’re dealing with. You will use the same technique as we used for keywords and strings here. Your regex will simply have many more alternatives to cover the whole syntax of your file format. That will automatically deal with keywords appearing inside of strings.
When matching keywords in other file formats or
programming languages, the word boundaries may not be sufficient. In
many languages, $end
is a
variable, even when end
is a keyword. In that case, the word boundaries are not sufficient to
make sure that you’re not matching keywords that aren’t keywords.
‹end
› matches
end
in
$end
. The
dollar sign is not a word character, but a letter is. ‹› matches between the dollar
sign and a letter.
You can solve this with lookaround. ‹(?<![$w])(?:end|in|inline|inherited|item|object)
›
uses negative lookbehind to make sure the keyword is not preceded by a
dollar sign. The negative lookbehind includes ‹w
›, and we
still have word boundary ‹› at the end to make sure the keyword is not
part of a longer word.
Chapter 2 discusses the techniques used in the regular expressions in this recipe. Recipe 2.6 explains word boundaries, and Recipe 2.8 explains alternation, which we used to match the keywords. Recipe 2.14 explains the atomic group, and Recipe 2.12 explains the quantifier we used to optimize the regular expression. Recipe 2.16 explains lookaround.
[9] This recipe gets its inspiration from Delphi form files, which use these exact keywords, except for “in,” which we added here to illustrate some pitfalls.