Keywords

Problem

You are working with a file format for forms in a software application. The words “end,” “in,” “inline,” “inherited,” “item,” and “object” are reserved keywords in this format.[9] You want a regular expression that matches any of these keywords.

Solution

The basic solution is very straightforward and works with all regex flavors in this book:

(?:end|in|inline|inherited|item|object)
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

We can optimize the regular expression for regex flavors that support atomic grouping:

(?>end|in(?:line|herited)?|item|object)
Regex options: Case insensitive
Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

Discussion

Matching a word from a list of words is very easy with a regular expression. We simply use alternation to match any one of the keywords. The word boundaries at the start and the end of the regex make sure we only match entire words. The regex should match inline rather than in when the file contains inline, and it should fail to match when the file contains interesting. Because alternation has the lowest precedence of all regex operators, we have to put the list of keywords inside a group. Here we used a noncapturing group for efficiency. When using this regex as part of a larger regular expression, you may want to use a capturing group instead, so you can determine whether the regex matched a keyword or something else.

We can optimize this regular expression when using regular expression flavors that support atomic grouping. When the first regex from the Solution section encounters the word interesting, the in alternative will match. After that, the word boundary at the end of the regex will fail to match. The regex engine will then backtrack, fruitlessly attempting the remaining alternatives.

By putting the alternatives inside an atomic group, we prevent the regex from backtracking after the second  fails to match. This allows the regex to fail faster.

Because the regex won’t backtrack, we have to make sure no backtracking is required to match any of our keywords. When the first regex encounters inline, it will first match in. The second word boundary then fails. The regex engine backtracks to match inline, at which point the word boundary, and thus the whole regex, can find their match. Because this backtracking won’t work with the atomic group, we changed in|inline|inherited from the first regex into in(?:line|herited)? in the second regex. The first regex attempts to match in, inline, and inherited in that order, because alternation is eager. The second regex matches inline or iniherited if it can because the quantifier is greedy, and matches in otherwise. Only after inline, inherited, or in has been matched will the second regex proceed with the word boundary. If the word boundary cannot be matched, there is no point in trying any of the other alternatives, which we expressed with the atomic group.

Variations

Matching just the keywords may not be sufficient. The form file format won’t treat these words as reserved keywords when they appear in single-quoted strings. If the form contains a control that has a caption with the text “The end is near,” that will be stored in the file this way:

object Button1: TButton
    Caption = 'The end is near'
end

In this snippet, the second occurrence of end is a keyword, but the first occurrence is not. We need a more complex solution if we only want to treat the second occurrence of end as a keyword.

There is no easy way to make our regex match keywords only when they appear outside of strings. But we can easily make our regex match both keywords and strings.

(end|in|inline|inherited|item|object)|'[^'
]*(?:''[^'
]*)*'
Regex options: Case insensitive
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

When this regex encounters a single quote, it will match the whole string up to the next single quote. The next match attempt then begins after the string. This way, the regex does not separately match keywords when they appear inside strings. The whole string will be matched instead. In the previous sample, this regular expression will first match object, then 'The end is near', and finally end at the end of the sample.

To be able to determine whether the regex matched a keyword or a string, we’re now using a capturing group rather than a noncapturing group for the list of keywords. When the regex matches a keyword, it will be held by the first (and only) capturing group. When the regex matches a string, the first capturing group will be blank, as it didn’t participate in the match.

If you’ll be constructing a parser as explained in Construct a Parser, then you will always combine the keyword regex with the string regex and the regexes for all the other tokens in the file format you’re dealing with. You will use the same technique as we used for keywords and strings here. Your regex will simply have many more alternatives to cover the whole syntax of your file format. That will automatically deal with keywords appearing inside of strings.

When matching keywords in other file formats or programming languages, the word boundaries may not be sufficient. In many languages, $end is a variable, even when end is a keyword. In that case, the word boundaries are not sufficient to make sure that you’re not matching keywords that aren’t keywords. end matches end in $end. The dollar sign is not a word character, but a letter is.  matches between the dollar sign and a letter.

You can solve this with lookaround. (?<![$w])(?:end|in|inline|inherited|item|object) uses negative lookbehind to make sure the keyword is not preceded by a dollar sign. The negative lookbehind includes w, and we still have word boundary  at the end to make sure the keyword is not part of a longer word.

See Also

Chapter 2 discusses the techniques used in the regular expressions in this recipe. Recipe 2.6 explains word boundaries, and Recipe 2.8 explains alternation, which we used to match the keywords. Recipe 2.14 explains the atomic group, and Recipe 2.12 explains the quantifier we used to optimize the regular expression. Recipe 2.16 explains lookaround.



[9] This recipe gets its inspiration from Delphi form files, which use these exact keywords, except for “in,” which we added here to illustrate some pitfalls.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset