You are developing a syntax coloring scheme for your
favorite text editor. You need a regular expression that matches any of
the characters that can be used as operators in the programming language
for which you’re creating the scheme: -
, +
, *
, /
, =
, <
, >
, %
, &
, ^
, |
, !
, ~
, and ?
. The
regex doesn’t need to check whether the combination of characters forms
a valid operator. That is not a job for a syntax coloring scheme;
instead, it should simply highlight all operator characters as
such.
If you read Recipe 2.3, the solution is obvious. You may wonder why we included this as a separate recipe.
The focus of this chapter is on regular expressions that will be used in larger systems, such as syntax coloring schemes. Such systems will often combine regular expressions using alternation. That can lead to unexpected pitfalls that may not be obvious when you see a regular expression in isolation.
One pitfall is that a system using this regular expression will
likely have other regular expressions that match the same characters.
Many programming languages use /
as the division operator and //
to start a comment. If you
combine the regular expression from this recipe with the one from Single-Line Comments into ‹(?<operator>[-+*/=<>%&^|!~?])|(?<comment>//.*)
›,
then you will find that your system never matches any comments. All
forward slashes will be matched as operators.
The solution is to reverse the alternatives: ‹(?<comment>//.*)|(?<operator>[-+*/=<>%&^|!~?])
›.
This regex will always match two adjacent forward slashes as a
single-line comment. It will not attempt to match any operators until
the first half of the regex has failed to match a single-line comment.
If you have an application that combines multiple regular expressions,
such as a text editor with regex-based syntax coloring, you will need to
know the order in which the application combines the regular
expressions.
Another pitfall is that you may try to be clever and “optimize”
your regex by adding a quantifier after the character class: ‹[-+*/=<>%&^|!~?]+
›.
Because the syntax coloring scheme needs to highlight all operator
characters, it should be more efficient to highlight all successive
operator characters in one go. And it would be if highlighting operators
were the scheme’s only task. But it will fail in some situations, even
when the regular expressions are combined in the order we determined to
be correct in the previous paragraph: ‹(?<comment>//.*)|(?<operator>[-+*/=<>%&^|!~?]+)
›.
This regex will correctly highlight operators and single-line comments,
unless the single-line comment is immediately preceded by an operator.
When the regex encounters !//bang
, the “comment” alternative will
fail to match the *
. The regex then tries the “operator”
alternative. This will match not just !
; instead, it will match all of
!//
because the
‹+
› after the character
class makes it match as many operator characters as it can. After this
match has been found, the regex will be attempted again on bang
. The regex fails to
match because the characters that started the comment have already been
consumed by the previous match.
If we leave off the quantifier and use ‹(?<comment>//.*)|(?<operator>[-+*/=<>%&^|!~?])
›,
the operator part of the regex will only match !
when encountering
!//bang
. The
next match attempt will then see //bang
, which will be matched by the
“comment” alternative in the regex.