Operators

Problem

You are developing a syntax coloring scheme for your favorite text editor. You need a regular expression that matches any of the characters that can be used as operators in the programming language for which you’re creating the scheme: -, +, *, /, =, <, >, %, &, ^, |, !, ~, and ?. The regex doesn’t need to check whether the combination of characters forms a valid operator. That is not a job for a syntax coloring scheme; instead, it should simply highlight all operator characters as such.

Solution

[-+*/=<>%&^|!~?]
Regex options: None
Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Discussion

If you read Recipe 2.3, the solution is obvious. You may wonder why we included this as a separate recipe.

The focus of this chapter is on regular expressions that will be used in larger systems, such as syntax coloring schemes. Such systems will often combine regular expressions using alternation. That can lead to unexpected pitfalls that may not be obvious when you see a regular expression in isolation.

One pitfall is that a system using this regular expression will likely have other regular expressions that match the same characters. Many programming languages use / as the division operator and // to start a comment. If you combine the regular expression from this recipe with the one from Single-Line Comments into (?<operator>[-+*/=<>%&^|!~?])|(?<comment>//.*), then you will find that your system never matches any comments. All forward slashes will be matched as operators.

The solution is to reverse the alternatives: (?<comment>//.*)|(?<operator>[-+*/=<>%&^|!~?]). This regex will always match two adjacent forward slashes as a single-line comment. It will not attempt to match any operators until the first half of the regex has failed to match a single-line comment. If you have an application that combines multiple regular expressions, such as a text editor with regex-based syntax coloring, you will need to know the order in which the application combines the regular expressions.

Another pitfall is that you may try to be clever and “optimize” your regex by adding a quantifier after the character class: [-+*/=<>%&^|!~?]+. Because the syntax coloring scheme needs to highlight all operator characters, it should be more efficient to highlight all successive operator characters in one go. And it would be if highlighting operators were the scheme’s only task. But it will fail in some situations, even when the regular expressions are combined in the order we determined to be correct in the previous paragraph: (?<comment>//.*)|(?<operator>[-+*/=<>%&^|!~?]+). This regex will correctly highlight operators and single-line comments, unless the single-line comment is immediately preceded by an operator. When the regex encounters !//bang, the “comment” alternative will fail to match the *. The regex then tries the “operator” alternative. This will match not just !; instead, it will match all of !// because the + after the character class makes it match as many operator characters as it can. After this match has been found, the regex will be attempted again on bang. The regex fails to match because the characters that started the comment have already been consumed by the previous match.

If we leave off the quantifier and use (?<comment>//.*)|(?<operator>[-+*/=<>%&^|!~?]), the operator part of the regex will only match ! when encountering !//bang. The next match attempt will then see //bang, which will be matched by the “comment” alternative in the regex.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset