A simple method of tokenizing—or breaking up a string into its discrete elements—was presented in Recipe 2.6. However, this is not powerful enough to handle all your string-tokenizing needs. You need a tokenizer—also referred to as a lexer—that can split up a string based on a well-defined set of characters.
Using
the Split
method of the Regex
class, we can use a regular expression to indicate the types of
tokens and separators that we are interested in gathering. This
technique works especially well with equations, since the tokens of
an equation are well-defined. For example, the
code:
using System; using System.Text.RegularExpressions; public static string[] Tokenize(string equation) { Regex RE = new Regex(@"([+-*()^\])"); return (RE.Split(equation)); }
will divide up a string according to the regular expression specified
in the Regex
constructor. In other words, the
string passed in to the Tokenize
method will be
divided up based on the delimiters +
,
-
, *
, (
,
)
, ^
, or .
The following method will call the
Tokenize
method
to tokenize the equation: (y - 3)(3111*x^21 + x
+ 320)
:
public void TestTokenize( ) { foreach(string token in Tokenize("(y - 3)(3111*x^21 + x + 320)")) Console.WriteLine("String token = " + token.Trim( )); }
which displays the following output:
String token = String token = ( String token = y String token = - String token = 3 String token = ) String token = String token = ( String token = 3111 String token = * String token = x String token = ^ String token = 21 String token = + String token = x String token = + String token = 320 String token = ) String token =
Notice that each individual operator, parenthesis, and number has been broken out into its own separate token.
The tokenizer created in Recipe 2.6 would be useful in specific controlled circumstances. However, in real-world projects, we do not always have the luxury of being able to control the set of inputs to our code. By making use of regular expressions, we can take the original tokenizer and make it flexible enough to allow it to be applied to any type or style of input we desire.
The key method used here is the Split
instance
method of the Regex
class. The return value of
this method is a string array whose elements include each individual
token of the source
string—the equation, in
this case.
Notice that the static method allows RegexOptions
enumeration values to be used, while the instance method allows for a
starting position to be defined and a maximum amount of matches to
occur. This may have some bearing on whether you choose the static or
instance method.
See Recipe 2.6; see the “.NET Framework Regular Expressions” topic in the MSDN documentation.