Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 30

Regular Expressions and Context-Free Parsers

Parsing text is a task that comes up in a lot of different programs, whether it is pulling data out of text files, dealing with user commands, evaluating formulas, or understanding programming languages, some form of processing is typically needed when dealing with text values. Our approach to this so far has been fairly ad hoc. When faced with a particular format of input, we have written specialized code to handle it. The one exception to this was using the built-in parser to deal with files that were formatted in EXtensible Markup Language (XML). In this chapter we will learn about some of the formal approaches that computer scientists have developed to deal with parsing text. These will give you that ability to deal with more complex forms of text input in fairly easy, standard, and flexible ways.

30.1 Chomsky Grammars

Our discussion begins with the theoretical basis. The hierarchy of grammars that underpins what will be discussed in this chapter as well as a fair bit of the theory of languages in Computer Science appeared in an article written by the linguist Noam Chomsky and was co-developed by Marcel-Paul Schützenberger [4].

In a formal sense, a Chomsky grammar consists of four elements:

a finite set of non-terminal symbols, typically denoted by capital letters,
a finite set of terminal symbols, typically denoted by lowercase letters,
a finite set of production rules, each of which has a left-hand side a right-hand side that are composed of those symbols,
a special non-terminal denoted as the start symbol, often the character S.

Each grammar defines a language that is all the strings of terminal symbols that can be generated by that grammar. The way a string is generated is that a section of the string matching the left-hand side of a production is converted to the right-hand side of that same production. At any time, any production can be selected for any symbol or sequence of symbols in the current string. This process continues until the string we are left with contains only terminal symbols.

To illustrate this, consider the grammar that has the set S, A, B as non-terminal symbols, with S as the start symbol, a, b as terminal symbols, and the following productions.

$\begin{matrix} S \to A B \\ A \to a A \\ A \to a \\ B \to b B \\ B \to b \end{matrix}$ $\begin{matrix} S \to A B \\ A \to a A \\ A \to a \\ B \to b B \\ B \to b \end{matrix}$

To see what language this grammar generates, we can run through a series of productions to generate one string.

$\begin{array}{l} S & apply S \to A B \\ A B & apply A \to a A \\ a A B & apply A \to a A \\ a a A B & apply B \to b B \\ a a A b B & apply A \to a \\ a a a b B & apply B \to b \\ a a a b b \end{array}$ $\begin{array}{l} S & apply S \to A B \\ A B & apply A \to a A \\ a A B & apply A \to a A \\ a a A B & apply B \to b B \\ a a A b B & apply A \to a \\ a a a b B & apply B \to b \\ a a a b b \end{array}$

If you play with this grammar for a little while, you should be able to see that the language it generates is one or more a symbols followed by one or more b symbols.

For the sake of brevity, when the same left-hand side can produce multiple different values on the right-hand side, it is common to list them on a single line using a single pipe, |, between them. This can shorten the above set of productions down to the following form.

$\begin{array}{l} S \to A B \\ A \to a A | a \\ B \to b B | b \end{array}$ $\begin{array}{l} S \to A B \\ A \to a A | a \\ B \to b B | b \end{array}$

While this is a very simple example, hopefully it gives you the idea of how these systems work. Anyone who completed the L-System projects in chapters 14 and 15 will find the Chomsky grammars to be somewhat familiar. L-Systems are a different type of grammar. The rules are slightly different, but the general ideas are very similar.

Perhaps the most interesting thing about these grammars, is that by limiting the complexity of productions in certain ways, one can subdivide all possible grammars into a hierarchy where each level in the hierarchy is fundamentally more powerful than the one below it. There are four specific levels of capability in the Chomsky hierarchy. We will look briefly at each of these in the following four subsections starting with the least powerful set of grammars.

30.1.1 Regular Grammars

In a regular grammar, all productions must have the form of a single non-terminal going to a terminal (A → a) or a single non-terminal going to a terminal and a non-terminal (A → aB). With some slight modifications to the example grammar made above, we can produce a regular grammar that generates the same language.

$\begin{array}{l} S \to A B \\ A \to a A | B \\ B \to b B | b \end{array}$ $\begin{array}{l} S \to A B \\ A \to a A | B \\ B \to b B | b \end{array}$

Run through some examples with this grammar to prove to yourself that it does generate the language of one or more a symbols followed by one or more b symbols.

Given these restrictions on the rules, the evolution of a string under a regular grammar is very straightforward. At any given time there is only one non-terminal, and it is the last symbol in the string. All productions either grow the length of the string by one, inserting a new terminal in front of a non-terminal, or they replace the one non-terminal by a terminal and finish the process.

Despite these limitations, regular grammars are still fairly powerful. The primary limitation on them is that they have no memory. For example, it is impossible to make a regular grammar that generates the language that has a certain number of a symbols followed by the same number of b symbols. We refer to this language as anbn. This is due to the fact that the system has no memory of how many a symbols it has generated and can not enforce that the number of b symbols matches. This is true at least for an arbitrary number of symbols. By employing multiple non-terminals, it is possible to make a grammar that can match a and b symbols up to a certain number, but no finite regular grammar can do this for an arbitrary value of n.

30.1.2 Context-Free Grammars

The second rung up the Chomsky hierarchy is the context-free grammars. The productions in a context free grammar have a single non-terminal on the left-hand side and ant string of terminals and/or non-terminals on the right-hand side (A → γ). The string on the right can potentially be the empty string, typically written as ∈.

The fact that context-free grammars are fundamentally more powerful than regular grammars can be seen with the following example grammar. The non-terminal set includes only the start symbol, S. The terminal set is a, b. The productions are S → aSb | ab. It should not take long to convince yourself that this grammar produces the language anbn for n ≥ 1. Context-free grammars can do this because they effectively have a limited memory in the form of a stack. Due to the limitations of the stack, however, there is no context-free grammar that generates the language anbncn. This is because the work of making sure that there are n b symbols consumes the contents of the stack and it is then impossible to produce the same number of c symbols.

The term context-free indicates the limitation of these grammars. While it is possible to have productions that produce whatever you want, the nature of the production can not depend on the context of the non-terminal that appears on the left-hand side. Despite this limitation, context-free grammars are remarkably useful in the field of Computer Science. The syntaxes of programming languages are typically specified as context-free grammars. In that situation, the symbols are the tokens of the language. The vast majority of sentences in natural language can also be generated by context-free grammars.

30.1.3 Context-Sensitive Grammars

The next step up the hierarchy removes the limitation that choice of production can not depend on what is around the non-terminal. The general form of the rule is αAβ → αγβ, where α, β, and γ are all arbitrary strings of terminals and non-terminals and are potentially empty. Note that the α and β are not really involved in the production. Instead, they simply limit when the production can be applied.

Context-sensitive grammars are significantly harder to work with than context-free grammars and the number of applications that require context-sensitivity is fairly limited. For that reason, we will not go into any detail on their usage.

30.1.4 Recursively Enumerable Grammars

The most powerful class of grammars in the Chomsky hierarchy is the set of recursively enumerable grammars. There are no restrictions on the productions of these grammars. You can have anything on the left-hand side produce whatever you want on the right-hand side. The truly remarkable thing about this class of grammars is that it is a full model of computation. Anything that is computable, in the theoretical sense of the Turing machine or the lambda calculus, can be computed with a properly designed recursively enumerable grammar.

This brief introduction to Chomsky grammars and the Chomsky hierarchy serves as a foundation for the rest of this chapter. However, it is a much broader topic and interested readers should do a bit more exploration of the topic. If nothing else, a proper course on the theory of Computer Science should cover these topics in significantly greater detail than has been presented here.

30.2 Regular Expressions

The theoretical underpinnings of Computer Science are extremely important for a variety of reasons, and they play a significant role in major developments in the field. However, the primary focus of this book is on the development of problem solving and programming skills. For that reason, we now turn our focus to the more practical applications of grammars and languages, beginning with regular expressions.

As the name implies, regular expressions are related to regular grammars. They provide a simple mechanism for matching and parsing text. As with regular grammars, they have their limitations. They are still quite powerful and for the problems they are good at, they can make life a whole lot easier.

Regular expressions are typically written as strings with a particular format where certain characters have special meaning. They are used by quite a few different tools, including some that you have been using for a while now. Both grep and the search feature of vi make use of regular expressions. Their tight integration into the programming language Perl was a significant part of why many people used that language for small text processing tasks. Even the split method we have been using on strings uses regular expressions.

While most of the aspects of regular expressions are fairly uniform, there are some features that differ from one implementation to another. Regular expressions in Scala are supported by the scala.util.matching.Regex class and companion object and they are built on top of the Java implementation in java.util.regex. The following subsections will go through a number of the major features of regular expressions. For a more definitive description of the features that are available, one can look at the Application Programming Interface (API) documentation for java.util.regex.Pattern. To turn a String into a Regex in Scala, simply call the r method.

scala> val str = "This is a string."

str: java.lang.String = This is a string.

scala> val regex = "This is a regular expression.".r

regex: scala.util.matching.Regex = This is a regular expression.

30.2.1 Characters and Character Classes

The reason that you have been able to use regular expressions in programs without even knowing you were using regular expressions is that most characters in regular expressions act just like themselves. So the regular expression "Fred" matches the string "Fred" and nothing else. However, imagine that you want to find all the method invocations in a Scala program that are called on objects ending with "ing" using grep. Here is some of that output from attempting this on the Drawing.scala file with the matching sections in bold.

 > grep ing. scalabook/drawing/Drawing.scala

import scala.swing.-

import javax.swing.event.-

∗ This type represents a Drawing that you can have open in the program.

class Drawing(private val root:DrawTransform,private[drawing]

val vars:mutable.Map[String,Double]) extends Serializable {

@transient private var lTree = new javax.swing.JTree(root)

root.drawing = this

if(lTree==null) lTree = new javax.swing.JTree(root)

private[drawing] def refresh {

private[drawing] def refreshTree {

case m:javax.swing.tree.DefaultTreeModel => m.reload

Clearly this finds everything that we wanted, but it is matching a lot of other things as well, things that do not include any period. This is a result of the fact that a period is a wild card that can match any character1 in regular expressions. So what we really asked for is any sequence of four characters that starts with "ing". In order to force the period to actually match only a period, we need to put a backslash in front of it.2 For all the special characters in regular expressions, you can force them to be treated as normal strings by putting a backslash in front of them. The fact that normal string literals in Scala also treat the backslash as an escape character becomes rather a pain when dealing with regular expressions. For this reason, it is standard to use the triple-quote string literals for regular expressions.

The period is actually one of many character classes in regular expressions. A character class matches one character as long as it comes from the right set. In the case of a period, the set is anything. You can also make your own character classes by putting characters you want to match in square brackets. For example, [abc] will match an ’a’, a ’b’, or a ’c’. A slightly more complex example is the regular expression b[aei]d which will match the strings "bad", "bed", or "bid". For consecutive characters in ASCII, you can use a hyphen. So [0-9] matches any digit and [a-zA-Z] matches any letter, upper of lowercase. If you put a ^ as the first character in the square brackets, the character class matches anything except what is specified. So [^aeiou] will match anything that is not a lowercase vowel. Of course, this means that square brackets are special characters in regular expressions and much be escaped if you want to use them for purposes other than defining a character class. The same is true for the caret in certain usages and for the minus sign inside of a character class.

Some character classes are used so frequently that there are shortcut names for them. In addition to the period, the following are defined in the Java regular expression library.

d - a digit, same as [0-9]
D - not a digit, same as [^0-9]
s - a whitespace character
S - a non-whitespace character
w - a word character, same as [a-zA-Z0-9]
W - not a word character, same as [^a-zA-Z0-9]

If you spend much time working with regular expressions these character classes will likely be committed to memory.

30.2.2 Logical Operators and Capturing Groups

You have already seen that characters or character groups that are adjacent to one another match consecutive characters. So the regular expression cat can be read as ’c’ and ’a’ and ’t’. Character classes can be used in any place that would take an individual character as well. If adjacency is like "and", how does one say "or"? With a single pipe, |?3 So the regular expression cat|dog will match either "cat" or "dog". As with logical expressions, the "and" operation has higher precedence than the "or" does.

Sections of a regular expression can also be grouped using parentheses. This gives you a way to control the parts of the regular expression you are taking the "or" of, such as a(bc|de)f, it lets you bundle sections together for the quantifiers discussed in the next subsection, and it allows you to capture sections of the match. When you put parentheses in a regular expression, they define a capturing group. The code keeps track of all of the sections of the string that wind up being in parentheses so that you can pull them back out. These groups are numbered starting at one and they are ordered by the location of the opening parentheses. So it does not matter how log groups are or how they are nested, it is only the location of the opening parentheses that matters.

Consider the simple example of a phone number in the format "(xxx) yyy-zzzz" where you want to capture each of the groups denoted by different letters and capture the standard 7-digit part of the number independently of the area code. To do this, you could use the regular expression ((ddd)) ((ddd)-(dddd)). This regular expression has four capturing groups in it numbered 1 through 4. The first one is the area code. The second one is the 7-digit phone number. The third and fourth groups are the subparts of the 7-digit number. They are numbered by the opening parentheses. Here is that same regular expression, with subscripts to make that clear, ((1ddd)) (2(3ddd)-(4dddd)). This example regular expression also illustrates that because parentheses are special characters used for grouping in regular expressions, they have to be escaped if you want to match one.

30.2.3 Greedy Quantifiers

The real power of regular expressions comes in with quantifiers. Quantifiers allow you to have some control over the number of times an expression will occur. The default quantifiers are called "greedy" quantifiers because they will match as many characters as possible while still allowing the whole pattern to be matched. There are six options for the greedy quantifiers.

X? - Matches the pattern X or nothing. Think of this as being 0 or 1 times.
X∗ - Matches nothing or the pattern X repeated any number of times. Think of this as 0 or more times.
X+ - Matches the pattern X 1 or more times.
X{n} - Matches the pattern X repeated exactly n times.
X{n,} - Matches the pattern X repeated n or more times.
X{n,m} - Matches the pattern X repeated at least n times, but not more than m times.

The regular expression for the phone number could have been expressed with quantifiers as ((d{3})) ((d{3})-(d{4})). These quantifiers also allow us to express the languages for the grammars described in subsection 30.1.1. For example a+b+ matches one or more "a"s followed by one or more "b"s. Note that there is not a good way, using these quantifiers, to allow the number of characters of each type to match without specifying how many times they occur.

This should now make it clear why the argument passed to split when we wanted to break a string on spaces was " +". The + here is a quantifier so that this matches one or more spaces. Just in case the user also used tabs, it might be safer to use "s+".

There are also options for quantifiers that are not greedy. The interested reader is directed to the Java API and other sources for information on reluctant and possessive quantifiers.

30.2.4 Boundary Requirements

The last option of regular expressions that we will consider here is boundary requirements. In the general usage, regular expressions can be used to match any part of a string and can be used to do multiple matches inside of a single string. There are situations where you want to restrict the generality of the matches. For this you can put characters at the beginning of the end of the regular expression to specify where it needs to begin or end. The following as some of the allowed options.

^ - As the first character in a regular expressions it requires the match to start at the beginning of a line.
$ - As the last character in a regular expression it requires the match to end at the end of a line.
b - Can be placed at the beginning or end of a regular expression forcing the match to start or end on a word boundary.
B - Like b except it forces a non-word boundary.
A - As the first character in a regular expressions it requires the match to start at the beginning of the input.
G - As the last character in a regular expression it requires the match to end at the end of the previous match.
z - As the last character in a regular expression it requires the match to end at the end of the input.

So if you want to match words that have four characters in them you could use the regular expression w{4}.

30.2.5 Using Regular Expressions in Code

Now that you know the basics of regular expressions we can see how they are used in Scala. We already saw that you can produce a scala.util.matching.Regex by calling the r method on a string and that triple-quote strings are particularly useful for defining regular expressions because they do not handle the backslash as an escape character. The next question is what can you do with one of these Regex objects.

The two most useful methods in the Regex class are def findAllIn(source: CharSequence): MatchIterator and def replaceAllIn(target: CharSequence, replacer: (Match) => String): String. The type CharSequence is a suptertype of String that is more flexible. For our examples we will just be using normal strings. The Match and MatchIterator types are declared in the Regex companion object. A Match object defines basic information from a single match of the regular expression to a string. The API can give you full details, but one of the things you will want to do a lot is find the parts of the string that were in the different capturing groups. This can be done with the def group(i: Int): String method. The argument is the number of the group you want the text for. If nothing matched that group then the method returns null.

The MatchIterator extends Iterator[String] with MatchData. It also has a method called matchData that returns an Iterator[Match]. Putting these things together we can take a block of text that has several embedded phone numbers along with the regular expression we made above and pull out just the parts of the phone numbers.

scala> val phoneNumbers = """For help you can try to following numbers:

 | Main line: (123) 555-4567

 | Secondary line: (123) 555-7022

 | Fax: (123) 555-5847"""

phoneNumbers: java.lang.String =

For help you can try to following numbers:

Main line: (123) 555-4567

Secondary line: (123) 555-7022

Fax: (123) 555-5847

scala> val phoneRegex = """((d{3})) ((d{3})-(d{4}))""".r

PhoneRegEx: scala.util.matching.Regex = ((d{3})) ((d{3})-(d{4}))

scala> for(m <- phoneRegex.findAllIn(phoneNumbers).matchData) {

 | println(m.group(1)+" "+m.group(3)+" "+m.group(4))

|}

123 555 4567

123 555 7022

123 555 5847

The for loop runs through the Matches produced by calling MatchData on the results of findAllIn. The groups with indexes 1, 3, and 4 are printed from each match.

Hopefully it is clear how the use of regular expressions made this mode much easier to write. Finding the phone numbers embedded in that text would have been rather challenging using other techniques that we have covered. Phone numbers have a nice, well-defined format to them, but it is not a specific string sequence that we can search for.

The replaceAllIn method has a second argument that is a function which takes a Match and returns a String that should replace the match in the first argument. We could use this method to the phone numbers, replacing their last four digits with the character ’x’ using the following line.

scala> phoneRegex.replaceAllIn(phoneNumbers,m => {

 | "("+m.group(1)+") "+m.group(3)+"-xxxx"

|})

res2: String =

For help you can try to following numbers:

Main line: (123) 555-xxxx

Secondary line: (123) 555-xxxx

Fax: (123) 555-xxxx

Passing in the Match gives you the ability to use the details of the match in determining the replacement string. In this case, we wanted to preserve the first six digits so groups 1 and 3 were used in the result. Again, if you think for a while about what would have been required to do this transformation using other techniques, you should realize that power provided by regular expressions.

There is one other approach to using regular expressions in Scala: patterns. We have previously seen patterns used with tuples, arrays, lists, and XML. They work with regular expressions as well. When a regular expression is used as a match on a string, it pulls out the groups and only works if there is a match for the whole string. The following code can be put into the Commands.scala file in place of more traditional string parsing code that we had before that split on the first space.

 val CommandSplit = """s∗(w+)(s+(.∗))?s∗""".r

 def apply(input:String,drawing:Drawing):Any = {

 val CommandSplit(command,_,rest) = input

 if(commands.contains(command)) commands(command)(rest,drawing)

  else "Not a valid command."

This does what we did previously with a call to indexOf, but this code is a bit more robust in how it handles whitespace, particularly leading whitespace. With the old code, if the user put spaces in front of the command, it would fail. This fixes that problem by leading with s∗.

You might notice something else odd about this code. The variable CommandSplit starts with a capital letter. This is contrary to standard naming practices on how everything has been named in this text. The reason for this is the use in the pattern. Recall that names beginning with lowercase letters in patterns are assumed to be variable names that should be bound to values in the match. This is the role played by command and rest in this example and that behavior is essential to this working. However, we want CommandSplit to refer to the variable declared above the method. We want to do this using a variable with a name that starts with a lowercase letter, we have to use backticks in the pattern like this.

  val ‘commandSplit’(command,_,rest) = input

The choice to make variables used in patterns start with capital letters is simply one of style. You can decide which style you prefer.

The place where regular expression patterns are probably most useful is in for loops. Recall that what you type before <- in a for loop is matched as a val pattern. Anything that does not match is skipped. This is remarkably helpful for regular expressions if you are running through a collection of strings where some might not match what you are looking for.

To see this, consider having a file where some of the lines are numbered and others are not. You only care about the ones that are numbered and they all have, for the first non-space characters, digits followed by a period. After the period is text you care about. You want to built a Map[Int,String] that lets you quickly look up a string based on the number of its line. The following code will do this for you.

val NumberedLine = """s∗(d+).(.+)""".r

val source = Source.fromFile(fileName)

val lines = source.getLines

val numberedLines = (for(NumberedList(num,text) <- lines) yield {

 num -> text

}).toMap

source.close

The for loop will only execute on lines matching the pattern and skip anything that does not match. This produces compact and fairly easy to maintain code.

30.2.6 Drawback of Regular Expressions

The primary drawback of regular expressions is that they can be cryptic and somewhat difficult to maintain. Indeed, languages that highlight the use of regular expressions are often derided for their lack of readability. Coming up with exactly the right regular expression for a problem can also be quite challenging. This is expressed in this oft-repeated quote.

Some people, when confronted with a problem, think

"I know, I’ll use regular expressions." Now they have two problems.

- Jamie Zawinski (1997 Usenet post)

There is inevitably some truth to this quote. However, hopefully the examples above have shown you that in situations where regular expressions work well, they can make life much easier. Just do not try to use them to solve every problem you encounter.

30.3 Context-Free Parsers

What options do you have if you need to parse text that is beyond the power of regular expressions? Scala includes the package scala.util.parsing which has a number of different subpackages that provide different parsing capabilities. We will focus on the scala.util.parsing.combinator package which gives us the ability to parse context-free grammars.4

The introduction to context-free, CF, grammars given in subsection 30.1.2 used a notation fitting with theoretical Computer Science. Like mathematicians, theoretical computer scientists are quite happy to name things using single letters or symbols. In practical application, it is typically helpful to use longer, more informative names. The symbology is also altered a bit when expressing CF grammars for things like programming languages. To illustrate this, we will go back to the example of formula parsing that we have utilized in the last two chapters. The CF grammar for a basic mathematical expression using only the four primary operators can be written this way.

form ::= term {(" + " | " − ") term}

term ::= f actor {(" ∗ " | "/") factor}

factor ::= floatingPointNumber | "(" form ")"

In this notation, the curly braces indicate elements that can be repeated zero or more times. It is also possible to use square brackets to indicate things that are optional.

This grammar says that a formula is a term followed by zero or more + or − signs followed by terms. The term is a factor followed by factors with ∗ or / between them. Lastly, the factor can be a number or a full formula in parentheses. Play with this to convince yourself that it can represent the full range of possibilities that are desired. It is also worth noting that this grammar does proper order of operations and ∗ and / are bound a level below the + and -.

Using the combinator parser library, this converts fairly directly to the following Scala code.

object FormulaP extends JavaTokenParsers {

 def form:Parser[Any] = term ˜ rep(("+" | "-") ˜ term)

 def term:Parser[Any] = factor ˜ rep(("∗" | "/") ˜ factor)

 def factor:Parser[Any] = floatingPointNumber | "(" ˜ form ˜ ")"

Each production in the original grammar becomes a method that returns a Parser, a class found insider scala.util.parser.combinator.Parsers. Currently this is a Parser[Any] to be completely general. Consecutive symbols in the original grammar are joined using the operator. The curly braces are replaced with a call to rep. Had there been square brackets, they would have been replaced with a call to opt. There is also a method called repsep that can be used for alternating productions. This includes things like comma-separated lists.

Using these rules, you can take any CF grammar that you might be interested in using and build a parser for it with little effort. You can use any of these parsers by calling the parseAll method that FormulaP gets via JavaTokenParsers in this example.5 The following main will do just that and print the value output produced by the parser.

def main(args:Array[String]) {

 println(parseAll(form,"3+5∗2").get)

30.3.1 Default Output

If you run that main method you get the following output.

((3˜List())˜List((+˜(5˜List((∗˜2))))))

The string parses, but the default output is a bit challenging to understand. To do so, we need to know what the default output of different parser elements are.

String or regular expression - Gives back the String that was matched.
P˜Q - Gives back ˜(p,q), where p is the match for P and q is the match for Q. Note that ˜(p,q) prints out as p˜q.
rep(P) - Give back lists of the matches: List(p1,p2,p3,....
repsep(P,Q) - Give back lists of the matches: List(p1,p2,p3,.... Note that while Q can be a full parser, the output is ignored. This is why it is best used for things like comma-separated lists where you do not care about the separators.
opt(P) - Gives back an Option on the match. If that optional part was not used you get None. Otherwise you get Some(p)

Using this information and the original grammar, you can figure out how this parser produced the output that it did.

30.3.2 Specified Output

While the default output provides the full information from the parse, and you could run through it using match statements to extract the desired information, it is often more useful to specify your own output from the parser. This is done using the ^^ method. The signature of ^^ in Parser[+T] is def ^^[U](f: (T) => U): Parser[U]. So it takes a function that converts from the normal output of this parser to some other type, U and gives back a Parser[U].

We can utilize this in the parser we wrote above to make it so that it outputs a numeric solution instead of the default parse output we had above. The first step in doing this is to change the return types of the various productions to Parser[Double]. That will introduce errors that we can fix by introducing the appropriate transformations. The resulting code looks like this.

object FormulaP extends JavaTokenParsers {

 def form:Parser[Double] = term ˜ rep(("+" | "-") ˜ term) ^^ {

 case d ˜ lst => lst.foldLeft(d)((n,t) => if(t._1=="+") n+t._2 else n-t._2)

 def term:Parser[Double] = factor ˜ rep(("∗" | "/") ˜ factor) ^^ {

 case d ˜ lst => lst.foldLeft(d)((n,t) => if(t._1=="∗") n∗t._2 else n/t._2)

 def factor:Parser[Double] = floatingPointNumber ^^ (_.toDouble) | "("˜> form <˜")"

 def main(args:Array[String]) {

 println(parseAll(form,"3+5∗2").get)

The simplest application of ^^ in this example is after floatingPointNumber. That is a Parser[String] so the normal output is just a String that we can convert to a Double with a call to toDouble. The other two productions follow the ^^ with partial functions that get converted to full functions for us. Each of these has a case that match a number followed by a tilde and a list. The number will be the result of the first term or factor. The list will be the result of rep, which is a List[[String,Double]]. To get an answer, we need to run through the list accumulating a value. At each step, the proper operation should be applied to the current value and the parse of what follows the operator. This can be done in short form using foldLeft. Note that the ˜ type works with infix pattern matching and has _1 and _2 methods like a tuple for getting to the first and second elements.

The parser for factor throws in something else to keep this code compact. It uses operators where a < or > is put before or after the ˜. These operators give you a shorthand way of ignoring parts of the match in the output. The direction the arrow points is toward the thing you actually care about. Without these, we would have had to follow that last case with a ^^ operator and a function that pulls off the parentheses and discards them.

With these changes, you can run this program and see that now the output is 13.0. Compare this to our original recursive parser that worked with these same operations and you will find that this is significantly shorter. However, this version is also completely opaque to anyone who is not familiar with CF grammars or combinator parsers.

There are, of course, many other options and possibilities with combinator parsers. Looking through the API at classes such as scala.util.parser.combinator.Parsers and scala.util.parser.combinator.Parsers.Parser can give you a quick view of some of the possibilities.

30.4 Project Integration

Now it is time to put everything together, including some concepts from the last chapter. The FormulaP class can be expanded to parse to a tree instead of a Double. This will allow the use of variables with good speed. To show the true power of the combinator parsers, this implementation will include not only the four basic math operations and a few extra functions, it will also include exponentiation and an if conditional. These two features present interesting challenges. Exponentiation is left associative while the other operators are right associative. In order to have an if, we have to have comparison operators that give back Booleans. It would also be nice to have the standard Boolean operators as well. All of this needs to be integrated into a system that preserves proper order of operation.

To start off, we need to develop a CF grammar that will parse this little language that we are trying to implement. Here is the one we will use in the code.6

cond ::= "if (" bform ")" cond "else" cond | form form ::= term {(" + " | " − ") term}

term ::= exp {(" ∗ " | "/") exp}

exp ::= f unc {"" func}

func ::= "sin" | "cos" | "tan" | "sqrt" "("cond")" | f actor

factor ::= f loatingP ointN umber | "(" form ")"

bform ::= bterm {" || " bterm}

bterm ::= bnot {"&&" bnot} bnot ::= "!("bf orm")" | bcomp

bcomp ::= cond (" == " | "! = " | " <= " | " >= " | " < " | " > ") cond

We convert this grammar into code, add some Node traits and classes, then make the parsers output nodes. The result is the following file.

package scalabook.util

import scala.util.parsing.combinator._

class FormulaP(val formula:String) {

 private val root = FormulaP parseAll(FormulaP.cond,formula) get

 def eval(vars:collection.Map[String,Double]) = root eval vars

object FormulaP extends JavaTokenParsers {

 def apply(f:String) = new FormulaP(f)

 def eval(f:String, vars:collection.Map[String,Double] = null) =

 new FormulaP(f) eval vars

 private def cond:Parser[DNode] = "if(" ˜> bform ˜ """)s∗""".r ˜ cond ˜

 """s∗elses∗""".r ˜ cond ^^ {

  case b ˜ _ ˜ e1 ˜ _ ˜ e2 => new IfNode(b,e1,e2)

} | form

 private def form:Parser[DNode] = term ˜ rep(("+" | "-") ˜ term) ^^ {

 case d ˜ lst => new LeftAssocBinaryOpDNode(d,lst)

 private def term:Parser[DNode] = exp ˜ rep(("∗" | "/") ˜ exp) ^^ {

 case d ˜ lst => new LeftAssocBinaryOpDNode(d,lst)

 private def exp:Parser[DNode] = func ˜ rep("^" ˜> func) ^^ {

 case d ˜ lst => new PowBinaryOpDNode(d,lst)

 private def func:Parser[DNode] = """(sin|cos|tan|sqrt)(""".r ˜ cond <˜ ")" ^^ {

 case f ˜ n => new FunctionDNode(f,n)

} | factor

 private def factor:Parser[DNode] = floatingPointNumber ^^ (s => new

  NumNode(s.toDouble)) |

 ident ^^ (s => new VarNode(s)) | "(" ˜> cond <˜ ")"

 private def bform:Parser[BNode] = bterm ˜ rep("||" ˜> bterm) ^^ {

 case b ˜ lst => new LeftAssocBinaryOpBNode(b,lst,_ || _)

 private def bterm:Parser[BNode] = bnot ˜ rep("&&" ˜> bnot) ^^ {

 case b ˜ lst => new LeftAssocBinaryOpBNode(b,lst,_ && _)

 private def bnot:Parser[BNode] = "!(" ˜> bform <˜ ")" ^^ (b => new BNotNode(b)) |

  bcomp

 private def bcomp:Parser[BNode] = cond ˜ ("""[=!><]=|<|>""".r) ˜ cond ^^ {

 case c1 ˜ op ˜ c2 => new CompNode(c1,op,c2)

 private trait DNode {

 def eval(vars:collection.Map[String,Double]):Double

 private trait BNode {

 def eval(vars:collection.Map[String,Double]):Boolean

 private class LeftAssocBinaryOpDNode(first:DNode, restStr:List[˜[String,DNode]])

  extends DNode {

 val rest = for(˜(op,n) <- restStr) yield (op match {

  case "+" => (_:Double)+(_:Double)

  case "-" => (_:Double)-(_:Double)

  case "∗" => (_:Double)∗(_:Double)

  case "/" => (_:Double)/(_:Double)

}, n)

 def eval(vars:collection.Map[String,Double]):Double =

  rest.foldLeft(first eval vars)((d,t) => {

   t._1(d,t._2 eval vars)

})

 private class PowBinaryOpDNode(first:DNode, rest:List[DNode]) extends DNode {

 def eval(vars:collection.Map[String,Double]):Double =

  math.pow(first eval vars, rest.foldRight(1.0)((n,d) => math.pow(n eval

  vars,d)))

 private class NumNode(num:Double) extends DNode {

 def eval(vars:collection.Map[String,Double]):Double = num

 private class VarNode(name:String) extends DNode {

 def eval(vars:collection.Map[String,Double]):Double = vars(name)

 private class FunctionDNode(name:String, arg:DNode) extends DNode {

 val f:Double=>Double = name match {

  case "sin(" => math.sin

  case "cos(" => math.cos

  case "tan(" => math.tan

  case "sqrt(" => math.sqrt

 def eval(vars:collection.Map[String,Double]):Double = f(arg eval vars)

 private class IfNode(cond:BNode, e1:DNode, e2:DNode) extends DNode {

 def eval(vars:collection.Map[String,Double]):Double =

  if(cond eval vars) e1 eval vars else e2 eval vars

 private class LeftAssocBinaryOpBNode(first:BNode, rest:List[BNode],

  op:(Boolean,Boolean)=>Boolean) extends BNode {

 def eval(vars:collection.Map[String,Double]):Boolean =

  rest.foldLeft(first eval vars)((tf,b) => op(tf,b eval vars))

 private class BNotNode(arg:BNode) extends BNode {

 def eval(vars:collection.Map[String,Double]):Boolean = !(arg eval vars)

 private class CompNode(left:DNode, compStr:String, right:DNode) extends BNode {

 val comp:(Double,Double)=>Boolean = compStr match {

  case "<" => _<_

  case ">" => _>_

  case "<=" => _<=_

  case ">=" => _>=_

  case "==" => _==_

  case "!=" => _!=_

 def eval(vars:collection.Map[String,Double]):Boolean =

  comp(left eval vars,right eval vars)

One of the first things to note about this code is that regular expressions are used in the parsers to keep things short and flexible. The other thing to note is that there are two top level node traits called DNode and BNode. These both have an eval method, but they differ in that they return a Double and a Boolean, respectively.

The binary operator nodes are a bit more complex than their counterparts from chapter 29. This is because these have the ability to take many operators that are at the same precedence level. Despite some parts of it being complex, the entire file and code for this is under 120 lines of code and it contains a significant amount of functionality. What is more, it is fairly easy to extent, particularly for adding single operator functions. This FormulaP class can now be substituted for the Formula class developed in the last chapter.

30.5 End of Chapter Material

30.5.1 Summary of Concepts

Chomsky Grammars
- Formal approach to generating languages and parsing languages. Form a hierarchy with four different levels of complexity. A grammar is defined as a set of terminals, a set of non-terminals, a set of productions, and a start symbol.
- Regular grammars are the lowest level. Productions restricted to the form A → a | aB.
- Context-free grammars are next up the level of complexity. Their productions have to be of the form A → γ.
- Context-sensitive grammars can take the symbols around a non-terminal into account when determining if a production is allowed. The productions must have the form αAβ → αγβ.
- Recursively enumerable grammars have no restriction on their productions. They are a complete model of computation.
Regular expressions are a syntax for string parsing that have a power roughly equal to regular grammars.
- Character classes can represent a set of characters that can be matched. You can build character classes with square brackets. There are also a number of built-in character classes for common used sets of characters.
- Quantifiers specify that a character of grouping can be present certain numbers of times. Quantifiers include ?, ∗, and +.
- Strings can be turned into Regex using the r method. The Regex object has methods called findAllIn and replaceAllIn. They can also be used at patterns.
Context-Free Parsers
- The combinator parser library in Scala makes it easy to build a parser from a CF grammar.
- Default output of parsers include a mix or strings, ˜ objects, and lists. This can be challenging to work with.
- Using the ^^ operator you can force parsers to output specific types that you want.

30.5.2 Exercises

Write a simple program that will take a Chomsky grammar and allow the user to generate strings from the language.
For each of the following grammars, try to figure out what languages they generate. For those that are regular, create an appropriate regular expression. For that that are context free, write appropriate parsing code.

S → A
- A → aA|B
  B → bB|b
  
  S → aA
- A → aA|bC
  C → cC|c
  
  S → ABC
- A → aA|a
  B → bB|b
  
  c → cC|c
  
  S → AB
- A → aA|a
  B → AbB|b
  
  S → eLAe eLA → eRA
- ALA → AAL
  ARe → ALAe
  
  ARA → RAA
Try to come up with, or look up, a simple context-free grammar for English. Write a parser for it and include a simple dictionary.
Write a grammar for roman numerals. Have it parse to a tree such that the output has an eval method that will return the numeric value as an Int.

30.5.3 Projects

If you have been working on the MUD project, this chapter gives you the ability to use commands with significantly greater complexity. This can be done with either regular expressions or a parser. You can decide which you want to use, but the suggestion would be to write a simple grammar for more complex commands that you could put into your MUD and then write a parser that will take those and build usable expression trees from them.
Regular expressions and parsers can be extremely useful in pulling data out of text files for the web spider. You likely wrote up a fair bit of ad-hoc code for getting the data you want out of HTML files. Convert that code to using a parser with regular expressions.
Both the networked and non-networked graphical game programs avoid having significant text input. However, in both cases, there are often reasons when it is nice to be able to give commands to the program that go beyond what a normal user should be able to do in order to help with testing and debugging.

For this project, you should add a small command language to your game. If there is a good reason to do this for normal users, feel free to do so. Networked games could integrate this with chatting features. Non-networked games could put it into the Graphical User Interface (GUI), or you could use the logging socket that was added back in chapter 23.
The material from this chapter can be particularly beneficial to the math worksheet project. Using combinatorial parsers, you can create a more complex, powerful formula parser, introduce additional commands with relative ease, and even consider putting in a simple programming language. You can also include conditionals expressions in your parser so that your mathematical functions can use recursive definitions.

For all of these additions, you should have the parser build an expression tree for faster evaluation. Add whatever set of features you feel takes the project in the direction you want to go.
Photoshop® is another project that included mathematical formulas, and which can benefit from having commands in a simple language. Given what you learned in this chapter, you should be able to add much more powerful parsing to your formulas that are used with filters.
The simulation workbench project will also benefit from having real formulas. All of the applications of formulas from chapter 29 apply when using a combinatorial parser as well. In addition, the combinatorial parser makes to reasonable to extend formulas into more of a real language.

1In some usages the period will not match a newline.

2Like regular string literals in Scala, the Linux command-line treats a backslash specially. As such, you have to include two backslashes on the command-line for this to work.

3Hopefully the reader appreciates the fact that there is some similarity in the symbols used for expressing similar ideas across different tools.

4Technically there are limitations on the grammars that can be parsed with these parsers, but it is not a limitation on power. Grammars just have to be written in a way that facilitates the parsing. The details of this are beyond the scope of this book.

5This example uses JavaTokenParsers because that is where floatingPointNumber is defined. It also defines ident, which we will use later for identifier names.

6This grammar includes some of those elements that go beyond the scope of this book. For example, if you change the first production to cond ::= form | "if(" bform ")" cond "else" cond it runs into problems because the string "if" is a valid identifier which matches form.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 30 Regular Expressions and Context-Free Parsers

Create new playlist

Sign In

Sign Up