Chapter 16. Lexing and Parsing

In this chapter, you take a closer look at lexing and parsing, topics introduced briefly in Chapter 9 and 11. In particular, the chapter introduces the lexer and parser generators, fslex and fsyacc, that come with the F# distribution. A typical scenario when these techniques and tools can come in handy is the following:

  • You want to read user-readable input that has a well-defined syntax.

  • You have a type (often an abstract syntax tree [AST] type) to represent this input.

The typical task is to parse the user input into your internal representation by breaking down the input string into a sequence of tokens (a process called lexical analysis) and then constructing an instance of your internal representation based on a grammar (via syntactic analysis). Lexing and parsing don't have to be separated, and there are often convenient .NET methods for extracting information from text in particular formats, as shown in this chapter. Nevertheless, it's often best to treat the two processes separately.

The goal of this chapter is to provide the background you need to use the built-in lexing and parsing facilities of .NET and F# effectively and to understand the options for lexing and parsing other input formats:

  • For lexing, we cover simple line-based techniques to crack data formats using the .NET libraries directly. You then learn how to use fslex to break text into simple tokens, strings with escape characters, and nested comments, and you see table-based token generation and stateful lexing (passing state as a parameter as opposed to using mutable state).

  • For parsing, we cover the core parsing topics of languages, grammar formalisms, and various parser types. We explain the typical problems for recursive-descent parsing and introduce fsyacc by giving a parser for Kitty, a small BASIC-like language. This highlights how to parse lists of symbols, how to assign precedence andassociativity to your tokens and production rules, and how you can recover from parsing errors. You also learn about conflicts that can arise in the fsyacc specifications and how to resolve them.

Finally, the chapter covers combinator-based techniques, which are particularly useful for writing parsers for binary formats without relying on fslex and fsyacc.

Processing Line-Based Input

A common, simple case of parsing and lexing occurs when you're working with an existing line-based text-file format. In this case, parsing is often as easy as splitting each line of input at a particular separator character and trimming whitespace off the resulting partial strings:

> let line = "Smith, John, 20 January 1986, Software Developer";;
val line : string = "Smith, John, 20 January 1986, Software Developer"

> line.Split [| ',' |];;
val it : string [] = [|"Smith"; " John"; " 20 January 1986"; " Software Developer"|]

> line.Split [| ',' |] |> Array.map (fun s -> s.Trim());;
val it : string [] = [|"Smith"; "John"; "20 January 1986"; "Software Developer"|]

You can then process each column in the data format:

let splitLine (line: string) =
    line.Split [| ',' |] |> Array.map (fun s -> s.Trim())

let parseEmployee (line: string) =
    match splitLine line with
    | [| last; first; startDate; title |] ->
        last, first, System.DateTime.Parse(startDate), title
    | _ ->
        failwithf "invalid employee format: '%s'" line

The type of this function is as follows:

val parseEmployee : string -> string * string * System.DateTime * string

Here is an example use:

> parseEmployee line;;
val it : string * string * System.DateTime * string
       = ("Smith", "John", 20/01/1986 00:00:00 { ... }, "Software Developer")

On-Demand Reading of Files

You can turn a file into an on-demand sequence of results using a sequence builder:

open System.IO

let readEmployees (fileName : string) =
    seq {
        use reader = File.OpenText fileName
        while not reader.EndOfStream do
            yield reader.ReadLine() |> parseEmployee
    }

The following example takes the first three entries from an artificially generated file containing 10,000 copies of the same employee:

> File.WriteAllLines("employees.txt", Array.create 10000 line);;
val it : unit

> let firstThree = readEmployees("employees.txt") |> Seq.take 3;;
val firstThree : (string * string * System.DateTime * string) list

> for (last,first,startDate,title) in firstThree do
      printfn "%s %s started on %A" first last startDate;;
John Smith started on 20/01/1986 00:00:00
John Smith started on 20/01/1986 00:00:00
John Smith started on 20/01/1986 00:00:00

This technique is often used to do exploratory analysis of large data files. After the algorithm is refined using a prefix of the data, the analysis can then easily be run directly over the full data file.

Using Regular Expressions

Another technique that's frequently used to extract information from strings is to use regular expressions. The System.Text.RegularExpressions namespace provides convenient string-matching and -replacement functions. For example, let's say you have a log file containing a record of HTML GET requests. Here is a sample request:

GET /favicon.ico HTTP/1.1

The following code captures the name of the requested resource (favicon.ico) and the lower version number of the HTML protocol (1) used:

open System.Text.RegularExpressions

let parseHttpRequest line =
    let result = Regex.Match(line, @"GET (.*?) HTTP/1.([01])$")
    let file = result.Groups.[1].Value
    let version = result.Groups.[2].Value
    file, version

The relevant fields are extracted by using the Groups attribute of the regular expression match to access the matched strings for each parenthesized group in the regular expression.

Tokenizing with FsLex

Although it's possible to hand-code lexers by using a range of ad hoc techniques such as those discussed in the previous section or by writing functions that explicitly manipulate lists of characters, doing so can be boring and time consuming. Instead, it's often easier to rely on a lexer generator to do this job for you. This section looks at how to use the fslex tool that comes with the F# Power Pack to perform lexical analysis. The F# Power Pack also includes fsyacc, which is also available as a separate download from MSDN and CodePlex. When you use fslex or fsyacc generated code in your projects, you must include a library reference to FSharp.PowerPack.dll from the F# Power Pack.

Let's start with a simple example. Listing 16-1 shows a lexer that replaces all < and > characters in an input stream with their HTML equivalents, &lt; and &gt;. Listing 16-2 shows a small program that uses this generated lexer.

Example 16.1. Replacing Characters with Their HTML Equivalents: text2htmllex.fsl

{
module Text2HtmlLex

let lexeme = Lexing.LexBuffer<_>.LexemeString

(* You can add your helper functions here *)
}
rule convertHtml chan = parse
 | '<'    { fprintf chan "&lt;"
            convertHtml chan lexbuf }
 | '>'    { fprintf chan "&gt;"
            convertHtml chan lexbuf }
 | eof    { () }
 | _      { fprintf chan "%s" (lexeme lexbuf)
            convertHtml chan lexbuf }

Example 16.2. Replacing Characters with Their HTML Equivalents: text2html.fs

open System.IO
open System.Text

let main() =
let args = System.Environment.GetCommandLineArgs()
   if args.Length <= 2 then
         let exe = Path.GetFileName(args.[0])
         eprintfn "Usage: %s dir pattern" exe
         exit 1
   let directory = args.[1]
   let pattern = args.[2]

   for fileName in Directory.GetFiles(directory, pattern) do

      // Open a file stream for the file name
      use inputReader = File.OpenText fileName

      // Create a lex buffer for use with the generated lexer. The lex buffer
      // reads the inputReader stream.
      let lexBuffer = Lexing.LexBuffer<_>.FromTextReader inputReader

      // Open an output channel
      let outputFile = Path.ChangeExtension(fileName,"html")
      use outputWriter = (new StreamWriter(outputFile) :> TextWriter)

      // Write the header
      fprintfn outputWriter "<html>
<head></head>
<pre>"

      // Run the generated lexer
      Text2htmllex.convertHtml outputWriter lexBuffer

      // Write the footer
      fprintfn outputWriter "</pre>
</html>
"

main()

You can produce an F# source file from the previous lexer definition by running the following command at the Windows command prompt:

fslex text2htmllex.fsl --unicode

This produces text2htmllex.fs, which contains the implementation of the lexer convertHtml. This lexer is imperative, in that it prints to an output stream instead of returning tokens. The signature of the entry point to the generated lexer is as follows:

val Text2HtmlLex.convertHtml: System.IO.TextWriter -> Lexing.Lexbuffer<char> -> unit

You can now compile the driver and the lexer together:

fsc text2htmllex.fs text2html.fs –r FSharp.PowerPack.dll

You can run the resulting program as follows, giving a source directory and a file pattern and producing an .html version of each file that matches by applying the HTML conversion:

text2html . *.txt

Let's look at the previous example more closely. The rule section of text2htmllex.fsl defines the lexer, which takes the output channel as an argument before the lexing buffer. It says that if you encounter the < or > character, you should output its HTML equivalent and recursively call your lexer to process the remaining input. If you find the end of the file, you stop; and for any other character, you print it to the output channel. In each rule, you can refer to a predefined variable (visible only inside the rule) named lexbuf that is the instantiation of the Microsoft.FSharp.Text.Lexing.LexBuffer type. You can access various bits of information through this variable about the lexing state; some of these are collected in Table 16-1.

The driver is all F# code. You check the input arguments and then iterate through files in the directory given by the first argument whose name matches the pattern given by the second argument. You then open each file and instantiate your generated lexer with the following lines:

use inputReader = File.OpenText fileName
let lexBuffer = Lexing.LexBuffer<_>.FromTextReader inputReader
...
Text2HtmlLex.convertHtml outputWriter lexBuffer

This code uses some important functions from the LexBuffer type. Table 16-1 shows the notable static members and some of the properties in this type.

Table 16.1. Some Static Members and Properties of the LexBuffer Type

Member

Type

Description

FromString

string -> LexBuffer<char>

Makes a LexBuffer for the given string

FromTextReader

TextReader -> LexBuffer<char>

Makes a LexBuffer for the given text reader

FromBinaryReader

BinaryReader -> LexBuffer<byte>

Makes a LexBuffer for the given binary reader

LexemeString

LexBuffer<char> -> string

Returns the matched string

this.EndPos

Lexing.Position

Gets/sets the current position associated with the end of the matched token

this.IsPastEndOf Stream

bool

True if the LexBuffer has exhausted the available input

this.StartPos

Lexing.Position

Gets/sets the current position associated with the start of the matched token

Note

FsLex works by constructing a table-driven finite automaton that is executed to consume the input character sequence one by one until a full token can be returned. The automaton blocks until further input is received. The states of this machine are derived from the regular expressions defined by each rule. Single-character literals advance the machine to a new state, and repetitions cause it to remain in the same state. These states form a graph, and the edges between the states are those symbols that advance between the states.

The FsLex Input in More Detail

You saw the basic structure of lexer files in the preceding example, which contained only a handful of rules. In general, fslex input files have the following simple structure:

// Preamble – any user code you need for the lexer, such as opening modules, etc.
{ [Code] }

// Definitions – named patterns that you can use in the rules or other definitions
let [Ident_1] = [Pattern]
let ...

// Rules – text patterns that trigger certain actions
rule [Rule_1] [arg1... argn] = parse
   | [Pattern] { [Action] }
   | ...
   | [Pattern] { [Action] }
and [Rule_2] [arg1... argn] = parse
    ...
rule [Rule_3] ...

// Epilogue – code that can call the lexer rules defined above
{ [Code] }

Each rule defined in the lexer becomes an F# function that can be accessed from other modules and the lexer itself. You can place comments between (* and *), and you can also use // comments in the actions just like in any other F# code. Patterns can be any of the forms listed in Table 16-2.

Lexical actions are pieces of F# code enclosed in braces; they're executed when a lexer match is made. You can put any logic here that you like; typically, you construct a token. Tokens are specified in the parser definition using the %token directive (you see this later in this chapter); if you don't have a parser, any user-defined type will do. If your lexer rules don't construct tokens, or if your lexer is simple enough, often you may want to put all driver code in the epilogue section to create a stand-alone lexer.

Table 16.2. Patterns in Lexer Rules

Pattern Form

Description

'c'

Character constants; in single quotes, such as '+' and '.'

['a' 'b' 'c']

Character sets; matches any character in the given set

['a'-'z']

Character ranges; matches any character in the given range, in ASCII ordering

[^'a' 'b' 'c']

Complementary character sets; matches any character except those in the given character set

"abc"

Matches the given string of characters

_

Matches any character

eof

Matches the end of the stream

identifier

A predefined named regular expression (named earlier in the file using a let binding)

pattern?

Zero or one occurrences of pattern

pattern+

One of more occurrences of pattern

pattern*

Zero or more occurrences of pattern

pattern1 | pattern2

Either pattern1 or pattern

pattern1 pattern2

Concatenation; pattern1 followed by pattern2

Generating a Simple Token Stream

Listing 16-3 shows a lexer that constructs a list of tokens that then is printed. It can recognize integers, floats, identifiers, and the symbols ^, *, -, and +. Any other character causes a runtime exception.

Example 16.3. SimpleTokensLex.fsl: Lexing Simple Tokens: Integers, Floats, and Identifiers

{
module SimpleTokensLex

open System

type token =
    | INT    of int
    | FLOAT  of float
    | ID     of string
    | STRING of string
    | PLUS | MINUS | TIMES | HAT
    | EOF

let lexeme = Lexing.LexBuffer<_>.LexemeString
}

let num        = ['0'-'9']+
let intNum     = '-'? num
let floatNum   = '-'? num ('.' num)? (['e' 'E'] num)?
let ident      = ['a'-'z']+
let whitespace = ' ' | '	'
let newline    = '
' | '
' '
'

rule token = parse
    | intNum     { INT (Convert.ToInt32 (lexeme lexbuf)) }
| floatNum   { FLOAT (Convert.ToDouble (lexeme lexbuf))  }
    | ident      { ID (lexeme lexbuf)  }
    | '+'        { PLUS }
    | '-'        { MINUS }
    | '*'        { TIMES }
    | '^'        { HAT }
    | whitespace { token lexbuf }
    | newline    { token lexbuf }
    | eof        { EOF }
    | _          { failwithf "unrecognized input: '%s'" (lexeme lexbuf) }

You can generate the lexer using this:

fslex simpleTokensLex.fsl --unicode

The generated lexer contains a single module SimpleTokensLex with one entry-point function for each rule. In this case, the type of this function is as follows:

val token: Lexing.LexBuffer<char> -> SimpleTokensLex.token

The following indicates how you can imperatively generate a simple token stream from a string and print the results in F# Interactive:

> #r "FSharp.PowerPack.dll";;
...
> #load "SimpleTokensLex.fs";;
> let lexbuf = Lexing.LexBuffer<_>.FromString "3.4 x 34 xyx";;
val lexbuf : Lexing.LexBuffer<char>

> SimpleTokensLex.token lexbuf;;
val it : SimpleTokensLex.token = FLOAT 3.4

> SimpleTokensLex.token lexbuf;;
val it : SimpleTokensLex.token = ID "x"

> SimpleTokensLex.token lexbuf;;
val it : SimpleTokensLex.token = INT 34

> SimpleTokensLex.token lexbuf;;
val it : SimpleTokensLex.token = ID "xyx"

> SimpleTokensLex.token lexbuf;;
val it : SimpleTokensLex.token = EOF

> SimpleTokensLex.token lexbuf;;
System.Exception: End of file on lexing stream

Tracking Position Information Correctly

Lexers generated by fslex keep track of partial information about the position of the most recently accepted token within the source stream of characters. In particular, the StartPos and EndPos properties on the LexBuffer type return Lexing.Position values. A partial signature of this position type is as follows:

type Position with
     // The file name associated with the input stream.
     member FileName : string

     // The line number in the input stream, assuming fresh
     // positions have been updated modifying the
     // EndPos property of the LexBuffer.
     member Line : int

     // The character number in the input stream
     member AbsoluteOffset : int

     // Return the column number marked by the position, e.g. the
     // difference between the AbsoluteOffset and the StartOfLineAbsoluteOffset.
     member Column : int

     // Convert a position just beyond the end of a line to a
     // position at the start of the next line.
     member NextLine : Position
     ...
end

In some cases, certain lexer actions must perform extra bookkeeping. In particular, the lexer should update the EndPos property of the LexBuffer each time a newline marker is processed (this is left up to the lexer because the interpretation of newline characters can differ between various lexers). In particular, you can change the endOfLine rule in the lexer in Listing 16-3 to make this update:

| newline    { lexbuf.EndPos <- lexbuf.EndPos.NextLine;
               token lexbuf }

You can now experiment with this updated lexer in F# Interactive and examine the StartPos and EndPos properties after fetching each token:

> let lexbuf = Lexing.LexBuffer<_>.FromString "3.4 
 34 xyx";;
val lexbuf : Lexing.LexBuffer<char>

> SimpleTokensLex.token lexbuf;;
val it : SimpleTokensLex.token = FLOAT 3.4

> (lexbuf.StartPos.Line, lexbuf.StartPos.Column);;
val it : int * int = (0,0)

> (lexbuf.EndPos.Line, lexbuf.EndPos.Column);;
val it : int * int = (0,3)

> SimpleTokensLex.token lexbuf;;
val it : SimpleTokensLex.token = INT 34

> (lexbuf.StartPos.Line, lexbuf.StartPos.Column);;
val it : int * int = (1,1)

Often, you may need to attach position information to each lexer token. However, when you use lexers in conjunction with fsyacc parser generators, the position information is automatically read after each token is processed and then stored in the parser's state. We return to this topic later in this chapter.

Handling Comments and Strings

So far, you've seen examples with one lexing rule only. This is because the main lexer rule was sufficient for all tokens and you haven't yet come across the need to lex input that can't be described by a regular expression. To illustrate this point, for instance, say you want to lex comments enclosed by (* and *). Formally, you have an opening delimiter, followed by the body of the comment, and finally enclosed by the closing delimiter. The first attempt, shown here

"(*" _* "*)"

fails because the middle pattern matches everything and you never reach the closing *). So, the best compromise could be as follows

"(*" [^ '*']* "*)"

where you match the inside of the comment as long as you don't see an asterisk, and then you try to match the closing *). This of course fails on any comment that contains an asterisk. You can play with this regular expression a little more. The inside of the comment is either anything but an asterisk or all asterisks that aren't followed by another asterisk or right parenthesis:

"(*" ([^ '*'] | ('*'+ ([^ '*' ')'])))* '*'+ ')'

This is about as close as you can get, and yet even this pattern has a problem: it can't match nested comments; it always stops at the first closing delimiter, ignoring all nested comment openers.

You can handle this problem by using a multirule lexer. The following rules show the additions you can make to the SimpleTokensLex.fsl lexer from Listing 16-3 in order to properly handle comments and strings:

rule token =
    ...
    | "(*"       { comment lexbuf; token lexbuf }
    | """       { STRING (string lexbuf.StartPos "" lexbuf) }
    | _          { failwith "unrecognized input: '%s'" (lexeme lexbuf) }

and comment = parse
    | "(*"       { comment lexbuf; comment lexbuf  }
    | "*)"       { () }
    | "
"       { lexbuf.EndPos <- lexbuf.EndPos.NextLine
comment lexbuf }
    | eof        { failwith "Unterminated comment" }
    | _          { comment lexbuf }

and string pos s = parse
    | "\" ('"' | 'n' | 'r' | 't')
                  { let s' = s + (match lexeme lexbuf with
                         | "\"" -> """
                         | "\n" ->  "
"
                         | "\r" ->  "
"
                         | "\t" ->  "	"
                         | "\\" -> "\"
                         | _ ->      "")
                    string pos s' lexbuf }
    | """         { s }
    | "
"         { lexbuf.EndPos <- lexbuf.EndPos.NextLine
                     string pos (s + "
") lexbuf }
    | eof          { failwithf "end of file in string started at or near %A" pos }
    | _            { string pos (s + (lexeme lexbuf)) lexbuf }

Comment processing begins when you encounter (* in the token rule. When the closing *) is encountered, you exit one invocation of the comment rule. The idea is that you deal with nested comments by recursively applying the lexer when a nested comment is reached. Note the double invocation of comment lexbuf for nested comment delimiters: once to tokenize the comment that belongs to the opener and again to tokenize the rest of the enclosing comment. There are also two further matches within the comment rule. If you hit the end of the source stream, you have an unterminated comment, and you raise an exception; in every other case, you move forward inside the comment.

In the example, strings are also handled by a separate lexer rule string that is invoked by the token lexer when you encounter the double-quote character. This rule takes two parameters: the string consumed so far and the start position of the string. The latter is used to report a nicer error for an unterminated string (you could also use this technique for the matching case for comments). You can also check whether you have an escaped character in the input. If so, you append the appropriate escape sequence to the string already accumulated and advance the current position. Upon encountering the closing character, you return the overall string. An imperative System.Text.StringBuffer object can also be used to accumulate the string, which is more efficient if strings get very long.

Note

Because lexer rules can pass arguments, there is little need to use mutable state in a lexer—instead, pass additional arguments.

Recursive-Descent Parsing

You can now turn your attention to parsing. Let's assume for the moment that you're writing an application that performs simple symbolic differentiation, say on polynomials only. You want to read polynomials such as x^5-2x^3+20 as input from your users, which in turn is converted to your internal polynomial representation so that you can perform symbolic differentiation and pretty-print the result to the screen. One way to represent polynomials is as a list of terms that are added or subtracted to form the polynomial:

type Term =
    | Term  of int * string * int
    | Const of int

type Polynomial = term list

For instance, the polynomial in this example is as follows:

[Term (1,"x",5); Term (−2,"x",3); Const 20]

In Listing 16-3, you built a lexer and a token type suitable for generating a token stream for the input text (shown as a list of tokens here):

[ID "x"; HAT; INT 5; MINUS; INT 2; ID "x"; HAT; INT 3; PLUS; INT 20]

Listing 16-4 shows a recursive-decent parser that consumes this token stream and converts it into the internal representation of polynomials. The parser works by generating a lazy list for the token stream. Lazy lists are a data structure in the F# library module Microsoft.FSharp.Collections.LazyList, and they're a lot like sequences with one major addition: lazy lists effectively allow you to pattern-match on a sequence and return a residue lazy list for the tail of the sequence.

Example 16.4. Recursive-Descent Parser for Polynomials

open SimpleTokensLex
open Microsoft.FSharp.Text.Lexing

type Term =
    | Term  of int * string * int
    | Const of int

type Polynomial = term list
type TokenStream = LazyList<token * Position * Position>

let tryToken (src: TokenStream ) =
    match src with
    | LazyList.Cons ((tok, startPos, endPos), rest) -> Some(tok, rest)
    | _ -> None

let parseIndex src =
    match tryToken src with
    | Some (HAT, src) ->
        match tryToken src with
        | Some (INT num2, src) ->
            num2, src
        | _ -> failwith "expected an integer after '^'"
    | _ -> 1, src

let parseTerm src =
    match tryToken src with
    | Some (INT num, src) ->
        match tryToken src with
        | Some (ID id, src) ->
           let idx, src = parseIndex src
Term (num, id, idx), src
        | _ -> Const num, src
    | Some (ID id, src) ->
         let idx, src = parseIndex src
         Term(1, id, idx), src
    | _ -> failwith "end of token stream in term"

let rec parsePolynomial src =
    let t1, src = parseTerm src
    match tryToken src with
    | Some (PLUS, src) ->
        let p2, src = parsePolynomial src
        (t1 :: p2), src
    | _ -> [t1], src

The functions here have the following types (using the type aliases you defined):

val tryToken        : TokenStream  -> (token * TokenStream ) option
val parseIndex      : TokenStream  -> int * TokenStream
val parseTerm       : TokenStream  -> Term * TokenStream
val parsePolynomial : TokenStream  -> Polynomial * TokenStream

You can turn the fslex-generated lexer for the lexer specification in Listing 16-3 into a TokenStream using the following code:

let getTokenStream  inp : TokenStream  =
    // Generate the token stream as a seq<token>
    seq { let lexbuf = LexBuffer<_>.FromString inp
          while not lexbuf.IsPastEndOfStream do
              match SimpleTokensLex.token lexbuf with
              | EOF -> yield! []
              | token -> yield (token, lexbuf.StartPos, lexbuf.EndPos) }

    // Convert to a lazy list
    |> LazyList.ofSeq

let parse input =
    let src = getTokenStream  input
    let result, src = parsePolynomial src
    match tryToken src with
    | Some _ -> failwith "unexpected input at end of token stream!"
    | None -> result

These functions have the following types:

val getTokenStream : string -> TokenStream
val parse: string -> polynomial

Note in the previous examples that you can successfully parse either constants or complete terms; but after you locate a HAT symbol, a number must follow. This sort of parsing, in which you look only at the next token to guide the parsing process, is referred to as LL(1), which stands for left-to-right, leftmost derivation parsing; 1 means that only one look-ahead symbol is used. The parser approach you used earlier is called recursive-descent. This has various advantages and disadvantages, and we discuss those in a bit. To conclude, you can look at the parse function in action:

> parse "1+3";;
val it : term list = [Const 1; Const 3]

> parse "2x^2+3x+5";;
val it : term list = [Term (2,"x",2); Term (3,"x",1); Const 5]

Limitations of Recursive-Descent Parsers

There are various limitations that you can't handle with recursive-descent parsers. For instance, if you translate a left-recursive production into code, you get an infinite recursion, like so:

<polynomial> ::= <polynomial> '+' <term> | <term>

This corresponds to the following:

let rec parsePolynomial src =
    let poly, src = parsePolynomial src
    ...

Another common problem with LL(k) parsing for some k>=1 is that the grammar rules for a given nonterminal can't start with the same symbols, or else there is no easy way to decide which rule to apply (because each is determined to be applicable upon checking k number of symbols). In such cases, left-factoring can be applied. For example, you can move the symbols after the common part into another production, as shown here:

<polynomial> ::= <term> | <term> '+' <polynomial>
can be refactored as
<polynomial> ::= <term> <polymomialTail>
<polynomialTail> ::= EPSILON  | '+' <polynomial>

Here, EPSILON is the empty symbol, so the function that parses polynomialTail both checks for a plus symbol and issues a call to parse a polynomial or exit in the absence of an initial plus symbol, leaving the input unchanged. Although this case is relatively simple (you can parse terms until they're followed by a plus sign iteratively), coding such grammar rules in the general case is cumbersome.

Problems such as these arise because you have to make parsing decisions early on (such as deciding which grammar rule you're pursuing if there are multiple choices for a given nonterminal), because they construct the parse tree from the top and proceed downward. LR parsers, on the other hand, postpone these decisions as much as possible and construct the parse tree bottom-up, resulting in much more flexibility both in terms of how naturally grammar rules can be expressed and how they can be mapped into code.

Parsing with FsYacc

The tool fsyacc generates LALR(1) parsers, which are a special subset of LR(1) parsers where the state table is compressed by merging similar states. This in practice doesn't limit the languages that can be parsed, but it does result in significant savings over the parse-table size. The generated parser automaton performs one of four distinct operations in any state based on the look-ahead token, and these are important to understand if you have various grammar conflicts to fix. It can shift the look-ahead (always a terminal symbol) to its stack, reduce a number of stack entries by a grammar rule leaving the head symbol in their place, accept the input as syntactically correct, or reject in the opposite case. Parsing proceeds until an accept or a reject state is reached.

You first see how to develop a parser for a simple programming language. A sample fragment of the BASIC-like language you want to parse is shown here:

a := 1;
b := 0;
if a then d := 20 + 20;
if b then d := 40 * 20 + 20;
print d;
while d do
    begin
        d := d + 1;
        print d
    end;
print d

For simplicity, let's call this language Kitty. As the previous example shows, Kitty supports naming values, printing values, basic arithmetic operators, and a while and conditional construct. The Ast module (shown in Listing 16-5) defines the internal representation for Kitty programs.

Example 16.5. KittyAst.fs: Defining the AST for Kitty Programs

module Ast

type expr =
    | Val   of string
    | Int   of int
    | Plus  of expr * expr
    | Minus of expr * expr
    | Times of expr * expr

type stmt =
    | Assign     of string * expr
    | While      of expr * stmt
    | Seq        of stmt list
    | IfThen     of expr * stmt
    | IfThenElse of expr * stmt * stmt
    | Print      of expr

type prog = Prog of stmt list

The Lexer for Kitty

Listing 16-6 shows a lexer for the language in the file KittyLexer.fsl. It's similar to lexers developed earlier in this chapter. The one exception is that you use a keyword table. Matching against lexemes to identify tokens is a sensible solution only if there are relatively few cases. Tokenizing a large set of keywords and operators using explicit rules can lead to large lexers. This situation is often handled using tables that contain the possible lexeme matches and the tokens to be returned. Listing 16-6 uses simple dictionaries (maps).

Example 16.6. KittyLexer.fsl: Lexer for Kitty

{
module KittyLexer

open System
open KittyParser
open Microsoft.FSharp.Text.Lexing

let lexeme = LexBuffer<_>.LexemeString

let ids = [ ("while",   WHILE);
            ("begin",   BEGIN);
            ("end",     END);
            ("do",      DO);
            ("if",      IF);
            ("then",    THEN);
            ("else",    ELSE);
            ("print",   PRINT);]

let idsMap = Map.ofList ids

let ident lexbuf tokenText =
   if Map.containsKey tokenText idsMap then Map.find tokenText idsMap
   else ID tokenText
}

let num        = ['0'-'9']+
let alpha      = ['a'-'z' 'A'-'Z']
let ident      = alpha+ (alpha | ['_' '$'])*
let integer    = '-'? num
let whitespace = ' ' | '	'
let newline    = '
' | '
' '
'

rule token = parse
    | whitespace { token lexbuf }
    | newline    { lexbuf.EndPos <- lexbuf.EndPos.NextLine; token lexbuf }
    | "("        { LPAREN }
    | ")"        { RPAREN }
    | "+"        { PLUS }
    | "-"        { MINUS }
    | "*"        { TIMES }
    | ";"        { SEMI }
    | ":="       { ASSIGN }
    | ident      { ident lexbuf (lexeme lexbuf) }
    | integer    { INT (Int32.Parse(lexeme lexbuf)) }
    | eof        { EOF }

Note that at compilation time, the lexer depends on the parser defined later in Listing 16-7. This is because the lexer must return the type of tokens required by the parser.

You can generate the lexer by calling fslex:

fslex KittyLexer.fsl --unicode

This produces KittyLexer.fs, which contains the implementation of the lexer.

The Parser for Kitty

Listing 16-7 shows the parser specification for the Kitty language in the file KittyParser.fsy.

Example 16.7. KittyParser.fsy: Parser for Kitty

%{
open Ast
%}

// The start token becomes a parser function in the compiled code.
%start start

// These are the terminal tokens of the grammar along with the types of
// the data carried by each token:
%token <string> ID
%token <int> INT
%token PLUS MINUS TIMES LPAREN RPAREN IF THEN ELSE
%token WHILE DO BEGIN END PRINT SEMI ASSIGN EOF

// Associativity and Precedences - Lowest precedence comes first
%left PLUS MINUS
%left TIMES

// This is the type of the data produced by a successful reduction
// of the 'start' symbol:
%type <prog> start

%%

start: Prog                   { $1 }

Prog: StmtList                { Prog (List.rev $1) }

Expr: ID                      { Val $1 }
    | INT                     { Int $1 }
    | Expr PLUS Expr          { Plus ($1, $3) }
    | Expr MINUS Expr         { Minus ($1, $3) }
    | Expr TIMES Expr         { Times ($1, $3) }
    | LPAREN Expr RPAREN      { $2 }


Stmt: ID ASSIGN Expr              { Assign ($1, $3) }
    | WHILE Expr DO Stmt          { While ($2, $4) }
| BEGIN StmtList END          { Seq (List.rev $2) }
    | IF Expr THEN Stmt           { IfThen ($2, $4) }
    | IF Expr THEN Stmt ELSE Stmt { IfThenElse ($2, $4, $6) }
    | PRINT Expr                  { Print $2 }

StmtList:
    | Stmt               { [$1] }
    | StmtList SEMI Stmt { $3 :: $1  }

You can generate the parser by calling fsyacc:

fsyacc KittyParser.fsy --module KittyParser

This produces KittyParser.fs, which contains the implementations of the parser, along with a signature file KittyParser.fsi. The generated parser has one entry point for each %start symbol (here there is only one). The type of this entry point is as follows:

val start : (LexBuffer<'a> -> token) -> LexBuffer<'a> -> Ast.prog

You see how to use this function a little later in this chapter. In a roundabout way, the type says, "If you give me a lexing function that generates tokens and give me a LexBuffer to supply to that lexing function, then I'll generate you an Ast.prog." The generic type variable indicates that the parser is independent of the characters manipulated by the LexBuffer.

We now cover in more detail the different aspects of fsyacc parsing illustrated by this example.

Parsing Lists

In Kitty, statements can be separated by semicolons. This is handled in the StmtList grammar production, whose semantic extract is a list of statements. Note that you could have written this rule in a head-recursive way:

StmtList:
    | Stmt               { [$1] }
    | Stmt SEMI StmtList { $1 @ [ $3 ] }

Unlike in recursive-descent or any other LL parsing technique, the previous rule doesn't pose a problem for fsyacc, and thus no left-factoring is needed. However, it does create a copy of the statement list each time a new expression is appended to it. You eliminate this by using the following productions

StmtList:
    | Stmt               { [$1] }
    | StmtList SEMI Stmt { $3 :: $1 }

combined with a List.rev where the rule is used. This rule consumes all statements and inserts them, one by one, into the singleton list that contains the first statement. As a result, the return list is in reverse order, which is why you need to apply List.rev. You may want to define a separate rule to perform this operation. Another feature that is often needed is the ability to parse empty or optional lists. This can be easily accomplished using an empty (epsilon) symbol, as in the following example:

StmtListOpt:
               { [] }
    | StmtList { $1 }

This rule matches an optional list of statements and returns an empty list if no statements can be parsed.

Resolving Conflicts, Operator Precedence, and Associativity

As usual with arithmetic operators, division and multiplication should take precedence over addition and subtraction: 1+2*3 should be parsed as 1+(2*3). With fsyacc, this can be expressed easily using the associativity directives—or, to be more precise, their ordering:

// Associativity and Precedences - Lowest precedence comes first
%left PLUS MINUS
%left TIMES

By specifying what tokens associate and where (how strongly they bind), you can control how parse derivations are performed. For instance, giving left-associativity to the addition operator (PLUS), given an input 1+2+3, the parser automatically generates a nonambiguous derivation in the form of (1+2)+3. The basic arithmetic operators are left-associative and should be listed from the lowest precedence to the highest; in the example, the addition and subtraction operators have lower precedence than multiplication—the way it should be. Other associativity specifications include %nonassoc and %right, which are used to denote that a given symbol doesn't associate or associates to the right, respectively. The former is useful for relational and equality operators such as <, >, or !=, where the operator isn't applicable if applied multiple times: 1 > 2 > 3 yields a syntax error.

You can also give precedence to a rule by using %prec at the end of the rule and giving a token whose precedence is to be applied. You can list arbitrary tokens in the associativity and precedence declarations, even if they haven't been declared as tokens, and use them in such situations. You can find more details about specifying precedence at www.expert-fsharp.com/Topics/FsYacc.

Tip

One option for fsyacc.exe is -v, which causes fsyacc to produce a readable extract of the parser's states. This is useful when there are various conflicts to resolve.

Each state in this extract corresponds to one or more items, which are productions that indicate what has been seen while parsing them. This current position with respect to a rule is marked with a period (.). Furthermore, to each state belong various actions that are triggered by certain look-ahead symbols. For instance, the action in some state as follows

action 'ID' (noprec):   shift 7

indicates that if the ID token, which has no defined precedence, is encountered as look-ahead, the parser pushes this token to the parsing stack and shifts to state 7. For reduce actions, the rule that is reduced is shown. An error (reject) or accept action is shown for tokens that trigger a syntax error or acceptance, respectively.

The parser state extract also provides useful information about conflicts in your grammar. Conflicts arise when your grammar is ambiguous (which translates to having more than one choice for a parser action at any time), ultimately meaning more than one derivation can accept a given input. You can do a number of things to disambiguate your grammar. You can apply precedence to various tokens or rules, or rewrite your rules to be consistent and unambiguous. There are two main sources of grammar conflicts: reduce-reduce and shift-reduce conflicts. Reduce-reduce conflicts are considered really bad because there are multiple rules to reduce by at a given situation. Although fsyacc applies a disambiguation strategy (reducing by the grammar rule that was defined earlier), you should avoid reduce-reduce conflicts as much as possible.

Shift-reduce conflicts arise when the parser has the choice to shift a token or reduce by a rule. Unless you fix this conflict, fsyacc favors the shift action and defers the reduction to a later point. In some situations (for instance, the dangling-else problem), this yields the expected behavior; but in general, any such conflict is also considered a serious problem.

Putting It Together

You can generate the parsers and lexers by calling fslex and fsyacc:

fslex KittyLexer.fsl --unicode
fsyacc KittyParser.fsy --module KittyParser

This produces KittyLexer.fs and KittyParser.fs, which contain the implementations of the parser and lexer. You can test these using F# Interactive by loading the files directly using the #load directive. The following code creates a LexBuffer called lexbuf. It then calls the KittyParser.start entry point for the parser, passing KittyLexer.token as the lexical analysis engine and lexbuf as the LexBuffer. This connects the parser and the lexer:

open Ast
open KittyParser
open KittyLexer

let parseText text =
    let lexbuf = Lexing.LexBuffer<_>.FromString text
    try
        KittyParser.start KittyLexer.token lexbuf
    with e ->
        let pos = lexbuf.EndPos
        failwithf "Error near line %d, character %d
" pos.Line pos.Column

You can now test this function interactively:

> let sample = "counter := 100; accum := 0; 

                while counter do 

                begin 

                    counter := counter - 1; 

                    accum := accum + counter 

                end; 

                print accum";;
val sample : string = "..."

> parseText sample;;
val it : Ast.prog
= Prog
   [Assign ("counter",Int 100); Assign ("accum",Int 0);
    While
     (Val "counter",
      Seq
       [Assign ("counter",Minus (Val "counter",Int 1));
        Assign ("accum",Plus (Val "accum",Val "counter"))]); Print Val "accum"]

Writing an evaluator for Kitty is straightforward. Here, you use an environment that maps variable names to the integer values they store. As expected, assignments in the source language add a binding for a given variable, and evaluating variables reads a value from this environment. Because of the lack of other types in Kitty, you use a nonzero value for the Boolean true and zero for false and wire the logic of the conditional and looping construct accordingly:

let rec evalE (env: Map<string, int>) = function
    | Val v          -> if env.ContainsKey v then env.[v]
                        else failwith ("unbound variable: " + v)
    | Int i          -> i
    | Plus  (e1, e2) -> evalE env e1 + evalE env e2
    | Minus (e1, e2) -> evalE env e1 - evalE env e2
| Times (e1, e2) -> evalE env e1 * evalE env e2

and eval (env: Map<string, int>) = function
    | Assign (v, e) ->
         env.Add(v, evalE env e)
    | While (e, body) ->
         let rec loop env e body =
             if evalE env e <> 0 then
                 loop (eval env body) e body
             else env
         loop env e body
    | Seq stmts ->
         List.fold eval env stmts
    | IfThen (e, stmt) ->
         if evalE env e <> 0 then eval env stmt else env
    | IfThenElse (e, stmt1, stmt2) ->
         if evalE env e <> 0 then eval env stmt1 else eval env stmt2
    | Print e ->
         printf "%d" (evalE env e); env

With these at hand, continuing the same interactive session, you can now evaluate the sample Kitty program:

> match parseText sample with
    | Prog stmts ->
        eval Map.empty (Seq stmts) |> ignore;;
4950
val it : unit = ()

Binary Parsing and Pickling Using Combinators

One final case of parsing is common when working with binary data. That is, say you want to work with a format that is conceptually relatively easy to parse and generate (such as a binary format) but where the process of actually writing the code to crack and encode the format is somewhat tedious. This section covers a useful set of techniques to write readers and writers for binary data quickly and reliably.

The running example shows a set of pickling (also called marshalling) and unpickling combinators to generate and read a binary format of our design. You can easily adapt the combinators to work with existing binary formats such as those used for network packets. Picklers and unpicklers for different data types are function values that have signatures as follows:

type outstate = System.IO.BinaryWriter
type instate  = System.IO.BinaryReader

type pickler<'T> = 'T -> outstate -> unit
type unpickler<'T> = instate -> 'T

Here, instate and outstate are types that record information during the pickling or parsing process. In this section, these are just binary readers and writers; but more generally, they can be any type that can collect information and help compact the data during the writing process, such as by ensuring that repeated strings are given unique identifiers during the pickling process.

At the heart of every such library lies a set of primitive leaf functions for the base cases of aggregate data structures. For example, when you're working with binary streams, this is the usual set of primitive read/write functions:

let byteP (b: byte) (st: outstate) = st.Write(b)
let byteU (st: instate) = st.ReadByte()

You can now begin to define additional pickler/unpickler pairs:

let boolP b st = byteP (if b then 1uy else 0uy) st
let boolU st = let b = byteU st in (b = 1uy)

let int32P i st =
    byteP (byte (i &&& 0xFF)) st
    byteP (byte ((i >>> 8) &&& 0xFF)) st
    byteP (byte ((i >>> 16) &&& 0xFF)) st
    byteP (byte ((i >>> 24) &&& 0xFF)) st

let int32U st =
    let b0 = int (byteU st)
    let b1 = int (byteU st)
    let b2 = int (byteU st)
    let b3 = int (byteU st)
    b0 ||| (b1 <<< 8) ||| (b2 <<< 16) ||| (b3 <<< 24)

These functions have the following types:

val byteP  : pickler<byte>
val byteU  : unpickler<byte>
val boolP  : pickler<bool>
val boolU  : unpickler<bool>
val int32P : pickler<int>
val int32U : unpickler<int>

So far, so simple. One advantage of this approach comes as you write combinators that put these together in useful ways. For example, for tuples

let tup2P p1 p2 (a, b) (st: outstate) =
    (p1 a st : unit)
    (p2 b st : unit)

let tup3P p1 p2 p3 (a, b, c) (st: outstate) =
    (p1 a st : unit)
    (p2 b st : unit)
    (p3 c st : unit)

let tup2U p1 p2 (st: instate) =
    let a = p1 st
    let b = p2 st
    (a, b)
let tup3U p1 p2 p3 (st: instate) =
    let a = p1 st
    let b = p2 st
    let c = p3 st
    (a, b, c)

and for lists:

// Outputs a list into the given output stream by pickling each element via f.
let rec listP f lst st =
    match lst with
    | [] ->     byteP 0uy st
    | h :: t -> byteP 1uy st; f h st; listP f t st

// Reads a list from a given input stream by unpickling each element via f.
let listU f st =
    let rec ulist_aux acc =
        let tag = byteU st
        match tag with
        | 0uy -> List.rev acc
        | 1uy -> let a = f st in ulist_aux (a::acc)
        | n ->   failwithf "listU: found number %d" n
    ulist_aux []

These functions conform to the following types:

val tup2P : 'a pickler -> 'b pickler -> ('a * 'b) pickler
val tup3P : 'a pickler -> 'b pickler -> 'c pickler -> ('a * 'b * 'c) pickler
val tup2U : 'a unpickler -> 'b unpickler -> ('a * 'b) unpickler
val tup3U : 'a unpickler -> 'b unpickler -> 'c unpickler -> ('a* 'b* 'c) unpickler
val listP : 'a pickler -> 'a list pickler
val listU : 'a unpickler -> 'a list unpickler

It's now beginning to be easy to pickle and unpickle aggregate data structures using a consistent format. For example, imagine that the internal data structure is a list of integers and Booleans:

type format = list<int32 * bool>

let formatP = listP (tup2P int32P boolP)
let formatU = listU (tup2U int32U boolU)

open System.IO

let writeData file data =
    use outStream = new BinaryWriter(File.OpenWrite(file))
    formatP data outStream

let readData file  =
    use inStream = new BinaryReader(File.OpenRead(file))
    formatU inStream

You can now invoke the pickle/unpickle process as follows:

> writeData "out.bin" [(102, true); (108, false)] ;;
val it : unit = ()

> readData "out.bin";;
val it : (int * bool) list = [(102, true); (108, false)]

Combinator-based pickling is a powerful technique and can be taken well beyond what has been shown here. For example, it's possible to do the following:

  • Ensure data is compressed and shared during the pickling process by keeping tables in the input and output states. Sometimes this requires two or more phases in the pickling and unpickling process.

  • Build in extra-efficient primitives that compress leaf nodes, such as writing out all integers using BinaryWriter.Write7BitEncodedInt and BinaryReader.Read7BitEncodedInt.

  • Build extra combinators for arrays, sequences, and lazy values and for lists stored in other binary formats than the 0/1 tag scheme used here.

  • Build combinators that allow dangling references to be written to the pickled data, usually written as a symbolic identifier. When the data is read, the identifiers must be resolved and relinked, usually by providing a function parameter that performs the resolution. This can be a useful technique when processing independent compilation units.

Combinator-based pickling is used mainly because it allows data formats to be created and read in a relatively bug-free manner. It isn't always possible to build a single pickling library suitable for all purposes, and you should be willing to customize and extend code samples such as those listed previously in order to build a set of pickling functions suitable for your needs.

Note

Combinator-based parsing borders on a set of techniques called parser combinators that we don't cover in this book. The idea is very much the same as the combinators presented here; parsing is described using a compositional set of functions. You also can write parser combinators using the workflow notation described in Chapter 9.

Summary

This chapter covered lexing and parsing, which are tasks that can be tiresome if you don't use the right tools and techniques for the job. You learned about working with simple line-oriented formats, including on-demand reading of large data files, and then with the fslex and fsyacc tools, which are particularly suited for parsing programming languages and other languages with structured, rather than line-oriented, formats. Finally, we covered some combinator-based approaches for generating and reading binary data, which is also a form of parsing. The next chapter switches to a different area of programming and looks at how to write F# code that interoperates with native code.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset