In this chapter, you take a closer look at lexing and parsing, topics introduced briefly in Chapter 9 and 11. In particular, the chapter introduces the lexer and parser generators, fslex
and fsyacc
, that come with the F# distribution. A typical scenario when these techniques and tools can come in handy is the following:
You want to read user-readable input that has a well-defined syntax.
You have a type (often an abstract syntax tree [AST] type) to represent this input.
The typical task is to parse the user input into your internal representation by breaking down the input string into a sequence of tokens (a process called lexical analysis) and then constructing an instance of your internal representation based on a grammar (via syntactic analysis). Lexing and parsing don't have to be separated, and there are often convenient .NET methods for extracting information from text in particular formats, as shown in this chapter. Nevertheless, it's often best to treat the two processes separately.
The goal of this chapter is to provide the background you need to use the built-in lexing and parsing facilities of .NET and F# effectively and to understand the options for lexing and parsing other input formats:
For lexing, we cover simple line-based techniques to crack data formats using the .NET libraries directly. You then learn how to use fslex
to break text into simple tokens, strings with escape characters, and nested comments, and you see table-based token generation and stateful lexing (passing state as a parameter as opposed to using mutable state).
For parsing, we cover the core parsing topics of languages, grammar formalisms, and various parser types. We explain the typical problems for recursive-descent parsing and introduce fsyacc
by giving a parser for Kitty, a small BASIC-like language. This highlights how to parse lists of symbols, how to assign precedence andassociativity to your tokens and production rules, and how you can recover from parsing errors. You also learn about conflicts that can arise in the fsyacc
specifications and how to resolve them.
Finally, the chapter covers combinator-based techniques, which are particularly useful for writing parsers for binary formats without relying on fslex
and fsyacc
.
A common, simple case of parsing and lexing occurs when you're working with an existing line-based text-file format. In this case, parsing is often as easy as splitting each line of input at a particular separator character and trimming whitespace off the resulting partial strings:
> let line = "Smith, John, 20 January 1986, Software Developer";;val line : string
= "Smith, John, 20 January 1986, Software Developer"
> line.Split [| ',' |];;val it : string [] = [|"Smith"; " John"; " 20 January 1986"; " Software Developer"|]
> line.Split [| ',' |] |> Array.map (fun s -> s.Trim());; val it : string [] = [|"Smith"; "John"; "20 January 1986"; "Software Developer"|]
You can then process each column in the data format:
let splitLine (line: string) = line.Split [| ',' |] |> Array.map (fun s -> s.Trim()) let parseEmployee (line: string) = match splitLine line with | [| last; first; startDate; title |] -> last, first, System.DateTime.Parse(startDate), title | _ -> failwithf "invalid employee format: '%s'" line
The type of this function is as follows:
val parseEmployee : string -> string * string * System.DateTime * string
> parseEmployee line;;val it : string * string * System.DateTime * string
= ("Smith", "John", 20/01/1986 00:00:00 { ... }, "Software Developer")
You can turn a file into an on-demand sequence of results using a sequence builder:
open System.IO let readEmployees (fileName : string) = seq { use reader = File.OpenText fileName while not reader.EndOfStream do yield reader.ReadLine() |> parseEmployee }
The following example takes the first three entries from an artificially generated file containing 10,000 copies of the same employee:
> File.WriteAllLines("employees.txt", Array.create 10000 line);;val it : unit
> let firstThree = readEmployees("employees.txt") |> Seq.take 3;;val firstThree : (string * string * System.DateTime * string) list
> for (last,first,startDate,title) in firstThree do printfn "%s %s started on %A" first last startDate;;John Smith started on 20/01/1986 00:00:00
John Smith started on 20/01/1986 00:00:00
John Smith started on 20/01/1986 00:00:00
This technique is often used to do exploratory analysis of large data files. After the algorithm is refined using a prefix of the data, the analysis can then easily be run directly over the full data file.
Another technique that's frequently used to extract information from strings is to use regular expressions. The System.Text.RegularExpressions
namespace provides convenient string-matching and -replacement functions. For example, let's say you have a log file containing a record of HTML GET
requests. Here is a sample request:
GET /favicon.ico HTTP/1.1
The following code captures the name of the requested resource (favicon.ico
) and the lower version number of the HTML protocol (1
) used:
open System.Text.RegularExpressions let parseHttpRequest line = let result = Regex.Match(line, @"GET (.*?) HTTP/1.([01])$") let file = result.Groups.[1].Value let version = result.Groups.[2].Value file, version
The relevant fields are extracted by using the Groups
attribute of the regular expression match to access the matched strings for each parenthesized group in the regular expression.
Although it's possible to hand-code lexers by using a range of ad hoc techniques such as those discussed in the previous section or by writing functions that explicitly manipulate lists of characters, doing so can be boring and time consuming. Instead, it's often easier to rely on a lexer generator to do this job for you. This section looks at how to use the fslex
tool that comes with the F# Power Pack to perform lexical analysis. The F# Power Pack also includes fsyacc
, which is also available as a separate download from MSDN and CodePlex. When you use fslex
or fsyacc
generated code in your projects, you must include a library reference to FSharp.PowerPack.dll
from the F# Power Pack.
Let's start with a simple example. Listing 16-1 shows a lexer that replaces all <
and >
characters in an input stream with their HTML equivalents, <
and >
. Listing 16-2 shows a small program that uses this generated lexer.
Example 16.1. Replacing Characters with Their HTML Equivalents: text2htmllex.fsl
{
module Text2HtmlLex
let lexeme = Lexing.LexBuffer<_>.LexemeString
(* You can add your helper functions here *)
}
rule convertHtml chan = parse
| '<' { fprintf chan "<"
convertHtml chan lexbuf }
| '>' { fprintf chan ">"
convertHtml chan lexbuf }
| eof { () }
| _ { fprintf chan "%s" (lexeme lexbuf)
convertHtml chan lexbuf }
Example 16.2. Replacing Characters with Their HTML Equivalents: text2html.fs
open System.IO open System.Text let main() =
let args = System.Environment.GetCommandLineArgs() if args.Length <= 2 then let exe = Path.GetFileName(args.[0]) eprintfn "Usage: %s dir pattern" exe exit 1 let directory = args.[1] let pattern = args.[2] for fileName in Directory.GetFiles(directory, pattern) do// Open a file stream for the file name
use inputReader = File.OpenText fileName// Create a lex buffer for use with the generated lexer. The lex buffer
// reads the inputReader stream.
let lexBuffer = Lexing.LexBuffer<_>.FromTextReader inputReader// Open an output channel
let outputFile = Path.ChangeExtension(fileName,"html") use outputWriter = (new StreamWriter(outputFile) :> TextWriter)// Write the header
fprintfn outputWriter "<html> <head></head> <pre>"// Run the generated lexer
Text2htmllex.convertHtml outputWriter lexBuffer// Write the footer
fprintfn outputWriter "</pre> </html> " main()
You can produce an F# source file from the previous lexer definition by running the following command at the Windows command prompt:
fslex text2htmllex.fsl --unicode
This produces text2htmllex.fs
, which contains the implementation of the lexer convertHtml
. This lexer is imperative, in that it prints to an output stream instead of returning tokens. The signature of the entry point to the generated lexer is as follows:
val Text2HtmlLex.convertHtml: System.IO.TextWriter -> Lexing.Lexbuffer<char> -> unit
You can now compile the driver and the lexer together:
fsc text2htmllex.fs text2html.fs –r FSharp.PowerPack.dll
You can run the resulting program as follows, giving a source directory and a file pattern and producing an .html
version of each file that matches by applying the HTML conversion:
text2html . *.txt
Let's look at the previous example more closely. The rule
section of text2htmllex.fsl
defines the lexer, which takes the output channel as an argument before the lexing buffer. It says that if you encounter the <
or >
character, you should output its HTML equivalent and recursively call your lexer to process the remaining input. If you find the end of the file, you stop; and for any other character, you print it to the output channel. In each rule, you can refer to a predefined variable (visible only inside the rule) named lexbuf
that is the instantiation of the Microsoft.FSharp.Text.Lexing.LexBuffer
type. You can access various bits of information through this variable about the lexing state; some of these are collected in Table 16-1.
The driver is all F# code. You check the input arguments and then iterate through files in the directory given by the first argument whose name matches the pattern given by the second argument. You then open each file and instantiate your generated lexer with the following lines:
use inputReader = File.OpenText fileName let lexBuffer = Lexing.LexBuffer<_>.FromTextReader inputReader ... Text2HtmlLex.convertHtml outputWriter lexBuffer
This code uses some important functions from the LexBuffer
type. Table 16-1 shows the notable static members and some of the properties in this type.
Table 16.1. Some Static Members and Properties of the LexBuffer
Type
Member | Type | Description |
---|---|---|
|
| Makes a |
|
| Makes a |
|
| Makes a |
|
| Returns the matched string |
|
| Gets/sets the current position associated with the end of the matched token |
|
| True if the |
|
| Gets/sets the current position associated with the start of the matched token |
FsLex works by constructing a table-driven finite automaton that is executed to consume the input character sequence one by one until a full token can be returned. The automaton blocks until further input is received. The states of this machine are derived from the regular expressions defined by each rule. Single-character literals advance the machine to a new state, and repetitions cause it to remain in the same state. These states form a graph, and the edges between the states are those symbols that advance between the states.
You saw the basic structure of lexer files in the preceding example, which contained only a handful of rules. In general, fslex
input files have the following simple structure:
// Preamble – any user code you need for the lexer, such as opening modules, etc. { [Code] }// Definitions – named patterns that you can use in the rules or other definitions
let [Ident_1] = [Pattern] let ...// Rules – text patterns that trigger certain actions
rule [Rule_1] [arg1... argn] = parse | [Pattern] { [Action] } | ... | [Pattern] { [Action] } and [Rule_2] [arg1... argn] = parse ... rule [Rule_3] ...// Epilogue – code that can call the lexer rules defined above
{ [Code] }
Each rule defined in the lexer becomes an F# function that can be accessed from other modules and the lexer itself. You can place comments between (*
and *)
, and you can also use //
comments in the actions just like in any other F# code. Patterns can be any of the forms listed in Table 16-2.
Lexical actions are pieces of F# code enclosed in braces; they're executed when a lexer match is made. You can put any logic here that you like; typically, you construct a token. Tokens are specified in the parser definition using the %token
directive (you see this later in this chapter); if you don't have a parser, any user-defined type will do. If your lexer rules don't construct tokens, or if your lexer is simple enough, often you may want to put all driver code in the epilogue section to create a stand-alone lexer.
Table 16.2. Patterns in Lexer Rules
Pattern Form | Description |
---|---|
| Character constants; in single quotes, such as |
| Character sets; matches any character in the given set |
| Character ranges; matches any character in the given range, in ASCII ordering |
| Complementary character sets; matches any character except those in the given character set |
| Matches the given string of characters |
| Matches any character |
Matches the end of the stream | |
| A predefined named regular expression (named earlier in the file using a |
| Zero or one occurrences of |
| One of more occurrences of |
| Zero or more occurrences of |
| Either |
| Concatenation; |
Listing 16-3 shows a lexer that constructs a list of tokens that then is printed. It can recognize integers, floats, identifiers, and the symbols ^
, *
, -
, and +
. Any other character causes a runtime exception.
Example 16.3. SimpleTokensLex.fsl
: Lexing Simple Tokens: Integers, Floats, and Identifiers
{ module SimpleTokensLex open System type token = | INT of int | FLOAT of float | ID of string | STRING of string | PLUS | MINUS | TIMES | HAT | EOF let lexeme = Lexing.LexBuffer<_>.LexemeString } let num = ['0'-'9']+ let intNum = '-'? num let floatNum = '-'? num ('.' num)? (['e' 'E'] num)? let ident = ['a'-'z']+ let whitespace = ' ' | ' ' let newline = ' ' | ' ' ' ' rule token = parse | intNum { INT (Convert.ToInt32 (lexeme lexbuf)) }
| floatNum { FLOAT (Convert.ToDouble (lexeme lexbuf)) } | ident { ID (lexeme lexbuf) } | '+' { PLUS } | '-' { MINUS } | '*' { TIMES } | '^' { HAT } | whitespace { token lexbuf } | newline { token lexbuf } | eof { EOF } | _ { failwithf "unrecognized input: '%s'" (lexeme lexbuf) }
You can generate the lexer using this:
fslex simpleTokensLex.fsl --unicode
The generated lexer contains a single module SimpleTokensLex
with one entry-point function for each rule. In this case, the type of this function is as follows:
val token: Lexing.LexBuffer<char> -> SimpleTokensLex.token
The following indicates how you can imperatively generate a simple token stream from a string and print the results in F# Interactive:
> #r "FSharp.PowerPack.dll";; ... > #load "SimpleTokensLex.fs";; > let lexbuf = Lexing.LexBuffer<_>.FromString "3.4 x 34 xyx";;val lexbuf : Lexing.LexBuffer<char>
> SimpleTokensLex.token lexbuf;;val it : SimpleTokensLex.token = FLOAT 3.4
> SimpleTokensLex.token lexbuf;;val it : SimpleTokensLex.token = ID "x"
> SimpleTokensLex.token lexbuf;;val it : SimpleTokensLex.token = INT 34
> SimpleTokensLex.token lexbuf;;val it : SimpleTokensLex.token = ID "xyx"
> SimpleTokensLex.token lexbuf;;val it : SimpleTokensLex.token = EOF
> SimpleTokensLex.token lexbuf;; System.Exception: End of file on lexing stream
Lexers generated by fslex
keep track of partial information about the position of the most recently accepted token within the source stream of characters. In particular, the StartPos
and EndPos
properties on the LexBuffer
type return Lexing.Position
values. A partial signature of this position type is as follows:
type Position with // The file name associated with the input stream. member FileName : string // The line number in the input stream, assuming fresh // positions have been updated modifying the // EndPos property of the LexBuffer. member Line : int // The character number in the input stream member AbsoluteOffset : int // Return the column number marked by the position, e.g. the // difference between the AbsoluteOffset and the StartOfLineAbsoluteOffset. member Column : int // Convert a position just beyond the end of a line to a // position at the start of the next line. member NextLine : Position ... end
In some cases, certain lexer actions must perform extra bookkeeping. In particular, the lexer should update the EndPos
property of the LexBuffer
each time a newline marker is processed (this is left up to the lexer because the interpretation of newline characters can differ between various lexers). In particular, you can change the endOfLine
rule in the lexer in Listing 16-3 to make this update:
| newline { lexbuf.EndPos <- lexbuf.EndPos.NextLine; token lexbuf }
You can now experiment with this updated lexer in F# Interactive and examine the StartPos
and EndPos
properties after fetching each token:
> let lexbuf = Lexing.LexBuffer<_>.FromString "3.4 34 xyx";;val lexbuf : Lexing.LexBuffer<char>
> SimpleTokensLex.token lexbuf;;val it : SimpleTokensLex.token = FLOAT 3.4
> (lexbuf.StartPos.Line, lexbuf.StartPos.Column);;val it : int * int = (0,0)
> (lexbuf.EndPos.Line, lexbuf.EndPos.Column);;
val it : int * int = (0,3)
> SimpleTokensLex.token lexbuf;;val it : SimpleTokensLex.token = INT 34
> (lexbuf.StartPos.Line, lexbuf.StartPos.Column);; val it : int * int = (1,1)
Often, you may need to attach position information to each lexer token. However, when you use lexers in conjunction with fsyacc
parser generators, the position information is automatically read after each token is processed and then stored in the parser's state. We return to this topic later in this chapter.
So far, you've seen examples with one lexing rule only. This is because the main lexer rule was sufficient for all tokens and you haven't yet come across the need to lex input that can't be described by a regular expression. To illustrate this point, for instance, say you want to lex comments enclosed by (*
and *)
. Formally, you have an opening delimiter, followed by the body of the comment, and finally enclosed by the closing delimiter. The first attempt, shown here
"(*" _* "*)"
fails because the middle pattern matches everything and you never reach the closing *)
. So, the best compromise could be as follows
"(*" [^ '*']* "*)"
where you match the inside of the comment as long as you don't see an asterisk, and then you try to match the closing *)
. This of course fails on any comment that contains an asterisk. You can play with this regular expression a little more. The inside of the comment is either anything but an asterisk or all asterisks that aren't followed by another asterisk or right parenthesis:
"(*" ([^ '*'] | ('*'+ ([^ '*' ')'])))* '*'+ ')'
This is about as close as you can get, and yet even this pattern has a problem: it can't match nested comments; it always stops at the first closing delimiter, ignoring all nested comment openers.
You can handle this problem by using a multirule lexer. The following rules show the additions you can make to the SimpleTokensLex.fsl
lexer from Listing 16-3 in order to properly handle comments and strings:
rule token = ... | "(*" { comment lexbuf; token lexbuf } | """ { STRING (string lexbuf.StartPos "" lexbuf) } | _ { failwith "unrecognized input: '%s'" (lexeme lexbuf) } and comment = parse | "(*" { comment lexbuf; comment lexbuf } | "*)" { () } | " " { lexbuf.EndPos <- lexbuf.EndPos.NextLine
comment lexbuf } | eof { failwith "Unterminated comment" } | _ { comment lexbuf } and string pos s = parse | "\" ('"' | 'n' | 'r' | 't') { let s' = s + (match lexeme lexbuf with | "\"" -> """ | "\n" -> " " | "\r" -> " " | "\t" -> " " | "\\" -> "\" | _ -> "") string pos s' lexbuf } | """ { s } | " " { lexbuf.EndPos <- lexbuf.EndPos.NextLine string pos (s + " ") lexbuf } | eof { failwithf "end of file in string started at or near %A" pos } | _ { string pos (s + (lexeme lexbuf)) lexbuf }
Comment processing begins when you encounter (*
in the token
rule. When the closing *)
is encountered, you exit one invocation of the comment
rule. The idea is that you deal with nested comments by recursively applying the lexer when a nested comment is reached. Note the double invocation of comment lexbuf
for nested comment delimiters: once to tokenize the comment that belongs to the opener and again to tokenize the rest of the enclosing comment. There are also two further matches within the comment rule. If you hit the end of the source stream, you have an unterminated comment, and you raise an exception; in every other case, you move forward inside the comment.
In the example, strings are also handled by a separate lexer rule string
that is invoked by the token
lexer when you encounter the double-quote character. This rule takes two parameters: the string consumed so far and the start position of the string. The latter is used to report a nicer error for an unterminated string (you could also use this technique for the matching case for comments). You can also check whether you have an escaped character in the input. If so, you append the appropriate escape sequence to the string already accumulated and advance the current position. Upon encountering the closing character, you return the overall string. An imperative System.Text.StringBuffer
object can also be used to accumulate the string, which is more efficient if strings get very long.
Because lexer rules can pass arguments, there is little need to use mutable state in a lexer—instead, pass additional arguments.
You can now turn your attention to parsing. Let's assume for the moment that you're writing an application that performs simple symbolic differentiation, say on polynomials only. You want to read polynomials such as x^5-2x^3+20
as input from your users, which in turn is converted to your internal polynomial representation so that you can perform symbolic differentiation and pretty-print the result to the screen. One way to represent polynomials is as a list of terms that are added or subtracted to form the polynomial:
type Term = | Term of int * string * int | Const of int type Polynomial = term list
For instance, the polynomial in this example is as follows:
[Term (1,"x",5); Term (−2,"x",3); Const 20]
In Listing 16-3, you built a lexer and a token
type suitable for generating a token stream for the input text (shown as a list of tokens here):
[ID "x"; HAT; INT 5; MINUS; INT 2; ID "x"; HAT; INT 3; PLUS; INT 20]
Listing 16-4 shows a recursive-decent parser that consumes this token stream and converts it into the internal representation of polynomials. The parser works by generating a lazy list for the token stream. Lazy lists are a data structure in the F# library module Microsoft.FSharp.Collections.LazyList
, and they're a lot like sequences with one major addition: lazy lists effectively allow you to pattern-match on a sequence and return a residue lazy list for the tail of the sequence.
Example 16.4. Recursive-Descent Parser for Polynomials
open SimpleTokensLex open Microsoft.FSharp.Text.Lexing type Term = | Term of int * string * int | Const of int type Polynomial = term list type TokenStream = LazyList<token * Position * Position> let tryToken (src: TokenStream ) = match src with | LazyList.Cons ((tok, startPos, endPos), rest) -> Some(tok, rest) | _ -> None let parseIndex src = match tryToken src with | Some (HAT, src) -> match tryToken src with | Some (INT num2, src) -> num2, src | _ -> failwith "expected an integer after '^'" | _ -> 1, src let parseTerm src = match tryToken src with | Some (INT num, src) -> match tryToken src with | Some (ID id, src) -> let idx, src = parseIndex src
Term (num, id, idx), src | _ -> Const num, src | Some (ID id, src) -> let idx, src = parseIndex src Term(1, id, idx), src | _ -> failwith "end of token stream in term" let rec parsePolynomial src = let t1, src = parseTerm src match tryToken src with | Some (PLUS, src) -> let p2, src = parsePolynomial src (t1 :: p2), src | _ -> [t1], src
The functions here have the following types (using the type aliases you defined):
val tryToken : TokenStream -> (token * TokenStream ) option
val parseIndex : TokenStream -> int * TokenStream
val parseTerm : TokenStream -> Term * TokenStream
val parsePolynomial : TokenStream -> Polynomial * TokenStream
You can turn the fslex
-generated lexer for the lexer specification in Listing 16-3 into a TokenStream
using the following code:
let getTokenStream inp : TokenStream =// Generate the token stream as a seq<token>
seq { let lexbuf = LexBuffer<_>.FromString inp while not lexbuf.IsPastEndOfStream do match SimpleTokensLex.token lexbuf with | EOF -> yield! [] | token -> yield (token, lexbuf.StartPos, lexbuf.EndPos) }// Convert to a lazy list
|> LazyList.ofSeq let parse input = let src = getTokenStream input let result, src = parsePolynomial src match tryToken src with | Some _ -> failwith "unexpected input at end of token stream!" | None -> result
These functions have the following types:
val getTokenStream : string -> TokenStream
val parse: string -> polynomial
Note in the previous examples that you can successfully parse either constants or complete terms; but after you locate a HAT symbol, a number must follow. This sort of parsing, in which you look only at the next token to guide the parsing process, is referred to as LL(1), which stands for left-to-right, leftmost derivation parsing; 1 means that only one look-ahead symbol is used. The parser approach you used earlier is called recursive-descent. This has various advantages and disadvantages, and we discuss those in a bit. To conclude, you can look at the parse
function in action:
> parse "1+3";;
val it : term list = [Const 1; Const 3]
> parse "2x^2+3x+5";;
val it : term list = [Term (2,"x",2); Term (3,"x",1); Const 5]
There are various limitations that you can't handle with recursive-descent parsers. For instance, if you translate a left-recursive production into code, you get an infinite recursion, like so:
<polynomial> ::= <polynomial> '+' <term> | <term>
This corresponds to the following:
let rec parsePolynomial src = let poly, src = parsePolynomial src ...
Another common problem with LL(k) parsing for some k
>=1
is that the grammar rules for a given nonterminal can't start with the same symbols, or else there is no easy way to decide which rule to apply (because each is determined to be applicable upon checking k
number of symbols). In such cases, left-factoring can be applied. For example, you can move the symbols after the common part into another production, as shown here:
<polynomial> ::= <term> | <term> '+' <polynomial> can be refactored as <polynomial> ::= <term> <polymomialTail> <polynomialTail> ::= EPSILON | '+' <polynomial>
Here, EPSILON
is the empty symbol, so the function that parses polynomialTail
both checks for a plus symbol and issues a call to parse a polynomial or exit in the absence of an initial plus symbol, leaving the input unchanged. Although this case is relatively simple (you can parse terms until they're followed by a plus sign iteratively), coding such grammar rules in the general case is cumbersome.
Problems such as these arise because you have to make parsing decisions early on (such as deciding which grammar rule you're pursuing if there are multiple choices for a given nonterminal), because they construct the parse tree from the top and proceed downward. LR parsers, on the other hand, postpone these decisions as much as possible and construct the parse tree bottom-up, resulting in much more flexibility both in terms of how naturally grammar rules can be expressed and how they can be mapped into code.
The tool fsyacc
generates LALR(1) parsers, which are a special subset of LR(1) parsers where the state table is compressed by merging similar states. This in practice doesn't limit the languages that can be parsed, but it does result in significant savings over the parse-table size. The generated parser automaton performs one of four distinct operations in any state based on the look-ahead token, and these are important to understand if you have various grammar conflicts to fix. It can shift the look-ahead (always a terminal symbol) to its stack, reduce a number of stack entries by a grammar rule leaving the head symbol in their place, accept the input as syntactically correct, or reject in the opposite case. Parsing proceeds until an accept or a reject state is reached.
You first see how to develop a parser for a simple programming language. A sample fragment of the BASIC-like language you want to parse is shown here:
a := 1; b := 0; if a then d := 20 + 20; if b then d := 40 * 20 + 20; print d; while d do begin d := d + 1; print d end; print d
For simplicity, let's call this language Kitty. As the previous example shows, Kitty supports naming values, printing values, basic arithmetic operators, and a while
and conditional construct. The Ast
module (shown in Listing 16-5) defines the internal representation for Kitty programs.
Example 16.5. KittyAst.fs
: Defining the AST for Kitty Programs
module Ast type expr = | Val of string | Int of int | Plus of expr * expr | Minus of expr * expr | Times of expr * expr type stmt = | Assign of string * expr | While of expr * stmt | Seq of stmt list | IfThen of expr * stmt | IfThenElse of expr * stmt * stmt | Print of expr type prog = Prog of stmt list
Listing 16-6 shows a lexer for the language in the file KittyLexer.fsl
. It's similar to lexers developed earlier in this chapter. The one exception is that you use a keyword table. Matching against lexemes to identify tokens is a sensible solution only if there are relatively few cases. Tokenizing a large set of keywords and operators using explicit rules can lead to large lexers. This situation is often handled using tables that contain the possible lexeme matches and the tokens to be returned. Listing 16-6 uses simple dictionaries (maps).
Example 16.6. KittyLexer.fsl
: Lexer for Kitty
{ module KittyLexer open System open KittyParser open Microsoft.FSharp.Text.Lexing let lexeme = LexBuffer<_>.LexemeString let ids = [ ("while", WHILE); ("begin", BEGIN); ("end", END); ("do", DO); ("if", IF); ("then", THEN); ("else", ELSE); ("print", PRINT);] let idsMap = Map.ofList ids let ident lexbuf tokenText = if Map.containsKey tokenText idsMap then Map.find tokenText idsMap else ID tokenText } let num = ['0'-'9']+ let alpha = ['a'-'z' 'A'-'Z'] let ident = alpha+ (alpha | ['_' '$'])* let integer = '-'? num let whitespace = ' ' | ' ' let newline = ' ' | ' ' ' ' rule token = parse | whitespace { token lexbuf } | newline { lexbuf.EndPos <- lexbuf.EndPos.NextLine; token lexbuf } | "(" { LPAREN } | ")" { RPAREN } | "+" { PLUS } | "-" { MINUS } | "*" { TIMES } | ";" { SEMI } | ":=" { ASSIGN } | ident { ident lexbuf (lexeme lexbuf) } | integer { INT (Int32.Parse(lexeme lexbuf)) } | eof { EOF }
Note that at compilation time, the lexer depends on the parser defined later in Listing 16-7. This is because the lexer must return the type of tokens required by the parser.
You can generate the lexer by calling fslex
:
fslex KittyLexer.fsl --unicode
This produces KittyLexer.fs
, which contains the implementation of the lexer.
Listing 16-7 shows the parser specification for the Kitty language in the file KittyParser.fsy
.
Example 16.7. KittyParser.fsy
: Parser for Kitty
%{ open Ast %}// The start token becomes a parser function in the compiled code.
%start start// These are the terminal tokens of the grammar along with the types of
// the data carried by each token:
%token <string> ID %token <int> INT %token PLUS MINUS TIMES LPAREN RPAREN IF THEN ELSE %token WHILE DO BEGIN END PRINT SEMI ASSIGN EOF// Associativity and Precedences - Lowest precedence comes first
%left PLUS MINUS %left TIMES// This is the type of the data produced by a successful reduction
// of the 'start' symbol:
%type <prog> start %% start: Prog { $1 } Prog: StmtList { Prog (List.rev $1) } Expr: ID { Val $1 } | INT { Int $1 } | Expr PLUS Expr { Plus ($1, $3) } | Expr MINUS Expr { Minus ($1, $3) } | Expr TIMES Expr { Times ($1, $3) } | LPAREN Expr RPAREN { $2 } Stmt: ID ASSIGN Expr { Assign ($1, $3) } | WHILE Expr DO Stmt { While ($2, $4) }
| BEGIN StmtList END { Seq (List.rev $2) } | IF Expr THEN Stmt { IfThen ($2, $4) } | IF Expr THEN Stmt ELSE Stmt { IfThenElse ($2, $4, $6) } | PRINT Expr { Print $2 } StmtList: | Stmt { [$1] } | StmtList SEMI Stmt { $3 :: $1 }
You can generate the parser by calling fsyacc
:
fsyacc KittyParser.fsy --module KittyParser
This produces KittyParser.fs
, which contains the implementations of the parser, along with a signature file KittyParser.fsi
. The generated parser has one entry point for each %start
symbol (here there is only one). The type of this entry point is as follows:
val start : (LexBuffer<'a> -> token) -> LexBuffer<'a> -> Ast.prog
You see how to use this function a little later in this chapter. In a roundabout way, the type says, "If you give me a lexing function that generates tokens and give me a LexBuffer
to supply to that lexing function, then I'll generate you an Ast.prog
." The generic type variable indicates that the parser is independent of the characters manipulated by the LexBuffer
.
We now cover in more detail the different aspects of fsyacc
parsing illustrated by this example.
In Kitty, statements can be separated by semicolons. This is handled in the StmtList
grammar production, whose semantic extract is a list of statements. Note that you could have written this rule in a head-recursive way:
StmtList: | Stmt { [$1] } | Stmt SEMI StmtList { $1 @ [ $3 ] }
Unlike in recursive-descent or any other LL parsing technique, the previous rule doesn't pose a problem for fsyacc
, and thus no left-factoring is needed. However, it does create a copy of the statement list each time a new expression is appended to it. You eliminate this by using the following productions
StmtList: | Stmt { [$1] } | StmtList SEMI Stmt { $3 :: $1 }
combined with a List.rev
where the rule is used. This rule consumes all statements and inserts them, one by one, into the singleton list that contains the first statement. As a result, the return list is in reverse order, which is why you need to apply List.rev
. You may want to define a separate rule to perform this operation. Another feature that is often needed is the ability to parse empty or optional lists. This can be easily accomplished using an empty (epsilon) symbol, as in the following example:
StmtListOpt: { [] } | StmtList { $1 }
This rule matches an optional list of statements and returns an empty list if no statements can be parsed.
As usual with arithmetic operators, division and multiplication should take precedence over addition and subtraction: 1+2*3
should be parsed as 1+(2*3)
. With fsyacc
, this can be expressed easily using the associativity directives—or, to be more precise, their ordering:
// Associativity and Precedences - Lowest precedence comes first
%left PLUS MINUS
%left TIMES
By specifying what tokens associate and where (how strongly they bind), you can control how parse derivations are performed. For instance, giving left-associativity to the addition operator (PLUS
), given an input 1+2+3
, the parser automatically generates a nonambiguous derivation in the form of (1+2)+3
. The basic arithmetic operators are left-associative and should be listed from the lowest precedence to the highest; in the example, the addition and subtraction operators have lower precedence than multiplication—the way it should be. Other associativity specifications include %nonassoc
and %right
, which are used to denote that a given symbol doesn't associate or associates to the right, respectively. The former is useful for relational and equality operators such as <
, >
, or !=
, where the operator isn't applicable if applied multiple times: 1 > 2 > 3
yields a syntax error.
You can also give precedence to a rule by using %prec
at the end of the rule and giving a token whose precedence is to be applied. You can list arbitrary tokens in the associativity and precedence declarations, even if they haven't been declared as tokens, and use them in such situations. You can find more details about specifying precedence at www.expert-fsharp.com/Topics/FsYacc
.
One option for fsyacc.exe
is -v
, which causes fsyacc
to produce a readable extract of the parser's states. This is useful when there are various conflicts to resolve.
Each state in this extract corresponds to one or more items, which are productions that indicate what has been seen while parsing them. This current position with respect to a rule is marked with a period (.
). Furthermore, to each state belong various actions that are triggered by certain look-ahead symbols. For instance, the action in some state as follows
action 'ID' (noprec): shift 7
indicates that if the ID token, which has no defined precedence, is encountered as look-ahead, the parser pushes this token to the parsing stack and shifts to state 7. For reduce actions, the rule that is reduced is shown. An error (reject) or accept action is shown for tokens that trigger a syntax error or acceptance, respectively.
The parser state extract also provides useful information about conflicts in your grammar. Conflicts arise when your grammar is ambiguous (which translates to having more than one choice for a parser action at any time), ultimately meaning more than one derivation can accept a given input. You can do a number of things to disambiguate your grammar. You can apply precedence to various tokens or rules, or rewrite your rules to be consistent and unambiguous. There are two main sources of grammar conflicts: reduce-reduce and shift-reduce conflicts. Reduce-reduce conflicts are considered really bad because there are multiple rules to reduce by at a given situation. Although fsyacc
applies a disambiguation strategy (reducing by the grammar rule that was defined earlier), you should avoid reduce-reduce conflicts as much as possible.
Shift-reduce conflicts arise when the parser has the choice to shift a token or reduce by a rule. Unless you fix this conflict, fsyacc
favors the shift action and defers the reduction to a later point. In some situations (for instance, the dangling-else problem), this yields the expected behavior; but in general, any such conflict is also considered a serious problem.
You can generate the parsers and lexers by calling fslex
and fsyacc
:
fslex KittyLexer.fsl --unicode fsyacc KittyParser.fsy --module KittyParser
This produces KittyLexer.fs
and KittyParser.fs
, which contain the implementations of the parser and lexer. You can test these using F# Interactive by loading the files directly using the #load
directive. The following code creates a LexBuffer
called lexbuf
. It then calls the KittyParser.start
entry point for the parser, passing KittyLexer.token
as the lexical analysis engine and lexbuf
as the LexBuffer
. This connects the parser and the lexer:
open Ast open KittyParser open KittyLexer let parseText text = let lexbuf = Lexing.LexBuffer<_>.FromString text try KittyParser.start KittyLexer.token lexbuf with e -> let pos = lexbuf.EndPos failwithf "Error near line %d, character %d " pos.Line pos.Column
You can now test this function interactively:
> let sample = "counter := 100; accum := 0; while counter do begin counter := counter - 1; accum := accum + counter end; print accum";;val sample : string = "..."
> parseText sample;;val it : Ast.prog
= Prog
[Assign ("counter",Int 100); Assign ("accum",Int 0);
While
(Val "counter",
Seq
[Assign ("counter",Minus (Val "counter",Int 1));
Assign ("accum",Plus (Val "accum",Val "counter"))]); Print Val "accum"]
Writing an evaluator for Kitty is straightforward. Here, you use an environment that maps variable names to the integer values they store. As expected, assignments in the source language add a binding for a given variable, and evaluating variables reads a value from this environment. Because of the lack of other types in Kitty, you use a nonzero value for the Boolean true and zero for false and wire the logic of the conditional and looping construct accordingly:
let rec evalE (env: Map<string, int>) = function | Val v -> if env.ContainsKey v then env.[v] else failwith ("unbound variable: " + v) | Int i -> i | Plus (e1, e2) -> evalE env e1 + evalE env e2 | Minus (e1, e2) -> evalE env e1 - evalE env e2
| Times (e1, e2) -> evalE env e1 * evalE env e2 and eval (env: Map<string, int>) = function | Assign (v, e) -> env.Add(v, evalE env e) | While (e, body) -> let rec loop env e body = if evalE env e <> 0 then loop (eval env body) e body else env loop env e body | Seq stmts -> List.fold eval env stmts | IfThen (e, stmt) -> if evalE env e <> 0 then eval env stmt else env | IfThenElse (e, stmt1, stmt2) -> if evalE env e <> 0 then eval env stmt1 else eval env stmt2 | Print e -> printf "%d" (evalE env e); env
With these at hand, continuing the same interactive session, you can now evaluate the sample Kitty program:
> match parseText sample with | Prog stmts -> eval Map.empty (Seq stmts) |> ignore;; 4950 val it : unit = ()
One final case of parsing is common when working with binary data. That is, say you want to work with a format that is conceptually relatively easy to parse and generate (such as a binary format) but where the process of actually writing the code to crack and encode the format is somewhat tedious. This section covers a useful set of techniques to write readers and writers for binary data quickly and reliably.
The running example shows a set of pickling (also called marshalling) and unpickling combinators to generate and read a binary format of our design. You can easily adapt the combinators to work with existing binary formats such as those used for network packets. Picklers and unpicklers for different data types are function values that have signatures as follows:
type outstate = System.IO.BinaryWriter type instate = System.IO.BinaryReader type pickler<'T> = 'T -> outstate -> unit type unpickler<'T> = instate -> 'T
Here, instate
and outstate
are types that record information during the pickling or parsing process. In this section, these are just binary readers and writers; but more generally, they can be any type that can collect information and help compact the data during the writing process, such as by ensuring that repeated strings are given unique identifiers during the pickling process.
At the heart of every such library lies a set of primitive leaf functions for the base cases of aggregate data structures. For example, when you're working with binary streams, this is the usual set of primitive read/write functions:
let byteP (b: byte) (st: outstate) = st.Write(b) let byteU (st: instate) = st.ReadByte()
You can now begin to define additional pickler/unpickler pairs:
let boolP b st = byteP (if b then 1uy else 0uy) st let boolU st = let b = byteU st in (b = 1uy) let int32P i st = byteP (byte (i &&& 0xFF)) st byteP (byte ((i >>> 8) &&& 0xFF)) st byteP (byte ((i >>> 16) &&& 0xFF)) st byteP (byte ((i >>> 24) &&& 0xFF)) st let int32U st = let b0 = int (byteU st) let b1 = int (byteU st) let b2 = int (byteU st) let b3 = int (byteU st) b0 ||| (b1 <<< 8) ||| (b2 <<< 16) ||| (b3 <<< 24)
These functions have the following types:
val byteP : pickler<byte>
val byteU : unpickler<byte>
val boolP : pickler<bool>
val boolU : unpickler<bool>
val int32P : pickler<int>
val int32U : unpickler<int>
So far, so simple. One advantage of this approach comes as you write combinators that put these together in useful ways. For example, for tuples
let tup2P p1 p2 (a, b) (st: outstate) = (p1 a st : unit) (p2 b st : unit) let tup3P p1 p2 p3 (a, b, c) (st: outstate) = (p1 a st : unit) (p2 b st : unit) (p3 c st : unit) let tup2U p1 p2 (st: instate) = let a = p1 st let b = p2 st (a, b)
let tup3U p1 p2 p3 (st: instate) = let a = p1 st let b = p2 st let c = p3 st (a, b, c)
and for lists:
// Outputs a list into the given output stream by pickling each element via f.
let rec listP f lst st = match lst with | [] -> byteP 0uy st | h :: t -> byteP 1uy st; f h st; listP f t st// Reads a list from a given input stream by unpickling each element via f.
let listU f st = let rec ulist_aux acc = let tag = byteU st match tag with | 0uy -> List.rev acc | 1uy -> let a = f st in ulist_aux (a::acc) | n -> failwithf "listU: found number %d" n ulist_aux []
These functions conform to the following types:
val tup2P : 'a pickler -> 'b pickler -> ('a * 'b) pickler
val tup3P : 'a pickler -> 'b pickler -> 'c pickler -> ('a * 'b * 'c) pickler
val tup2U : 'a unpickler -> 'b unpickler -> ('a * 'b) unpickler
val tup3U : 'a unpickler -> 'b unpickler -> 'c unpickler -> ('a* 'b* 'c) unpickler
val listP : 'a pickler -> 'a list pickler
val listU : 'a unpickler -> 'a list unpickler
It's now beginning to be easy to pickle and unpickle aggregate data structures using a consistent format. For example, imagine that the internal data structure is a list of integers and Booleans:
type format = list<int32 * bool> let formatP = listP (tup2P int32P boolP) let formatU = listU (tup2U int32U boolU) open System.IO let writeData file data = use outStream = new BinaryWriter(File.OpenWrite(file)) formatP data outStream let readData file = use inStream = new BinaryReader(File.OpenRead(file)) formatU inStream
You can now invoke the pickle/unpickle process as follows:
> writeData "out.bin" [(102, true); (108, false)] ;;
val it : unit = ()
> readData "out.bin";;
val it : (int * bool) list = [(102, true); (108, false)]
Combinator-based pickling is a powerful technique and can be taken well beyond what has been shown here. For example, it's possible to do the following:
Ensure data is compressed and shared during the pickling process by keeping tables in the input and output states. Sometimes this requires two or more phases in the pickling and unpickling process.
Build in extra-efficient primitives that compress leaf nodes, such as writing out all integers using BinaryWriter.Write7BitEncodedInt
and BinaryReader.Read7BitEncodedInt
.
Build extra combinators for arrays, sequences, and lazy values and for lists stored in other binary formats than the 0/1 tag scheme used here.
Build combinators that allow dangling references to be written to the pickled data, usually written as a symbolic identifier. When the data is read, the identifiers must be resolved and relinked, usually by providing a function parameter that performs the resolution. This can be a useful technique when processing independent compilation units.
Combinator-based pickling is used mainly because it allows data formats to be created and read in a relatively bug-free manner. It isn't always possible to build a single pickling library suitable for all purposes, and you should be willing to customize and extend code samples such as those listed previously in order to build a set of pickling functions suitable for your needs.
Combinator-based parsing borders on a set of techniques called parser combinators that we don't cover in this book. The idea is very much the same as the combinators presented here; parsing is described using a compositional set of functions. You also can write parser combinators using the workflow notation described in Chapter 9.
This chapter covered lexing and parsing, which are tasks that can be tiresome if you don't use the right tools and techniques for the job. You learned about working with simple line-oriented formats, including on-demand reading of large data files, and then with the fslex
and fsyacc
tools, which are particularly suited for parsing programming languages and other languages with structured, rather than line-oriented, formats. Finally, we covered some combinator-based approaches for generating and reading binary data, which is also a form of parsing. The next chapter switches to a different area of programming and looks at how to write F# code that interoperates with native code.