In this chapter, we examine the various ways in which an F# program can transfer data, including printing to the screen and saving and loading information on disc. In particular, we examine some sophisticated tools for F# which greatly simplify the task of designing and implementing programs to load files in well-defined formats.
The ability to print information on the screen or into a string or file is useful as a means of conveying the result of a program (if the result is simple enough) and providing run-time information on the current state of the program as well as providing extra information to aid debugging. Naturally, F# provides several functions to print to the screen. In particular, the printf
function can be used to print a variety of different types and perform some simple formatting.
The printf
function understands several format specifiers:
%s print a string
%d print an int
% f print a float
%g print a float
in scientific notation
%a print a value using a custom print function
%0 print a value of any type using the ToString
() method.
%A print a value of any type using the built-in structural pretty printer.
Special characters can also be printed:
newline
" double-quotes
tab
For example, the following prints a string
in quotes, an int
and then a float:
> printf "String: "%s", int: %d, float: %f " "foo" 3 7.4;; String: "foo", int: 3, float: 7.4
The following function uses the generic %A format specifier to print a value as a running F# interactive session would:
> printf "%A" (1, 2.3, "foo");; (1, 2.3, "foo")
The ability to pretty print values of any type at run time is particularly useful during debugging. There are also related functions that use the same syntax to print into a string or to a channel.
Values can be printed into strings using the sprintf
function. This function accepts the same format specifiers as the printf
function.
For example, the following prints the same result as the previous example but returns a string containing the result:
> sprintf "String: "%s", int: %d, float: %f " "foo" 3 7.4;; val it : string = "String: "foo", int: 3, float: 7.4"
The %a format specifier is most easily elucidated using a example based upon the sprintf
function. This format specifier expects two corresponding arguments, a print function g
and its argument x:
> let f g x = sprintf "%a" g x; val f : (unit -> 'a -> string) -> 'a -> string
This is particularly useful when printing recursive data structures such as lists and trees as a print function can pass itself or another related function as an argument to sprintf.
The F# interactive mode can be supplemented with custom pretty printers for user-defined types by supplying a function to convert a value of that type into the corresponding string to the fsi . AddPrinter
function. A function to convert a value of the type expr
to a string may be written using the sprintf
function and the %d and %a format specifiers:
> let rec string_of_expr () expr = | Int n -> sprintf "%d" n | Var v -> v | Add(f, g) -> sprintf "%a + %a" string_of_expr f string_of_expr g | Mul(f, g) -> sprintf "%a * %a" string_of_mul f string_of_mul g and string_of_mul () = function | Int _ | Mul _ as f -> string_of_expr () f | Add _ as f -> sprintf "(%a)" string_of_expr f;; val string_of_expr : unit -> expr -> string val string_of_mul : unit -> expr -> string
Printing of subexpressions inside a product are dispatched to the auxiliary function string_of _mul
in order to bracket subexpressions that have a lower precedence. Specifically, any addition inside a multiplication must be bracketed when pretty printing an expression.
The previous example expression is now printed more comprehensibly by the F# interactive mode:
> (Int 1 + Var "x") * Int 3;; val it : expr = (1 + x) * 3;;
Note that the subexpression 1 + x
is bracketed correctly when it is printed because the string_of _expr
function takes the relative precedence of multiplication over addition into account.
In addition to printing values of specific types, the F# programming language also provides a print_any
function (equivalent to printf
" %A") to print values of any type. For example, to print a list of 2-tuples in F# syntax:
> print_any [1, 2; 2, 3];; [(l, 2); (2, 3)]
Analogously, there is an any_to_string
function:
> any_to_string [1, 2; 2, 3];; val it : string = "[(1, 2); (2, 3)]"
This is particularly useful when debugging, to avoid having to write conversion functions for each and every type.
The act of saving data in a file is performed by opening the file, writing data toit and then closing the file. F# provides an elegant and efficient way to do this using functionality provided by .NET in the System. IO
namespace combined with F#'s sequence expressions.
The following lines_of_file
function returns a sequence of the lines in the file with the given name:
> open System.IO;; > let lines_of_file filename = seq { use stream = File.OpenRead filename user reader = new StreamReader(stream) while not reader.EndOfStream do yield reader.ReadLine() };; val lines_of_file : string -> seq<string>
The sequence expression in this function loops until the end of the file, reading and yielding each line of the file in turn. The use bindings are analogous to let
bindings except that they also dispose of the created objects when enumeration over the sequence is complete. In this case, the use
bindings cause the file to be closed when enumeration over the sequence is complete.
The .NET platform actually provides similar functions for reading whole files such as stream.ReadToEnd
() but this sequence expression has an important advantage: the file is only read as the sequence is enumerated over. This is a form of lazy evaluation and is one of the most important benefits of sequences in F#. For example, this lines_of_file
function can be used to read only the first three lines of a file without having to read the entire file first:
> lines_of_file "fib.ml" |> Seq.take 3;; val it : seq<string> = ["let rec fib = function"; " | 0 | 1 as n -> n"; " | n -> fib(n-l) + fib(n-2)"]
The lines_of_file
function can be used to count the number of space-separated words in a file as follows:
> lines_of_file @"C:iblel3.txt" |> Seq.map (String.split [' '] >> Seq.length)
|> Seq.fold (+) 0;; val it : int = 823647
An important characteristic of this word counter is that it handles a single line at a time and, consequently, has very low memory requirements compared to more obvious solutions that read in the whole file. This kind of programming makes F# interactive sessions a very powerful tool for data dissection and manipulation.
This functionality is of particular importance when handling large amounts of data, such as many database applications. This function will be used to write a simple word frequency counter in chapter 6. The use of sequence expressions to transparently interrogate databases on demand is covered in chapter 10.
The following save
function uses the built-in .NET serialization routines to save any value of any type to the given file:
> open System.Runtime.Serialization.Formatters.Binary;; > let save filename x = use stream = new FileStream(filename, FileMode.Create) (new BinaryFormatter()) .Serialize(stream, x);; val save : string -> 'a -> unit
For example, the following saves a 3-tuple to the "test.dat" file:
> save "test.dat" (1, 3.0, ["piece"]);; val it : unit = ()
The following load
function reads back a value that was saved by the save
function:
> let load filename = use stream = new FileStream(filename, FileMode.Open) (new BinaryFormatter()).Deserialize(stream, x) |> unbox;; val load : string -> 'a
For example, the saved 3-tuple can be loaded back with:
> (load "test.dat" : int * float * string list);; val it : int * float * string list = (1, 3.0, ["piece"])
Note that we are careful to annotate the type of the data. Although such type annotations can be omitted in compiled code under certain circumstances, it is always a good idea to make this information explicit in the source code as we have done.
The standard .NET serialization routines can store and retrieve arbitrary data in a binary format. Data stored in non-trivial formats is typically read using techniques known as lexing and parsing. Serialization is simpler but often an order of magnitude slower than using a custom format, even a readable text-based format.
In addition to the primitive input and output functions offered by F#, the language is bundled with very powerful tools, called fslex
and fsyacc
, for deciphering the content of files according to a formal grammar. This aspect of the language has been particularly well honed due to the widespread use of this family of languages for writing compilers and interpreters, i.e. programs which understand, and operate on, other programs.
In the context of scientific computing, providing data in a human readable format with a formally defined grammar is highly desirable. This allows a data file to convey useful information both to a human and to a computer audience. In the case of a human, the file can be read using a text editor. In the case of a computer, the file can be parsed to produce a data structure which reflects the information in the file (illustrated in figure 5.1). In the latter case, the program could then go on to perform computations on the data.
The ability to use fslex
and fsyacc
is, therefore, likely to be of great benefit to scientists. We shall now examine the use of these tools in more detail.
The first step in using these tools to interpret a file is called lexing. This stage involves reading characters from the input, matching them against patterns and outputting a stream of tokens. A token is a value which represents some of the input. For example, a sequence of space-separated digits could be lexed into a stream of integer-valued tokens.
String. split
The task of lexing can be elucidated by the simple example of splitting a file into lines and words, expected to represent the columns and elements of a matrix, respectively.
A function to read a matrix by splitting each line into words, converting each word into a float
and constructing an F# Matrix
may be written in terms of the lines_of _file
function that was defined in section 5.3:
> let read_matrix filename = lines_of_file filename |> Seq.map (String.split [' ']) |> Seq.map (Seq.map float) |> Math.Matrix.of_seq;; val read_matrix : string -> Math.Matrix
Note that reading from the file is deferred until the call to Matrix. of_seq
because Seq
is lazy.
The String. split
function is lazily mapped over this sequence to obtain a sequence of string lists representing the space-separated words on each line. Two sequence maps are used to apply the float_of _string
function to each element. Finally, the Matrix. of_seq
function enumerates the elements of the sequences (opening, reading and closing the file) and generates the matrix result.
For example, reading a 3 × 3 matrix from a file:
> read_matrix @"C:vectors.txt";; val it : Math.Matrix = matrix [[1.0; 2.0; 3.0]; [4.0; 5.0; 6.0]; [7.0; 8.0; 9.0];]
Splitting strings can be a useful way to lex and parse very simple formats. However, this approach is not very powerful or extensible. In particular, lexing mathematical expressions with a sophisticated syntax would be very tedious. Fortunately, the fs1ex
tool allows more complicated lexing to be performed easily.
fslex
The fslex
tool is a lexer generator that takes F#-like source code in a file with the suffix ".fsl" that describes a lexer and compiles it into efficient F# source code that implements that lexer, which can then be compiled with the F# compiler as an ordinary source file. The resulting lexer translates an incoming stream of characters into an outgoing stream of tokens. A token is a variant type. In order to produce tokens, patterns are spotted in the stream of characters by matching them against regular expressions, also known as regexs. Like pattern matches, the regexs handled by fslex
may contain several kinds of structure:
'x' match the specified, single character.
'a'-'z' match any character from a range.
_ match any single character.
"string " match the given string of characters.
Several other constructs are specific to regexps:
['a' 'c' 'e'] match any regexp in the given set.
[ˆ 'a' 'c' 'e'] match any regexp not in the given set.
regexp * match zero or more repetitions of a string matching regexp.
regexp + match one or more repetitions of a string matching regexp.
regexp ? match regexp or the empty string.
regexp1 # regexp2 match any string which matches regexp1 and does not match regexp2.
regexp1 | regexp 2 match any string which either matches regexp1 or matches regexp2.
regexp1 regexp2 concatenate a string matching regexp1 with a string matching regexp2.
eof match the end of file.
In an fslex
file (with the suffix ".fsl"), regular expressionscan be named using the let
construct. A regular expression named digit
that matches any decimal digit may be written:
let digit = ['0'-'9']
String representations of floating-point numbers are somewhat more adventurous. An initial attempt at a regular expression might match a sequence a digits followed by a full-stop followed by another sequence of digits:
let floating = digits+ '.' digits+
This will match 12.3 and 1.23 correctly but will fail to match 123. and .123. These can be matched by splitting the regular expression into two variants, one which allows zero or more digits before the full-stop and one which allows zero or more digits after the full-stop:
let floating = digit+ '.' digit* | digit* '.' digit+
Note that a single ' . ' does not match this floating
regexp.
As we have already seen, a conventional notation (e.g. 1,000,000,000,000?lel2) exists for decimal exponents. The exponent portion of the string ("el2") may be represented by the regexp:
let exponent = ['e' 'E'] ['+' '-']? digit+
Thus, a regular expression matching positive floating-point numbers represented as strings may be written:
let digit = ['0'-'9'] let mantissa = digit+ '.' digit* | digit* '.' digit+ let exponent = ['e' 'E'] ['+' '-']? digit+ let floating = mantissa exponent?
On the basis of these example regular expressions for integer and floating-point number representations, we shall now develop a lexer. A file giving the description of a lexer for fslex
has the suffix ".fsl". Although lexer definitions depend upon external declarations, we shall examine the description of the lexer first. Specifically, we shall consider an example file "lexer.fsl":
{ open Parser open Lexing } let digit = [' 0' -' 9'] let mantissa = digit+ '.' digit* | digit* '.' digit+ let exponent = ['e' 'E'] ['+' '-']? digit+ let floating = mantissa exponent? rule token = parse | [' ' ' ' ' '] { token lexbuf } | floating | digit+ { NUMBER(float(lexeme lexbuf)) } | '+' { ADD } | '*' { MUL } | eof { EOF }
This lexer defines regular expressions for digits and floating point numbers (as a mantissa followed by an optional exponent) and uses these regular expressions to convert a stream of characters into a stream of four different kinds of
token: NUMBER, ADD, MUL
and EOF
.
The lexer contains a single rule called token
. The first pattern in this rule matches whitespace (spaces, tabs and newlines) and the corresponding action simply calls the token
rule recursively to continue lexing, i.e. it skips all whitespace.
The next two patterns match either the floating
regular expression or sequences of one or more digits, converting the matched string (given by applying the Lexing. lexeme
function to the lexbuf
variable) into a token containing the number as a value of the type float
. The order of the regular expressions is important in this case. If the stream were searched for digits first then the front of a floating point number would be matched up to the decimal point or e
, leaving either an incorrect float to be parsed next (the fractional part) or an erroneous result. For example, the regular expression digit+
would match 123
in 123. 456
, leaving. 456
to be matched next.
The next two patterns match the + and * symbols in the input, generating tokens called ADD
and MUL
, respectively.
The final pattern generates an EOF
token when the input character stream ends (denoted by the built-in regexp eof
).
This lexer can be compiled into an F# program using the fs1ex
compiler which, at the time of writing, must be invoked manually from a DOS prompt. Forexample:
C:...> fslex lexer.fsl compiling to dfas (can take a while...) 14 8 NFA nodes 17 states writing output
The resulting F# source code which implements a lexer of this description is placed in the "lexer.fs" file. Before this file can be compiled, we must create the Parser
module which it depends upon.
In many cases, a parser using a lexer would itself be generated from a parser description, using the fsyacc
compiler. We shall describe this approach in the next section but, before this, we shall demonstrate how the functionality of a generated lexer may be exploited without using fsyacc
.
Before compiling the F# program "lexer.fs", which implements our lexer, we must create the Parser
module which it depends upon:
> module Parser = type token = | NUMBER of float | ADD | MUL | EOF;;
Note that the NUMBER, ADD, MUL
and EOF
tokens used by the lexer are actually nothing more than type constructors, in this case for for the Parser. token
type. Having defined the Parser
module and, in particular, the Parser. token
variant type, we can include the Lexer
module in the interactive session using the #load
directive:
> #load "lexer .ml";; ...
A function to lex a string into a sequence of tokens may be written:
> let rec lex lexbuf = match Lexer.token lexbuf with
| Parser.EOF as token -> [token] | token -> token :: lex lexbuf;; val lex : Tools.FsLex.LexBuffer<Lexing.position, byte> -> Parser.token list
For example, the expression 1 + 2 × 3 is lexed into six tokens:
> lex (Lexing.from_string "1 + 2 * 3");; val it : seq<token> = [NUMBER 1.0; ADD; NUMBER 2.0; MUL; NUMBER 3.0; EOF]
The capabilities of a lexer can clearly be useful in a stand-alone configuration. In particular, programs using lexers, such as that we have just described, will validate their input to some degree. In contrast, many current scientific applications silently produce erroneous results. However, the capabilities of a lexer can be greatly supplemented by an associated parser, as we shall now demonstrate.
The parsing stage of interpreting input converts the sequence of tokens from a lexer into a hierarchical representation (illustrated in figure 5.1) - the abstract syntax tree (AST). This is performed by accumulating tokens from the lexer either until a valid piece of grammar is recognised and can be acted upon, or until the sequence of tokens is clearly invalid, in which case a Parsing. Parse_error
exception is raised.
The fsyacc
compiler can be used to create F# programs which implement a specified parser. The specification for a parser is given by a description in a file with the suffix ".fsy". Formally, these parsers implement LALR(1) grammars described by rules provided in Backus-Naur form (BNF).
For example, consider a parser, based upon the lexer described in the previous section, which interprets textual data files. These files are expected to contain simple mathematical expressions composed of numbers, additions and multiplications.
A program to parse these files can be generated by fsyacc
from a grammar which we shall now describe. Given a file "name. fsy" describing a grammar, by default fsyacc
produces a file "name.fs" containing an F# program implementing that grammar. Therefore, in order to generate a Parser
module for the lexer, we shall place our grammar description in a "parser.fsy" file.
Grammar description files begin by listing the definitions of the tokens which the lexer may generate. In this case, the possible tokens are NUMBER, ADD
and MUL
and EOF
.
Tokens which carry no associated values are defined with:
%tokentoken
1token
2 ...
Tokens which carry an associated value of type type are defined by:
%token<type> token
1token
2 ...
Thus, the tokens for our lexer can be defined by:
%token ADD MUL EOF %token <float> NUMBER
The token definitions are followed by declarations of the associativities and relative precedences (listed in order of increasing precedence) of tokens. In this case, both ADD
and MUL
are left associative and MUL
has higher precedence:
%left ADD %left MUL
Then a declaration of the entry point into the parser. In this case, we shall use a parsing rule called expr
to parse expressions:
%start expr
This is followed by a declaration of the type returned by the entry point. In this case, we shall use a type expr
:
%type <expr> expr
Before the rules and corresponding actions describing the grammar and parser are given, the token and entry-point definitions are followed by a separator:
%%
Like ordinary pattern matching, the guts of a parser is represented by a sequenceof groupings ofrules and their corresponding actions:
group: |rule
1 {action
1 } |rule
2 {action
2 } ... |rule
n {action
n } ;
A grouping represents several possible grammatical constructs, all of which are used to produce F# values of the same type, i.e. the expressions action1 to actionn must have the same type. Rules are simply a list of expected tokens and groupings. In particular, rules may be recursive, i.e. they may refer to themselves, which is useful when building up arbitrarily long lists.
Tokens and rule groups carry values. The value of the n th -matched subexpression of a rule is referred to by the variable $n in the corresponding action.
This example parser contains a single, recursive group of rules that parses simple symbolic expressions:
expr: | NUMBER { Num $1 } | OPEN expr CLOSE { $2 } | expr ADD expr { $1 + $3 } | expr MUL expr { $1 * $3 } ;
The first rule matches a NUMBER
token and generates a Num
node in the AST with the float
value carried by the token (denoted by $1). The second rule returns the AST of a bracketed expression. The third and fourth rules handle the infix addition and multiplication operators, generating Add
and Mul
nodes in the AST using the static member functions of the expr
type.
An F# program "parser.fs" and interface "parser.fsi" that implement this parser, described in this "parser.fsy" file, may be compiled using the fsyacc
program:
$ fsyacc parser.fsy building tables computing first function building kernels computing lookahead relations building lookahead table building action table building goto table 10 states 2 nonterminals 9 terminals 5 productions #rows in action table: 10 #unique rows in action table: 7 maximum #different actions per state: 4 average #different actions per state: 2
The lexer and parser can be tested from the F# interactive mode. The parser must be loaded before the lexer and the expr
type must be loaded before the parser. Therefore, we begin with the definition of the expr
type:
> type expr = | Num of float | Add of expr * expr | Mul of expr * expr with static member (+) (f, g) = Add(f, g) static member (*) (f, g) = Mul(f, g);;
The Parser
module that uses this expr
type can be loaded from the generated file using the #load
directive:
> #load "parser.fs";; ...
The Lexer
module that uses the Parser
module can be loaded from the generated file using the #load
directive:
> #load "lexer.fs";; ...
Finally the lexer and parser can be used to parse a simple symbolic expression to produce an abstract syntax tree by applying the Lexer. token
function and the lex buffer to the higher-order Parser. expr
function:
> Lexing.from_string "1 + 2 * 3" |> Parser.expr Lexer.token;; val it : expr = Add(Num 1.0, Mul(Num 2.0, Num 3.0))
Note that the lexer handled the whitespace and the parser handled the operator precedences.
Thus our lexer and parser have worked together to interpret the integer and floating-point numbers contained in the given text as well as the structure of the text in order to convert the information contained in the text into a data structure that can then be manipulated by further computation.
Moreover, the lexer and parser help to validate input. For example, input that erroneously contains letters is caught by the lexer. In this case, the lexer will raise FailureException.
A file which erroneously contains a floating-point value where an integer was expected will be lexed into tokens without fault but the parser will spot the grammatical error and raise the Parsing.Parser_error
exception.
In an application, the call to the main
function in the Parser
module can be wrapped in a try. . with
to catch the Parsing. Parser_error
exception and handle it.
As we have seen, the fslex
and fsyacc
compilers can be indispensable in both designing and implementing programs that use a non-trivial file format. These tools are likely to be of great benefit to scientists wishing to create unambiguous, human-readable formats.