Chapter 5. INPUT AND OUTPUT

In this chapter, we examine the various ways in which an F# program can transfer data, including printing to the screen and saving and loading information on disc. In particular, we examine some sophisticated tools for F# which greatly simplify the task of designing and implementing programs to load files in well-defined formats.

PRINTING

The ability to print information on the screen or into a string or file is useful as a means of conveying the result of a program (if the result is simple enough) and providing run-time information on the current state of the program as well as providing extra information to aid debugging. Naturally, F# provides several functions to print to the screen. In particular, the printf function can be used to print a variety of different types and perform some simple formatting.

The printf function understands several format specifiers:

  • %s print a string

  • %d print an int

  • % f print a float

  • %g print a float in scientific notation

  • %a print a value using a custom print function

  • %0 print a value of any type using the ToString () method.

  • %A print a value of any type using the built-in structural pretty printer.

Special characters can also be printed:

  • newline

  • " double-quotes

  • tab

For example, the following prints a string in quotes, an int and then a float:

> printf "String: "%s", int: %d, float: %f
" "foo" 3
    7.4;;
String: "foo", int: 3, float: 7.4

The following function uses the generic %A format specifier to print a value as a running F# interactive session would:

> printf "%A" (1, 2.3, "foo");;
(1, 2.3, "foo")

The ability to pretty print values of any type at run time is particularly useful during debugging. There are also related functions that use the same syntax to print into a string or to a channel.

Generating strings

Values can be printed into strings using the sprintf function. This function accepts the same format specifiers as the printf function.

For example, the following prints the same result as the previous example but returns a string containing the result:

> sprintf "String: "%s", int: %d, float: %f
" "foo" 3
    7.4;;
val it : string = "String: "foo", int: 3, float: 7.4"

The %a format specifier is most easily elucidated using a example based upon the sprintf function. This format specifier expects two corresponding arguments, a print function g and its argument x:

> let f g x = sprintf "%a" g x;
val f : (unit -> 'a -> string) -> 'a -> string

This is particularly useful when printing recursive data structures such as lists and trees as a print function can pass itself or another related function as an argument to sprintf.

The F# interactive mode can be supplemented with custom pretty printers for user-defined types by supplying a function to convert a value of that type into the corresponding string to the fsi . AddPrinter function. A function to convert a value of the type expr to a string may be written using the sprintf function and the %d and %a format specifiers:

> let rec string_of_expr () expr =
    | Int n -> sprintf "%d" n
    | Var v -> v
    | Add(f, g) ->
        sprintf "%a + %a"
          string_of_expr f string_of_expr g
    | Mul(f, g) ->
        sprintf "%a * %a"
          string_of_mul f string_of_mul g
  and string_of_mul () = function
    | Int _ | Mul _ as f -> string_of_expr () f
    | Add _ as f -> sprintf "(%a)" string_of_expr f;;
val string_of_expr : unit -> expr -> string
val string_of_mul : unit -> expr -> string

Printing of subexpressions inside a product are dispatched to the auxiliary function string_of _mul in order to bracket subexpressions that have a lower precedence. Specifically, any addition inside a multiplication must be bracketed when pretty printing an expression.

The previous example expression is now printed more comprehensibly by the F# interactive mode:

> (Int 1 + Var "x") * Int 3;;
val it : expr = (1 + x) * 3;;

Note that the subexpression 1 + x is bracketed correctly when it is printed because the string_of _expr function takes the relative precedence of multiplication over addition into account.

GENERIC PRINTING

In addition to printing values of specific types, the F# programming language also provides a print_any function (equivalent to printf " %A") to print values of any type. For example, to print a list of 2-tuples in F# syntax:

> print_any [1, 2; 2, 3];;
[(l, 2); (2, 3)]

Analogously, there is an any_to_string function:

> any_to_string [1, 2; 2, 3];;
val it : string = "[(1, 2); (2, 3)]"

This is particularly useful when debugging, to avoid having to write conversion functions for each and every type.

READING FROM AND WRITING TO FILES

The act of saving data in a file is performed by opening the file, writing data toit and then closing the file. F# provides an elegant and efficient way to do this using functionality provided by .NET in the System. IO namespace combined with F#'s sequence expressions.

The following lines_of_file function returns a sequence of the lines in the file with the given name:

> open System.IO;;

> let lines_of_file filename =
    seq { use stream = File.OpenRead filename
          user reader = new StreamReader(stream)
          while not reader.EndOfStream do
            yield reader.ReadLine() };;
val lines_of_file : string -> seq<string>

The sequence expression in this function loops until the end of the file, reading and yielding each line of the file in turn. The use bindings are analogous to let bindings except that they also dispose of the created objects when enumeration over the sequence is complete. In this case, the use bindings cause the file to be closed when enumeration over the sequence is complete.

The .NET platform actually provides similar functions for reading whole files such as stream.ReadToEnd() but this sequence expression has an important advantage: the file is only read as the sequence is enumerated over. This is a form of lazy evaluation and is one of the most important benefits of sequences in F#. For example, this lines_of_file function can be used to read only the first three lines of a file without having to read the entire file first:

> lines_of_file "fib.ml"
  |> Seq.take 3;;
val it : seq<string> =
  ["let rec fib = function";
   " | 0 | 1 as n -> n";
   " | n -> fib(n-l) + fib(n-2)"]

The lines_of_file function can be used to count the number of space-separated words in a file as follows:

> lines_of_file @"C:iblel3.txt"
  |> Seq.map (String.split [' '] >> Seq.length)
|> Seq.fold (+) 0;;
val it : int = 823647

An important characteristic of this word counter is that it handles a single line at a time and, consequently, has very low memory requirements compared to more obvious solutions that read in the whole file. This kind of programming makes F# interactive sessions a very powerful tool for data dissection and manipulation.

This functionality is of particular importance when handling large amounts of data, such as many database applications. This function will be used to write a simple word frequency counter in chapter 6. The use of sequence expressions to transparently interrogate databases on demand is covered in chapter 10.

SERIALIZATION

The following save function uses the built-in .NET serialization routines to save any value of any type to the given file:

> open System.Runtime.Serialization.Formatters.Binary;;

> let save filename x =
    use stream =
      new FileStream(filename, FileMode.Create)
    (new BinaryFormatter()) .Serialize(stream, x);;
val save : string -> 'a -> unit

For example, the following saves a 3-tuple to the "test.dat" file:

> save "test.dat" (1, 3.0, ["piece"]);;
val it : unit = ()

The following load function reads back a value that was saved by the save function:

> let load filename =
    use stream = new FileStream(filename, FileMode.Open)
    (new BinaryFormatter()).Deserialize(stream, x)
    |> unbox;;
val load : string -> 'a

For example, the saved 3-tuple can be loaded back with:

> (load "test.dat" : int * float * string list);;
val it : int * float * string list = (1, 3.0, ["piece"])

Note that we are careful to annotate the type of the data. Although such type annotations can be omitted in compiled code under certain circumstances, it is always a good idea to make this information explicit in the source code as we have done.

The standard .NET serialization routines can store and retrieve arbitrary data in a binary format. Data stored in non-trivial formats is typically read using techniques known as lexing and parsing. Serialization is simpler but often an order of magnitude slower than using a custom format, even a readable text-based format.

Parsing character sequences often entails lexing into a token stream and then parsing to convert patterns of tokens into grammatical constructs represented hierarchically by a tree data structure.

Figure 5.1. Parsing character sequences often entails lexing into a token stream and then parsing to convert patterns of tokens into grammatical constructs represented hierarchically by a tree data structure.

LEXING AND PARSING

In addition to the primitive input and output functions offered by F#, the language is bundled with very powerful tools, called fslex and fsyacc, for deciphering the content of files according to a formal grammar. This aspect of the language has been particularly well honed due to the widespread use of this family of languages for writing compilers and interpreters, i.e. programs which understand, and operate on, other programs.

In the context of scientific computing, providing data in a human readable format with a formally defined grammar is highly desirable. This allows a data file to convey useful information both to a human and to a computer audience. In the case of a human, the file can be read using a text editor. In the case of a computer, the file can be parsed to produce a data structure which reflects the information in the file (illustrated in figure 5.1). In the latter case, the program could then go on to perform computations on the data.

The ability to use fslex and fsyacc is, therefore, likely to be of great benefit to scientists. We shall now examine the use of these tools in more detail.

Lexing

The first step in using these tools to interpret a file is called lexing. This stage involves reading characters from the input, matching them against patterns and outputting a stream of tokens. A token is a value which represents some of the input. For example, a sequence of space-separated digits could be lexed into a stream of integer-valued tokens.

Using

String. split The task of lexing can be elucidated by the simple example of splitting a file into lines and words, expected to represent the columns and elements of a matrix, respectively.

A function to read a matrix by splitting each line into words, converting each word into a float and constructing an F# Matrix may be written in terms of the lines_of _file function that was defined in section 5.3:

> let read_matrix filename =
    lines_of_file filename
    |> Seq.map (String.split [' '])
    |> Seq.map (Seq.map float)
    |> Math.Matrix.of_seq;;
val read_matrix : string -> Math.Matrix

Note that reading from the file is deferred until the call to Matrix. of_seq because Seq is lazy.

The String. split function is lazily mapped over this sequence to obtain a sequence of string lists representing the space-separated words on each line. Two sequence maps are used to apply the float_of _string function to each element. Finally, the Matrix. of_seq function enumerates the elements of the sequences (opening, reading and closing the file) and generates the matrix result.

For example, reading a 3 × 3 matrix from a file:

> read_matrix @"C:vectors.txt";;
val it : Math.Matrix = matrix [[1.0; 2.0; 3.0];
                               [4.0; 5.0; 6.0];
                               [7.0; 8.0; 9.0];]

Splitting strings can be a useful way to lex and parse very simple formats. However, this approach is not very powerful or extensible. In particular, lexing mathematical expressions with a sophisticated syntax would be very tedious. Fortunately, the fs1ex tool allows more complicated lexing to be performed easily.

Using

fslex The fslex tool is a lexer generator that takes F#-like source code in a file with the suffix ".fsl" that describes a lexer and compiles it into efficient F# source code that implements that lexer, which can then be compiled with the F# compiler as an ordinary source file. The resulting lexer translates an incoming stream of characters into an outgoing stream of tokens. A token is a variant type. In order to produce tokens, patterns are spotted in the stream of characters by matching them against regular expressions, also known as regexs. Like pattern matches, the regexs handled by fslex may contain several kinds of structure:

'x' match the specified, single character.

'a'-'z' match any character from a range.

_ match any single character.

"string " match the given string of characters.

Several other constructs are specific to regexps:

['a' 'c' 'e'] match any regexp in the given set.

[ˆ 'a' 'c' 'e'] match any regexp not in the given set.

regexp * match zero or more repetitions of a string matching regexp.

regexp + match one or more repetitions of a string matching regexp.

regexp ? match regexp or the empty string.

regexp1 # regexp2 match any string which matches regexp1 and does not match regexp2.

regexp1 | regexp 2 match any string which either matches regexp1 or matches regexp2.

regexp1 regexp2 concatenate a string matching regexp1 with a string matching regexp2.

eof match the end of file.

In an fslex file (with the suffix ".fsl"), regular expressionscan be named using the let construct. A regular expression named digit that matches any decimal digit may be written:

let digit = ['0'-'9']

String representations of floating-point numbers are somewhat more adventurous. An initial attempt at a regular expression might match a sequence a digits followed by a full-stop followed by another sequence of digits:

let floating = digits+ '.' digits+

This will match 12.3 and 1.23 correctly but will fail to match 123. and .123. These can be matched by splitting the regular expression into two variants, one which allows zero or more digits before the full-stop and one which allows zero or more digits after the full-stop:

let floating = digit+ '.' digit* | digit* '.' digit+

Note that a single ' . ' does not match this floating regexp.

As we have already seen, a conventional notation (e.g. 1,000,000,000,000?lel2) exists for decimal exponents. The exponent portion of the string ("el2") may be represented by the regexp:

let exponent = ['e' 'E'] ['+' '-']? digit+

Thus, a regular expression matching positive floating-point numbers represented as strings may be written:

let digit = ['0'-'9']
let mantissa = digit+ '.' digit* | digit* '.' digit+
let exponent = ['e' 'E'] ['+' '-']? digit+
let floating = mantissa exponent?

On the basis of these example regular expressions for integer and floating-point number representations, we shall now develop a lexer. A file giving the description of a lexer for fslex has the suffix ".fsl". Although lexer definitions depend upon external declarations, we shall examine the description of the lexer first. Specifically, we shall consider an example file "lexer.fsl":

{
  open Parser
  open Lexing
}
let digit = [' 0' -' 9']
let mantissa = digit+ '.' digit* | digit* '.' digit+
let exponent = ['e' 'E'] ['+' '-']? digit+
let floating = mantissa exponent?
rule token = parse
| [' ' '	' '
'] { token lexbuf }
| floating
| digit+ { NUMBER(float(lexeme lexbuf)) }
| '+' { ADD }
| '*' { MUL }
| eof { EOF }

This lexer defines regular expressions for digits and floating point numbers (as a mantissa followed by an optional exponent) and uses these regular expressions to convert a stream of characters into a stream of four different kinds of token: NUMBER, ADD, MUL and EOF.

The lexer contains a single rule called token. The first pattern in this rule matches whitespace (spaces, tabs and newlines) and the corresponding action simply calls the token rule recursively to continue lexing, i.e. it skips all whitespace.

The next two patterns match either the floating regular expression or sequences of one or more digits, converting the matched string (given by applying the Lexing. lexeme function to the lexbuf variable) into a token containing the number as a value of the type float. The order of the regular expressions is important in this case. If the stream were searched for digits first then the front of a floating point number would be matched up to the decimal point or e, leaving either an incorrect float to be parsed next (the fractional part) or an erroneous result. For example, the regular expression digit+ would match 123 in 123. 456, leaving. 456 to be matched next.

The next two patterns match the + and * symbols in the input, generating tokens called ADD and MUL, respectively.

The final pattern generates an EOF token when the input character stream ends (denoted by the built-in regexp eof).

This lexer can be compiled into an F# program using the fs1ex compiler which, at the time of writing, must be invoked manually from a DOS prompt. Forexample:

C:...> fslex lexer.fsl
compiling to dfas (can take a while...)
14 8 NFA nodes
17 states
writing output

The resulting F# source code which implements a lexer of this description is placed in the "lexer.fs" file. Before this file can be compiled, we must create the Parser module which it depends upon.

In many cases, a parser using a lexer would itself be generated from a parser description, using the fsyacc compiler. We shall describe this approach in the next section but, before this, we shall demonstrate how the functionality of a generated lexer may be exploited without using fsyacc.

Before compiling the F# program "lexer.fs", which implements our lexer, we must create the Parser module which it depends upon:

> module Parser =
    type token =
      | NUMBER of float
      | ADD
      | MUL
      | EOF;;

Note that the NUMBER, ADD, MUL and EOF tokens used by the lexer are actually nothing more than type constructors, in this case for for the Parser. token type. Having defined the Parser module and, in particular, the Parser. token variant type, we can include the Lexer module in the interactive session using the #load directive:

> #load "lexer .ml";;
...

A function to lex a string into a sequence of tokens may be written:

> let rec lex lexbuf =
    match Lexer.token lexbuf with
| Parser.EOF as token -> [token]
    | token -> token :: lex lexbuf;;
val lex :
  Tools.FsLex.LexBuffer<Lexing.position, byte> ->
    Parser.token list

For example, the expression 1 + 2 × 3 is lexed into six tokens:

> lex (Lexing.from_string "1 + 2 * 3");;
val it : seq<token>
  = [NUMBER 1.0; ADD; NUMBER 2.0; MUL; NUMBER 3.0; EOF]

The capabilities of a lexer can clearly be useful in a stand-alone configuration. In particular, programs using lexers, such as that we have just described, will validate their input to some degree. In contrast, many current scientific applications silently produce erroneous results. However, the capabilities of a lexer can be greatly supplemented by an associated parser, as we shall now demonstrate.

Parsing

The parsing stage of interpreting input converts the sequence of tokens from a lexer into a hierarchical representation (illustrated in figure 5.1) - the abstract syntax tree (AST). This is performed by accumulating tokens from the lexer either until a valid piece of grammar is recognised and can be acted upon, or until the sequence of tokens is clearly invalid, in which case a Parsing. Parse_error exception is raised.

The fsyacc compiler can be used to create F# programs which implement a specified parser. The specification for a parser is given by a description in a file with the suffix ".fsy". Formally, these parsers implement LALR(1) grammars described by rules provided in Backus-Naur form (BNF).

For example, consider a parser, based upon the lexer described in the previous section, which interprets textual data files. These files are expected to contain simple mathematical expressions composed of numbers, additions and multiplications.

A program to parse these files can be generated by fsyacc from a grammar which we shall now describe. Given a file "name. fsy" describing a grammar, by default fsyacc produces a file "name.fs" containing an F# program implementing that grammar. Therefore, in order to generate a Parser module for the lexer, we shall place our grammar description in a "parser.fsy" file.

Grammar description files begin by listing the definitions of the tokens which the lexer may generate. In this case, the possible tokens are NUMBER, ADD and MUL and EOF.

Tokens which carry no associated values are defined with:

%token token1 token2 ...

Tokens which carry an associated value of type type are defined by:

%token <type> token1 token2 ...

Thus, the tokens for our lexer can be defined by:

%token ADD MUL EOF
%token <float> NUMBER

The token definitions are followed by declarations of the associativities and relative precedences (listed in order of increasing precedence) of tokens. In this case, both ADD and MUL are left associative and MUL has higher precedence:

%left ADD
%left MUL

Then a declaration of the entry point into the parser. In this case, we shall use a parsing rule called expr to parse expressions:

%start expr

This is followed by a declaration of the type returned by the entry point. In this case, we shall use a type expr:

%type <expr> expr

Before the rules and corresponding actions describing the grammar and parser are given, the token and entry-point definitions are followed by a separator:

%%

Like ordinary pattern matching, the guts of a parser is represented by a sequenceof groupings ofrules and their corresponding actions:

group:
| rule1 { action1 }
| rule2 { action2 }
...
| rulen { actionn }
;

A grouping represents several possible grammatical constructs, all of which are used to produce F# values of the same type, i.e. the expressions action1 to actionn must have the same type. Rules are simply a list of expected tokens and groupings. In particular, rules may be recursive, i.e. they may refer to themselves, which is useful when building up arbitrarily long lists.

Tokens and rule groups carry values. The value of the n th -matched subexpression of a rule is referred to by the variable $n in the corresponding action.

This example parser contains a single, recursive group of rules that parses simple symbolic expressions:

expr:
| NUMBER { Num $1 }
| OPEN expr CLOSE { $2 }
| expr ADD expr { $1 + $3 }
| expr MUL expr { $1 * $3 }
;

The first rule matches a NUMBER token and generates a Num node in the AST with the float value carried by the token (denoted by $1). The second rule returns the AST of a bracketed expression. The third and fourth rules handle the infix addition and multiplication operators, generating Add and Mul nodes in the AST using the static member functions of the expr type.

An F# program "parser.fs" and interface "parser.fsi" that implement this parser, described in this "parser.fsy" file, may be compiled using the fsyacc program:

$ fsyacc parser.fsy
building tables
computing first function
building kernels
computing lookahead relations
building lookahead table
building action table
building goto table
10 states
2 nonterminals
9 terminals
5 productions
#rows in action table: 10
#unique rows in action table: 7
maximum #different actions per state: 4
average #different actions per state: 2

The lexer and parser can be tested from the F# interactive mode. The parser must be loaded before the lexer and the expr type must be loaded before the parser. Therefore, we begin with the definition of the expr type:

> type expr =
    | Num of float
    | Add of expr * expr
    | Mul of expr * expr
  with
    static member (+) (f, g) = Add(f, g)
    static member (*) (f, g) = Mul(f, g);;

The Parser module that uses this expr type can be loaded from the generated file using the #load directive:

> #load "parser.fs";;
...

The Lexer module that uses the Parser module can be loaded from the generated file using the #load directive:

> #load "lexer.fs";;
...

Finally the lexer and parser can be used to parse a simple symbolic expression to produce an abstract syntax tree by applying the Lexer. token function and the lex buffer to the higher-order Parser. expr function:

> Lexing.from_string "1 + 2 * 3"
  |> Parser.expr Lexer.token;;
val it : expr = Add(Num 1.0, Mul(Num 2.0, Num 3.0))

Note that the lexer handled the whitespace and the parser handled the operator precedences.

Thus our lexer and parser have worked together to interpret the integer and floating-point numbers contained in the given text as well as the structure of the text in order to convert the information contained in the text into a data structure that can then be manipulated by further computation.

Moreover, the lexer and parser help to validate input. For example, input that erroneously contains letters is caught by the lexer. In this case, the lexer will raise FailureException. A file which erroneously contains a floating-point value where an integer was expected will be lexed into tokens without fault but the parser will spot the grammatical error and raise the Parsing.Parser_error exception.

In an application, the call to the main function in the Parser module can be wrapped in a try. . with to catch the Parsing. Parser_error exception and handle it.

As we have seen, the fslex and fsyacc compilers can be indispensable in both designing and implementing programs that use a non-trivial file format. These tools are likely to be of great benefit to scientists wishing to create unambiguous, human-readable formats.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset