Iteratee I/O

Why Functional Programming Matters [REF 1990, John Hughes] has been described as the manifesto for lazy programming, written at a time when enthusiasm for this style was running very high. Less than 20 years later, Oleg Kiselyov published the obituary of Lazy I/O in a series of writings; for more information, visit http://okmij.org/ftp/Streams.html.

In the late 2000s, Kiselyov championed a new way of doing I/O that combines the best of Handle-based I/O (precise control over resources and predictable space requirements) with the best of lazy I/O (decoupling of producers and consumers, high level of abstraction).

Let's get to the root of this style of programming by rephrasing the example we saw earlier. Recall the Chunk data type and the parseChunk function:

data Chunk = Chunk {chunk :: String}
                    | LineEnd {chunk :: String,
                               remainder :: String}

  deriving Show
parseChunk :: ByteString -> Chunk
parseChunk chunk
 = if rightS == B8.pack ""
      then Chunk   (toS leftS)
      else LineEnd (toS leftS) ((toS . B8.tail) rightS)
 where
     (leftS, rightS) = B8.break (== '
') chunk

-– this time we extract this helper function to the top level:
toS = map (chr . fromEnum) . B.unpack

We want to iterate through the chunks in a file and accumulate them into lines. Each time a new line is accumulated, we will do some I/O with that line.

Let's model the iteration step as a function that processes a file chunk and returns an IterResult value:

  -- iterF :: String -> IterResult

Note

This example was inspired by http://www.scs.stanford.edu/11au-cs240h/notes/iteratee.html and https://themonadreader.files.wordpress.com/2010/05/issue16.pdf; look for Iteratee: teaching an old fold new tricks.

Because we want to refer to this function signature in IterResult, we capture it as a data type:

  newtype Iter
    = Iter {runIter :: B8.ByteString -> IterResult}

This type represents the concept of an "iteratee". IterResult can indicate two possibilities, HaveLine and NeedChunk:

data IterResult
  = HaveLine {line :: String, residual :: String}
  -- a line has been accumulated (with possible residual)
  | NeedChunk Iter
  -- need more chunk data

instance Show IterResult where
  show (HaveLine l r) = "HaveLine " ++ l ++ "|" ++ r
  show (NeedChunk _) = "NeedChunk"

The HaveLine option indicates that a line has been accumulated, along with a possible residual. The NeedChunk option specifies the step function that must be run as the next iteration step.

  NeedChunk (String -> IterResult)

The step function returns another IterResult, which may be a HaveLine result or another step function (in the form of NeedChunk). This is at the heart of Iteratee I/O: a step function can return another step function. In other words, the Iteratee provides the step function to be run at each iteration step of some process. The easiest way to make sense of this is to continue a bit further down this road.

Iteratee

Let's write an iteratee step function for processing chunks of data:

chunkIter :: Iter
chunkIter = Iter (go "")
    where
      go :: String -> B8.ByteString -> IterResult
      go acc chunk =
       case (parseChunk chunk) of
        (Chunk chunk')
          -> NeedChunk (Iter (go (acc ++ chunk')))
        (LineEnd chunk' residual')
          -> HaveLine (acc ++ chunk') residual'

Note that when we curry the go function with the acc parameter in go acc, we get the Iter type signature:

  (B8.ByteString -> IterResult)

Let's run the chunkIter iteratee function on a chunk of file. To start with the simple case, we make the chunk size large enough so that we can expect IterResult to return a HaveLine:

main = do
  h <- openFile "jabberwocky.txt" ReadMode
  chunk1 <- B.hGet h 50
  print $ runIter chunkIter chunk1
-- HaveLine "'Twas brillig, and the slithy toves"
--                  "Did gy"

With a chunk size smaller than the first file line, our IterResult is of type NeedChunk:

main = do
    h <- openFile "jabberwocky.txt" ReadMode
    chunk1 <- B.hGet h 25
    print $ runIter chunkIter chunk1
    -- NeedChunk

The first run of chunkIter returns a NeedChunk result. This is the iteratee's way of communicating to us (the caller) that we need to perform another iteration. Instead of asking us to simply call the chunkIter function again (which has no knowledge of the chunks of data accumulated so far), the iteratee provides us with the next iteratee to run (which has embedded knowledge of accumulated chunks):

main = do
    h <- openFile "jabberwocky.txt" ReadMode

    chunk1 <- B.hGet h 25
    let (NeedChunk iter1) = runIter chunkIter chunk1

    chunk2 <- B.hGet h 25
    let (HaveLine line residual) = runIter iter1 chunk2
    putStrLn line

Enumerator

Let's write a function that will loop through the chunks of a file and accumulate lines. An enumerator feeds data to an iteratee to get a result:

enumerateFile path initIter =
  withFile path ReadMode $ h ->
     let
      go iter = do
       isEOF <- hIsEOF h
       if isEOF
        then return (HaveLine "End Of File" "")
        -- we're cheating
        else do
          chunk <- B.hGet h 8
          check $ runIter iter chunk
check (NeedChunk iterNext) = go iterNext
check (HaveLine line residual)
  = do 
    putStrLn line
    check $ runIter initIter (B8.pack residual)
    in go initIter

The enumerator takes an initial iteratee, runs it, and checks the result. If the iteratee needs another chunk, the enumerator runs the step function embedded in the iteratee result. If the iteratee signals that it has accumulated a line, the enumerator processes the line and passes the residual to the original iteratee (since, in our design, the HaveLine case does not return a step function to run).

We can use our enumerator like this:

  main = do enumerateFile "jabberwocky.txt" chunkIter

The enumerator has a type:

  enumerateFile :: FilePath -> Iter -> IO IterResult

If we apply the first argument to enumerateFile, we get the type:

  (enumerateFile file) :: Iter -> IO IterResult

We can concretize this type as follows:

  type Enumerator = Iter -> IO IterResult

This shows that an enumerator takes an iteratee, performs the iteration with possible side effects, and then returns the iteratee result:

  enumerateFile :: FilePath -> Enumerator

Enumerators work like folds, in the sense that they apply a function (with an accumulator) over an input stream element by element.

Generalized iteratees, enumerators, and enumeratees

Our code is flawed, and a very long way from what we would need for a robust and composable Iteratee I/O library.

For instance, with the preceding code, the last line of the file is lost. This is easily solved by passing a richer input type to the iteratee. Instead of the chunk ByteString, we can have the following:

  data IterInput = Chunk' String | EndOfFile

At the end of the file, our enumerator can pass an EndOfFile to the iteratee and ask it to return the last line as HaveLine result.

To address the critical issue of dealing with failure, the iteratee can also signal failure to the enumerator by extending the IterResult function:

data IterResult
  = HaveLine {line :: String, residual :: String}
  | NeedChunk Iter
  | Failure {errMsg :: String}

In our example, the iteratee and enumerator types were specialized to the types we needed, but these can easily be generalized by parameterizing our types.

We also want to be able to compose iteratees and enumerators; this can be achieved by making them implement the Monad type class.

Another major omission is that, in practice, we also need abstractions that enable transforming the output of an enumerator or iteratee and feeding that into another iteratee. This allows for very flexible processing of pipelines:

--  ITERATEE  -> ENUMERATOR
--  ITERATEE1 -> ITERATEE2  -> ENUMERATOR1 -> ENUMERATOR2 ...
--  ITERATEE  -> ENUMERATEE -> ENUMERATOR
--  ... -> ENUMERATEE1 -> ENUMERATEE2 -> ENUMERATEE3 ...

Iteratees produce data, enumeratees serve as pipeline transformers of data, and enumerators consume data and drive the pipeline process.

While the enumerator drives processing, the iteratee can also influence evaluation by signaling when it is done processing, when it needs more data, when it can yield a result, or when it has encountered a failure. The producer collaborates with the consumer.

With this flexibility, resource management in the face of exceptions becomes much more tractable than with lazy I/O, and it can be abstracted at a much higher level than with Handle-based I/O.

 

"Iteratee I/O is a style of incremental input processing with precise resource control. The style encourages building input processors from a user-extensible set of primitives by chaining, layering, pairing and other modes of compositions.

The style is especially suitable for processing of communication streams, large amounts of data, and data that has undergone several levels of encoding such as pickling, compression, chunking, framing.

It has been used for programming high-performance (HTTP) servers and web frameworks, in computational linguistics and financial trading."

 
 --Iteratees, Kiselyov

Iteratee I/O libraries

Iteratee libraries differ markedly in how they model types for iteratees, enumerators, and enumeratees. Even the meaning and necessity of enumeratees varies between libraries. Libraries also differ in the type of signals the iteratee can send back to the consumer (that is, the kind of influence the iteratee can have on processing).

There are several libraries to choose from; the choice is made simpler by understanding the chronology and historical development:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset