Why Functional Programming Matters [REF 1990, John Hughes] has been described as the manifesto for lazy programming, written at a time when enthusiasm for this style was running very high. Less than 20 years later, Oleg Kiselyov published the obituary of Lazy I/O in a series of writings; for more information, visit http://okmij.org/ftp/Streams.html.
In the late 2000s, Kiselyov championed a new way of doing I/O that combines the best of Handle-based I/O (precise control over resources and predictable space requirements) with the best of lazy I/O (decoupling of producers and consumers, high level of abstraction).
Let's get to the root of this style of programming by rephrasing the example we saw earlier. Recall the Chunk
data type and the parseChunk
function:
data Chunk = Chunk {chunk :: String} | LineEnd {chunk :: String, remainder :: String} deriving Show parseChunk :: ByteString -> Chunk parseChunk chunk = if rightS == B8.pack "" then Chunk (toS leftS) else LineEnd (toS leftS) ((toS . B8.tail) rightS) where (leftS, rightS) = B8.break (== ' ') chunk -– this time we extract this helper function to the top level: toS = map (chr . fromEnum) . B.unpack
We want to iterate through the chunks in a file and accumulate them into lines. Each time a new line is accumulated, we will do some I/O with that line.
Let's model the iteration step as a function that processes a file chunk and returns an IterResult
value:
-- iterF :: String -> IterResult
This example was inspired by http://www.scs.stanford.edu/11au-cs240h/notes/iteratee.html and https://themonadreader.files.wordpress.com/2010/05/issue16.pdf; look for Iteratee: teaching an old fold new tricks.
Because we want to refer to this function signature in IterResult
, we capture it as a data type:
newtype Iter = Iter {runIter :: B8.ByteString -> IterResult}
This type represents the concept of an "iteratee". IterResult
can indicate two possibilities, HaveLine
and NeedChunk
:
data IterResult = HaveLine {line :: String, residual :: String} -- a line has been accumulated (with possible residual) | NeedChunk Iter -- need more chunk data instance Show IterResult where show (HaveLine l r) = "HaveLine " ++ l ++ "|" ++ r show (NeedChunk _) = "NeedChunk"
The HaveLine
option indicates that a line has been accumulated, along with a possible residual. The NeedChunk
option specifies the step function that must be run as the next iteration step.
NeedChunk (String -> IterResult)
The step function returns another IterResult
, which may be a HaveLine
result or another step function (in the form of NeedChunk
). This is at the heart of Iteratee I/O: a step function can return another step function. In other words, the Iteratee provides the step function to be run at each iteration step of some process. The easiest way to make sense of this is to continue a bit further down this road.
Let's write an iteratee step function for processing chunks of data:
chunkIter :: Iter chunkIter = Iter (go "") where go :: String -> B8.ByteString -> IterResult go acc chunk = case (parseChunk chunk) of (Chunk chunk') -> NeedChunk (Iter (go (acc ++ chunk'))) (LineEnd chunk' residual') -> HaveLine (acc ++ chunk') residual'
Note that when we curry the go
function with the acc
parameter in go acc
, we get the Iter
type signature:
(B8.ByteString -> IterResult)
Let's run the chunkIter
iteratee function on a chunk of file. To start with the simple case, we make the chunk size large enough so that we can expect IterResult
to return a HaveLine
:
main = do h <- openFile "jabberwocky.txt" ReadMode chunk1 <- B.hGet h 50 print $ runIter chunkIter chunk1 -- HaveLine "'Twas brillig, and the slithy toves" -- "Did gy"
With a chunk size smaller than the first file line, our IterResult
is of type NeedChunk
:
main = do h <- openFile "jabberwocky.txt" ReadMode chunk1 <- B.hGet h 25 print $ runIter chunkIter chunk1 -- NeedChunk
The first run of chunkIter
returns a NeedChunk
result. This is the iteratee's way of communicating to us (the caller) that we need to perform another iteration. Instead of asking us to simply call the chunkIter
function again (which has no knowledge of the chunks of data accumulated so far), the iteratee provides us with the next iteratee to run (which has embedded knowledge of accumulated chunks):
main = do h <- openFile "jabberwocky.txt" ReadMode chunk1 <- B.hGet h 25 let (NeedChunk iter1) = runIter chunkIter chunk1 chunk2 <- B.hGet h 25 let (HaveLine line residual) = runIter iter1 chunk2 putStrLn line
Let's write a function that will loop through the chunks of a file and accumulate lines. An enumerator feeds data to an iteratee to get a result:
enumerateFile path initIter = withFile path ReadMode $ h -> let go iter = do isEOF <- hIsEOF h if isEOF then return (HaveLine "End Of File" "") -- we're cheating else do chunk <- B.hGet h 8 check $ runIter iter chunk check (NeedChunk iterNext) = go iterNext check (HaveLine line residual) = do putStrLn line check $ runIter initIter (B8.pack residual) in go initIter
The enumerator takes an initial iteratee, runs it, and checks the result. If the iteratee needs another chunk, the enumerator runs the step function embedded in the iteratee result. If the iteratee signals that it has accumulated a line, the enumerator processes the line and passes the residual to the original iteratee (since, in our design, the HaveLine
case does not return a step function to run).
We can use our enumerator like this:
main = do enumerateFile "jabberwocky.txt" chunkIter
The enumerator has a type:
enumerateFile :: FilePath -> Iter -> IO IterResult
If we apply the first argument to enumerateFile
, we get the type:
(enumerateFile file) :: Iter -> IO IterResult
We can concretize this type as follows:
type Enumerator = Iter -> IO IterResult
This shows that an enumerator takes an iteratee, performs the iteration with possible side effects, and then returns the iteratee result:
enumerateFile :: FilePath -> Enumerator
Enumerators work like folds, in the sense that they apply a function (with an accumulator) over an input stream element by element.
Our code is flawed, and a very long way from what we would need for a robust and composable Iteratee I/O library.
For instance, with the preceding code, the last line of the file is lost. This is easily solved by passing a richer input type to the iteratee. Instead of the chunk ByteString
, we can have the following:
data IterInput = Chunk' String | EndOfFile
At the end of the file, our enumerator can pass an EndOfFile
to the iteratee and ask it to return the last line as HaveLine
result.
To address the critical issue of dealing with failure, the iteratee can also signal failure to the enumerator by extending the IterResult
function:
data IterResult
= HaveLine {line :: String, residual :: String}
| NeedChunk Iter
| Failure {errMsg :: String}
In our example, the iteratee and enumerator types were specialized to the types we needed, but these can easily be generalized by parameterizing our types.
We also want to be able to compose iteratees and enumerators; this can be achieved by making them implement the Monad
type class.
Another major omission is that, in practice, we also need abstractions that enable transforming the output of an enumerator or iteratee and feeding that into another iteratee. This allows for very flexible processing of pipelines:
-- ITERATEE -> ENUMERATOR -- ITERATEE1 -> ITERATEE2 -> ENUMERATOR1 -> ENUMERATOR2 ... -- ITERATEE -> ENUMERATEE -> ENUMERATOR -- ... -> ENUMERATEE1 -> ENUMERATEE2 -> ENUMERATEE3 ...
Iteratees produce data, enumeratees serve as pipeline transformers of data, and enumerators consume data and drive the pipeline process.
While the enumerator drives processing, the iteratee can also influence evaluation by signaling when it is done processing, when it needs more data, when it can yield a result, or when it has encountered a failure. The producer collaborates with the consumer.
With this flexibility, resource management in the face of exceptions becomes much more tractable than with lazy I/O, and it can be abstracted at a much higher level than with Handle-based I/O.
Iteratee libraries differ markedly in how they model types for iteratees, enumerators, and enumeratees. Even the meaning and necessity of enumeratees varies between libraries. Libraries also differ in the type of signals the iteratee can send back to the consumer (that is, the kind of influence the iteratee can have on processing).
There are several libraries to choose from; the choice is made simpler by understanding the chronology and historical development: