Lazy I/O

Of the three main glues of Haskell (HOFs, the type system, and laziness), laziness is different in that it is not a concrete thing in the language, but is instead related to the way the code will be evaluated in the runtime. Laziness is something that we have to know about rather than something we can always see in the code.

Of course, laziness does show in the language, for example, wherever we want to enforce strict evaluation, as with the sequence function:

main = do
  –- lazy IO stream
  let ios = map putStrLn ["this", "won't", "run"]
  
  putStrLn "until ios is 'sequenced'..."
  sequence_ ios –- perform actions

Where:

  sequence_  :: [IO ()] -> IO ()
  sequence_  =  foldr (>>) (return ())

The sequence_ function discards the action results because the (>>) operator discards the results of the its firsts argument action.

In contrast, the sequence function retains the results:

main = do
  h <- openFile "jabberwocky.txt" ReadMode
  line1 <- hGetLine h                 -– perform action
  let getLines = [hGetLine h, hGetLine h]
  [line2, line3] <- sequence getLines -- perform actions
  hClose h
  putStrLn line2

The line1 string is read eagerly with the first hGetLine h action, while the line2 and line3 are only read when "sequenced".

While the hGetLine function returns a strict string, the hGetContents function returns a lazy string. Put another way, hGetContents streams file contents on demand while hGetLine reads the file eagerly:

main = do
  h <- openFile "jabberwocky.txt" ReadMode
  contents  <- hGetContents h
  putStrLn (take 10 contents) -- lazily fetch 10 chars
  hClose h

Using lines together with hGetContents, we get a lazy stream of file lines:

lineStream h = hGetContents h >>= return . lines
main = do
  h <- (openFile "jabberwocky.txt" ReadMode)
  lines' <- lineStream h
  sequence_ (map putStrLn lines')
  hClose h

The mapM function captures the common pattern of mapping and sequencing:

main = do
  h <- (openFile "jabberwocky.txt" ReadMode)
  lines' <- lineStream h
  mapM_ putStrLn lines'
  hClose h

Here, mapM_ f is equal to sequence_ (map f).

The forM_ function is just mapM_ with flipped arguments, which is useful when you want to pass a "trailing lambda":

main = do
  h <- (openFile "jabberwocky.txt" ReadMode)
  lines' <- lineStream h
  forM_ lines' $ line -> do
        let reversed = reverse line
        putStrLn reversed
  hClose h

When performing lazy I/O, we need to make the distinction between an I/O action and performing an I/O action. Also, we need to know the lazy/strict characteristics of the functions we are working with (for example, hGetLine and hGetContents).

Let's return to our imperative style I/O code from the previous section and rephrase it in the style of lazy I/O. We'll retain the Chunk data type and the parseChunk function from the imperative example:

data Chunk = Chunk   {chunk :: String}
                     | LineEnd {chunk :: String, 
                                remainder :: String}

parseChunk :: String -> Chunk
-– parseChunk (B8.pack "gimble in the wabe:
All")
-- gives: LineEnd "gimble in the wabe" 
--                "All"

To write this in a lazy I/O style, we'll start by defining a stream of file chunks:

-- import qualified Data.ByteString.Lazy as LB
-- import qualified Data.ByteString.Lazy.Char8 as L8
chunkStream :: Handle -> IO [L8.ByteString]
chunkStream h
  = do
    isEof <- hIsEOF h
    if isEof
      then return []
      else do
        chunk <- LB.hGet h 8
        rest  <- (chunkStream h)
        return (chunk:rest)

Now we can produce a stream and consume it:

main = do
   chunks <- chunkStream h
   print $ take 10 chunks

The chunk stream produces data. Next, we write a consumer:

processChunk :: String -> [L8.ByteString] -> IO ()
processChunk acc []
  = do putStrLn acc -- terminate recursion

processChunk' acc (chunk:chunks)
  = case (parseChunk chunk) of
    (Chunk chunk')
      -> do
        processChunk' (acc ++ chunk') chunks
     (LineEnd chunk' remainder)
      -> do
        let line = acc ++ chunk'
        putStrLn line -- do something with line
        processChunk' remainder chunks

processChunk = processChunk' ""

The processChunk function recursively consumes our stream and accumulates chunks into lines. It is tail recursive and uses constant space:

main = do
  h <- openFile "jabberwocky.txt" ReadMode
  chunkStream h >>= processChunk
  hClose h

We decoupled producer from the consumer and the consumer drives the materializing of the source stream.

In contrast, in the imperative example, the loop function drives the iteration through the chunks. Also, in the imperative case, the consumer is not explicit.

The processChunk method is an I/O action with side effects. As it loops through the file chunks (via recursion), it keeps accumulating chunks until it has captured a whole line.

Then it does some I/O putStrLn line and starts accumulating chunks for the next line. Iteration and I/O processing are interleaved. We can decouple this further, by making a pure function, lineStream, that produces a stream of lines:

lineStream accChunks [] = [accChunks]
lineStream accChunks (chunk:chunks)
  = case (parseChunk chunk) of
    (Chunk chunk')
      -> lineStream (accChunks ++ chunk') chunks
    (LineEnd chunk' remainder)
      -> (accChunks ++ chunk') :
        (lineStream remainder chunks)
toLines = lineStream ""

Now we can feed chunkStream to the toLines function:

main = do
  h <- openFile "jabberwocky.txt" ReadMode
  lines' <- liftM toLines (chunkStream h)
  mapM_ putStrLn lines'
  hClose h

The toLines is a pure function of its input stream while chunkStream is a stream wrapped in an I/O monad. This is why we need liftM to lift toLines into the Monad. We could have said this instead:

  chunks <- (chunkStream h)
  let lines' = toLines chunks

It is only when we do mapM putStrLn lines' that the stream toLines starts to materialize, which in turn drives the evaluation of chunkStream. When we attempt to print one line, the line is materialized, just in time, lazily. Underneath the mapM_ function, it is the sequence_ function that drives the evaluation of the stream.

This brings us back full circle to the lines library function, which lazily returns lines from file chunks:

main = do
  h <- openFile "jabberwocky.txt" ReadMode
  lines' <- hGetContents h >>= return . Lines
  -– vs
  -- lines' <- liftM toLines (chunkStream h)
  mapM putStrLn lines'
  hClose h

In the preceding code, we composed pure functional streams with I/O streams using monadic operators and functions. This is the lazy I/O, the pure functional way for composable I/O. In the same way that we can often express functions as pipelines of simpler functions, the same is true for I/O. Many practical I/O can be modeled as processing pipelines of streams.

Let's try to describe the essence of Lazy I/O, starting with the advantages:

  • I/O is expressed at a relatively high level of abstraction
  • It is very composable, enabling the decoupling of producers from consumers

The disadvantages of Lazy I/O are that it provides:

  • Poor control over when something is evaluated
  • Poor control of resources

Lazy evaluation is elegant and has far reaching implications. It is an integral part of functional programming that unfortunately does not translate to performing I/O.

The problems with lazy I/O

Let's use the hGetLine function alongside hGetContents:

main = do
  h <- openFile "jabberwocky.txt" ReadMode
  firstLine <- hGetLine h     -- returns a string
  contents  <- hGetContents h -- returns a "promise"

  hClose h           -– close file
  print $ words firstLine
  print $ words contents

We close the file before consuming the firstLine string and the contents stream:

  print $ words firstLine
    ["'Twas","brillig,","and","the","slithy","toves"]
  print $ words contents
    []

The contents stream is a live stream that gets turned off when the file is closed. The firstLine is an eager string and survives the closing of the file.

The preceding example points to some serious problems with lazy I/O:

  • The order of the side effects is tied to the order of the lazy evaluation. Because the order of lazy evaluation is not explicit, the order of effects also isn't. The sequence of side-effects can become hard to predict.
  • It can be difficult to reason about the space requirements of a lazy program. For example, in the previous example, when we use print $ words contents, all the file contents will be held in memory at once. We could have used hGetContents h >>= return . lines to print the lines incrementally, thereby using only constant space. This shows that the space requirements are contextual to how the stream is used.
  • Poor resource management and lack of explicit order of effects can make it difficult to know when to clean up resources. Since the demand drives lazy evaluation, which drives effects, we inherently have little opportunity to "intercept" evaluation for resource management purposes. Also, resource management is made more difficult by the possibility of errors. We will see more about this in the next section Resource management with Bracket.

Despite this, lazy I/O remains an attractive option in simple situations, where space requirements and order of execution are sufficiently predictable and where resource management is easy enough. However, when there is strong demand for precise resource management or predictable space usage, lazy I/O is not an option; for example, writing networking code, handling many files, or handling many HTTP requests in a web server, and so on.

"Extensive experience in Haskell has, however, exposed severe drawbacks of lazy evaluation, which are especially grievous for stream processing of large amounts of data.

Lazy evaluation is fundamentally incompatible with computational effects, can cause fatal memory leaks, and greatly inhibits modular reasoning, especially about termination and space consumption.

Seemingly innocuous and justified changes to the code or code compositions may lead to divergence, or explosion in memory consumption."

Lazy v. Yield: Incremental, Linear Pretty-printing - Kiselyov et al

Before we look at an established solution to the problem of lazy I/O (Iteratee I/O), we'll briefly explore the basic lazy I/O approach to resource management in the face of exceptions.

Resource management with bracket

So far, we have been explicitly opening and closing files. This is what we call explicit resource management:

main = do
  h <- (openFile "jabberwocky.txt" ReadMode)
  useResource h
  hClose h
where
    useResource h'
         = (stream h') >>= mapM_ putStrLn
    stream h' 
         = hGetContents h' >>= return . lines

Let's look at some higher level abstractions to capture this pattern: open resource, use it, in some way clean up resource. The crudest solution is to just ignore the problem and rely on the garbage collector for the cleanup:

main = do
  contents <- readFile "jabberwocky.txt"
  mapM_ putStrLn (lines contents)

The readFile function encapsulates the file handle, which is then garbage collected when the contents is garbage collected or when it has been entirely consumed. This is very poor resource management!

It would be more idiomatic to use the wrapper function withFile:

main = do
  withFile "jabberwocky.txt" ReadMode enumerateLines
  where
    enumerateLines h = lines' h >>= mapM_ putStrLn
    lines' h' = hGetContents h' >>= return . lines

The withFile function cleanly decouples the producer from the consumer and gives better control over resource management. The file will be closed in case of completion or error because withFile makes use of the bracket function:

bracket
  (openFile "filename" ReadMode) -- acquire resource
  hClose                         -- release resource
  (h -> "do some work")

where

bracket ::  IO a        -- before action
        -> (a -> IO b)  -- after action
        -> (a -> IO c)  -- do action
        -> IO c         -- result

The bracket function relies on higher order functions to express a specific kind of wrapper pattern: "acquire and release". For more information, visit https://wiki.haskell.org/Bracket_pattern.

The finally function is a special form of bracket:

finally :: IO a     -- some action
           -> IO b  -- final action: runs afterwards
           -> IO b  -- result

The bracket family of functions helps us clean up resources more reliably, but by no means definitively solves the problem of closing resources in a timely manner. If we need more precise resource management than this (and more predictable space requirements and ordering of effects), then we must use a more sophisticated pattern for stream programming called Iteratee I/O.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset