Chapter 24

Quarantine

image

24.1 Constraints

  • Core program functions have no side effects of any kind, including IO.
  • All IO actions must be contained in computation sequences that are clearly separated from the pure functions.
  • All sequences that have IO must be called from the main program.

24.2 A Program in this Style

  1 #!/usr/bin/env python
  2 import sys, re, operator, string
  3
  4 #
  5 # The Quarantine class for this example
  6 #
  7 class TFQuarantine:
  8 def __init__(self, func):
  9    self._funcs = [func]
 10
 11 def bind(self, func):
 12    self._funcs.append(func)
 13    return self
 14
 15 def execute(self):
 16    def guard_callable(v):
 17        return v() if hasattr(v, '__call__') else v
 18
 19    value = lambda : None
 20    for func in self._funcs:
 21        value = func(guard_callable(value))
 22    print guard_callable(value)
 23
 24 #
 25 # The functions
 26 #
 27 def get_input(arg):
 28 def _f():
 29    return sys.argv[1]
 30 return _f
 31
 32 def extract_words(path_to_file):
 33 def _f():
 34    with open(path_to_file) as f:
 35        data = f.read()
 36    pattern = re.compile('[W_]+')
 37    word_list = pattern.sub(' ', data).lower().split()
 38    return word_list
 39 return _f
 40
 41 def remove_stop_words(word_list):
 42 def _f():
 43    with open('../stop_words.txt') as f:
 44        stop_words = f.read().split(',')
 45    # add single-letter words
 46    stop_words.extend(list(string.ascii_lowercase))
 47    return [w for w in word_list if not w in stop_words]
 48 return _f
 49
 50 def frequencies(word_list):
 51 word_freqs = {}
 52 for w in word_list:
 53    if w in word_freqs:
 54        word_freqs[w] += 1
 55    else:
 56        word_freqs[w] = 1
 57 return word_freqs
 58
 59 def sort(word_freq):
 60 return sorted(word_freq.iteritems(), key=operator.itemgetter
      (1), reverse=True)
 61
 62 def top25_freqs(word_freqs):
 63 top25 = ""
 64 for tf in word_freqs[0:25]:
 65    top25 += str(tf[0]) + ' - ' + str(tf[1]) + '
'
 66 return top25
 67
 68 #
 69 # The main function
 70 #
 71 TFQuarantine(get_input)
 72 .bind(extract_words)
 73 .bind(remove_stop_words)
 74 .bind(frequencies)
 75 .bind(sort)
 76 .bind(top25_freqs)
 77 .execute()

24.3 Commentary

THIS STYLE makes use of another variation of function composition. Its constraints are very interesting, namely the first one: the core program cannot do IO. For our term-frequency task, that reads files and outputs a result on the screen, this constraint poses a puzzling problem: how can we do it if the functions can't read files such as the Pride and Prejudice text and can't print things on the screen? Indeed, how can one write programs these days that don't interact one way or another with the user, the file system, and the network while the program is executing?

Before explaining how to do it, let's first look into why anyone would want to program under such seemingly unreasonable constraints. Whenever program functions need to interact with the outside world, they lose the "purity" of mathematical functions – they stop being just relations of inputs to outputs, and they either get or leak data through some other means. These "impure" functions are harder to handle, from a software engineering perspective. Take, for example, the function extract_words defined in the Passive Aggressive style (line #34 in that example). There are absolutely no guarantees that two different calls to that function with the exact same path_to_file argument produce the exact same word list: for example, someone could replace the file in between the two calls. Because of the unpredictability of the outside world, "impure" functions are more difficult than "pure" functions to reason about (to test, for example). As such, a certain philosophy of program design calls for avoiding, or at least minimizing, IO.1 The style in this chapter is inspired by this design philosophy and Haskell's IO monad: it is a flat-out quarantine of all IO operations. Here is how it works.

The core program functions, i.e. the first-order functions, cannot do IO; they need to be "pure," in the sense that one or more calls to them with the same arguments should always produce the same results. However, higher-order functions can do IO. So the overall approach is to wrap all "IO-infected" code in higher-order functions, chain those in a contained sequence without executing them, and execute that chain only in the main program, when we absolutely need to do IO.

Let's take a look at the example program. First, let's look at the program's functions between lines #27 and #66. There are two kinds of functions: those that do IO and those that don't. The functions that do IO are: (1) get_input (lines #27–30), which reads from the command line; (2) extract_words (lines #32–39), which opens and reads a file; and (3) remove_stop_words (lines #41–48), which opens and reads the stop words file. The other three functions – frequencies, sort and top25_freqs – are "pure" in the sense that we have used that word before: given the same inputs, they will always produce the same outputs, without interaction with the outside world. In order to identify the functions that do IO, and separate them from the rest, we have abstracted the bodies of those functions into higher-order functions:

 1 def func(arg):
 2 def _f():
 3    ...body...
 4 return _f

Doing so makes the first-order program functions be "pure", in the sense that any calls to any of them will always return the same value (their inner function) without any side effects. Here is what the Python interpreter will say if we call get_input:

>>> get_input(1)
<function _f at 0x01E4FC70>
>>> get_input([1, 2, 3])
<function _f at 0x01E4FC30>
>>> get_input(1)
<function _f at 0x01E4FC70>
>>> get_input([1, 2, 3])
<function _f at 0x01E4FC30>

We are quarantining the IO code so that it doesn't get executed at the first level of our program's functions. At this first level, these functions are again easy to deal with – they don't do anything and just return a function. Calling them is perfectly safe: nothing in the outside world will be affected, because the inner function is not being executed [yet]. The other three "pure" functions are written in the normal, direct way.

But if we are delaying the application of the IO-infected functions, how can we actually compose the functions so that they read files, count words and print characters on the screen? The first thing to notice is that this will not work:

 1 top25_freqs(sort(frequencies(remove_stop_words(extract_words(
     get_input(None))))))

We need another way to define a sequence of functions. Furthermore, we need to hold on to that sequence until the time comes to interact with the outside world. As per constraint of this style, that time is in the main function: IO cannot be done in arbitrary parts of the program's execution.

The chaining is done in the main program, line #71 onward, using an instance of the Quarantine class defined in lines #7–22. We've seen these chains before in Style #9, The One. But this chain is a bit different; let's take a look at the TFQuarantine class.

Like the other monad-inspired classes that we have seen before, this class also consists of a constructor, a bind method and a third method, this time called execute, that discloses what's inside. The idea for instances of this class is to hold on to a list of functions without calling them until execute is called. As such, bind simply appends the given function to the list of functions (line #12), returning the instance for further bindings or for execution (line #13). execute is where the action takes place: it iterates through the list of functions (line #20), calling them one by one (line #21). The argument for each call is the return value of the previous call. At the end of the iteration, we print out the last value (line #22).

The TFQuarantine does lazy evaluation of the chain of functions: it first stores them without calling them, and only when execute is called in main does it call them.

In implementing execute we need to be careful, because, by voluntary choice, we have set ourselves for having two kinds of functions: those that return higher-order functions (the IO-infected code) and those that have normal function bodies. Since we can have both kinds of functions in these chains, the execute method needs to know whether it needs to apply the value or whether it simply needs to reference it (the argument in the function call in line #21). Hence the guard_callable inner function in lines #16–17: this function calls the value or references it, depending on whether that value is callable (functions are callable) or not (simple data types like string and dictionaries aren't callable).

It should be noted that the style in this chapter, as well as the particular implementation shown to exemplify it, don't reflect Haskell's IO monad in important ways. Faithful reproduction is not the goal here; we are focusing on the most important constraints of each style. But it's important to understand what those differences are.

First of all, Haskell is a strongly typed language, and its implementation and use of monads is very much tied to types. Functions that perform IO are of certain IO types that are part of the language. That is not the case with Python and with the implementation of the Quarantine style shown here. Haskell provides some syntactic sugar to chain functions using the donotation, which makes these chains look like sequences of imperative commands. We didn't do that here either. More importantly, in Haskell this style is not a voluntary choice on the part of the programmers; it's part of the language design, and it's strongly enforced via type inference. That is, IO must be done this way.2 For example, we cannot execute an IO monad in some arbitrary function; it needs to have been called from the main function. In our example, all the choices – like, for example, deciding to flag IO functions by having them return a higher-order function – are voluntary, and established with the sole purpose of making the constraints of the style visible in the code. As usual in this book, other implementation options could have been taken that would also honor the constraints.

A final comment about whether this style really achieves its ultimate purpose of minimizing IO. Clearly, it falls short. Programmers can still write programs with as much IO as with any other style, just as we did for term frequency. However this style does one thing on the way to that ultimate goal: it forces programmers to think carefully about what functions do IO and what functions don't do IO; by thinking about it, they may be more responsible in separating IO code from the rest, and that can only be good.

24.4 This Style in Systems Design

The issue of IO being problematic goes well beyond designing programs at the small scale. In fact, its problematic nature is more visible in large distributed systems, where disk access, network latency and server load can have a tremendous effect on the user experience.

Consider, for example, a Web server that is part of a multi-user game, and that returns an image for the following kinds of URLs: http://example.com/images/fractal?minx=-2&maxx=1&miny=-1&maxy=1 This fractal service can be implemented in at least 2 different ways: (1) it can retrieve the image from a database where fractal images are stored, possibly generating and storing the image first if it doesn't exist on the database yet; or (2) it can always generate the image on the fly without accessing the disk. The first approach is the classical compute & cache value approach that we see pervasively used in computing systems, and is equivalent to "impure" functions described above. The second approach is the equivalent of "pure" functions, and seems, at first glance, worse, because for every request with the same parameters we are recomputing the image again, using more CPU cycles.

In both cases, and given that the Web is designed with explicit caches in mind, the image server can tag these images as cacheable for a very long time, which allows Web caches on the Internet to decrease the load on the original server. This weakens the argument about approach 2 being worse.

A second aspect to consider is the time of disk access vs. the time for computing the image. CPUs are so fast these days, that disk accesses have become bottlenecks in many applications. In many cases, there are substantial performance increases when using procedural generation of data vs. retrieving the pre-generated data from disk. In the case of our image server, that may or may not be the case, but if responsiveness is important, this tradeoff would need to be checked.

A third aspect to consider is the variety of images to be served and the amount of disk that it will take to store them. Our image service can generate an infinite amount of different fractal images, one for every combination of parameters minx, maxx, miny, and maxy. Images usually take up a significant amount of bytes. So if we expect thousands of clients to request hundreds of thousands of different fractal images, storing them may be a bad idea.

Finally, another aspect to consider is the consequences of changes in the service's specification. For example, the first implementation of this service might generate images in the red part of the spectrum, but we may want to change it at some point to generate images in the blue part of the spectrum (say, the graphic designers changed their minds about the color scheme). If that happened using approach 1 (the database), we would need to delete these images from the database – something that may be problematic in itself, depending on how those images were stored and whether there were other images stored in the same table or not. Using approach 2, this change would be trivial to handle, since images are always generated on the fly. In either case, if we had previously set the Web cache expiration of these images to a future date, some clients would not see the changes, possibly for a long time – this shows the problem with using caches in general.3

If we believe that generation on the fly, i.e. "pure" functions, is beneficial for our image service, a third, even more radical, approach would be to send the server-side fractal generation function to the client, and let the client do the work, therefore unloading the server from those computations. This can only be done if that function doesn't do IO.

All this analysis goes to show that IO is, indeed, a non-trivial issue in large distributed systems. Any programming techniques and styles that shine a spotlight on this issue at the small scale are worth studying for understanding the tradeoffs at the systems design level.

24.5 Historical Notes

Monads were brought to programming languages in the early 1990s in the context of the Haskell programming language. IO was the main reason why they were introduced, as IO has always been a contentious issue in pure functional languages.

24.6 Further Reading

Peyton-Jones, S. and Wadler, P. (1993). Imperative functional programming. 20th Symposium on Principles of Programming Languages ACM Press.
Synopsis: Another take on monads.

Wadler, P. (1997). How to declare an imperative. ACM Computing Surveys 29(3): 240–263.
Synopsis: More monads. Philip Wadler's papers are always fun and interesting to read.

24.7 Glossary

Pure function: A function whose result is always the same for the same input value(s), that does not depend on any data other than its explicit parameters and that doesn't have any observable effect in the external world.

Impure function: A function that, in addition to mapping inputs to outputs, depends on data other than its explicit parameters and/or changes the observable state of the external world.

Lazy evaluation: A program execution strategy which delays the evaluation of expressions until their values are absolutely needed.

24.8 Exercises

24.1 Another language. Implement the example program in another language, but preserve the style.

24.2 Bound but not committed. Find a way to demonstrate that the function chain defined in line #71 is, indeed, just creating the chain of functions without executing them.

24.3 Top 25. Change the top25_freqs function so that instead of accumulating the output on a string, it prints the data onto the screen directly, one word-frequency pair at a time. Do this without violating the constraints of this style.

24.4 True to the style. The goal of this style is to coerce programmers into isolating their IO code from the rest. Two of the three IO-infected functions in the example program, namely extract_words and remove_stop_words, end up doing more than just IO. Refactor the program so that it does a better job at separating IO code from the rest.

24.5 A different task. Write one of the tasks proposed in the Prologue using this style.

1Yes, one could write an essay with the title IO Considered Harmful, and someone already has, albeit in the context of CS education.

2Ignoring unsafePerformIO.

3"There are only two hard things in Computer Science: cache invalidation and naming things." – quote usually attributed to Phil Karlton.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset