Chapter 27

Lazy Rivers

image

27.1 Constraints

  • Data is available in streams, rather than as a complete whole.
  • Functions are filters/transformers from one kind of data stream to another.
  • Data is processed from upstream on a need basis from downstream.

27.2 A Program in this Style

  1 #!/usr/bin/env python
  2 import sys, operator, string
  3
  4 def characters(filename):
  5 for line in open(filename):
  6    for c in line:
  7        yield c
  8
  9 def all_words(filename):
 10 start_char = True
 11 for c in characters(filename):
 12    if start_char == True:
 13        word = ""
 14        if c.isalnum():
 15        # We found the start of a word
 16        word = c.lower()
 17        start_char = False
 18        else: pass
 19    else:
 20        if c.isalnum():
 21        word += c.lower()
 22        else:
 23        # We found end of word, emit it
 24        start_char = True
 25        yield word
 26
 27 def non_stop_words(filename):
 28 stopwords = set(open('../stop_words.txt').read().split(',') +
   list(string.ascii_lowercase))
 29 for w in all_words(filename):
 30    if not w in stopwords:
 31        yield w
 32
 33 def count_and_sort(filename):
 34 freqs, i = {}, 1
 35 for w in non_stop_words(filename):
 36    freqs[w] = 1 if w not in freqs else freqs[w]+1
 37    if i % 5000 == 0:
 38        yield sorted(freqs.iteritems(), key=operator.
       itemgetter(1), reverse=True)
 39    i = i+1
 40 yield sorted(freqs.iteritems(), key=operator.itemgetter(1),
      reverse=True)
 41 #
 42 # The main function
 43 #
 44 for word_freqs in count_and_sort(sys.argv[1]):
 45 print "-----------------------------"
 46 for (w, c) in word_freqs[0:25]:
 47    print w, ' - ', c

27.3 Commentary

THIS STYLE focuses on the problem of processing data that comes into the application continuously and may not even have an end. The same issues are seen in processing data whose size is known, but larger than the available memory. The style establishes a flow of data from upstream (data sources) to downstream (data sinks), with processing units along the way. The data is let to flow through the stream only when the sinks need it. At any point in time, the only data present in the stream is the one needed to produce whatever piece of data the sinks need, therefore avoiding the problems raised by too much data at a time.

The example program consists of 4 functions, all of them generators. Generators are simplified coroutines that allow us to iterate through sequences of data on a need-basis. A generator is a function containing a yield statement where a return statement might normally be found. In the example program the data flow is established from top to bottom, textually: the top function, characters (lines #4–7), connects to the data source (the file) while the main instructions (lines #44–47) drive the fetching and flow of data.

Before explaining the data flow control at the bottom, let's take a look at each of the functions, from top to bottom.

  • characters (lines #4–7) iterates through the file one line at a time (line #5). It then iterates on the line one character at a time (line #6), yielding each character downstream (line #7).
  • all_words (lines #9–25) iterates through the characters passed to it by the function above (line #11) looking for words. The logic in this function is related to the identification of the beginning and ending of words. When the end of a word is detected, this function yields that word downstream (line #25).
  • non_stop_words (lines #27–31) iterates through the words passed to it by the function above (line #29). For each word, it checks to see if it is a stop word and yields it only if it is not a stop word (lines #30–31).
  • count_and_sort (lines #33–40) iterates through the non-stop words passed to it by the previous function (line #35) and increments the word counts (line #36). For every 5,000 words that it processes, it yields its current word-frequency dictionary, sorted (lines #37–38). The dictionary is also yielded at the end (line #40), because the last batch of words coming from upstream may not be an exact multiple of 5,000.
  • The main instructions (lines #44–47) iterate through the word-frequency dictionaries passed by the previous function (line #44), printing them on the screen.

Functions that contain iterations yielding values are special: the next time that they are called, rather than starting from the beginning, they resume from where they yielded. So, for example, the iteration through characters in line #11 doesn't open the file (line #5) multiple times; instead, after the first call, every subsequent request for the next character in line #11 simply resumes the characters function from where it left off in line #7. The same happens in all the other generators of our program.

Having seen what each generator does, let's now look at the flow control. In line #44, there is an iteration on a sequence of word-frequency dictionaries. In each iteration, a dictionary is requested from the count_and_sort generator. That request prompts that generator to action: it starts iterating through the non-stop words provided to it by the non_stop_words generator until it gets to 5,000 words, at which point it passes the dictionary downstream. Each iteration that count_and_sort does through a non-stop word prompts the non_stop_words generator to action: it fetches the next word passed to it from upstream yielding it downstream if it is a non-stop word, or fetching another word if the one it got was a stop word. Similarly, each time tha non_stop_words requests the next word from all_words, it prompts the all words generator to action: it requests characters from upstream until it identifies a word, at which point it yields that word downstream.

The data flow control is, then, driven by the sink code at the bottom: data is only fetched from the source, flowing through the generators in the middle, for as long as the main instructions need it. Hence the adjective lazy in the name of the style, in contrast to the eager form of normal functions. For example, if instead of the iteration in line #44 we had this:

word_freqs = count_and_sort(sys.argv[1]).next()

only one word-frequency dictionary, the first, would be printed out, and only one portion of the input file would be read, instead of the entire file.

There are many similarities between the Lazy Rivers style and the Pipeline style described in Chapter 5. The important difference is in the way that the data flows through the functions: in the Pipeline style, the data is one single "blob" that passes from one function to another, all at once (e.g. the entire list of words); in the Lazy Rivers style, the data flows, lazily, piece by piece; it is only fetched from the source as needed by the sink.

The Lazy Rivers style is nicely expressed when programming languages support generators. Some programming languages, e.g. Java, don't support generators; the Lazy Rivers style can be implemented using iterators in Java, but the code will look ugly. When the programming language doesn't support generators or iterators, it is still possible to support the goals of this style, but the expression of that intent is considerably more complicated. In the absence of generators and iterators, the next best mechanism for implementing the constraints underlying this style is with threads. The style presented in the next chapter, Actors, would be a good fit for this data-centric style.

27.4 This Style in Systems Design

The Lazy Rivers style is of great value in data-intensive applications, especially those where the data is either live-streamed, or very large, or both. Its strength comes from the fact that only a portion of data needs to be held in memory at any point in time, that amount being driven by the end-goal need for the data.

Within components, language support for generators makes for elegant programs in this style. As will be seen in Chapter 28, threads used in a special way are a viable alternative for implementing Lazy Rivers. Threads, however, tend to be much heavier than generators in terms of creation and context switching.

27.5 Historical Notes

Coroutines were first introduced in the context of a compiler for COBOL in 1963. However, they have not always been incorporated in programming languages since then. Several mainstream languages – most notably C/C++ and Java – lack support for any flavor of coroutines.

Generators were first described in the context of the language CLU circa 1977, where they were called iterators. These days, the word iterator is used to denote the object-oriented flavor of the concept, i.e. an object that traverses a container. The word generator has converged to denoting the specialized coroutines that support iteration.

27.6 Further Reading

Conway, M. (1963). Design of a separable transition-diagram compiler. Communications of the ACM 6(7): 396–408.
Synopsis: The description of the COBOL compiler design, presenting coroutines.

Liskov, B., Snyder, A., Atkinson, R. and Schaffert, C. (1977). Abstraction mechanisms in CLU. Communications of the ACM 20(8): 564–576.
Synopsis: This paper describes the language CLU, which featured the early concept of iterators.

27.7 Glossary

Coroutine: Procedures that allow multiple entry and exit points for suspending and resuming execution.

Generator: (aka Semicoroutine) A special kind of coroutine used to control iteration over a sequence of values. A generator always yields control back to the caller, rather than to an arbitrary place of the program.

Iterator: An object that is used to traverse a sequence of values.

27.8 Exercises

27.1 Another language. Implement the example program in another language, but preserve the style.

27.2 Lines vs. characters. The example program, in its eagerness to demonstrate data flows of all kinds, ends up doing something monolithic – the function all_words (yuk!). It would be much better to use Python's facilities to handle words (e.g. split). Change the program, without changing the style, so that the first generator yields an entire line, and the second yields words from those lines using the proper library functions.

27.3 Iterators. Some languages that don't support generators support their more verbose cousins, iterators (e.g. Java). Python supports both. Change the example program so that it uses iterators instead of generators.

27.4 A different task. Write one of the tasks proposed in the Prologue using this style.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset