Chapter 6. Generators and Coroutines – Infinity, One Step at a Time

A generator is a specific type of iterator that generates values through a function. While traditional methods build and return a list of items, a generator will simply yield every value separately at the moment when they are requested by the caller. This method has several benefits:

  • Generators pause execution completely until the next value is yielded, which makes them completely lazy. If you fetch five items from a generator, only five items will be generated, so no other computation is needed.
  • Generators have no need to save values. Whereas a traditional function would require creating a list and storing all results until they are returned, a generator only needs to store a single value.
  • Generators can have infinite size. There is no requirement to stop at a certain point.

These benefits come at a price, however. The immediate results of these benefits are a few disadvantages:

  • Until you are done processing, you never know how many values are left; it could even be infinite. This makes usage dangerous in some cases; executing list(some_infinite_generator) will run out of memory.
  • You cannot slice generators.
  • You cannot get specific items without yielding all values before that index.
  • You cannot restart a generator. All values are yielded exactly once.

In addition to generators, there is a variation to the generator's syntax that creates coroutines. Coroutines are functions that allow for multitasking without requiring multiple threads or processes. Whereas generators can only yield values to the caller, coroutines actually receive values from the caller while it is still running. While this technique has a few limitations, if it suits your purpose, it can result in great performance at a very little cost.

In short, the topics covered in this chapter are:

  • The characteristics and uses of generators
  • Generator comprehensions
  • Generator functions
  • Generator classes
  • Bundled generators
  • Coroutines

What are generators?

A generator, in its simplest form, is a function that returns elements one at a time instead of returning a collection of items. The most important advantage of this is that it requires very little memory and that it doesn't need to have a predefined size. Creating an endless generator (such as the itertools.count iterator discussed in Chapter 4, Functional Programming – Readability Versus Brevity) is actually quite easy, but it does come with a cost, of course. Not having the size of an object available makes certain patterns difficult to achieve.

The basic trick in writing generators (as functions) is using the yield statement. Let's use the itertools.count generator as an example and extend it with a stop variable:

>>> def count(start=0, step=1, stop=10):
...     n = start
...     while n <= stop:
...         yield n
...         n += step

>>> for x in count(10, 2.5, 20):
...     print(x)
10
12.5
15.0
17.5
20.0

Due to the potentially infinite nature of generators, caution is required. Without the stop variable, simply doing list(count()) would result in an out-of-memory situation quite fast.

So how does this work? It's just a normal for loop, but the big difference between this and the regular method of returning a list of items is that the yield statement returns the items one at a time. An important thing to note here is that the return statement results in a StopIteration and passing something along to return will be the argument to the StopIteration. It should be noted that this behavior changed in Python 3.3; in Python 3.2 and earlier versions, it was simply not possible to return anything other than None. Here is an example:

>>> def generator():
...     yield 'this is a generator'
...     return 'returning from a generator'

>>> g = generator()
>>> next(g)
'this is a generator'
>>> next(g)
Traceback (most recent call last):
    ...
StopIteration: returning from a generator

Of course, as always, there are multiple ways of creating generators with Python. Other than functions, there are also generator comprehensions and classes that can do the same thing. Generator comprehensions are pretty much identical to list comprehensions but use parentheses instead of brackets, like this for example:

>>> generator = (x ** 2 for x in range(4))

>>> for x in generator:
...    print(x)
0
1
4
9

For completeness, the class version of the count function is as follows:

>>> class Count(object):
...     def __init__(self, start=0, step=1, stop=10):
...         self.n = start
...         self.step = step
...         self.stop = stop
...
...     def __iter__(self):
...         return self
...
...     def __next__(self):
...         n = self.n
...         if n > self.stop:
...             raise StopIteration()
...
...         self.n += self.step
...         return n

>>> for x in Count(10, 2.5, 20):
...     print(x)
10
12.5
15.0
17.5
20.0

The biggest difference between the class and the function-based approach is that you are required to raise a StopIteration explicitly instead of just returning it. Beyond that, they are quite similar, although the class-based version obviously adds some verbosity.

Advantages and disadvantages of generators

You have seen a few examples of generators and know the basics of what you can do with them. However, it is important to keep their advantages and disadvantages in mind.

The following are most important pros:

  • Memory usage. Items can be processed one at a time, so there is generally no need to keep the entire list in memory.
  • The results can depend on outside factors, instead of having a static list. Think of processing a queue/stack for example.
  • Generators are lazy. This means that if you're using only the first five results of a generator, the rest won't even be calculated.
  • Generally, it is simpler to write than list generating functions.

The most important cons:

  • The results are available only once. After processing the results of a generator, it cannot be used again.
  • The size is unknown until you are done processing, which can be detrimental to certain algorithms.
  • Generators are not indexable, which means that some_generator[5] will not work.

Considering all the advantages and disadvantages, my general advice would be to use generators if possible and only return a list or tuple when you actually need to. Converting a generator to a list is as simple as list(some_generator), so that shouldn't stop you since generator functions tend to be simpler than the equivalents that produce list.

The memory usage advantage is understandable; one item requires less memory than many items. The lazy part, however, needs some additional explanation as it has a small snag:

>>> def generator():
...     print('Before 1')
...     yield 1
...     print('After 1')
...     print('Before 2')
...     yield 2
...     print('After 2')
...     print('Before 3')
...     yield 3
...     print('After 3')

>>> g = generator()
>>> print('Got %d' % next(g))
Before 1
Got 1

>>> print('Got %d' % next(g))
After 1
Before 2
Got 2

As you can see, the generator effectively freezes right after the yield statement, so even the After 2 won't print until 3 is yielded.

This has important advantages, but it's definitely something you need to take into consideration. You can't have your cleanup right after the yield as it won't be executed until the next yield.

Pipelines – an effective use of generators

The theoretical possibilities of generators are infinite (no pun intended), but their practical uses can be difficult to find. If you are familiar with the Unix/Linux shell, you must have probably used pipes before, something like ps aux | grep python' for example to list all Python processes. There are many ways to do this, of course, but let's emulate something similar in Python to see a practical example. To create an easy and consistent output, we will create a file called lines.txt with the following lines:

spam
eggs
spam spam
eggs eggs
spam spam spam
eggs eggs eggs

Now, let's take the following Linux/Unix/Mac shell command to read the file with some modifications:

# cat lines.txt | grep spam | sed 's/spam/bacon/g'
bacon
bacon bacon
bacon bacon bacon

This reads the file using cat, outputs all lines that contain spam using grep, and replaces spam with bacon using the sed command. Now let's see how we can recreate this with the use of Python generators:

>>> def cat(filename):
...     for line in open(filename):
...         yield line.rstrip()
...
>>> def grep(sequence, search):
...     for line in sequence:
...         if search in line:
...             yield line
...
>>> def replace(sequence, search, replace):
...     for line in sequence:
...         yield line.replace(search, replace)
...
>>> lines = cat('lines.txt')
>>> spam_lines = grep(lines, 'spam')
>>> bacon_lines = replace(spam_lines, 'spam', 'bacon')

>>> for line in bacon_lines:
...     print(line)
...
bacon
bacon bacon
bacon bacon bacon

# Or the one-line version, fits within 78 characters:
>>> for line in replace(grep(cat('lines.txt'), 'spam'),
...                     'spam', 'bacon'):
...     print(line)
...
bacon
bacon bacon
bacon bacon bacon

That's the big advantage of generators. You can wrap a list or sequence multiple times with very little performance impact. Not a single one of the functions involved executes anything until a value is requested.

tee – using an output multiple times

As mentioned before, one of the biggest disadvantages of generators is that the results are usable only once. Luckily, Python has a function that allows you to copy the output to several generators. The name tee might be familiar to you if you are used to working in a command-line shell. The tee program allows you to write outputs to both the screen and a file, so you can store an output while still maintaining a live view of it.

The Python version, itertools.tee, does a similar thing except that it returns several iterators, allowing you to process the results separately.

By default, tee will split your generator into a tuple containing two different generators, which is why tuple unpacking works nicely here. By passing along the n parameter, this can easily be changed to support more than 2 generators. Here is an example:

>>> import itertools

>>> def spam_and_eggs():
...     yield 'spam'
...     yield 'eggs'

>>> a, b = itertools.tee(spam_and_eggs())
>>> next(a)
'spam'
>>> next(a)
'eggs'
>>> next(b)
'spam'
>>> next(b)
'eggs'
>>> next(b)
Traceback (most recent call last):
    ...
StopIteration

After seeing this code, you might be wondering about the memory usage of tee. Does it need to store the entire list for you? Luckily, no. The tee function is pretty smart in handling this. Assume you have a generator that contains 1,000 items, and you read the first 100 items from a and the first 75 items from b simultaneously. Then tee will only keep the difference (100 - 75 = 25 items) in the memory and drop the rest while you are iterating the results.

Whether tee is the best solution in your case or not depends, of course. If instance a is read from the beginning to (nearly) the end before instance b is read, then it would not be a great idea to use tee. Simply converting the generator to a list would be faster since it involves much fewer operations.

Generating from generators

As we have seen before, we can use generators to filter, modify, add, and remove items. In many cases, however, you'll notice that when writing generators, you'll be returning from sub-generators and/or sequences. An example of this is when creating a powerset using the itertools library:

>>> import itertools

>>> def powerset(sequence):
...     for size in range(len(sequence) + 1):
...         for item in itertools.combinations(sequence, size):
...             yield item

>>> for result in powerset('abc'):
...     print(result)
()
('a',)
('b',)
('c',)
('a', 'b')
('a', 'c')
('b', 'c')
('a', 'b', 'c')

This pattern was so common that the yield syntax was actually enhanced to make this even easier. Instead of manually looping over the results, Python 3.3 introduced the yield from syntax, which makes this common pattern even simpler:

>>> import itertools

>>> def powerset(sequence):
...     for size in range(len(sequence) + 1):
...         yield from itertools.combinations(sequence, size)

>>> for result in powerset('abc'):
...     print(result)
()
('a',)
('b',)
('c',)
('a', 'b')
('a', 'c')
('b', 'c')
('a', 'b', 'c')

And that's how you create a powerset in only three lines of code.

Perhaps, a more useful example of this is flattening a sequence recursively:

>>> def flatten(sequence):
...     for item in sequence:
...         try:
...             yield from flatten(item)
...         except TypeError:
...             yield item
...
>>> list(flatten([1, [2, [3, [4, 5], 6], 7], 8]))
[1, 2, 3, 4, 5, 6, 7, 8]

Note that this code uses TypeError to detect non-iterable objects. The result is that if the sequence (which could be a generator) returns a TypeError, it will silently hide it.

Also note that this is a very basic flattening function that has no type checking whatsoever. An iterable containing an str for example will be flattened recursively until the maximum recursion depth is reached, since every item in an str also returns an str.

Context managers

As with most of the techniques described in this book, Python also comes bundled with a few useful generators. Some of these (itertools and contextlib.contextmanager for example) have already been discussed in Chapter 4, Functional Programming – Readability Versus Brevity and Chapter 5, Decorators – Enabling Code Reuse by Decorating but we can use some extra examples to demonstrate how simple and powerful they can be.

The Python context managers do not appear to be directly related to generators, but that's a large part of what they use internally:

>>> import datetime
>>> import contextlib

# Context manager that shows how long a context was active
>>> @contextlib.contextmanager
... def timer(name):
...     start_time = datetime.datetime.now()
...     yield
...     stop_time = datetime.datetime.now()
...     print('%s took %s' % (name, stop_time - start_time))

# The write to log function writes all stdout (regular print data) to
# a file. The contextlib.redirect_stdout context wrapper
# temporarily redirects standard output to a given file handle, in
# this case the file we just opened for writing.
>>> @contextlib.contextmanager
... def write_to_log(name):
...     with open('%s.txt' % name, 'w') as fh:
...         with contextlib.redirect_stdout(fh):
...             with timer(name):
...                 yield

# Use the context manager as a decorator
>>> @write_to_log('some function')
... def some_function():
...     print('This function takes a bit of time to execute')
...     ...
...     print('Do more...')

>>> some_function()

While all this works just fine, the three levels of context managers tend to get a bit unreadable. Generally, decorators can solve this. In this case, however, we need the output from one context manager as the input for the next.

That's where ExitStack comes in. It allows easy combining of multiple context managers:

>>> import contextlib


>>> @contextlib.contextmanager
... def write_to_log(name):
...     with contextlib.ExitStack() as stack:
...         fh = stack.enter_context(open('stdout.txt', 'w'))
...         stack.enter_context(contextlib.redirect_stdout(fh))
...         stack.enter_context(timer(name))
...
...         yield

>>> @write_to_log('some function')
... def some_function():
...     print('This function takes a bit of time to execute')
...     ...
...     print('Do more...')

>>> some_function()

Looks at least a bit simpler, doesn't it? While the necessity is limited in this case, the convenience of ExitStack becomes quickly apparent when you need to do specific teardowns. In addition to the automatic handling as seen before, it's also possible to transfer the contexts to a new ExitStack and manually handle the closing:

>>> import contextlib


>>> with contextlib.ExitStack() as stack:
...     spam_fh = stack.enter_context(open('spam.txt', 'w'))
...     eggs_fh = stack.enter_context(open('eggs.txt', 'w'))
...     spam_bytes_written = spam_fh.write('writing to spam')
...     eggs_bytes_written = eggs_fh.write('writing to eggs')
...     # Move the contexts to a new ExitStack and store the
...     # close method
...     close_handlers = stack.pop_all().close

>>> spam_bytes_written = spam_fh.write('still writing to spam')
>>> eggs_bytes_written = eggs_fh.write('still writing to eggs')

# After closing we can't write anymore
>>> close_handlers()
>>> spam_bytes_written = spam_fh.write('cant write anymore')
Traceback (most recent call last):
    ...
ValueError: I/O operation on closed file.

Most of the contextlib functions have extensive documentation available in the Python manual. ExitStack in particular is documented using many examples at https://docs.python.org/3/library/contextlib.html#contextlib.ExitStack. I recommend keeping an eye on the contextlib documentation as it is improving greatly with every Python version.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset