A generator is a specific type of iterator that generates values through a function. While traditional methods build and return a list
of items, a generator will simply yield
every value separately at the moment when they are requested by the caller. This method has several benefits:
list
and storing all results until they are returned, a generator only needs to store a single value.These benefits come at a price, however. The immediate results of these benefits are a few disadvantages:
list(some_infinite_generator)
will run out of memory.In addition to generators, there is a variation to the generator's syntax that creates coroutines. Coroutines are functions that allow for multitasking without requiring multiple threads or processes. Whereas generators can only yield values to the caller, coroutines actually receive values from the caller while it is still running. While this technique has a few limitations, if it suits your purpose, it can result in great performance at a very little cost.
In short, the topics covered in this chapter are:
A generator, in its simplest form, is a function that returns elements one at a time instead of returning a collection of items. The most important advantage of this is that it requires very little memory and that it doesn't need to have a predefined size. Creating an endless generator (such as the itertools.count
iterator discussed in Chapter 4, Functional Programming – Readability Versus Brevity) is actually quite easy, but it does come with a cost, of course. Not having the size of an object available makes certain patterns difficult to achieve.
The basic trick in writing generators (as functions) is using the yield
statement. Let's use the itertools.count
generator as an example and extend it with a stop
variable:
>>> def count(start=0, step=1, stop=10): ... n = start ... while n <= stop: ... yield n ... n += step >>> for x in count(10, 2.5, 20): ... print(x) 10 12.5 15.0 17.5 20.0
Due to the potentially infinite nature of generators, caution is required. Without the stop
variable, simply doing list(count())
would result in an out-of-memory situation quite fast.
So how does this work? It's just a normal for
loop, but the big difference between this and the regular method of returning a list of items is that the yield
statement returns the items one at a time. An important thing to note here is that the return
statement results in a StopIteration
and passing something along to return
will be the argument to the StopIteration
. It should be noted that this behavior changed in Python 3.3; in Python 3.2 and earlier versions, it was simply not possible to return anything other than None
. Here is an example:
>>> def generator(): ... yield 'this is a generator' ... return 'returning from a generator' >>> g = generator() >>> next(g) 'this is a generator' >>> next(g) Traceback (most recent call last): ... StopIteration: returning from a generator
Of course, as always, there are multiple ways of creating generators with Python. Other than functions, there are also generator comprehensions and classes that can do the same thing. Generator comprehensions are pretty much identical to list comprehensions but use parentheses instead of brackets, like this for example:
>>> generator = (x ** 2 for x in range(4)) >>> for x in generator: ... print(x) 0 1 4 9
For completeness, the class version of the count
function is as follows:
>>> class Count(object): ... def __init__(self, start=0, step=1, stop=10): ... self.n = start ... self.step = step ... self.stop = stop ... ... def __iter__(self): ... return self ... ... def __next__(self): ... n = self.n ... if n > self.stop: ... raise StopIteration() ... ... self.n += self.step ... return n >>> for x in Count(10, 2.5, 20): ... print(x) 10 12.5 15.0 17.5 20.0
The biggest difference between the class and the function-based approach is that you are required to raise a StopIteration
explicitly instead of just returning it. Beyond that, they are quite similar, although the class-based version obviously adds some verbosity.
You have seen a few examples of generators and know the basics of what you can do with them. However, it is important to keep their advantages and disadvantages in mind.
The following are most important pros:
some_generator[5]
will not work.Considering all the advantages and disadvantages, my general advice would be to use generators if possible and only return a list
or tuple
when you actually need to. Converting a generator to a list
is as simple as list(some_generator)
, so that shouldn't stop you since generator functions tend to be simpler than the equivalents that produce list
.
The memory usage advantage is understandable; one item requires less memory than many items. The lazy part, however, needs some additional explanation as it has a small snag:
>>> def generator(): ... print('Before 1') ... yield 1 ... print('After 1') ... print('Before 2') ... yield 2 ... print('After 2') ... print('Before 3') ... yield 3 ... print('After 3') >>> g = generator() >>> print('Got %d' % next(g)) Before 1 Got 1 >>> print('Got %d' % next(g)) After 1 Before 2 Got 2
As you can see, the generator effectively freezes right after the yield
statement, so even the After 2
won't print until 3
is yielded.
This has important advantages, but it's definitely something you need to take into consideration. You can't have your cleanup right after the yield
as it won't be executed until the next yield
.
The theoretical possibilities of generators are infinite (no pun intended), but their practical uses can be difficult to find. If you are familiar with the Unix/Linux shell, you must have probably used pipes before, something like ps aux | grep python'
for example to list all Python processes. There are many ways to do this, of course, but let's emulate something similar in Python to see a practical example. To create an easy and consistent output, we will create a file called lines.txt
with the following lines:
spam eggs spam spam eggs eggs spam spam spam eggs eggs eggs
Now, let's take the following Linux/Unix/Mac shell command to read the file with some modifications:
# cat lines.txt | grep spam | sed 's/spam/bacon/g' bacon bacon bacon bacon bacon bacon
This reads the file using cat
, outputs all lines that contain spam
using grep
, and replaces spam
with bacon
using the sed
command. Now let's see how we can recreate this with the use of Python generators:
>>> def cat(filename): ... for line in open(filename): ... yield line.rstrip() ... >>> def grep(sequence, search): ... for line in sequence: ... if search in line: ... yield line ... >>> def replace(sequence, search, replace): ... for line in sequence: ... yield line.replace(search, replace) ... >>> lines = cat('lines.txt') >>> spam_lines = grep(lines, 'spam') >>> bacon_lines = replace(spam_lines, 'spam', 'bacon') >>> for line in bacon_lines: ... print(line) ... bacon bacon bacon bacon bacon bacon # Or the one-line version, fits within 78 characters: >>> for line in replace(grep(cat('lines.txt'), 'spam'), ... 'spam', 'bacon'): ... print(line) ... bacon bacon bacon bacon bacon bacon
That's the big advantage of generators. You can wrap a list or sequence multiple times with very little performance impact. Not a single one of the functions involved executes anything until a value is requested.
As mentioned before, one of the biggest disadvantages of generators is that the results are usable only once. Luckily, Python has a function that allows you to copy the output to several generators. The name tee
might be familiar to you if you are used to working in a command-line shell. The tee
program allows you to write outputs to both the screen and a file, so you can store an output while still maintaining a live view of it.
The Python version, itertools.tee
, does a similar thing except that it returns several iterators, allowing you to process the results separately.
By default, tee
will split your generator into a tuple containing two different generators, which is why tuple unpacking works nicely here. By passing along the n
parameter, this can easily be changed to support more than 2 generators. Here is an example:
>>> import itertools >>> def spam_and_eggs(): ... yield 'spam' ... yield 'eggs' >>> a, b = itertools.tee(spam_and_eggs()) >>> next(a) 'spam' >>> next(a) 'eggs' >>> next(b) 'spam' >>> next(b) 'eggs' >>> next(b) Traceback (most recent call last): ... StopIteration
After seeing this code, you might be wondering about the memory usage of tee
. Does it need to store the entire list for you? Luckily, no. The tee
function is pretty smart in handling this. Assume you have a generator that contains 1,000 items, and you read the first 100 items from a
and the first 75
items from b
simultaneously. Then tee
will only keep the difference (100 - 75 = 25
items) in the memory and drop the rest while you are iterating the results.
Whether tee
is the best solution in your case or not depends, of course. If instance a
is read from the beginning to (nearly) the end before instance b
is read, then it would not be a great idea to use tee
. Simply converting the generator to a list
would be faster since it involves much fewer operations.
As we have seen before, we can use generators to filter, modify, add, and remove items. In many cases, however, you'll notice that when writing generators, you'll be returning from sub-generators and/or sequences. An example of this is when creating a powerset
using the itertools
library:
>>> import itertools >>> def powerset(sequence): ... for size in range(len(sequence) + 1): ... for item in itertools.combinations(sequence, size): ... yield item >>> for result in powerset('abc'): ... print(result) () ('a',) ('b',) ('c',) ('a', 'b') ('a', 'c') ('b', 'c') ('a', 'b', 'c')
This pattern was so common that the yield syntax was actually enhanced to make this even easier. Instead of manually looping over the results, Python 3.3 introduced the yield from
syntax, which makes this common pattern even simpler:
>>> import itertools >>> def powerset(sequence): ... for size in range(len(sequence) + 1): ... yield from itertools.combinations(sequence, size) >>> for result in powerset('abc'): ... print(result) () ('a',) ('b',) ('c',) ('a', 'b') ('a', 'c') ('b', 'c') ('a', 'b', 'c')
And that's how you create a powerset in only three lines of code.
Perhaps, a more useful example of this is flattening a sequence recursively:
>>> def flatten(sequence): ... for item in sequence: ... try: ... yield from flatten(item) ... except TypeError: ... yield item ... >>> list(flatten([1, [2, [3, [4, 5], 6], 7], 8])) [1, 2, 3, 4, 5, 6, 7, 8]
Note that this code uses TypeError
to detect non-iterable objects. The result is that if the sequence (which could be a generator) returns a TypeError
, it will silently hide it.
Also note that this is a very basic flattening function that has no type checking whatsoever. An iterable containing an str
for example will be flattened recursively until the maximum recursion depth is reached, since every item in an str
also returns an str
.
As with most of the techniques described in this book, Python also comes bundled with a few useful generators. Some of these (itertools
and contextlib.contextmanager
for example) have already been discussed in Chapter 4, Functional Programming – Readability Versus Brevity and Chapter 5, Decorators – Enabling Code Reuse by Decorating but we can use some extra examples to demonstrate how simple and powerful they can be.
The Python context managers do not appear to be directly related to generators, but that's a large part of what they use internally:
>>> import datetime >>> import contextlib # Context manager that shows how long a context was active >>> @contextlib.contextmanager ... def timer(name): ... start_time = datetime.datetime.now() ... yield ... stop_time = datetime.datetime.now() ... print('%s took %s' % (name, stop_time - start_time)) # The write to log function writes all stdout (regular print data) to # a file. The contextlib.redirect_stdout context wrapper # temporarily redirects standard output to a given file handle, in # this case the file we just opened for writing. >>> @contextlib.contextmanager ... def write_to_log(name): ... with open('%s.txt' % name, 'w') as fh: ... with contextlib.redirect_stdout(fh): ... with timer(name): ... yield # Use the context manager as a decorator >>> @write_to_log('some function') ... def some_function(): ... print('This function takes a bit of time to execute') ... ... ... print('Do more...') >>> some_function()
While all this works just fine, the three levels of context managers tend to get a bit unreadable. Generally, decorators can solve this. In this case, however, we need the output from one context manager as the input for the next.
That's where ExitStack
comes in. It allows easy combining of multiple context managers:
>>> import contextlib >>> @contextlib.contextmanager ... def write_to_log(name): ... with contextlib.ExitStack() as stack: ... fh = stack.enter_context(open('stdout.txt', 'w')) ... stack.enter_context(contextlib.redirect_stdout(fh)) ... stack.enter_context(timer(name)) ... ... yield >>> @write_to_log('some function') ... def some_function(): ... print('This function takes a bit of time to execute') ... ... ... print('Do more...') >>> some_function()
Looks at least a bit simpler, doesn't it? While the necessity is limited in this case, the convenience of ExitStack
becomes quickly apparent when you need to do specific teardowns. In addition to the automatic handling as seen before, it's also possible to transfer the contexts to a new ExitStack
and manually handle the closing:
>>> import contextlib >>> with contextlib.ExitStack() as stack: ... spam_fh = stack.enter_context(open('spam.txt', 'w')) ... eggs_fh = stack.enter_context(open('eggs.txt', 'w')) ... spam_bytes_written = spam_fh.write('writing to spam') ... eggs_bytes_written = eggs_fh.write('writing to eggs') ... # Move the contexts to a new ExitStack and store the ... # close method ... close_handlers = stack.pop_all().close >>> spam_bytes_written = spam_fh.write('still writing to spam') >>> eggs_bytes_written = eggs_fh.write('still writing to eggs') # After closing we can't write anymore >>> close_handlers() >>> spam_bytes_written = spam_fh.write('cant write anymore') Traceback (most recent call last): ... ValueError: I/O operation on closed file.
Most of the contextlib
functions have extensive documentation available in the Python manual. ExitStack
in particular is documented using many examples at https://docs.python.org/3/library/contextlib.html#contextlib.ExitStack. I recommend keeping an eye on the contextlib
documentation as it is improving greatly with every Python version.