Generator expressions

Let's now talk about the other techniques to generate values one at a time.

The syntax is exactly the same as list comprehensions, only, instead of wrapping the comprehension with square brackets, you wrap it with round brackets. That is called a generator expression.

In general, generator expressions behave like equivalent list comprehensions, but there is one very important thing to remember: generators allow for one iteration only, then they will be exhausted. Let's see an example:

>>> cubes = [k**3 for k in range(10)] # regular list
>>> cubes
[0, 1, 8, 27, 64, 125, 216, 343, 512, 729]
>>> type(cubes)
<class 'list'>
>>> cubes_gen = (k**3 for k in range(10)) # create as generator
>>> cubes_gen
<generator object <genexpr> at 0x103fb5a98>
>>> type(cubes_gen)
<class 'generator'>
>>> _(cubes_gen) # this will exhaust the generator
[0, 1, 8, 27, 64, 125, 216, 343, 512, 729]
>>> _(cubes_gen) # nothing more to give

Look at the line in which the generator expression is created and assigned the name cubes_gen. You can see it's a generator object. In order to see its elements, we can use a for loop, a manual set of calls to next, or simply, feed it to a list constructor, which is what I did (remember I'm using _ as an alias).

Notice how, once the generator has been exhausted, there is no way to recover the same elements from it again. We need to recreate it if we want to use it from scratch again.

In the next few examples, let's see how to reproduce map and filter using generator expressions:

def adder(*n):
return sum(n)
s1 = sum(map(lambda *n: adder(*n), range(100), range(1, 101)))
s2 = sum(adder(*n) for n in zip(range(100), range(1, 101)))

In the previous example, s1 and s2 are exactly the same: they are the sum of adder(0, 1), adder(1, 2), adder(2, 3), and so on, which translates to sum(1, 3, 5, ...). The syntax is different, though I find the generator expression to be much more readable:

cubes = [x**3 for x in range(10)]

odd_cubes1 = filter(lambda cube: cube % 2, cubes)
odd_cubes2 = (cube for cube in cubes if cube % 2)

In the previous example, odd_cubes1 and odd_cubes2 are the same: they generate a sequence of odd cubes. Yet again, I prefer the generator syntax. This should be evident when things get a little more complicated:

N = 20
cubes1 = map(
lambda n: (n, n**3),
filter(lambda n: n % 3 == 0 or n % 5 == 0, range(N))
cubes2 = (
(n, n**3) for n in range(N) if n % 3 == 0 or n % 5 == 0)

The preceding code creates two generators, cubes1 and cubes2. They are exactly the same, and return two-tuples (n, n3) when n is a multiple of 3 or 5.

If you print the list (cubes1), you get: [(0, 0), (3, 27), (5, 125), (6, 216), (9, 729), (10, 1000), (12, 1728), (15, 3375), (18, 5832)].

See how much better the generator expression reads? It may be debatable when things are very simple, but as soon as you start nesting functions a bit, like we did in this example, the superiority of the generator syntax is evident. It's shorter, simpler, and more elegant.

Now, let me ask you a question—what is the difference between the following lines of code:

s1 = sum([n**2 for n in range(10**6)])
s2 = sum((n**2 for n in range(10**6)))
s3 = sum(n**2 for n in range(10**6))

Strictly speaking, they all produce the same sum. The expressions to get s2 and s3 are exactly the same because the brackets in s2 are redundant. They are both generator expressions inside the sum function. The expression to get s1 is different though. Inside sum, we find a list comprehension. This means that in order to calculate s1, the sum function has to call next on a list a million times.

Do you see where we're losing time and memory? Before sum can start calling next on that list, the list needs to have been created, which is a waste of time and space. It's much better for sum to call next on a simple generator expression. There is no need to have all the numbers from range(10**6) stored in a list.

So, watch out for extra parentheses when you write your expressions: sometimes it's easy to skip over these details, which makes our code very different. If you don't believe me, check out the following code:

s = sum([n**2 for n in range(10**8)]) # this is killed
# s = sum(n**2 for n in range(10**8)) # this succeeds
print(s) # prints: 333333328333333350000000

Try running the preceding example. If I run the first line on my old Linux box with 8 GB RAM, this is what I get:

$ python

On the other hand, if I comment out the first line, and uncomment the second one, this is the result:

$ python

Sweet generator expressions. The difference between the two lines is that in the first one, a list with the squares of the first hundred million numbers must be made before being able to sum them up. That list is huge, and we ran out of memory (at least, my box did, if yours doesn't try a bigger number), therefore Python kills the process for us. Sad face.

But when we remove the square brackets, we don't have a list any more. The sum function receives 0, 1, 4, 9, and so on until the last one, and sums them up. No problems, happy face.

