When many people with experience in another language start learning Python, they are taken
aback by the difference in for
loop notation. That is to say, instead of
writing:
#
Other languages
for
(
i
=
0
;
i
<
N
;
i
+
+
)
{
do_work
(
i
)
;
}
they are instead introduced to a new function called range
or xrange
:
# Python
for
i
in
range
(
N
)
:
do_work
(
i
)
It seems that in the python code sample that we are calling a function, range
,
which creates all of the data we need for the for loop to continue. Intuitively,
this can be quite a time consuming process — if we are trying to loop over the
numbers 1 through 100,000,000 then we need to spend a lot of time creating that
array! However, this is where generators come into play and they allow us to
essentially lazily evaluate these sorts of functions so we can have the
code-readability of these special-purpose functions without the performance
impacts.
In order to understand this concept, let us implement a function which calculates several fibonacci numbers by both filling a list and by using a generator.
def
fibonacci_list
(
num_items
)
:
numbers
=
[
]
a
,
b
=
0
,
1
while
len
(
numbers
)
<
num_items
:
numbers
.
append
(
a
)
a
,
b
=
b
,
a
+
b
return
numbers
def
fibonacci_gen
(
num_items
)
:
a
,
b
=
0
,
1
while
num_items
:
yield
a
a
,
b
=
b
,
a
+
b
num_items
-
=
1
for
i
in
range
(
10000
)
:
pass
for
i
in
xrange
(
10000
)
:
pass
This function will yield
many values instead of returning one value. This turns this regular-looking function into a generator that can be repeatedly polled for the next available value.
The first thing to note is that the fibonacci_list
implementation must create
and store the list of all the relevant fibonacci numbers. So, if we want to
have 10,000 numbers of the sequence, the function will do 10,000 appends to the
numbers
list (which, as we discussed in Chapter 3, has overhead
associated with it), and then return it.
On the other hand, the generator is able to “return” many values. Every time
the code gets to the yield
, the function emits its value, and when another
value is requested the function resumes running (maintaining its previous state)
and emits the new value. When the function reaches its end, a StopIteration
exception is thrown, indicating that the given generator has no more values. As
a result, even though both functions must, in the end, do the same number of
calculations, the fibonacci_list
version of the preceding loop uses 10,000x
more memory (or num_items
times more memory).
With this code in mind, we can decompose the for
loops that use our implementations of fibonacci_list
and fibonacci_gen
. In
Python, for
loops require that the object we are looping over supports
iteration. This means that we must be able to create an iterator out of the
object we want to loop over. To create an iterator from almost any object, we
can simply use Python’s built-in iter
function. This function, for lists,
tuples, dictionaries, and sets, returns an iterator over the items or keys in
the object. For more complex objects, iter
returns the result of the
__iter__
property of the object. Since fibonacci_gen
already returns an
iterator, calling iter
on it is a trivial operation, and it simply returns the
original object (so type(fibonacci_gen(10)) == type(iter(fibonacci_gen(10)))
).
However, since fibonacci_list
returns a list, we must create a new object, a
list iterator, that will iterate over all values in the list. In general, once
an iterator is created, we simply call the next()
function with it, retrieving
new values until a StopIteration
exception is thrown. This gives us a good
deconstructed view of for
loops, as illustrated in Example 5-1.
# The python loop
for
i
in
object
:
do_work
(
i
)
# Is equivalent to
object_iterator
=
iter
(
object
)
while
True
:
try
:
i
=
next
(
object_iterator
)
do_work
(
i
)
except
StopIteration
:
break
The for
loop code shows that we are doing extra work calling iter
when using
fibonacci_list
instead of fibonacci_gen
. When using fibonacci_gen
, we
create a generator that is trivially transformed into an iterator (since it is
already an iterator!); however, for fibonacci_list
we need to allocate a new list and
precompute its values, and then we still must create an iterator.
More importantly, pre-computing the fibonacci_list
list requires allocating
enough space for the full dataset and setting each element to the correct value,
even though we only ever require one value at a time. This also makes the list
allocation useless. In fact, it may even make the loop un-runnable, because it
may be trying to allocate more memory than is available (fibonacci_list(100,000,000)
would create a list 3.1 GB large!). By timing the results, we can see this very
explicitly:
def
test_fibonacci_list
():
"""
>>> %timeit test_fibonacci_list()
332 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %memit test_fibonacci_list()
peak memory: 492.82 MiB, increment: 441.75 MiB
"""
for
i
in
fibonacci_list
(
100000
):
pass
def
test_fibonacci_gen
():
"""
>>> %timeit test_fibonacci_gen()
126 ms ± 905 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %memit test_fibonacci_gen()
peak memory: 51.13 MiB, increment: 0.00 MiB
"""
for
i
in
fibonacci_gen
(
100000
):
pass
As we can see, the generator version is over twice as fast and requires no
measurable memory as compared to the fibonacci_list
’s 441MB. It may seem at
this point that you should simply use generators everywhere in place of creating
lists, but there are many complications to this.
What if, for example, you needed to reference the list of fibonacci numbers
multiple times? In this case, fibonacci_list
would provide a pre-computed list
of these digits while fibonacci_gen
would have to re-compute them over and
over again. In general, changing to using generators instead of pre-computed
arrays requires algorithmic changes which are sometimes not so easy to
understand. 1
An important choice that must be made when architecting your code, whether you are going to optimize between CPU speed or memory efficiency. In some cases, using extra memory so that you have values pre-calculated and ready for future reference will save in overall speed. Other times, memory may be so constrained that the only solution is to re-calculate values as opposed to saving them in memory. Every problem has it’s own considerations for this CPU/memory tradeoff.
One simple example of this that is often seen in source-code is using a generator to create a sequence of numbers, only to use list-comprehension in order to calculate the length of the result,
divisible_by_three
=
len
([
n
for
n
in
fibonacci_gen
(
100000
)
if
n
%
3
==
0
])
While we are still using fibonacci_gen
to generate the fibonacci sequence as a
generator, we are then saving all values divisible by 3 into an array only to
take the length of that array and then throw away the data. In the process,
we’re consuming 105MB of data for no reason. In fact, if we were doing this for
a long enough fibonacci sequence, the above code wouldn’t be able to run because
of memory issues even though the calculation itself is quite simple!
Recall that we can create a list comprehension using a statement of the form
[<value>
for
<item>
in
<sequence>
if
<condition>
]
. This will create a list of
all the <value>
items. Alternatively, we can use similar syntax to create a
generator of the <value>
items instead of a list by doing (
<value>
for
<item>
in
<sequence>
if
<condition>
)
.
Using this subtle difference between list comprehension and generator
comprehension, we can optimize the preceding code for divisible_by_three
.
However, generators do not have a length
property. As a result, we will have to
be a bit clever:
divisible_by_three
=
sum
((
1
for
n
in
fibonacci_gen
(
100000
)
if
n
%
3
==
0
))
Here, we have a generator that emits a value of 1
whenever it encounters a
number divisible by 3, and nothing otherwise. By summing all elements in this
generator we are essentially doing the same as the list comprehension version
and consuming no significant memory.
Many of python’s builtin functions that operate on sequences are generators
themselves (albeit sometimes a special type of generator). For example, range
returns a generator of values as opposed to the actual list of numbers within
the specified range. Similarly, map
, zip
, filter
, reversed
and
enumerate
all perform the calculation as needed and don’t store the full
result. This means that the operation zip(range(100000), range(100000))
will
only ever have two numbers in memory in order to return it’s corresponding
values, instead of pre-calculating the result for the entire range beforehand.
The performance of the two versions of this code is almost equivalent for these smaller sequence lengths, but the memory impact of the generator version is far less than that of the list comprehension. Furthermore, we simply transform the list version into a generator, because all that matters for each element of the list is its current value—either the number is divisible by 3 or it is not; it doesn’t matter where its placement is in the list of numbers or what the previous/next values are. More complex functions can also be transformed into generators, but depending on their reliance on state, this can become a difficult thing to do.
Instead of calculating a known number of fibonacci numbers, what if we instead attempted to calculate all of them.
def
fibonacci
():
i
,
j
=
0
,
1
while
True
:
yield
j
i
,
j
=
j
,
i
+
j
In this code we are doing something that wouldn’t be possible with the previous
fibonacci_list
code, we are encapsulating an infinite series of numbers into a
function. This allows us to take as many values as we’d like from this stream
and teminate when our code thinks it’s had enough.
One reason why generators aren’t used as much as they could be is that a lot of the logic within them can be encapsulated in your logic code. This means that generators are really a way of organizing your code and having smarter loops. For example, we could answer the question, “How many Fibonacci numbers below 5,000 are odd?” in multiple ways:
def
fibonacci_naive
():
i
,
j
=
0
,
1
count
=
0
while
j
<=
5000
:
if
j
%
2
:
count
+=
1
i
,
j
=
j
,
i
+
j
return
count
def
fibonacci_transform
():
count
=
0
for
f
in
fibonacci
():
if
f
>
5000
:
break
if
f
%
2
:
count
+=
1
return
count
from
itertools
import
takewhile
def
fibonacci_succinct
():
first_5000
=
takewhile
(
lambda
x
:
x
<=
5000
,
fibonacci
())
return
sum
(
1
for
x
in
first_5000
if
x
%
2
)
All of these methods have similar runtime properties (as measured by their
memory footprint and runtime performance), but the fibonacci_transform
function benefits from several things. First, it is much more verbose than
fibonacci_succinct
, which means it will be easy for another developer to debug
and understand. The latter mainly stands as a warning for the next section,
where we cover some common workflows using itertools
—while the module
greatly simplifies many simple actions with iterators, it can also quickly make
Python code very un-Pythonic. Conversely, fibonacci_naive
is doing multiple
things at a time, which hides the actual calculation it is doing! While it is
obvious in the generator function that we are iterating over the Fibonacci
numbers, we are not overencumbered by the actual calculation. Lastly,
fibonacci_transform
is more generalizable. This function could be renamed
num_odd_under_5000
and take in the generator by argument, and thus work over
any series.
One last benefit of the fibonacci_transform
and fibonacci_succinct
functions
is that they supports the notion that in computation there are two phases:
generating data and transforming data. These functions are very clearly
performing a transformation on data, while the fibonacci
function generates
it. This demarcation adds extra clarity and functionality: we can move a
transformative function to work on a new set of data, or perform multiple
transformations on existing data. This paradigm has always been important when
creating complex programs; however, generators facilitate this clearly by making
generators responsible for creating the data, and normal functions responsible
for acting on the generated data.
As touched on previously, the way we get the memory benefits with a generator is by dealing only with the current values of interest. At any point in our calculation with a generator, we only have the current value and cannot reference any other items in the sequence (algorithms that perform this way are generally called single pass or online). This can sometimes make generators more difficult to use, but there are many modules and functions that can help.
The
main library of interest is itertools
, in the standard library. It supplies
many other useful functions including:
islice
Allows slicing a potentially infinite generator
chain
Chains together multiple generators
takewhile
Adds a condition that will end a generator
cycle
Makes a finite generator infinite by constantly repeating it
Let’s build up an example of using generators to analyze a large dataset. Let’s say we’ve had an analysis routine going over temporal data, one piece of data per second, for the last 20 years—that’s 631,152,000 data points! The data is stored in a file, one second per line, and we cannot load the entire dataset into memory. As a result, if we wanted to do some simple anomaly detection we’d have to use generators to save memory!
The problem will be: Given a data file of the form “timestamp, value”, find days
whose values differ from normal distribution. We start by writing the code that
will read the file, line by line, and output each line’s value as a Python
object. We will also create a read_fake_data
generator to generate fake data
we can test our algorithms with. For this function we still take the argument
filename
, so as to have the same function signature as read_data
; however,
we will simply disregard it. These two functions, shown in Example 5-2,
are indeed lazily evaluated—we only read the next line in the file, or generate
new fake data, when the next()
function is called.
from
random
import
normalvariate
,
uniform
from
itertools
import
count
from
datetime
import
datetime
def
read_data
(
filename
):
with
open
(
filename
)
as
fd
:
for
line
in
fd
:
data
=
line
.
strip
()
.
split
(
','
)
timestamp
,
value
=
map
(
int
,
data
)
yield
datetime
.
fromtimestamp
(
timestamp
),
value
def
read_fake_data
(
filename
):
for
timestamp
in
count
():
# We insert an anomalous data-point approximately once a week
if
uniform
(
0
,
1
)
>
1.0
/
(
7
*
60
*
60
*
24
):
value
=
normalvariate
(
0
,
1
)
else
:
value
=
100
yield
datetime
.
fromtimestamp
(
timestamp
),
value
Now, we’d like to create a function that outputs groups of data that occur in
the same day. For this, we can use the groupby
function in itertools
(Example 5-3). This function works by taking in a sequence of items
and a key used to group these items. The output is a generator that produces
tuples whose items are the key for the group and a generator for the items in
the group. As our key function, we will output the calendar day that the data
was recorded. This “key” function could be anything—we could group our data by
hour, by year, or by some property in the actual value. The only limitation is
that groups will only be formed for data that is sequential. So, if we had the
input A A A A B B A A
and had groupby
group by the letter, we would get
three groups, (A, [A, A, A, A])
, (B, [B, B])
, and (A, [A, A])
.
from
itertools
import
groupby
def
groupby_day
(
iterable
):
key
=
lambda
row
:
row
[
0
]
.
day
for
day
,
data_group
in
groupby
(
iterable
,
key
):
yield
list
(
data_group
)
Now to do the actual anomaly detection. We will do this
by creating a function that, given one group of data, returns whether or not it
follows the normal distribution (using scipy.stats.normaltest
). We can use
this check with itertools.filterfalse
to filter down the full dataset to only
inputs which don’t pass the test. These inputs are what we consider to be
anomalous.
In Example 5-3 we cast data_group
into a list, even though it is
provided to us as an iterator. This is because the normaltest
function
requires an array-like object. We could, however, write our own normaltest
function that is “one-pass” and could operate on a single view of the data. This
could be done without too much trouble by using Knuth’s online averaging
algorithm to calculate the skew and kurtosis of the numbers. This would save us
even more memory by only ever storing a single value of the dataset in memory at
once instead of storing a full day at a time. However, performance time
regressions and development time should be taken into consideration: if storing
one day of data in memory at a time sufficient for this problem or does it need
to be further optimized?
from
scipy.stats
import
normaltest
from
itertools
import
filterfalse
def
is_normal
(
data
,
threshold
=
1e-3
):
_
,
values
=
zip
(
*
data
)
k2
,
p_value
=
normaltest
(
values
)
if
p_value
<
threshold
:
return
False
return
True
def
filter_anomalous_groups
(
data
):
yield from
filterfalse
(
is_normal
,
data
)
Finally, we can put together the chain of generators to get the days that had anomalous data (Example 5-5).
from
itertools
import
islice
def
filter_anomalous_data
(
data
):
data_group
=
groupby_day
(
data
)
yield from
filter_anomalous_groups
(
data_group
)
data
=
read_data
(
filename
)
anomaly_generator
=
filter_anomalous_data
(
data
)
first_five_anomalies
=
islice
(
anomaly_generator
,
5
)
for
data_anomaly
in
first_five_anomalies
:
start_date
=
data_anomaly
[
0
][
0
]
end_date
=
data_anomaly
[
-
1
][
0
]
(
f
"Anomaly from {start_date} - {end_date}"
)
This method very simply allows us to get the list of days that are anomalous
without having to load the entire dataset. Only enough data is read in order to
generate the first five anomalies. Furthermore, the anomaly_generator
object
can be read further to continue retrieving anomalous data This is called lazy
evaluation—only the calculations that are explicitly requested are performed,
which can drastically reduce overall runtime if there is an early termination
condition.
Another nicety about organizing analysis this way is it allows us to do
more expansive calculations easily, without having to rework large parts of the code. For
example, if we want to have a moving window of one day instead of chunking up
by days, we can simply replace the groupby_day
in Example 5-5
with something like,
from
datetime
import
datetime
def
groupby_window
(
data
,
window_size
=
3600
):
window
=
tuple
(
islice
(
data
,
window_size
))
for
item
in
data
:
yield
window
window
=
window
[
1
:]
+
(
item
),)
In this version we also see very explicitly the memory guarantee of this and the
previous method—it will store only the window’s worth of data as state (in both
cases, one day, or 3,600 data points). Note that the first item retrieved by the
for loop is the window_size-th value. This is because data
is an iterator and
in the previous line we consumed the first window_size values.
A final note: In the groupby_window
function, we are constantly creating new
tuples, filling them with data and yielding them to the caller. We can greatly
optimize this by using the deque
object in the collections
module. This
object gives us O(1)
appends and removals to and from the beginning or end of
a list (while normal lists are O(1)
for appends or removals to/from the end of
the list and O(n)
for the same operations at the beginning of the list).
Using the deque
object, we can append
the new data to the right (or end)
of the list and use deque.popleft()
to delete data from the left (or
beginning) of the list without having to allocate more space or perform long
O(n)
operations. However, we would have to work on the deque
object in-place
and destroy previous views to the rolling window (see [Link to Come] for
more about in-place operations). The only way around this would be to copy the
data into a tuple before yielding it back to the caller, which gets rid of any
benefit of the change!
By formulating our anomaly-finding algorithm with iterators, we can
process much more data than could fit into memory. What’s more, we can
do it faster than if we had used lists, since we avoid all the costly append
operations.
Since iterators are a primitive type in Python, this should always be a go-to method for trying to reduce the memory footprint of an application. The benefits are that results are lazily evaluated, so you only ever process the data you need, and memory is saved since we don’t store previous results unless explicitly required to. In [Link to Come], we will talk about other methods that can be used for more specific problems and introduce some new ways of looking at problems when RAM is an issue.
Another benefit of solving problems using iterators is that it prepares your
code to be used on multiple CPUs or multiple computers, as we will see in
Chapters [Link to Come] and [Link to Come]. As we discussed in
“Iterators for Infinite Series”, when working with iterators you must always think about the
various states that are necessary for your algorithm to work. Once you figure
out how to package the state necessary for the algorithm to run, it doesn’t
matter where it runs. We can see this sort of paradigm, for example, with the
multiprocessing
and ipython
modules, both of which use a map
-like
function to launch parallel tasks.
1 In general, algorithms that are online or single pass are a great fit for generators. When making the change yourself, care should be taken in terms of how many times you’ll need to reference the data being calculated