Chapter 5. Iterators and Generators

When many people with experience in another language start learning Python, they are taken aback by the difference in for loop notation. That is to say, instead of writing:

# Other languages
for (i=0; i<N; i++) {
    do_work(i);
}

they are instead introduced to a new function called range or xrange:

# Python
for i in range(N):
    do_work(i)

It seems that in the python code sample that we are calling a function, range, which creates all of the data we need for the for loop to continue. Intuitively, this can be quite a time consuming process — if we are trying to loop over the numbers 1 through 100,000,000 then we need to spend a lot of time creating that array! However, this is where generators come into play and they allow us to essentially lazily evaluate these sorts of functions so we can have the code-readability of these special-purpose functions without the performance impacts.

In order to understand this concept, let us implement a function which calculates several fibonacci numbers by both filling a list and by using a generator.

def fibonacci_list(num_items):
    numbers = []
    a, b = 0, 1
    while len(numbers) < num_items:
        numbers.append(a)
        a, b = b, a+b
    return numbers

def fibonacci_gen(num_items):
    a, b = 0, 1
    while num_items:
        yield a  1
        a, b = b, a+b
        num_items -= 1

for i in range(10000):
    pass

for i in xrange(10000):
    pass
1

This function will yield many values instead of returning one value. This turns this regular-looking function into a generator that can be repeatedly polled for the next available value.

The first thing to note is that the fibonacci_list implementation must create and store the list of all the relevant fibonacci numbers. So, if we want to have 10,000 numbers of the sequence, the function will do 10,000 appends to the numbers list (which, as we discussed in Chapter 3, has overhead associated with it), and then return it.

On the other hand, the generator is able to “return” many values. Every time the code gets to the yield, the function emits its value, and when another value is requested the function resumes running (maintaining its previous state) and emits the new value. When the function reaches its end, a StopIteration exception is thrown, indicating that the given generator has no more values. As a result, even though both functions must, in the end, do the same number of calculations, the fibonacci_list version of the preceding loop uses 10,000x more memory (or num_items times more memory).

With this code in mind, we can decompose the for loops that use our implementations of fibonacci_list and fibonacci_gen. In Python, for loops require that the object we are looping over supports iteration. This means that we must be able to create an iterator out of the object we want to loop over. To create an iterator from almost any object, we can simply use Python’s built-in iter function. This function, for lists, tuples, dictionaries, and sets, returns an iterator over the items or keys in the object. For more complex objects, iter returns the result of the __iter__ property of the object. Since fibonacci_gen already returns an iterator, calling iter on it is a trivial operation, and it simply returns the original object (so type(fibonacci_gen(10)) == type(iter(fibonacci_gen(10)))). However, since fibonacci_list returns a list, we must create a new object, a list iterator, that will iterate over all values in the list. In general, once an iterator is created, we simply call the next() function with it, retrieving new values until a StopIteration exception is thrown. This gives us a good deconstructed view of for loops, as illustrated in Example 5-1.

Example 5-1. Python for loop deconstructed
# The python loop
for i in object:
    do_work(i)

# Is equivalent to
object_iterator = iter(object)
while True:
    try:
        i = next(object_iterator)
        do_work(i)
    except StopIteration:
        break

The for loop code shows that we are doing extra work calling iter when using fibonacci_list instead of fibonacci_gen. When using fibonacci_gen, we create a generator that is trivially transformed into an iterator (since it is already an iterator!); however, for fibonacci_list we need to allocate a new list and precompute its values, and then we still must create an iterator.

More importantly, pre-computing the fibonacci_list list requires allocating enough space for the full dataset and setting each element to the correct value, even though we only ever require one value at a time. This also makes the list allocation useless. In fact, it may even make the loop un-runnable, because it may be trying to allocate more memory than is available (fibonacci_list(100,000,000) would create a list 3.1 GB large!). By timing the results, we can see this very explicitly:

def test_fibonacci_list():
    """
    >>> %timeit test_fibonacci_list()
    332 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    >>> %memit test_fibonacci_list()
    peak memory: 492.82 MiB, increment: 441.75 MiB
    """
    for i in fibonacci_list(100000):
        pass

def test_fibonacci_gen():
    """
    >>> %timeit test_fibonacci_gen()
    126 ms ± 905 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

    >>> %memit test_fibonacci_gen()
    peak memory: 51.13 MiB, increment: 0.00 MiB
    """
    for i in fibonacci_gen(100000):
        pass

As we can see, the generator version is over twice as fast and requires no measurable memory as compared to the fibonacci_list’s 441MB. It may seem at this point that you should simply use generators everywhere in place of creating lists, but there are many complications to this.

What if, for example, you needed to reference the list of fibonacci numbers multiple times? In this case, fibonacci_list would provide a pre-computed list of these digits while fibonacci_gen would have to re-compute them over and over again. In general, changing to using generators instead of pre-computed arrays requires algorithmic changes which are sometimes not so easy to understand. 1

Note

An important choice that must be made when architecting your code, whether you are going to optimize between CPU speed or memory efficiency. In some cases, using extra memory so that you have values pre-calculated and ready for future reference will save in overall speed. Other times, memory may be so constrained that the only solution is to re-calculate values as opposed to saving them in memory. Every problem has it’s own considerations for this CPU/memory tradeoff.

One simple example of this that is often seen in source-code is using a generator to create a sequence of numbers, only to use list-comprehension in order to calculate the length of the result,

divisible_by_three = len([n for n in fibonacci_gen(100000) if n % 3 == 0])

While we are still using fibonacci_gen to generate the fibonacci sequence as a generator, we are then saving all values divisible by 3 into an array only to take the length of that array and then throw away the data. In the process, we’re consuming 105MB of data for no reason. In fact, if we were doing this for a long enough fibonacci sequence, the above code wouldn’t be able to run because of memory issues even though the calculation itself is quite simple!

Recall that we can create a list comprehension using a statement of the form [<value> for <item> in <sequence> if <condition>]. This will create a list of all the <value> items. Alternatively, we can use similar syntax to create a generator of the <value> items instead of a list by doing (<value> for <item> in <sequence> if <condition>).

Using this subtle difference between list comprehension and generator comprehension, we can optimize the preceding code for divisible_by_three. However, generators do not have a length property. As a result, we will have to be a bit clever:

divisible_by_three = sum((1 for n in fibonacci_gen(100000) if n % 3 == 0))

Here, we have a generator that emits a value of 1 whenever it encounters a number divisible by 3, and nothing otherwise. By summing all elements in this generator we are essentially doing the same as the list comprehension version and consuming no significant memory.

Note

Many of python’s builtin functions that operate on sequences are generators themselves (albeit sometimes a special type of generator). For example, range returns a generator of values as opposed to the actual list of numbers within the specified range. Similarly, map, zip, filter, reversed and enumerate all perform the calculation as needed and don’t store the full result. This means that the operation zip(range(100000), range(100000)) will only ever have two numbers in memory in order to return it’s corresponding values, instead of pre-calculating the result for the entire range beforehand.

The performance of the two versions of this code is almost equivalent for these smaller sequence lengths, but the memory impact of the generator version is far less than that of the list comprehension. Furthermore, we simply transform the list version into a generator, because all that matters for each element of the list is its current value—either the number is divisible by 3 or it is not; it doesn’t matter where its placement is in the list of numbers or what the previous/next values are. More complex functions can also be transformed into generators, but depending on their reliance on state, this can become a difficult thing to do.

Iterators for Infinite Series

Instead of calculating a known number of fibonacci numbers, what if we instead attempted to calculate all of them.

def fibonacci():
    i, j = 0, 1
    while True:
        yield j
        i, j = j, i + j

In this code we are doing something that wouldn’t be possible with the previous fibonacci_list code, we are encapsulating an infinite series of numbers into a function. This allows us to take as many values as we’d like from this stream and teminate when our code thinks it’s had enough.

One reason why generators aren’t used as much as they could be is that a lot of the logic within them can be encapsulated in your logic code. This means that generators are really a way of organizing your code and having smarter loops. For example, we could answer the question, “How many Fibonacci numbers below 5,000 are odd?” in multiple ways:

def fibonacci_naive():
    i, j = 0, 1
    count = 0
    while j <= 5000:
        if j % 2:
            count += 1
        i, j = j, i + j
    return count

def fibonacci_transform():
    count = 0
    for f in fibonacci():
        if f > 5000:
            break
        if f % 2:
            count += 1
    return count

from itertools import takewhile
def fibonacci_succinct():
    first_5000 = takewhile(lambda x: x <= 5000,
                           fibonacci())
    return sum(1 for x in first_5000
               if x % 2)

All of these methods have similar runtime properties (as measured by their memory footprint and runtime performance), but the fibonacci_transform function benefits from several things. First, it is much more verbose than fibonacci_succinct, which means it will be easy for another developer to debug and understand. The latter mainly stands as a warning for the next section, where we cover some common workflows using itertools—while the module greatly simplifies many simple actions with iterators, it can also quickly make Python code very un-Pythonic. Conversely, fibonacci_naive is doing multiple things at a time, which hides the actual calculation it is doing! While it is obvious in the generator function that we are iterating over the Fibonacci numbers, we are not overencumbered by the actual calculation. Lastly, fibonacci_transform is more generalizable. This function could be renamed num_odd_under_5000 and take in the generator by argument, and thus work over any series.

One last benefit of the fibonacci_transform and fibonacci_succinct functions is that they supports the notion that in computation there are two phases: generating data and transforming data. These functions are very clearly performing a transformation on data, while the fibonacci function generates it. This demarcation adds extra clarity and functionality: we can move a transformative function to work on a new set of data, or perform multiple transformations on existing data. This paradigm has always been important when creating complex programs; however, generators facilitate this clearly by making generators responsible for creating the data, and normal functions responsible for acting on the generated data.

Lazy Generator Evaluation

As touched on previously, the way we get the memory benefits with a generator is by dealing only with the current values of interest. At any point in our calculation with a generator, we only have the current value and cannot reference any other items in the sequence (algorithms that perform this way are generally called single pass or online). This can sometimes make generators more difficult to use, but there are many modules and functions that can help.

The main library of interest is itertools, in the standard library. It supplies many other useful functions including:

islice

Allows slicing a potentially infinite generator

chain

Chains together multiple generators

takewhile

Adds a condition that will end a generator

cycle

Makes a finite generator infinite by constantly repeating it

Let’s build up an example of using generators to analyze a large dataset. Let’s say we’ve had an analysis routine going over temporal data, one piece of data per second, for the last 20 years—that’s 631,152,000 data points! The data is stored in a file, one second per line, and we cannot load the entire dataset into memory. As a result, if we wanted to do some simple anomaly detection we’d have to use generators to save memory!

The problem will be: Given a data file of the form “timestamp, value”, find days whose values differ from normal distribution. We start by writing the code that will read the file, line by line, and output each line’s value as a Python object. We will also create a read_fake_data generator to generate fake data we can test our algorithms with. For this function we still take the argument filename, so as to have the same function signature as read_data; however, we will simply disregard it. These two functions, shown in Example 5-2, are indeed lazily evaluated—we only read the next line in the file, or generate new fake data, when the next() function is called.

Example 5-2. Lazily reading data
from random import normalvariate, uniform
from itertools import count
from datetime import datetime

def read_data(filename):
    with open(filename) as fd:
        for line in fd:
            data = line.strip().split(',')
            timestamp, value = map(int, data)
            yield datetime.fromtimestamp(timestamp), value

def read_fake_data(filename):
    for timestamp in count():
        #  We insert an anomalous data-point approximately once a week
        if uniform(0, 1) > 1.0 / (7 * 60 * 60 * 24):
            value = normalvariate(0, 1)
        else:
            value = 100
        yield datetime.fromtimestamp(timestamp), value

Now, we’d like to create a function that outputs groups of data that occur in the same day. For this, we can use the groupby function in itertools (Example 5-3). This function works by taking in a sequence of items and a key used to group these items. The output is a generator that produces tuples whose items are the key for the group and a generator for the items in the group. As our key function, we will output the calendar day that the data was recorded. This “key” function could be anything—we could group our data by hour, by year, or by some property in the actual value. The only limitation is that groups will only be formed for data that is sequential. So, if we had the input A A A A B B A A and had groupby group by the letter, we would get three groups, (A, [A, A, A, A]), (B, [B, B]), and (A, [A, A]).

Example 5-3. Grouping our data
from itertools import groupby

def groupby_day(iterable):
    key = lambda row: row[0].day
    for day, data_group in groupby(iterable, key):
        yield list(data_group)

Now to do the actual anomaly detection. We will do this by creating a function that, given one group of data, returns whether or not it follows the normal distribution (using scipy.stats.normaltest). We can use this check with itertools.filterfalse to filter down the full dataset to only inputs which don’t pass the test. These inputs are what we consider to be anomalous.

Note

In Example 5-3 we cast data_group into a list, even though it is provided to us as an iterator. This is because the normaltest function requires an array-like object. We could, however, write our own normaltest function that is “one-pass” and could operate on a single view of the data. This could be done without too much trouble by using Knuth’s online averaging algorithm to calculate the skew and kurtosis of the numbers. This would save us even more memory by only ever storing a single value of the dataset in memory at once instead of storing a full day at a time. However, performance time regressions and development time should be taken into consideration: if storing one day of data in memory at a time sufficient for this problem or does it need to be further optimized?

Example 5-4. Generator-based anomaly detection
from scipy.stats import normaltest
from itertools import filterfalse

def is_normal(data, threshold=1e-3):
    _, values = zip(*data)
    k2, p_value = normaltest(values)
    if p_value < threshold:
        return False
    return True

def filter_anomalous_groups(data):
    yield from filterfalse(is_normal, data)

Finally, we can put together the chain of generators to get the days that had anomalous data (Example 5-5).

Example 5-5. Chaining together our generators
from itertools import islice


def filter_anomalous_data(data):
    data_group = groupby_day(data)
    yield from filter_anomalous_groups(data_group)

data = read_data(filename)
anomaly_generator = filter_anomalous_data(data)
first_five_anomalies = islice(anomaly_generator, 5)

for data_anomaly in first_five_anomalies:
    start_date = data_anomaly[0][0]
    end_date = data_anomaly[-1][0]
    print(f"Anomaly from {start_date} - {end_date}")

This method very simply allows us to get the list of days that are anomalous without having to load the entire dataset. Only enough data is read in order to generate the first five anomalies. Furthermore, the anomaly_generator object can be read further to continue retrieving anomalous data This is called lazy evaluation—only the calculations that are explicitly requested are performed, which can drastically reduce overall runtime if there is an early termination condition.

Another nicety about organizing analysis this way is it allows us to do more expansive calculations easily, without having to rework large parts of the code. For example, if we want to have a moving window of one day instead of chunking up by days, we can simply replace the groupby_day in Example 5-5 with something like,

from datetime import datetime

def groupby_window(data, window_size=3600):
    window = tuple(islice(data, window_size))
    for item in data:
        yield window
        window = window[1:] + (item),)

In this version we also see very explicitly the memory guarantee of this and the previous method—it will store only the window’s worth of data as state (in both cases, one day, or 3,600 data points). Note that the first item retrieved by the for loop is the window_size-th value. This is because data is an iterator and in the previous line we consumed the first window_size values.

A final note: In the groupby_window function, we are constantly creating new tuples, filling them with data and yielding them to the caller. We can greatly optimize this by using the deque object in the collections module. This object gives us O(1) appends and removals to and from the beginning or end of a list (while normal lists are O(1) for appends or removals to/from the end of the list and O(n) for the same operations at the beginning of the list). Using the deque object, we can append the new data to the right (or end) of the list and use deque.popleft() to delete data from the left (or beginning) of the list without having to allocate more space or perform long O(n) operations. However, we would have to work on the deque object in-place and destroy previous views to the rolling window (see [Link to Come] for more about in-place operations). The only way around this would be to copy the data into a tuple before yielding it back to the caller, which gets rid of any benefit of the change!

Wrap-Up

By formulating our anomaly-finding algorithm with iterators, we can process much more data than could fit into memory. What’s more, we can do it faster than if we had used lists, since we avoid all the costly append operations.

Since iterators are a primitive type in Python, this should always be a go-to method for trying to reduce the memory footprint of an application. The benefits are that results are lazily evaluated, so you only ever process the data you need, and memory is saved since we don’t store previous results unless explicitly required to. In [Link to Come], we will talk about other methods that can be used for more specific problems and introduce some new ways of looking at problems when RAM is an issue.

Another benefit of solving problems using iterators is that it prepares your code to be used on multiple CPUs or multiple computers, as we will see in Chapters [Link to Come] and [Link to Come]. As we discussed in “Iterators for Infinite Series”, when working with iterators you must always think about the various states that are necessary for your algorithm to work. Once you figure out how to package the state necessary for the algorithm to run, it doesn’t matter where it runs. We can see this sort of paradigm, for example, with the multiprocessing and ipython modules, both of which use a map-like function to launch parallel tasks.

1 In general, algorithms that are online or single pass are a great fit for generators. When making the change yourself, care should be taken in terms of how many times you’ll need to reference the data being calculated

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset