Chapter 3: Functions, Iterators, and Generators

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

3
Functions, Iterators, and Generators

The core of functional programming is the use of pure functions to map values from the input domain to the output range. Avoiding side effects reduces any dependence on variable assignment to maintain the state of a computation. We can’t purge the assignment statement from the Python language, but we can reduce our dependence on stateful objects. This means choosing among the available Python built-in functions and data structures to select those that don’t require stateful operations.

This chapter will present several Python features from a functional viewpoint, as follows:

Pure functions, free of side effects
Functions as objects that can be passed as arguments or returned as results
The use of Python’s object-oriented suffix notation and prefix notation
Using tuples as a way to create immutable objects, which avoid the confusion of state changes
Using iterable collections as our primary design tool for functional programming

We’ll look at generators and generator expressions, since these are ways to work with collections of objects. As we noted in Chapter 2, Introducing Essential Functional Concepts, there are some boundary issues when trying to replace all generator expressions with recursions. Python imposes a recursion limit and doesn’t automatically handle Tail-Call Optimization (TCO): we must optimize recursions manually using a generator expression.

We’ll write generator expressions that will perform the following tasks:

Conversions
Restructuring
Complex calculations

We’ll take a quick survey of many of the built-in Python collections and how we can work with collections while pursuing a functional paradigm. This may change our approach to working with lists, dicts, and sets. Writing functional Python encourages us to focus on tuples and immutable collections. In the next chapter, we’ll emphasize more functional ways to work with specific kinds of collections.

3.1 Writing pure functions

In Chapter 2, Introducing Essential Functional Concepts, we looked at pure functions. In this section, we’ll look at a common problem with non-functional programming: a function that has a reference to a global variable. When a global variable is assigned, the global statement will be used. When a global variable is read, however, this is called a free variable, and there’s no obvious marker in the Python code.

Any references to values in the Python global namespace (using a free variable) is something we can rework into a proper parameter. In most cases, it’s quite easy. Here is an example that depends on a free variable:

global_adjustment: float 
 
def some_function(a: float, b: float, t: float) -> float: 
    return a+b*t+global_adjustment

After refactoring the function, we would need to change each reference to this function. This may have a ripple effect through a complex application. We’ll leave the refactoring as an exercise.

There are many internal Python objects that are stateful. Objects used for input and output are generally called file objects or file-like objects; these are examples of stateful objects in common use. See the io module for more information on file objects. We observe that some of the commonly used stateful objects in Python generally behave as context managers. In a few cases, stateful objects don’t completely implement the context manager interface; in these cases, there’s often a close() method. We can use the contextlib.closing() function to provide these objects with the proper context manager interface.

A context manager provides a way to perform operations on entry to and exit from a block of code. The with statement uses a context manager to perform the entry operation, execute the indented block of code, and perform the exit operation. It’s important to note the exit operation is always performed, even if an exception is raised in the indented block of code. This makes for a tidy way to perform state-changing operations, making the code easier to reason about. In practice, it looks like the following example:

from pathlib import Path 
 
def write_file(some_path: Path) -> None: 
    result = "Hello, world!" 
    with some_path.open(’w’) as output_file: 
        output_file.write(result + "
")

The file is only open for writing inside the with statement. This makes it easier to see where state-changing operations are performed.

We can’t easily eliminate all stateful Python objects. Therefore, we must strike a balance between managing state while still exploiting the strengths of functional design. To this end, we should always use the with statement to encapsulate stateful file objects into a well-defined scope.

Always use file objects in a with context. This defines a context in which state-changing operations will be performed.

We should always avoid global file objects, global database connections, and the associated stateful object issues. A global file object is a common pattern for handling open files or databases. We may have a function as shown in the following example:

from typing import TextIO 
ifile: TextIO 
ofile: TextIO 
 
def open_files(iname: str, oname: str) -> None: 
    """A bad idea...""" 
    global ifile, ofile 
    ifile = open(iname, "r") 
    ofile = open(oname, "w")

This function creates an easily overlooked pair of global variables. Other functions can use the ifile and ofile variables, hoping they properly refer to the global files, which are left open and will endure a difficult-to-understand series of state changes.

This is not a very functional design, and we need to avoid it. The files should be proper parameters to functions, and the open files should be nested in a with statement to ensure that their stateful behavior is handled properly. This is an important rewrite to change these variables from globals to formal parameters: it makes the file operations more visible.

The rewrite will involve locating every function using ifile or ofile as free variables. For example, we might have a function like the following:

def next_line_with(prefix: str) -> str | None: 
    """Also a bad idea...""" 
    line = ifile.readline() 
    while (line is not None and not line.startswith(prefix)): 
        line = ifile.readline() 
    return line

We’ll need to make the ifile global variable reference into a parameter to this function. This will create ripples of change to the functions that call next_line_with(). This can become an extensive rewrite to identify and localize the state changes. This may lead to rethinking the design to replace a function like next_line_with().

This context manager design pattern also applies to databases. A database connection object should generally be provided as a formal argument to an application’s functions. This is contrary to the way some popular web frameworks work: some frameworks rely on a global database connection in an effort to make the database a transparent feature of the application. This transparency obscures a dependency between a web operation and the database; it can make unit testing more complex than necessary. Additionally, a multithreaded web server may not benefit from sharing a single database connection: a connection pool is often better. This suggests that there are some benefits of a hybrid approach that uses functional design with a few isolated stateful features.

3.2 Functions as first-class objects

In Chapter 2, Introducing Essential Functional Concepts, we looked at ways in which Python functions are first-class objects. In Python, function objects have a number of attributes. The reference manual lists a number of special member names that apply to functions. Since functions are objects with attributes, we can extract the docstring or the name of a function, using special attributes such as __doc__ or __name__. We can also extract the body of the function through the __code__ attribute. In compiled languages, this introspection can be either impossible or quite complicated.

Additionally, a callable object helps us to create functions. We can consider the callable class definition as a higher-order function. We do need to be judicious in how we use the __init__() method of a callable object; we should avoid setting stateful class variables. One common application is to use an __init__() method to create objects that fit the Strategy design pattern.

A class following the Strategy design pattern depends on other objects to provide an algorithm or parts of an algorithm. This allows us to inject algorithmic details at runtime, rather than compiling the details into the class.

To focus on the overall design principles, we’ll look at a function that does a tiny computation. This computes one of the Mersenne prime numbers. See https://www.mersenne.org/primes/ for ongoing research into this topic.

Here is an example of the definition of a callable object class with an embedded Strategy object:

from collections.abc import Callable 
 
class Mersenne1: 
 
    def __init__( 
            self, 
            algorithm : Callable[[int], int] 
    ) -> None: 
        self.pow2 = algorithm 
 
    def __call__(self, arg: int) -> int: 
        return self.pow2(arg) - 1

This class uses __init__() to save a reference to another function, algorithm, as self.pow2. We’re not creating any stateful instance variables; the value of self.pow2 isn’t expected to change. It’s a common practice to use a name like _pow2 to suggest this attribute isn’t expected to be used by a client of this class. The algorithm parameter has a type hint of Callable[[int], int], which describes a function that takes an integer argument and returns an integer value.

We’ve used the Callable type hint from the collections.abc module, where it is defined. An alias is available in the typing module, but since PEP 585 was implemented, the use of the typing.Callable is deprecated. We’ll use a number of generic types from the collections.abc module throughout this chapter.

The function given as a Strategy object must raise 2 to the given power. We can plug in any function that performs this computation. Three candidate objects that we can plug into this class are as follows:

def shifty(b: int) -> int: 
    return 1 << b 
 
def multy(b: int) -> int: 
    if b == 0: return 1 
    return 2 * multy(b - 1) 
 
def faster(b: int) -> int: 
    if b == 0: return 1 
    if b % 2 == 1: return 2 * faster(b-1) 
    t = faster(b // 2) 
    return t * t

The shifty() function raises 2 to the desired power using a left shift of the bits. The multy() function uses a naive recursive multiplication. The faster() function uses a divide and conquer strategy that will perform log ₂(b) multiplications instead of b multiplications.

All three of these functions have identical function signatures. Each of them can be summarized as Callable[[int], int], which matches the parameter, algorithm, of the Mersenne1.__init__() method.

We can create instances of our Mersenne1 class with an embedded Strategy algorithm, as follows:

m1s = Mersenne1(shifty) 
 
m1m = Mersenne1(multy) 
 
m1f = Mersenne1(faster)

Each of the resulting functions, m1s(), m1m(), and m1f(), is built from another function. The functions shifty(), multy(), and faster() are incorporated into resulting functions. This shows how we can define alternative functions that produce the same result but use different algorithms.

The callable objects created by this class behave as ordinary Python functions, as shown in the following example:

>>> m1s(17) 
131071 
>>> m1f(89) 
618970019642690137449562111

Python allows us to compute M₈₉ = 2⁸⁹ − 1, since this doesn’t even come close to the recursion limits in Python. This is quite a large prime number, as it has 27 digits. In order to exceed the limits of the multy() function, we’d have to ask for the value of M_1,279, a number with 386 digits.

3.3 Using strings

Since Python strings are immutable, they’re an excellent example of functional programming objects. A Python str object has a number of methods, all of which produce a new string as the result. These methods are pure functions with no side effects.

The syntax for methods is postfix, where most functions are prefix. This mixture of syntax styles means complex string operations can be hard to read when they’re co-mingled with conventional functions. For example, in this expression, len(variable.title()), the title() method is in postfix notation and the len() function is in prefix notation. (We touched on this in Chapter 2, Introducing Essential Functional Concepts, in the Familiar territory section.)

When scraping data from a web page, we may have a function to clean the data. This could apply a number of transformations to a string to clean up the punctuation and return a Decimal object for use by the rest of the application. This will involve a mixture of prefix and postfix syntax.

It could look like the following code snippet:

from decimal import Decimal 
 
def clean_decimal(text: str | None) -> Decimal | None: 
    if text is None: return None 
    return Decimal( 
        text.replace("$", "").replace(",", "") 
    )

This function does two replacements on the string to remove $ and , string values. The resulting string is used as an argument to the Decimal class constructor, which returns the desired object. If the input value is None, this is preserved; this is why the str | None type hint is used.

To make the syntax look more consistent, we can consider defining our own prefix functions for the string methods, as follows:

def replace(text: str, a: str, b: str) -> str: 
    return text.replace(a, b)

This can allow us to use Decimal(replace(replace(text, "$", ""), ",", "")) with consistent-looking prefix syntax. It’s not clear whether this kind of consistency is a significant improvement over the mixed prefix and postfix notation. This may be an example of a foolish consistency.

A slightly better approach may be to define a more meaningful prefix function to strip punctuation, such as the following code snippet:

def remove(str: str, chars: str) -> str: 
    if chars: 
        return remove( 
            str.replace(chars[0], ""), 
            chars[1:] 
        ) 
    return str

This function will recursively remove each of the characters from the chars variable. We can use it as Decimal(remove(text, "$,")) to make the intent of our string cleanup more clear.

3.4 Using tuples and named tuples

Since Python tuples are immutable objects, they’re another excellent example of objects suitable for functional programming. A Python tuple has very few methods, so almost everything is done using prefix syntax. There are a number of use cases for tuples, particularly when working with list-of-tuple, tuple-of-tuple, and generator-of-tuple constructs.

The typing.NamedTuple class adds an essential feature to a tuple: names to use instead of cryptic index numbers. We can exploit named tuples to create objects that are accretions of data. This allows us to write pure functions based on stateless objects, yet keep data bound into tidy object-like packages. The collections.namedtuple() can also be used to define an immutable class of objects. This lacks a mechanism for providing type hints, making it less desirable than the typing.NamedTuple class.

The decision to use a tuple or typing.NamedTuple object is entirely a matter of convenience. As an example, consider working with a sequence of color values as a three-tuple of the form (number, number, number). It’s not clear that these are in red, green, blue order. We have a number of approaches for making the tuple structure explicit.

One purely functional approach to expose the triple structure is by creating functions to pick a three-tuple apart, as shown in the following code snippet:

from collections.abc import Callable 
from typing import TypeAlias 
 
Extractor: TypeAlias = Callable[[tuple[int, int, int, str]], int] 
 
red: Extractor = lambda color: color[0] 
 
green: Extractor = lambda color: color[1] 
 
blue: Extractor = lambda color: color[2]

Given a tuple, item, we can use red(item) to select the item that has the red component. This style is used in a number of purely functional languages; it has a structure that matches mathematical abstractions nicely.

In Python, it can sometimes help to provide a more formal type hint on each variable, as follows:

from collections.abc import Callable 
from typing import TypeAlias 
 
RGB: TypeAlias = tuple[int, int, int, str] 
 
redt: Callable[[RGB], int] = lambda color: color[0]

This defines a new type alias, RGB, as a four-tuple. The redt() function is provided with a type hint of Callable[[RGB], int] to indicate it should be considered to be a function that accepts an argument value of the RGB class and produces an integer result. This follows other styles of functional programming and adds type hints that can be checked by mypy.

A somewhat better technique is to use Python’s typing.NamedTuple class. This uses a class definition instead of function definitions and looks like this:

from typing import NamedTuple 
class Color(NamedTuple): 
    """An RGB color.""" 
    red: int 
    green: int 
    blue: int 
    name: str

The Color class defines a tuple with specific names and type hints for each position within the tuple. This preserves the advantages of performance and immutability. It adds the ability for the mypy program to confirm that the tuple is used properly.

This also means we’ll use color.red instead of red(color). Using an attribute name to access a member of a tuple seems to add clarity.

There are a number of additional approaches for working with immutable tuples. We’ll look at all of these immutable class techniques in Chapter 7, Complex Stateless Objects.

3.5 Using generator expressions

We’ve shown some examples of generator expressions already in the Lazy and eager evaluation section of Chapter 2, Introducing Essential Functional Concepts. We’ll show some more later in this chapter. In this section, we’ll introduce some more generator techniques.

Python collections are described as iterable. We can use a for statement to iterate over the values. The key mechanism is the ability of a collection to create an iterator object to be used by the for statement. This concept generalizes to encompass a function that is an iterator over values. We call these generator functions. We can also write generator expressions.

It’s common to see generator expressions used to create the list or dict literals through list comprehension or a dictionary comprehension syntax. This is a list comprehension example, [x**2 for x in range(10)], a kind of list display. A list comprehension is one of several places in Python where generator expressions are used. In this example, the list literal [] characters wrap the generator expression, x**2 for x in range(10). This list comprehension creates a list object from the enclosed generator expression.

The underlying x**2 for x in range(10) expression yields a sequence of values. These must be consumed by a client function. The list() function can consume the values. This means we have two ways to create a list object from a generator expression, as shown in the following example:

>>> list(x**2 for x in range(10)) == [x**2 for x in range(10)] 
True

There are other kinds of comprehensions to create dictionaries and sets. When the enclosing characters are {}, this is a set comprehension. When the enclosing characters are {}, and there are : to separate keys and values, this is a dictionary comprehension. In this section, we’re going to focus on the generator expressions, separate from the specific kind of collection object they might create.

A collection object and a generator expression have some similar behaviors because both are iterable. They’re not equivalent, as we’ll see in the following code. Using a display object has the disadvantage of creating a (potentially large) collection of objects. A generator expression is lazy and creates objects only as required; this can improve performance.

We have to provide two important caveats on generator expressions, as follows:

Generators have some of the same internal methods as lists. This means we can apply functions like sorted() and iter() to either generators or lists. An exception is the len() function, which needs to know the size of the collection and won’t work for a generator.
Generators can be used only once. After that, they appear empty.

A generator function is a function that has a yield expression in it. This makes the function act as an iterator. Each individual yield value must be individually consumed by a client function. For a tutorial introduction, see https://wiki.python.org/moin/Generators.

Also, see https://docs.python.org/3/howto/functional.html#generator-expressions-and-list-comprehensions.

We can use something like the following to create a sequence of possible prime numbers:

from collections.abc import Iterator 
 
def candidates() -> Iterator[int]: 
    for i in range(2, 1024): 
        yield m1f(i)

This function iterates through 1,024 result values. It doesn’t compute them eagerly, however. It is lazy, and computes values as they’re requested. The built-in next() function is one way to consume values. Here’s an example of consuming values from a generator function:

>>> c = candidates() 
>>> next(c) 
3 
>>> next(c) 
7 
>>> next(c) 
15 
>>> next(c) 
31

When the candidates() function is evaluated, it creates a generator object, which is saved in the variable c. Each time we use next(c), the generator function computes one more value and yields it. In this example, it will get a new value from the range object, and evaluate the m1f() function to compute a new value.

The yield from expression extends the yield expression. This will consume values from some iterator, yielding each of the values it consumes. As a small example, consider the following function:

from collections.abc import Iterator 
 
def bunch_of_numbers() -> Iterator[int]: 
    for i in range(5): 
        yield from range(i)

Each time a value is requested, it will be produced by the yield from nested inside the for statement. This will yield i distinct values, one from each request. Since i is set by the containing for statement, this will be used to produce ever longer sequences of numbers.

Here’s what the result looks like:

>>> list(bunch_of_numbers()) 
[0, 0, 1, 0, 1, 2, 0, 1, 2, 3]

Here is a generator function that we’ll use for some more examples:

from collections.abc import Iterator 
import math 
 
def pfactorsl(x: int) -> Iterator[int]: 
    if x % 2 == 0: 
        yield 2 
        if x // 2 > 1: 
            yield from pfactorsl(x // 2) 
        return 
    for i in range(3, int(math.sqrt(x) + .5) + 1, 2): 
        if x % i == 0: 
            yield i 
            if x // i > 1: 
                yield from pfactorsl(x // i) 
            return 
    yield x

We’re locating the prime factors of a number. If the number, x, is even, we’ll yield 2 and then recursively yield all prime factors of x // 2.

For odd numbers, we’ll step through odd values greater than or equal to 3 to locate a candidate factor of the number. When we locate a factor, i, we’ll yield that factor, and then recursively yield all prime factors of x // i.

In the event that we can’t locate a factor, the number, x, must be prime, so we can yield the number.

We handle 2 as a special case to cut the number of iterations in half. All prime numbers, except 2, are odd.

We’ve used one important for statement in addition to recursion. This is an optimization, and it’s a teaser for the content of Chapter 6, Recursions and Reductions. This optimization allows us to easily handle numbers that have as many as 1,000 factors. (As an example, 2^1,000, a number with 300 digits, will have 1,000 factors.) Since the for variable, i, is not used outside the indented body of the statement, the stateful nature of the i variable won’t lead to confusion if we make any changes to the body of the for statement.

Because the function, as a whole, is a generator, the yield from statement is used to consume values from the recursive call and yield them to the caller. It provides an iterable sequence of values as a result.

In a recursive generator function, be careful of the return statement.

Do not use the following statement: return recursive_iter(args). It returns only a generator object; it doesn’t evaluate the recursive_iter() function to return the generated values. Use any of the following alternatives:

A yield expression:

for result in recursive_iter(args): 
    yield result

A yield from expression:
```
yield from recursive_iter(args)
```

A function that implements the Iterator protocol is often described as being a generator function. There is a separate Generator protocol, which extends the essential Iterator definition. We often find that functional Python programs can be structured around the generator expression construct. This tends to focus the design effort on functions and stateless objects.

3.5.1 Exploring the limitations of generators

We noted that there are some limitations of generator expressions and generator functions. The limitations can be observed by executing the following command snippet:

>>> pfactorsl(1560) 
<generator object pfactorsl at ...> 
 
>>> list(pfactorsl(1560)) 
[2, 2, 2, 3, 5, 13] 
 
>>> len(pfactorsl(1560)) 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
TypeError: object of type ’generator’ has no len()

In the first example, we saw the generator function, pfactors1(), created a generator. The generator is lazy, and doesn’t have a proper value until we consume the results yielded by the generator. By itself, this isn’t a limitation; lazy evaluation is an important reason why generator expressions fit with functional programming in Python.

In the second example, we materialized a list object from the results yielded by the generator function. This is handy for seeing the output and writing unit test cases.

In the third example, we saw one limitation of generator functions: there’s no len(). Because the generator is lazy, the size can’t be known until after all of the values are consumed.

The other limitation of a generator object is that they can only be used once. For example, look at the following command snippet:

>>> result = pfactorsl(1560) 
>>> sum(result) 
27 
 
>>> sum(result) 
0

The first evaluation of the sum() function performed evaluation of the generator object, result. All of the values were consumed. The second evaluation of the sum() function found that the generator object was now empty. We can only consume the values of a generator object once.

The generator function, pfactorsl(), can produce an indefinite number of generator objects. In many cases, we’ll define generator functions that consume the results yielded by other generators. In these cases, we may not be able to trivially create generators, but must create a whole pipeline of generators.

Generators have a stateful life in Python. While they’re very nice for some aspects of functional programming, they’re not quite perfect.

We can try to use the itertools.tee() function to overcome the once-only limitation. We’ll look at this in depth in Chapter 8, The Itertools Module. It’s not a great idea because it can consume a great deal of memory.

Here is a quick example of its usage:

import itertools 
from typing import Any 
from collections.abc import Iterable 
 
def limits(iterable: Iterable[Any]) -> Any: 
    max_tee, min_tee = itertools.tee(iterable, 2) 
    return max(max_tee), min(min_tee)

We created two clones of the parameter generator expression, max_tee and min_tee. We can consume these two clones to get maximum and minimum values from the iterable. Interestingly, because the two clones are used serially, this leads to consuming a great deal of memory to cache items. This specific example often works out better using a list object instead of using tee() to clone an iterator.

Once consumed, a generator object will not provide any more values. When we want to compute multiple kinds of reductions—for example, sums and counts, or minimums and maximums—we need to design with this one-pass-only limitation in mind.

3.5.2 Combining generator expressions

The essence of functional programming comes from the ways we can easily combine generator expressions and generator functions to create very sophisticated composite processing sequences. When working with generator expressions, we can combine generators in several ways.

One common way to combine generator functions is when we create a composite function. We may have a generator that computes (f(x) for x in some_iterable). If we want to compute g(f(x)), we have several ways to combine two generators.

We can tweak the original generator expression as follows:

g_f_x = (g(f(x)) for x in some_iterable)

While technically correct, this defeats any idea of reuse. Rather than reusing an expression, we rewrote it.

We can also substitute one expression within another expression, as follows:

g_f_x = (g(y) for y in (f(x) for x in some_iterable))

This has the advantage of allowing us to use simple substitution. We can revise this slightly to emphasize reuse, using the following commands:

f_x = (f(x) for x in some_iterable) 
g_f_x = (g(y) for y in f_x)

This has the advantage of leaving the initial expression, (f(x) for x in some_iterable), essentially unchanged. All we did was assign the expression to a variable without altering the syntax.

The resulting composite function is also a generator expression, which is also lazy. This means that extracting the next value from g_f_x will extract one value from f_x, which will extract one value from the source some_iterable object.

3.6 Cleaning raw data with generator functions

One of the tasks that arise in exploratory data analysis is cleaning up raw source data. This is often done as a composite operation applying several scalar functions to each piece of input data to create a usable dataset.

Let’s look at a simplified set of data. This data is commonly used to show techniques in exploratory data analysis. It’s called Anscombe’s quartet, and it comes from the article Graphs in Statistical Analysis, by F. J. Anscombe, that appeared in American Statistician in 1973. The following are the first few rows of a downloaded file with this dataset:

Anscombe’s quartet 
I II III IV 
x y x y x y x y 
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58 
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76 
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71

Since the data is properly tab-delimited, we can use the csv.reader() function to iterate through the various rows. Sadly, we can’t trivially process this with the csv module. We have to do a little bit of parsing to extract the useful information from this file. We can define a function to iterate over the raw data as follows:

import csv 
from typing import TextIO 
from collections.abc import Iterator, Iterable 
 
def row_iter(source: TextIO) -> Iterator[list[str]]: 
    return csv.reader(source, delimiter="	")

We wrapped a file in a csv.reader() function to create an iterator over the rows of raw data. The typing module provides a handy definition, TextIO, for file objects that read (or write) string values. Each row is a list of text values. It can be helpful to define an additional type, Row = list[str], to make this more explicit.

We can use this row_iter() function in the following context:

>>> from pathlib import Path 
>>> source_path = Path("Anscombe.txt") 
>>> with source_path.open() as source: 
...     print(list(row_iter(source)))

While this will display useful information, the problem is the first three items in the resulting iterable aren’t data. The Anscombe’s quartet file starts with the following header rows:

[["Anscombe’s quartet"], 
 [’I’, ’II’, ’III’, ’IV’], 
 [’x’, ’y’, ’x’, ’y’, ’x’, ’y’, ’x’, ’y’],

We need to filter these three non-data rows from the iterable. There are several possible approaches. Here is a function that will excise three expected title rows, validate they are the expected headers, and return an iterator over the remaining rows:

from collections.abc import Iterator 
 
def head_split_fixed( 
        row_iter: Iterator[list[str]] 
) -> Iterator[list[str]]: 
    title = next(row_iter) 
    assert (len(title) == 1 
        and title[0] == "Anscombe’s quartet") 
    heading = next(row_iter) 
    assert (len(heading) == 4 
        and heading == [’I’, ’II’, ’III’, ’IV’]) 
 
    columns = next(row_iter) 
    assert (len(columns) == 8 
        and columns == [’x’,’y’, ’x’,’y’, ’x’,’y’, ’x’,’y’]) 
    return row_iter

This function plucks three rows from the source data, an iterator. It asserts that each row has an expected value. If the file doesn’t meet these basic expectations, it’s a sign that the file was damaged or perhaps our analysis is focused on the wrong file.

Since both the row_iter() and the head_split_fixed() functions expect an iterator as an argument value, they can be combined, as follows:

from pathlib import Path 
from collections.abc import Iterator 
 
def get_rows(path: Path) -> Iterator[list[str]]: 
    with path.open() as source: 
        yield from head_split_fixed(row_iter(source))

We’ve applied one iterator to the results of another iterator. In effect, this defines a composite function. We’re not done, of course; we still need to convert the string values to the float values, and we also need to pick apart the four parallel series of data in each row.

The final conversions and data extractions are more easily done with higher-order functions, such as map() and filter(). We’ll return to those in Chapter 5, Higher-Order Functions.

3.7 Applying generators to built-in collections

We’ll now look at ways to apply generator expressions to a number of Python’s built-in collections. This section will cover the following topics:

Generators for lists, dicts, and sets
Working with stateful collections
Using the bisect module to create a mapping
Using stateful sets

Each of these looks at some specialized cases of Python collections and generator functions. In particular, we’ll look at ways to produce a collection, and consume the collection in later processing.

This is a lead-in to the next chapter, Chapter 4, Working with Collections, which covers the Python collections in considerably more detail.

3.7.1 Generators for lists, dicts, and sets

A Python sequence object, such as a list, is iterable. However, it has some additional features. We can think of a list as a materialized iterable. We’ve used the tuple() function in several examples to collect the output of a generator expression or generator function into a single tuple object. We can use the list() function to materialize a sequence to create a list object.

In Python, a list display, or list comprehension, offers simple syntax to materialize a generator: we add the [] brackets. This is ubiquitous to the point where the distinction between generator expression and list comprehension can be lost. We need to disentangle the idea of a generator expression from a list display that uses a generator expression.

The following is an example to enumerate the cases:

>>> range(10) 
range(0, 10) 
 
>>> [range(10)] 
[range(0, 10)] 
 
>>> [x for x in range(10)] 
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] 
 
>>> list(range(10)) 
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

The first example is the range object, which is a type of generator function. It doesn’t produce any values because it’s lazy.

The second example shows a list composed of a single instance of the generator function. The [] syntax creates a list literal of the range() object without consuming any values created by the iterator.

The third example shows a list comprehension built from a generator expression that includes a generator function. The function, range(10), is evaluated by a generator expression, x for x in range(10). The resulting values are collected into a list object.

We can also use the list() function to build a list from an iterable or a generator expression. This also works for set(), tuple(), and dict().

The list(range(10)) function evaluates the generator object. The [range(10)] list literal does not evaluate the range(10) generator object.

While there’s shorthand syntax for list, dict, and set using [] and {}, there’s no shorthand syntax for a tuple. To materialize a tuple, we must use the tuple() function. For this reason, it often seems most consistent to use the list(), tuple(), and set() functions as the preferred syntax.

In the data cleansing code in the previous section, we used a composite function to create a list of four tuples. The function looked as follows:

>>> data = list(get_rows(Path("Anscombe.txt"))) 
>>> data[0] 
[’10.0’, ’8.04’, ’10.0’, ’9.14’, ’10.0’, ’7.46’, ’8.0’, ’6.58’] 
>>> data[1] 
[’8.0’, ’6.95’, ’8.0’, ’8.14’, ’8.0’, ’6.77’, ’8.0’, ’5.76’] 
>>> data[-1] 
[’5.0’, ’5.68’, ’5.0’, ’4.74’, ’5.0’, ’5.73’, ’8.0’, ’6.89’]

We assigned the results of the get_rows() composite function to a name, data. Each of these rows is a collection of four (x,y) pairs.

To extract one of the (x,y) pairs, we’ll need to do a little bit more processing to make this useful. First, we need to pick pairs of columns from the eight-tuple. Since the pairs are always adjacent, we can select a pair of columns with a slice operation of the form row[2 * n: 2 * n + 2]. The idea is that pair n is in positions 2 × n and 2 × n + 1. The slice expression 2 * n: 2 * n + 2 includes the start element, 2 * n, and stops just before the stop element, 2 * n + 2. We can wrap this with a reusable function, as shown in the following definition:

from typing import cast, TypeVar 
from collections.abc import Iterator, Iterable 
 
SrcT = TypeVar("SrcT") 
 
def series( 
        n: int, 
        row_iter: Iterable[list[SrcT]] 
) -> Iterator[tuple[SrcT, SrcT]]: 
    for row in row_iter: 
        yield cast(tuple[SrcT, SrcT], tuple(row[n * 2: n * 2 + 2]))

This function picks two adjacent columns based on a number between 0 and 3. It creates a tuple object from those two columns. The cast() function is a type hint to inform the mypy tool that the result will be a two-tuple where both items are strings. This is required because it’s difficult for the mypy tool to determine that the expression tuple(row[n * 2: n * 2 + 2]) will select exactly two elements from the row collection.

This example uses a type variable, SrcT, to make a deeper claim about the transformation. Specifically, the type variable tells people reading the code (and tools like mypy) that the input object types will be the resulting object types. If the source is, for example, an iterable of lists of str, then SrcT = str, and the output will be an iterator over tuples with two str values.

We can use the series() function to extract collections from the source file as follows:

>>> from pathlib import Path 
>>> source_path = Path("Anscombe.txt") 
>>> with source_path.open() as source: 
...     data = tuple(head_split_fixed(row_iter(source))) 
>>> series_I = tuple(series(0, data)) 
>>> series_II = tuple(series(1, data)) 
>>> series_III = tuple(series(2, data)) 
>>> series_IV = tuple(series(3, data))

We applied the tuple() function to a composite function based on the series(), head_split_fixed(), and row_iter() functions. Each of these expressions will create an object that we can reuse in several other functions. We can then do analysis on subsets of the source data.

The series_I sequence looks as follows:

>>> series_I 
((’10.0’, ’8.04’), (’8.0’, ’6.95’), ... (’5.0’, ’5.68’))

The other three sequences are similar in structure. The values, however, are quite different.

The final thing we’ll need to do is create proper numeric values from the strings that we’ve accumulated so that we can compute some statistical summary values. We can apply the float() function conversion as the last step. There are many alternative places to apply the float() function, and we’ll look at some choices in Chapter 5, Higher-Order Functions.

To reduce memory use and increase performance, we prefer to use generator expressions and functions as much as possible. These iterate through collections in a lazy manner, computing values only when required. Since iterators can only be used once, we’re sometimes forced to materialize a collection as a tuple (or list) object. Materializing a collection costs memory and time, so we do it reluctantly.

Programmers familiar with Clojure can match Python’s lazy generators with the lazy-seq and lazy-cat functions. The idea is that we can specify a potentially infinite sequence, but only take values from it as needed.

3.7.2 Using stateful mappings

Python offers several stateful collections; the various mappings include the dict class and a number of related mappings defined in the collections module. We need to emphasize the stateful nature of these mappings and use them carefully.

For our purposes, learning functional programming techniques in Python, there are two use cases for mapping: a stateful dictionary that accumulates a mapping and a frozen dictionary that can’t be updated. Python doesn’t provide an easy-to-use definition of an immutable mapping. We can use the abstract base class Mapping from the collections.abc module. We can also create an immutable MappingProxyType object from a mutable mapping. For more information, see the types module.

The stateful dictionary can be further decomposed into the following two typical use cases:

A dictionary built once and never updated. In this case, we want to exploit the hashed keys feature of the dict class to optimize performance. We can create a dictionary from any iterable sequence of (key, value) two-tuples using the expression dict(sequence).
A dictionary built incrementally. This is an optimization we can use to avoid materializing and sorting a list object. We’ll look at this in Chapter 6, Recursions and Reductions, where we’ll look at the collections.Counter class as a sophisticated reduction. Incremental building is particularly helpful for memoization. We’ll defer memoization until Chapter 10, The Functools Module.

The first example, building a dictionary once, stems from an application with three operating phases: gather some input, create a dict object, and then process input based on the mappings in the dictionary. As an example of this kind of application, we may be doing some image processing and have a specific palette of colors, represented by names and (R, G, B) tuples. If we use the GNU Image Manipulation Program (GIMP) file format, the color palette may look like the following command snippet:

GIMP Palette 
Name: Small 
Columns: 3 
# 
0 0 0 Black 
255 255 255 White 
238 32 77 Red 
28 172 120 Green 
31 117 254 Blue

The details of parsing this file are the subject of Chapter 6, Recursions and Reductions. What’s important is the results of the parsing.

First, we’ll define a typing.NamedTuple class Color as follows:

from typing import NamedTuple 
 
class Color(NamedTuple): 
    red: int 
    green: int 
    blue: int 
    name: str

Second, we’ll assume that we have a parser that produces an iterable of Color objects. If we materialized it as a tuple, it would look like the following:

>>> palette = [ 
...     Color(red=239, green=222, blue=205, name=’Almond’), 
...     Color(red=205, green=149, blue=117, name=’Antique Brass’), 
...     Color(red=253, green=217, blue=181, name=’Apricot’), 
...     Color(red=197, green=227, blue=132, name=’Yellow Green’), 
...     Color(red=255, green=174, blue=66, name=’Yellow Orange’) 
... ]

To locate a given color name quickly, we will create a frozen dictionary from this sequence. This is not the only way to get fast lookups of a color by name. We’ll look at another option later.

To create a mapping from an iterable sequence of tuples, we will use the process(wrap(iterable)) design pattern. The following command shows how we can create the color name mapping:

>>> name_map = dict((c.name, c) for c in palette)

Here are three three parts to the design pattern:

The source iterable is the palette. We could formalize this with the hint Iterable[Color].
The wrap is the (c.name, c) expression to transform a Color object to a tuple[str, Color] pair.
The process is the dict() function to create a mapping.

The resulting dictionary looks as follows:

>>> name_map[’Antique Brass’] 
Color(red=205, green=149, blue=117, name=’Antique Brass’) 
>>> name_map[’Yellow Orange’] 
Color(red=255, green=174, blue=66, name=’Yellow Orange’)

This can also be done using a dictionary comprehension. We leave that as an exercise for the reader.

Now that we’ve materialized the mapping, we can use this dict() object in some later processing for repeated transformations from color name to (R, G, B) color numbers. The lookup will be blazingly fast because a dictionary does a rapid transformation from key to hash value followed by lookup in the dictionary.

3.7.3 Using the bisect module to create a mapping

In the previous example, we created a dict object to achieve a fast mapping from a color name to a Color object. This isn’t the only choice; we can use the bisect module instead. Using the bisect module means that we have to create a sorted sequence, which we can then search. To be perfectly compatible with the dictionary implementation, we can use collections.Mapping as the base class.

The dict class uses a hash computation to locate items almost immediately. However, this requires allocating a fairly large block of memory. The bisect mapping does a search, which doesn’t require as much memory, but performance cannot be described as immediate. The performance drops from O(1) to O(log n). While this is dramatic, the savings in memory can be critical for processing large collections of data.

A static mapping class looks like the following command snippet:

import bisect 
from collections.abc import Mapping, Iterable 
from typing import Any 
 
class StaticMapping(Mapping[str, Color]): 
    def __init__(self, 
            iterable: Iterable[tuple[str, Color]] 
    ) -> None: 
        self._data: tuple[tuple[str, Color], ...] = tuple(iterable) 
        self._keys: tuple[str, ...] = tuple(sorted(key for key, _ in self._data)) 
 
    def __getitem__(self, key: str) -> Color: 
        ix = bisect.bisect_left(self._keys, key) 
        if (ix != len(self._keys) and self._keys[ix] == key): 
            return self._data[ix][1] 
        raise ValueError(f"{key!r} not found") 
 
    def __iter__(self) -> Iterator[str]: 
        return iter(self._keys) 
 
    def __len__(self) -> int: 
        return len(self._keys)

This class extends the abstract superclass collections.Mapping. It provides an initialization and implementations for three functions missing from the abstract definition. The type of tuple[str, Color] defines a specific kind of two-tuple expected by this mapping.

The __getitem__() method uses the bisect.bisect_left() function to search the collection of keys. If the key is found, the appropriate value is returned. The __iter__() method returns an iterator, as required by the superclass. The __len__() method, similarly, provides the required length of the collection.

This class may not seem to embody too many functional programming principles. Our goal here is to support a larger application that minimizes the use of stateful variables. This class saves a static collection of key-value pairs. As an optimization, it materializes two objects.

An application would create an instance of this class to perform relatively rapid lookups of values associated with keys. The superclass does not support updates to the object. The collection, as a whole, is stateless. It’s not as fast as the built-in dict class, but it uses less memory and, through the formality of being a subclass of the Mapping class, we can be assured that this object is not used to contain a processing state.

3.7.4 Using stateful sets

Python offers several stateful collections, including the set collection. For our purposes, there are two use cases for a set:

A stateful set that accumulates items
frozenset, which can be used to optimize searches for an item

We can create frozenset from an iterable in the same way we create a tuple object from an iterable, via a frozenset(some_iterable) expression; this will create a structure that has the advantage of a very fast in operator. This can be used in an application that gathers data, creates a set, and then uses a frozenset to match other data items against the set.

We may have a set of colors that we will use as a kind of chroma-key: we will use this color to create a mask that will be used to combine two images. Pragmatically, a single color isn’t appropriate, but a small set of very similar colors works best. In this case, we can examine each pixel of an image file to see if the pixel is in the chroma-key set or not. For this kind of processing, the chroma-key colors can be loaded into frozenset before processing the target images. The set lookup is extremely fast.

As with mappings—specifically the Counter class—there are some algorithms that can benefit from a memoized set of values. Some functions benefit from memoization because a function is a mapping between domain values and range values, a job where mapping works well. A few algorithms benefit from a memoized set, which is stateful and grows as data is processed.

We’ll return to memoization in Chapter 10, The Functools Module.

3.8 Summary

In this chapter, we looked again at writing pure functions free of side effects. We looked at generator functions and how we can use these as the backbone of functional programming to process collections of items. We also examined a number of the built-in collection classes to show how they’re used in the functional paradigm. While the general idea behind functional programming is to limit the use of stateful variables, the collection objects have a stateful implementation. For many algorithms, they’re often essential. Our goal is to be judicious in our use of Python’s non-functional features.

In the next two chapters, we’ll look at functions for processing collections. After that, we’ll look closely at higher-order functions: functions that accept functions as arguments as well as returning functions. In later chapters, we’ll look at techniques for defining our own higher-order functions. We’ll also look at the itertools and functools modules and their higher-order functions in later chapters.

3.9 Exercises

This chapter’s exercises are based on code available from Packt Publishing on GitHub. See https://github.com/PacktPublishing/Functional-Python-Programming-3rd-Edition.

In some cases, the reader will notice that the code provided on GitHub includes partial solutions to some of the exercises. These serve as hints, allowing the reader to explore alternative solutions.

In many cases, exercises will need unit test cases to confirm they actually solve the problem. These are often identical to the unit test cases already provided in the GitHub repository. The reader should replace the book’s example function name with their own solution to confirm that it works.

3.9.1 Rewrite the some_function() function

In the Writing pure functions section, a function was shown that relied on a global variable.

Create a small application that sets the global variable and calls the function. The application can expand on the following example:

def some_function ... 
 
def main(): 
    """ 
    >>> main() 
    some_function(2, 3, 5)=30 
    some_function(2, 3, 5)=34 
    """ 
    global global_adjustment 
    global_adjustment = 13 
    print(f"{some_function(2, 3, 5)=}") 
    global_adjustment = 17 
    print(f"{some_function(2, 3, 5)=}") 
 
    if __name__ == "__main__": 
        main()

First, create a test suite for some_function() and main(). A doctest suite embedded in docstring is shown in the example.

Second, rewrite some_function() to make global_adjustment into a parameter. This will lead to revising main() and all of the test cases.

3.9.2 Alternative Mersenne class definition

The examples in the Functions as first-class objects section show a Mersenne1 class that accepts a function as a parameter to the __init__() method.

An alternative is to provide a plug-in Strategy function as part of the class definition.

This would permit the following kinds of object definitions:

 
>>> class ShiftyMersenne(Mersenne2): 
...     pow2 = staticmethod(shifty) 
 
>>> m2s = ShiftyMersenne() 
>>> m2s(17) 
131071

The use of staticmethod() is essential because the shifty() function does not expect the self argument when it is evaluated. It’s essential to make sure this function is understood as being ”static”—that is, not using a self parameter.

3.9.3 Alternative algorithm implementations

Consider the following algorithm:

As we’ve seen in this chapter, there are three ways to write this in Python:

As a for statement that updates stateful variables
As a generator expression
As a map() operation to apply the function

Write all three versions in Python.

A test case is the following data:

V ← {7.46,6.77,12.74,7.11,7.81,8.84,6.08,5.39,8.15,6.42,5.73}

And the following scaling function:

(v-−-7.5) f(v) = 2.031

The computed value for m is approximately zero.

3.9.4 Map and filter

The built-in map() and filter() functions always have an equivalent generator expression. In order to have consistent-looking code, a project team is wrestling with the idea of insisting that all code use generator expressions, avoiding the built-in map() and filter() functions.

Take the side of only using generator expressions and provide reasons for which this is advantageous.
Take the side of only using built-in map() and filter() functions and provide reasons why this alternative might be advantageous.
In looking at the reasons for the first two parts of this exercise, is there a clearly articulated decision on which approach is better? If not, why not? If so, what rule should the team use?

3.9.5 Dictionary comprehension

In the Using stateful mappings section, we built a mapping from a list of two-tuples. We can also build a mapping using a dictionary comprehension. Rewrite the expression dict((c.name, c) for c in palette) as a dictionary comprehension.

3.9.6 Raw data cleanup

A file, Anscombe.txt, is almost a valid CSV format file. The problem is there are three lines of useless text at the beginning. The lines are easy to recognize because applying the float() function to the values in those header rows will raise a ValueError exception.

Some team members suggest using a regular expression to examine the values to see if they are valid numbers. This can be called Look Before You Leap (LBYL):

Other team members suggest using a simpler try: statement to uncover the invalid non-numeric headers and discard them. This can be called Easier to Ask Forgiveness Than Permission (EAFP):

Both algorithms work. It’s instructive to implement each one in Python to compare them. Here are a few starting points for comparing the algorithms:

The LBYL variant can rely entirely on generator expressions. However, it requires writing a regular expression that recognizes all possible floating-point values. Is this a responsibility that should be part of this application?
The EAFP variant needs a separate function to implement the try: statement processing. Otherwise, it seems amenable to also being written via generator expressions or the map() function.

After building the two variants, which seems to be more expressive of the purpose of filtering and acquiring data?

Join our community Discord space

Join our Python Discord workspace to discuss and know more about the book: https://packt.link/dHrHU

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 3: Functions, Iterators, and Generators

Create new playlist

Sign In

Sign Up

3 Functions, Iterators, and Generators