The core of functional programming is the use of pure functions to map values from the input domain to the output range. Avoiding side effects reduces any dependence on variable assignment to maintain the state of a computation. We can’t purge the assignment statement from the Python language, but we can reduce our dependence on stateful objects. This means choosing among the available Python built-in functions and data structures to select those that don’t require stateful operations.
This chapter will present several Python features from a functional viewpoint, as follows:
Pure functions, free of side effects
Functions as objects that can be passed as arguments or returned as results
The use of Python’s object-oriented suffix notation and prefix notation
Using tuples as a way to create immutable objects, which avoid the confusion of state changes
Using iterable collections as our primary design tool for functional programming
We’ll look at generators and generator expressions, since these are ways to work with collections of objects. As we noted in Chapter 2, Introducing Essential Functional Concepts, there are some boundary issues when trying to replace all generator expressions with recursions. Python imposes a recursion limit and doesn’t automatically handle Tail-Call Optimization (TCO): we must optimize recursions manually using a generator expression.
We’ll write generator expressions that will perform the following tasks:
Conversions
Restructuring
Complex calculations
We’ll take a quick survey of many of the built-in Python collections and how we can work with collections while pursuing a functional paradigm. This may change our approach to working with lists, dicts, and sets. Writing functional Python encourages us to focus on tuples and immutable collections. In the next chapter, we’ll emphasize more functional ways to work with specific kinds of collections.
In Chapter 2, Introducing Essential Functional Concepts, we looked at pure functions. In this section, we’ll look at a common problem with non-functional programming: a function that has a reference to a global variable. When a global variable is assigned, the global
statement will be used. When a global variable is read, however, this is called a free variable, and there’s no obvious marker in the Python code.
Any references to values in the Python global namespace (using a free variable) is something we can rework into a proper parameter. In most cases, it’s quite easy. Here is an example that depends on a free variable:
global_adjustment: float
def some_function(a: float, b: float, t: float) -> float:
return a+b*t+global_adjustment
After refactoring the function, we would need to change each reference to this function. This may have a ripple effect through a complex application. We’ll leave the refactoring as an exercise.
There are many internal Python objects that are stateful. Objects used for input and output are generally called file objects or file-like objects; these are examples of stateful objects in common use. See the io
module for more information on file objects. We observe that some of the commonly used stateful objects in Python generally behave as context managers. In a few cases, stateful objects don’t completely implement the context manager interface; in these cases, there’s often a close()
method. We can use the contextlib.closing()
function to provide these objects with the proper context manager interface.
A context manager provides a way to perform operations on entry to and exit from a block of code. The with
statement uses a context manager to perform the entry operation, execute the indented block of code, and perform the exit operation. It’s important to note the exit operation is always performed, even if an exception is raised in the indented block of code. This makes for a tidy way to perform state-changing operations, making the code easier to reason about. In practice, it looks like the following example:
from pathlib import Path
def write_file(some_path: Path) -> None:
result = "Hello, world!"
with some_path.open(’w’) as output_file:
output_file.write(result + "
")
The file is only open for writing inside the with
statement. This makes it easier to see where state-changing operations are performed.
We can’t easily eliminate all stateful Python objects. Therefore, we must strike a balance between managing state while still exploiting the strengths of functional design. To this end, we should always use the with
statement to encapsulate stateful file objects into a well-defined scope.
Always use file objects in a with
context. This defines a context in which state-changing operations will be performed.
We should always avoid global file objects, global database connections, and the associated stateful object issues. A global file object is a common pattern for handling open files or databases. We may have a function as shown in the following example:
from typing import TextIO
ifile: TextIO
ofile: TextIO
def open_files(iname: str, oname: str) -> None:
"""A bad idea..."""
global ifile, ofile
ifile = open(iname, "r")
ofile = open(oname, "w")
This function creates an easily overlooked pair of global variables. Other functions can use the ifile
and ofile
variables, hoping they properly refer to the global files, which are left open and will endure a difficult-to-understand series of state changes.
This is not a very functional design, and we need to avoid it. The files should be proper parameters to functions, and the open files should be nested in a with
statement to ensure that their stateful behavior is handled properly. This is an important rewrite to change these variables from globals to formal parameters: it makes the file operations more visible.
The rewrite will involve locating every function using ifile
or ofile
as free variables. For example, we might have a function like the following:
def next_line_with(prefix: str) -> str | None:
"""Also a bad idea..."""
line = ifile.readline()
while (line is not None and not line.startswith(prefix)):
line = ifile.readline()
return line
We’ll need to make the ifile
global variable reference into a parameter to this function. This will create ripples of change to the functions that call next_line_with()
. This can become an extensive rewrite to identify and localize the state changes. This may lead to rethinking the design to replace a function like next_line_with()
.
This context manager design pattern also applies to databases. A database connection object should generally be provided as a formal argument to an application’s functions. This is contrary to the way some popular web frameworks work: some frameworks rely on a global database connection in an effort to make the database a transparent feature of the application. This transparency obscures a dependency between a web operation and the database; it can make unit testing more complex than necessary. Additionally, a multithreaded web server may not benefit from sharing a single database connection: a connection pool is often better. This suggests that there are some benefits of a hybrid approach that uses functional design with a few isolated stateful features.
In Chapter 2, Introducing Essential Functional Concepts, we looked at ways in which Python functions are first-class objects. In Python, function objects have a number of attributes. The reference manual lists a number of special member names that apply to functions. Since functions are objects with attributes, we can extract the docstring or the name of a function, using special attributes such as __doc__
or __name__
. We can also extract the body of the function through the __code__
attribute. In compiled languages, this introspection can be either impossible or quite complicated.
Additionally, a callable object helps us to create functions. We can consider the callable class definition as a higher-order function. We do need to be judicious in how we use the __init__()
method of a callable object; we should avoid setting stateful class variables. One common application is to use an __init__()
method to create objects that fit the Strategy design pattern.
A class following the Strategy design pattern depends on other objects to provide an algorithm or parts of an algorithm. This allows us to inject algorithmic details at runtime, rather than compiling the details into the class.
To focus on the overall design principles, we’ll look at a function that does a tiny computation. This computes one of the Mersenne prime numbers. See https://www.mersenne.org/primes/ for ongoing research into this topic.
Here is an example of the definition of a callable object class with an embedded Strategy object:
from collections.abc import Callable
class Mersenne1:
def __init__(
self,
algorithm : Callable[[int], int]
) -> None:
self.pow2 = algorithm
def __call__(self, arg: int) -> int:
return self.pow2(arg) - 1
This class uses __init__()
to save a reference to another function, algorithm
, as self.pow2
. We’re not creating any stateful instance variables; the value of self.pow2
isn’t expected to change. It’s a common practice to use a name like _pow2
to suggest this attribute isn’t expected to be used by a client of this class. The algorithm
parameter has a type hint of Callable[[int],
int]
, which describes a function that takes an integer argument and returns an integer value.
We’ve used the Callable
type hint from the collections.abc
module, where it is defined. An alias is available in the typing
module, but since PEP 585 was implemented, the use of the typing.Callable
is deprecated. We’ll use a number of generic types from the collections.abc
module throughout this chapter.
The function given as a Strategy object must raise 2 to the given power. We can plug in any function that performs this computation. Three candidate objects that we can plug into this class are as follows:
def shifty(b: int) -> int:
return 1 << b
def multy(b: int) -> int:
if b == 0: return 1
return 2 * multy(b - 1)
def faster(b: int) -> int:
if b == 0: return 1
if b % 2 == 1: return 2 * faster(b-1)
t = faster(b // 2)
return t * t
The shifty()
function raises 2 to the desired power using a left shift of the bits. The multy()
function uses a naive recursive multiplication. The faster()
function uses a divide and conquer strategy that will perform log 2(b) multiplications instead of b multiplications.
All three of these functions have identical function signatures. Each of them can be summarized as Callable[[int],
int]
, which matches the parameter, algorithm
, of the Mersenne1.__init__()
method.
We can create instances of our Mersenne1
class with an embedded Strategy algorithm, as follows:
m1s = Mersenne1(shifty)
m1m = Mersenne1(multy)
m1f = Mersenne1(faster)
Each of the resulting functions, m1s()
, m1m()
, and m1f()
, is built from another function. The functions shifty()
, multy()
, and faster()
are incorporated into resulting functions. This shows how we can define alternative functions that produce the same result but use different algorithms.
The callable objects created by this class behave as ordinary Python functions, as shown in the following example:
>>> m1s(17)
131071
>>> m1f(89)
618970019642690137449562111
Python allows us to compute M89 = 289 − 1, since this doesn’t even come close to the recursion limits in Python. This is quite a large prime number, as it has 27 digits. In order to exceed the limits of the multy()
function, we’d have to ask for the value of M1,279, a number with 386 digits.
Since Python strings are immutable, they’re an excellent example of functional programming objects. A Python str
object has a number of methods, all of which produce a new string as the result. These methods are pure functions with no side effects.
The syntax for methods is postfix, where most functions are prefix. This mixture of syntax styles means complex string operations can be hard to read when they’re co-mingled with conventional functions. For example, in this expression, len(variable.title())
, the title()
method is in postfix notation and the len()
function is in prefix notation. (We touched on this in Chapter 2, Introducing Essential Functional Concepts, in the Familiar territory section.)
When scraping data from a web page, we may have a function to clean the data. This could apply a number of transformations to a string to clean up the punctuation and return a Decimal
object for use by the rest of the application. This will involve a mixture of prefix and postfix syntax.
It could look like the following code snippet:
from decimal import Decimal
def clean_decimal(text: str | None) -> Decimal | None:
if text is None: return None
return Decimal(
text.replace("$", "").replace(",", "")
)
This function does two replacements on the string to remove $
and ,
string values. The resulting string is used as an argument to the Decimal
class constructor, which returns the desired object. If the input value is None
, this is preserved; this is why the str
|
None
type hint is used.
To make the syntax look more consistent, we can consider defining our own prefix functions for the string methods, as follows:
def replace(text: str, a: str, b: str) -> str:
return text.replace(a, b)
This can allow us to use Decimal(replace(replace(text,
"$",
""),
",",
""))
with consistent-looking prefix syntax. It’s not clear whether this kind of consistency is a significant improvement over the mixed prefix and postfix notation. This may be an example of a foolish consistency.
A slightly better approach may be to define a more meaningful prefix function to strip punctuation, such as the following code snippet:
def remove(str: str, chars: str) -> str:
if chars:
return remove(
str.replace(chars[0], ""),
chars[1:]
)
return str
This function will recursively remove each of the characters from the chars
variable. We can use it as Decimal(remove(text,
"$,"))
to make the intent of our string cleanup more clear.
Since Python tuples are immutable objects, they’re another excellent example of objects suitable for functional programming. A Python tuple has very few methods, so almost everything is done using prefix syntax. There are a number of use cases for tuples, particularly when working with list-of-tuple, tuple-of-tuple, and generator-of-tuple constructs.
The typing.NamedTuple
class adds an essential feature to a tuple: names to use instead of cryptic index numbers. We can exploit named tuples to create objects that are accretions of data. This allows us to write pure functions based on stateless objects, yet keep data bound into tidy object-like packages. The collections.namedtuple()
can also be used to define an immutable class of objects. This lacks a mechanism for providing type hints, making it less desirable than the typing.NamedTuple
class.
The decision to use a tuple or typing.NamedTuple
object is entirely a matter of convenience. As an example, consider working with a sequence of color values as a three-tuple of the form (number,
number,
number)
. It’s not clear that these are in red, green, blue order. We have a number of approaches for making the tuple structure explicit.
One purely functional approach to expose the triple structure is by creating functions to pick a three-tuple apart, as shown in the following code snippet:
from collections.abc import Callable
from typing import TypeAlias
Extractor: TypeAlias = Callable[[tuple[int, int, int, str]], int]
red: Extractor = lambda color: color[0]
green: Extractor = lambda color: color[1]
blue: Extractor = lambda color: color[2]
Given a tuple, item
, we can use red(item)
to select the item that has the red component. This style is used in a number of purely functional languages; it has a structure that matches mathematical abstractions nicely.
In Python, it can sometimes help to provide a more formal type hint on each variable, as follows:
from collections.abc import Callable
from typing import TypeAlias
RGB: TypeAlias = tuple[int, int, int, str]
redt: Callable[[RGB], int] = lambda color: color[0]
This defines a new type alias, RGB
, as a four-tuple. The redt()
function is provided with a type hint of Callable[[RGB],
int]
to indicate it should be considered to be a function that accepts an argument value of the RGB
class and produces an integer result. This follows other styles of functional programming and adds type hints that can be checked by mypy.
A somewhat better technique is to use Python’s typing.NamedTuple
class. This uses a class definition instead of function definitions and looks like this:
from typing import NamedTuple
class Color(NamedTuple):
"""An RGB color."""
red: int
green: int
blue: int
name: str
The Color
class defines a tuple with specific names and type hints for each position within the tuple. This preserves the advantages of performance and immutability. It adds the ability for the mypy program to confirm that the tuple is used properly.
This also means we’ll use color.red
instead of red(color)
. Using an attribute name to access a member of a tuple seems to add clarity.
There are a number of additional approaches for working with immutable tuples. We’ll look at all of these immutable class techniques in Chapter 7, Complex Stateless Objects.
We’ve shown some examples of generator expressions already in the Lazy and eager evaluation section of Chapter 2, Introducing Essential Functional Concepts. We’ll show some more later in this chapter. In this section, we’ll introduce some more generator techniques.
Python collections are described as iterable. We can use a for
statement to iterate over the values. The key mechanism is the ability of a collection to create an iterator object to be used by the for
statement. This concept generalizes to encompass a function that is an iterator over values. We call these generator functions. We can also write generator expressions.
It’s common to see generator expressions used to create the list
or dict
literals through list comprehension or a dictionary comprehension syntax. This is a list comprehension example, [
x
**2
for
x
in
range
(10)
]
, a kind of list display. A list comprehension is one of several places in Python where generator expressions are used. In this example, the list literal []
characters wrap the generator expression, x**2
for
x
in
range(10)
. This list comprehension creates a list object from the enclosed generator expression.
The underlying x**2
for
x
in
range(10)
expression yields a sequence of values. These must be consumed by a client function. The list()
function can consume the values. This means we have two ways to create a list object from a generator expression, as shown in the following example:
>>> list(x**2 for x in range(10)) == [x**2 for x in range(10)]
True
There are other kinds of comprehensions to create dictionaries and sets. When the enclosing characters are {}
, this is a set comprehension. When the enclosing characters are {}
, and there are :
to separate keys and values, this is a dictionary comprehension. In this section, we’re going to focus on the generator expressions, separate from the specific kind of collection object they might create.
A collection object and a generator expression have some similar behaviors because both are iterable. They’re not equivalent, as we’ll see in the following code. Using a display object has the disadvantage of creating a (potentially large) collection of objects. A generator expression is lazy and creates objects only as required; this can improve performance.
We have to provide two important caveats on generator expressions, as follows:
Generators have some of the same internal methods as lists. This means we can apply functions like sorted()
and iter()
to either generators or lists. An exception is the len()
function, which needs to know the size of the collection and won’t work for a generator.
Generators can be used only once. After that, they appear empty.
A generator function is a function that has a yield
expression in it. This makes the function act as an iterator. Each individual yield
value must be individually consumed by a client function. For a tutorial introduction, see https://wiki.python.org/moin/Generators.
Also, see https://docs.python.org/3/howto/functional.html#generator-expressions-and-list-comprehensions.
We can use something like the following to create a sequence of possible prime numbers:
from collections.abc import Iterator
def candidates() -> Iterator[int]:
for i in range(2, 1024):
yield m1f(i)
This function iterates through 1,024 result values. It doesn’t compute them eagerly, however. It is lazy, and computes values as they’re requested. The built-in next()
function is one way to consume values. Here’s an example of consuming values from a generator function:
>>> c = candidates()
>>> next(c)
3
>>> next(c)
7
>>> next(c)
15
>>> next(c)
31
When the candidates()
function is evaluated, it creates a generator object, which is saved in the variable c
. Each time we use next(c)
, the generator function computes one more value and yields it. In this example, it will get a new value from the range
object, and evaluate the m1f()
function to compute a new value.
The yield
from
expression extends the yield
expression. This will consume values from some iterator, yielding each of the values it consumes. As a small example, consider the following function:
from collections.abc import Iterator
def bunch_of_numbers() -> Iterator[int]:
for i in range(5):
yield from range(i)
Each time a value is requested, it will be produced by the yield
from
nested inside the for
statement. This will yield i
distinct values, one from each request. Since i
is set by the containing for
statement, this will be used to produce ever longer sequences of numbers.
Here’s what the result looks like:
>>> list(bunch_of_numbers())
[0, 0, 1, 0, 1, 2, 0, 1, 2, 3]
Here is a generator function that we’ll use for some more examples:
from collections.abc import Iterator
import math
def pfactorsl(x: int) -> Iterator[int]:
if x % 2 == 0:
yield 2
if x // 2 > 1:
yield from pfactorsl(x // 2)
return
for i in range(3, int(math.sqrt(x) + .5) + 1, 2):
if x % i == 0:
yield i
if x // i > 1:
yield from pfactorsl(x // i)
return
yield x
We’re locating the prime factors of a number. If the number, x
, is even, we’ll yield 2 and then recursively yield all prime factors of x
//
2
.
For odd numbers, we’ll step through odd values greater than or equal to 3 to locate a candidate factor of the number. When we locate a factor, i
, we’ll yield that factor, and then recursively yield all prime factors of x
//
i
.
In the event that we can’t locate a factor, the number, x
, must be prime, so we can yield the number.
We handle 2 as a special case to cut the number of iterations in half. All prime numbers, except 2, are odd.
We’ve used one important for
statement in addition to recursion. This is an optimization, and it’s a teaser for the content of Chapter 6, Recursions and Reductions. This optimization allows us to easily handle numbers that have as many as 1,000 factors. (As an example, 21,000, a number with 300 digits, will have 1,000 factors.) Since the for
variable, i
, is not used outside the indented body of the statement, the stateful nature of the i
variable won’t lead to confusion if we make any changes to the body of the for
statement.
Because the function, as a whole, is a generator, the yield
from
statement is used to consume values from the recursive call and yield them to the caller. It provides an iterable sequence of values as a result.
In a recursive generator function, be careful of the return
statement.
Do not use the following statement: return
recursive_iter(args)
. It returns only a generator object; it doesn’t evaluate the recursive_iter()
function to return the generated values. Use any of the following alternatives:
A yield
expression:
for result in recursive_iter(args):
yield result
A yield
from
expression:
yield from recursive_iter(args)
A function that implements the Iterator
protocol is often described as being a generator function. There is a separate Generator
protocol, which extends the essential Iterator
definition. We often find that functional Python programs can be structured around the generator expression construct. This tends to focus the design effort on functions and stateless objects.
We noted that there are some limitations of generator expressions and generator functions. The limitations can be observed by executing the following command snippet:
>>> pfactorsl(1560)
<generator object pfactorsl at ...>
>>> list(pfactorsl(1560))
[2, 2, 2, 3, 5, 13]
>>> len(pfactorsl(1560))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: object of type ’generator’ has no len()
In the first example, we saw the generator function, pfactors1()
, created a generator. The generator is lazy, and doesn’t have a proper value until we consume the results yielded by the generator. By itself, this isn’t a limitation; lazy evaluation is an important reason why generator expressions fit with functional programming in Python.
In the second example, we materialized a list object from the results yielded by the generator function. This is handy for seeing the output and writing unit test cases.
In the third example, we saw one limitation of generator functions: there’s no len()
. Because the generator is lazy, the size can’t be known until after all of the values are consumed.
The other limitation of a generator object is that they can only be used once. For example, look at the following command snippet:
>>> result = pfactorsl(1560)
>>> sum(result)
27
>>> sum(result)
0
The first evaluation of the sum()
function performed evaluation of the generator object, result
. All of the values were consumed. The second evaluation of the sum()
function found that the generator object was now empty. We can only consume the values of a generator object once.
The generator function, pfactorsl()
, can produce an indefinite number of generator objects. In many cases, we’ll define generator functions that consume the results yielded by other generators. In these cases, we may not be able to trivially create generators, but must create a whole pipeline of generators.
Generators have a stateful life in Python. While they’re very nice for some aspects of functional programming, they’re not quite perfect.
We can try to use the itertools.tee()
function to overcome the once-only limitation. We’ll look at this in depth in Chapter 8, The Itertools Module. It’s not a great idea because it can consume a great deal of memory.
Here is a quick example of its usage:
import itertools
from typing import Any
from collections.abc import Iterable
def limits(iterable: Iterable[Any]) -> Any:
max_tee, min_tee = itertools.tee(iterable, 2)
return max(max_tee), min(min_tee)
We created two clones of the parameter generator expression, max_tee
and min_tee
. We can consume these two clones to get maximum and minimum values from the iterable. Interestingly, because the two clones are used serially, this leads to consuming a great deal of memory to cache items. This specific example often works out better using a list object instead of using tee()
to clone an iterator.
Once consumed, a generator object will not provide any more values. When we want to compute multiple kinds of reductions—for example, sums and counts, or minimums and maximums—we need to design with this one-pass-only limitation in mind.
The essence of functional programming comes from the ways we can easily combine generator expressions and generator functions to create very sophisticated composite processing sequences. When working with generator expressions, we can combine generators in several ways.
One common way to combine generator functions is when we create a composite function. We may have a generator that computes (f(x)
for
x
in
some_iterable)
. If we want to compute g(f(x))
, we have several ways to combine two generators.
We can tweak the original generator expression as follows:
g_f_x = (g(f(x)) for x in some_iterable)
While technically correct, this defeats any idea of reuse. Rather than reusing an expression, we rewrote it.
We can also substitute one expression within another expression, as follows:
g_f_x = (g(y) for y in (f(x) for x in some_iterable))
This has the advantage of allowing us to use simple substitution. We can revise this slightly to emphasize reuse, using the following commands:
f_x = (f(x) for x in some_iterable)
g_f_x = (g(y) for y in f_x)
This has the advantage of leaving the initial expression, (f(x)
for
x
in
some_iterable)
, essentially unchanged. All we did was assign the expression to a variable without altering the syntax.
The resulting composite function is also a generator expression, which is also lazy. This means that extracting the next value from g_f_x
will extract one value from f_x
, which will extract one value from the source some_iterable
object.
One of the tasks that arise in exploratory data analysis is cleaning up raw source data. This is often done as a composite operation applying several scalar functions to each piece of input data to create a usable dataset.
Let’s look at a simplified set of data. This data is commonly used to show techniques in exploratory data analysis. It’s called Anscombe’s quartet, and it comes from the article Graphs in Statistical Analysis, by F. J. Anscombe, that appeared in American Statistician in 1973. The following are the first few rows of a downloaded file with this dataset:
Anscombe’s quartet
I II III IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
Since the data is properly tab-delimited, we can use the csv.reader()
function to iterate through the various rows. Sadly, we can’t trivially process this with the csv
module. We have to do a little bit of parsing to extract the useful information from this file. We can define a function to iterate over the raw data as follows:
import csv
from typing import TextIO
from collections.abc import Iterator, Iterable
def row_iter(source: TextIO) -> Iterator[list[str]]:
return csv.reader(source, delimiter=" ")
We wrapped a file in a csv.reader()
function to create an iterator over the rows of raw data. The typing
module provides a handy definition, TextIO
, for file objects that read (or write) string values. Each row is a list of text values. It can be helpful to define an additional type, Row
=
list[str]
, to make this more explicit.
We can use this row_iter()
function in the following context:
>>> from pathlib import Path
>>> source_path = Path("Anscombe.txt")
>>> with source_path.open() as source:
... print(list(row_iter(source)))
While this will display useful information, the problem is the first three items in the resulting iterable aren’t data. The Anscombe’s quartet file starts with the following header rows:
[["Anscombe’s quartet"],
[’I’, ’II’, ’III’, ’IV’],
[’x’, ’y’, ’x’, ’y’, ’x’, ’y’, ’x’, ’y’],
We need to filter these three non-data rows from the iterable. There are several possible approaches. Here is a function that will excise three expected title rows, validate they are the expected headers, and return an iterator over the remaining rows:
from collections.abc import Iterator
def head_split_fixed(
row_iter: Iterator[list[str]]
) -> Iterator[list[str]]:
title = next(row_iter)
assert (len(title) == 1
and title[0] == "Anscombe’s quartet")
heading = next(row_iter)
assert (len(heading) == 4
and heading == [’I’, ’II’, ’III’, ’IV’])
columns = next(row_iter)
assert (len(columns) == 8
and columns == [’x’,’y’, ’x’,’y’, ’x’,’y’, ’x’,’y’])
return row_iter
This function plucks three rows from the source data, an iterator. It asserts that each row has an expected value. If the file doesn’t meet these basic expectations, it’s a sign that the file was damaged or perhaps our analysis is focused on the wrong file.
Since both the row_iter()
and the head_split_fixed()
functions expect an iterator as an argument value, they can be combined, as follows:
from pathlib import Path
from collections.abc import Iterator
def get_rows(path: Path) -> Iterator[list[str]]:
with path.open() as source:
yield from head_split_fixed(row_iter(source))
We’ve applied one iterator to the results of another iterator. In effect, this defines a composite function. We’re not done, of course; we still need to convert the string values to the float values, and we also need to pick apart the four parallel series of data in each row.
The final conversions and data extractions are more easily done with higher-order functions, such as map()
and filter()
. We’ll return to those in Chapter 5, Higher-Order Functions.
We’ll now look at ways to apply generator expressions to a number of Python’s built-in collections. This section will cover the following topics:
Generators for lists, dicts, and sets
Working with stateful collections
Using the bisect
module to create a mapping
Using stateful sets
Each of these looks at some specialized cases of Python collections and generator functions. In particular, we’ll look at ways to produce a collection, and consume the collection in later processing.
This is a lead-in to the next chapter, Chapter 4, Working with Collections, which covers the Python collections in considerably more detail.
A Python sequence object, such as a list, is iterable. However, it has some additional features. We can think of a list as a materialized iterable. We’ve used the tuple()
function in several examples to collect the output of a generator expression or generator function into a single tuple object. We can use the list()
function to materialize a sequence to create a list object.
In Python, a list display, or list comprehension, offers simple syntax to materialize a generator: we add the []
brackets. This is ubiquitous to the point where the distinction between generator expression and list comprehension can be lost. We need to disentangle the idea of a generator expression from a list display that uses a generator expression.
The following is an example to enumerate the cases:
>>> range(10)
range(0, 10)
>>> [range(10)]
[range(0, 10)]
>>> [x for x in range(10)]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> list(range(10))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
The first example is the range
object, which is a type of generator function. It doesn’t produce any values because it’s lazy.
The second example shows a list composed of a single instance of the generator function. The []
syntax creates a list literal of the range()
object without consuming any values created by the iterator.
The third example shows a list comprehension built from a generator expression that includes a generator function. The function, range(10)
, is evaluated by a generator expression, x
for
x
in
range(10)
. The resulting values are collected into a list object.
We can also use the list()
function to build a list from an iterable or a generator expression. This also works for set()
, tuple()
, and dict()
.
The list(range(10))
function evaluates the generator object. The [range(10)]
list literal does not evaluate the range(10)
generator object.
While there’s shorthand syntax for list
, dict
, and set
using []
and {}
, there’s no shorthand syntax for a tuple. To materialize a tuple, we must use the tuple()
function. For this reason, it often seems most consistent to use the list()
, tuple()
, and set()
functions as the preferred syntax.
In the data cleansing code in the previous section, we used a composite function to create a list of four tuples. The function looked as follows:
>>> data = list(get_rows(Path("Anscombe.txt")))
>>> data[0]
[’10.0’, ’8.04’, ’10.0’, ’9.14’, ’10.0’, ’7.46’, ’8.0’, ’6.58’]
>>> data[1]
[’8.0’, ’6.95’, ’8.0’, ’8.14’, ’8.0’, ’6.77’, ’8.0’, ’5.76’]
>>> data[-1]
[’5.0’, ’5.68’, ’5.0’, ’4.74’, ’5.0’, ’5.73’, ’8.0’, ’6.89’]
We assigned the results of the get_rows()
composite function to a name, data
. Each of these rows is a collection of four (x,y) pairs.
To extract one of the (x,y) pairs, we’ll need to do a little bit more processing to make this useful. First, we need to pick pairs of columns from the eight-tuple. Since the pairs are always adjacent, we can select a pair of columns with a slice operation of the form row[2
*
n:
2
*
n
+
2]
. The idea is that pair n is in positions 2 × n and 2 × n + 1. The slice expression 2
*
n:
2
*
n
+
2
includes the start element, 2
*
n
, and stops just before the stop element, 2
*
n
+
2
. We can wrap this with a reusable function, as shown in the following definition:
from typing import cast, TypeVar
from collections.abc import Iterator, Iterable
SrcT = TypeVar("SrcT")
def series(
n: int,
row_iter: Iterable[list[SrcT]]
) -> Iterator[tuple[SrcT, SrcT]]:
for row in row_iter:
yield cast(tuple[SrcT, SrcT], tuple(row[n * 2: n * 2 + 2]))
This function picks two adjacent columns based on a number between 0 and 3. It creates a tuple object from those two columns. The cast()
function is a type hint to inform the mypy tool that the result will be a two-tuple where both items are strings. This is required because it’s difficult for the mypy tool to determine that the expression tuple(row[n
*
2:
n
*
2
+
2])
will select exactly two elements from the row collection.
This example uses a type variable, SrcT
, to make a deeper claim about the transformation. Specifically, the type variable tells people reading the code (and tools like mypy) that the input object types will be the resulting object types. If the source is, for example, an iterable of lists of str
, then SrcT
=
str
, and the output will be an iterator over tuples with two str
values.
We can use the series()
function to extract collections from the source file as follows:
>>> from pathlib import Path
>>> source_path = Path("Anscombe.txt")
>>> with source_path.open() as source:
... data = tuple(head_split_fixed(row_iter(source)))
>>> series_I = tuple(series(0, data))
>>> series_II = tuple(series(1, data))
>>> series_III = tuple(series(2, data))
>>> series_IV = tuple(series(3, data))
We applied the tuple()
function to a composite function based on the series()
, head_split_fixed()
, and row_iter()
functions. Each of these expressions will create an object that we can reuse in several other functions. We can then do analysis on subsets of the source data.
The series_I
sequence looks as follows:
>>> series_I
((’10.0’, ’8.04’), (’8.0’, ’6.95’), ... (’5.0’, ’5.68’))
The other three sequences are similar in structure. The values, however, are quite different.
The final thing we’ll need to do is create proper numeric values from the strings that we’ve accumulated so that we can compute some statistical summary values. We can apply the float()
function conversion as the last step. There are many alternative places to apply the float()
function, and we’ll look at some choices in Chapter 5, Higher-Order Functions.
To reduce memory use and increase performance, we prefer to use generator expressions and functions as much as possible. These iterate through collections in a lazy manner, computing values only when required. Since iterators can only be used once, we’re sometimes forced to materialize a collection as a tuple (or list) object. Materializing a collection costs memory and time, so we do it reluctantly.
Programmers familiar with Clojure can match Python’s lazy generators with the lazy-seq
and lazy-cat
functions. The idea is that we can specify a potentially infinite sequence, but only take values from it as needed.
Python offers several stateful collections; the various mappings include the dict
class and a number of related mappings defined in the collections
module. We need to emphasize the stateful nature of these mappings and use them carefully.
For our purposes, learning functional programming techniques in Python, there are two use cases for mapping: a stateful dictionary that accumulates a mapping and a frozen dictionary that can’t be updated. Python doesn’t provide an easy-to-use definition of an immutable mapping. We can use the abstract base class Mapping
from the collections.abc
module. We can also create an immutable MappingProxyType
object from a mutable mapping. For more information, see the types
module.
The stateful dictionary can be further decomposed into the following two typical use cases:
A dictionary built once and never updated. In this case, we want to exploit the hashed keys feature of the dict
class to optimize performance. We can create a dictionary from any iterable sequence of (key,
value)
two-tuples using the expression dict(sequence)
.
A dictionary built incrementally. This is an optimization we can use to avoid materializing and sorting a list object. We’ll look at this in Chapter 6, Recursions and Reductions, where we’ll look at the collections.Counter
class as a sophisticated reduction. Incremental building is particularly helpful for memoization. We’ll defer memoization until Chapter 10, The Functools Module.
The first example, building a dictionary once, stems from an application with three operating phases: gather some input, create a dict
object, and then process input based on the mappings in the dictionary. As an example of this kind of application, we may be doing some image processing and have a specific palette of colors, represented by names and (R,
G,
B)
tuples. If we use the GNU Image Manipulation Program (GIMP) file format, the color palette may look like the following command snippet:
GIMP Palette
Name: Small
Columns: 3
#
0 0 0 Black
255 255 255 White
238 32 77 Red
28 172 120 Green
31 117 254 Blue
The details of parsing this file are the subject of Chapter 6, Recursions and Reductions. What’s important is the results of the parsing.
First, we’ll define a typing.NamedTuple
class Color
as follows:
from typing import NamedTuple
class Color(NamedTuple):
red: int
green: int
blue: int
name: str
Second, we’ll assume that we have a parser that produces an iterable of Color
objects. If we materialized it as a tuple, it would look like the following:
>>> palette = [
... Color(red=239, green=222, blue=205, name=’Almond’),
... Color(red=205, green=149, blue=117, name=’Antique Brass’),
... Color(red=253, green=217, blue=181, name=’Apricot’),
... Color(red=197, green=227, blue=132, name=’Yellow Green’),
... Color(red=255, green=174, blue=66, name=’Yellow Orange’)
... ]
To locate a given color name quickly, we will create a frozen dictionary from this sequence. This is not the only way to get fast lookups of a color by name. We’ll look at another option later.
To create a mapping from an iterable sequence of tuples, we will use the process(wrap(iterable))
design pattern. The following command shows how we can create the color name mapping:
>>> name_map = dict((c.name, c) for c in palette)
Here are three three parts to the design pattern:
The source iterable is the palette
. We could formalize this with the hint Iterable[Color]
.
The wrap is the (c.name,
c)
expression to transform a Color
object to a tuple[str,
Color]
pair.
The process is the dict()
function to create a mapping.
The resulting dictionary looks as follows:
>>> name_map[’Antique Brass’]
Color(red=205, green=149, blue=117, name=’Antique Brass’)
>>> name_map[’Yellow Orange’]
Color(red=255, green=174, blue=66, name=’Yellow Orange’)
This can also be done using a dictionary comprehension. We leave that as an exercise for the reader.
Now that we’ve materialized the mapping, we can use this dict()
object in some later processing for repeated transformations from color name to (R,
G,
B)
color numbers. The lookup will be blazingly fast because a dictionary does a rapid transformation from key to hash value followed by lookup in the dictionary.
In the previous example, we created a dict
object to achieve a fast mapping from a color name to a Color
object. This isn’t the only choice; we can use the bisect
module instead. Using the bisect
module means that we have to create a sorted sequence, which we can then search. To be perfectly compatible with the dictionary implementation, we can use collections.Mapping
as the base class.
The dict
class uses a hash computation to locate items almost immediately. However, this requires allocating a fairly large block of memory. The bisect
mapping does a search, which doesn’t require as much memory, but performance cannot be described as immediate. The performance drops from O(1) to O(log n). While this is dramatic, the savings in memory can be critical for processing large collections of data.
A static mapping class looks like the following command snippet:
import bisect
from collections.abc import Mapping, Iterable
from typing import Any
class StaticMapping(Mapping[str, Color]):
def __init__(self,
iterable: Iterable[tuple[str, Color]]
) -> None:
self._data: tuple[tuple[str, Color], ...] = tuple(iterable)
self._keys: tuple[str, ...] = tuple(sorted(key for key, _ in self._data))
def __getitem__(self, key: str) -> Color:
ix = bisect.bisect_left(self._keys, key)
if (ix != len(self._keys) and self._keys[ix] == key):
return self._data[ix][1]
raise ValueError(f"{key!r} not found")
def __iter__(self) -> Iterator[str]:
return iter(self._keys)
def __len__(self) -> int:
return len(self._keys)
This class extends the abstract superclass collections.Mapping
. It provides an initialization and implementations for three functions missing from the abstract definition. The type of tuple[str,
Color]
defines a specific kind of two-tuple expected by this mapping.
The __getitem__()
method uses the bisect.bisect_left()
function to search the collection of keys. If the key is found, the appropriate value is returned. The __iter__()
method returns an iterator, as required by the superclass. The __len__()
method, similarly, provides the required length of the collection.
This class may not seem to embody too many functional programming principles. Our goal here is to support a larger application that minimizes the use of stateful variables. This class saves a static collection of key-value pairs. As an optimization, it materializes two objects.
An application would create an instance of this class to perform relatively rapid lookups of values associated with keys. The superclass does not support updates to the object. The collection, as a whole, is stateless. It’s not as fast as the built-in dict
class, but it uses less memory and, through the formality of being a subclass of the Mapping
class, we can be assured that this object is not used to contain a processing state.
Python offers several stateful collections, including the set
collection. For our purposes, there are two use cases for a set:
A stateful set
that accumulates items
frozenset
, which can be used to optimize searches for an item
We can create frozenset
from an iterable in the same way we create a tuple object from an iterable, via a frozenset(some_iterable)
expression; this will create a structure that has the advantage of a very fast in
operator. This can be used in an application that gathers data, creates a set, and then uses a frozenset
to match other data items against the set.
We may have a set of colors that we will use as a kind of chroma-key: we will use this color to create a mask that will be used to combine two images. Pragmatically, a single color isn’t appropriate, but a small set of very similar colors works best. In this case, we can examine each pixel of an image file to see if the pixel is in the chroma-key set or not. For this kind of processing, the chroma-key colors can be loaded into frozenset
before processing the target images. The set lookup is extremely fast.
As with mappings—specifically the Counter
class—there are some algorithms that can benefit from a memoized set of values. Some functions benefit from memoization because a function is a mapping between domain values and range values, a job where mapping works well. A few algorithms benefit from a memoized set, which is stateful and grows as data is processed.
We’ll return to memoization in Chapter 10, The Functools Module.
In this chapter, we looked again at writing pure functions free of side effects. We looked at generator functions and how we can use these as the backbone of functional programming to process collections of items. We also examined a number of the built-in collection classes to show how they’re used in the functional paradigm. While the general idea behind functional programming is to limit the use of stateful variables, the collection objects have a stateful implementation. For many algorithms, they’re often essential. Our goal is to be judicious in our use of Python’s non-functional features.
In the next two chapters, we’ll look at functions for processing collections. After that, we’ll look closely at higher-order functions: functions that accept functions as arguments as well as returning functions. In later chapters, we’ll look at techniques for defining our own higher-order functions. We’ll also look at the itertools
and functools
modules and their higher-order functions in later chapters.
This chapter’s exercises are based on code available from Packt Publishing on GitHub. See https://github.com/PacktPublishing/Functional-Python-Programming-3rd-Edition.
In some cases, the reader will notice that the code provided on GitHub includes partial solutions to some of the exercises. These serve as hints, allowing the reader to explore alternative solutions.
In many cases, exercises will need unit test cases to confirm they actually solve the problem. These are often identical to the unit test cases already provided in the GitHub repository. The reader should replace the book’s example function name with their own solution to confirm that it works.
In the Writing pure functions section, a function was shown that relied on a global variable.
Create a small application that sets the global variable and calls the function. The application can expand on the following example:
def some_function ...
def main():
"""
>>> main()
some_function(2, 3, 5)=30
some_function(2, 3, 5)=34
"""
global global_adjustment
global_adjustment = 13
print(f"{some_function(2, 3, 5)=}")
global_adjustment = 17
print(f"{some_function(2, 3, 5)=}")
if __name__ == "__main__":
main()
First, create a test suite for some_function()
and main()
. A doctest
suite embedded in docstring is shown in the example.
Second, rewrite some_function()
to make global_adjustment
into a parameter. This will lead to revising main()
and all of the test cases.
The examples in the Functions as first-class objects section show a Mersenne1
class that accepts a function as a parameter to the __init__()
method.
An alternative is to provide a plug-in Strategy function as part of the class definition.
This would permit the following kinds of object definitions:
>>> class ShiftyMersenne(Mersenne2):
... pow2 = staticmethod(shifty)
>>> m2s = ShiftyMersenne()
>>> m2s(17)
131071
The use of staticmethod()
is essential because the shifty()
function does not expect the self
argument when it is evaluated. It’s essential to make sure this function is understood as being ”static”—that is, not using a self
parameter.
Consider the following algorithm:
As we’ve seen in this chapter, there are three ways to write this in Python:
As a for
statement that updates stateful variables
As a generator expression
As a map()
operation to apply the function
Write all three versions in Python.
A test case is the following data:
And the following scaling function:
The computed value for m is approximately zero.
The built-in map()
and filter()
functions always have an equivalent generator expression. In order to have consistent-looking code, a project team is wrestling with the idea of insisting that all code use generator expressions, avoiding the built-in map()
and filter()
functions.
Take the side of only using generator expressions and provide reasons for which this is advantageous.
Take the side of only using built-in map()
and filter()
functions and provide reasons why this alternative might be advantageous.
In looking at the reasons for the first two parts of this exercise, is there a clearly articulated decision on which approach is better? If not, why not? If so, what rule should the team use?
In the Using stateful mappings section, we built a mapping from a list of two-tuples. We can also build a mapping using a dictionary comprehension. Rewrite the expression dict((c.name,
c)
for
c
in
palette)
as a dictionary comprehension.
A file, Anscombe.txt
, is almost a valid CSV format file. The problem is there are three lines of useless text at the beginning. The lines are easy to recognize because applying the float()
function to the values in those header rows will raise a ValueError
exception.
Some team members suggest using a regular expression to examine the values to see if they are valid numbers. This can be called Look Before You Leap (LBYL):
Other team members suggest using a simpler try:
statement to uncover the invalid non-numeric headers and discard them. This can be called Easier to Ask Forgiveness Than Permission (EAFP):
Both algorithms work. It’s instructive to implement each one in Python to compare them. Here are a few starting points for comparing the algorithms:
The LBYL variant can rely entirely on generator expressions. However, it requires writing a regular expression that recognizes all possible floating-point values. Is this a responsibility that should be part of this application?
The EAFP variant needs a separate function to implement the try:
statement processing. Otherwise, it seems amenable to also being written via generator expressions or the map()
function.
After building the two variants, which seems to be more expressive of the purpose of filtering and acquiring data?
Join our Python Discord workspace to discuss and know more about the book: https://packt.link/dHrHU