Partitioning an iterator with groupby()

We can use the groupby() function to partition an iterator into smaller iterators. This works by evaluating the given key() function for each item in the given iterable. If the key value matches the previous item's key, the two items are part of the same partition. If the key does not match the previous item's key, the previous partition is ended and a new partition is started.

The output from the groupby() function is a sequence of two tuples. Each tuple has the group's key value and an iterable over the items in the group. Each group's iterator can be preserved as a tuple or processed to reduce it to some summary value. Because of the way the group iterators are created, they can't be preserved.

In the Running totals with accumulate() section, earlier in the chapter, we showed how to compute quartile values for an input sequence.

Given the trip variable with the raw data and the quartile variable with the quartile assignments, we can group the data using the following commands:

group_iter = groupby(
    zip(quartile, trip), 
    key=lambda q_raw: q_raw[0])
for group_key, group_iter in group_iter:
    print(group_key, tuple(group_iter))

This will start by zipping the quartile numbers with the raw trip data, iterating over two tuples. The groupby() function will use the given lambda variable to group by the quartile number. We used a for loop to examine the results of the groupby() function. This shows how we get a group key value and an iterator over members of the group.

The input to the groupby() function must be sorted by the key values. This will assure that all of the items in a group will be adjacent.

Note that we can also create groups using the defaultdict(list) method, as follows:

from collections import defaultdict
from typing import Iterable, Callable, Tuple, List, Dict

D_ = TypeVar("D_")
K_ = TypeVar("K_")
def groupby_2(
        iterable: Iterable[D_],
        key: Callable[[D_], K_]
    ) -> Iterator[Tuple[K_, Iterator[D_]]]:
    groups: Dict[K_, List[D_]] = defaultdict(list)
    for item in iterable:
        groups[key(item)].append(item)
    for g in groups:
        yield g, iter(groups[g])

We created a defaultdict class with a list object as the value associated with each key. The type hints clarify the relationship between the key() function, which emits objects of some arbitrary type associated with the type variable K_ and the dictionary, which uses the same type, K_, for the keys.

Each item will have the given key() function applied to create a key value. The item is appended to the list in the defaultdict class with the given key.

Once all of the items are partitioned, we can then return each partition as an iterator over the items that share a common key. This is similar to the groupby() function because the input iterator to this function isn't necessarily sorted in precisely the same order; it's possible that the groups may have the same members, but the order may differ.

The type hints clarify that the source is some arbitrary type, associated with the variable D_. The result will be an iterator that includes iterators of the type D_. This makes a strong statement that no transformation is happening: the range type matches the input domain type.

Table of Contents for Partitioning an iterator with groupby()

Create new playlist

Sign In

Sign Up

Table of Contents for
Partitioning an iterator with groupby()