Using the groupby() and reduce() functions

A common requirement is to summarize data after partitioning it into groups. We can use a defaultdict(list) method to partition data. We can then analyze each partition separately. In Chapter 4, Working with Collections, we looked at some ways to group and partition. In Chapter 8, The Itertools Module, we looked at others.

The following is some sample data that we need to analyze:

>>> data = [('4', 6.1), ('1', 4.0), ('2', 8.3), ('2', 6.5),
... ('1', 4.6), ('2', 6.8), ('3', 9.3), ('2', 7.8),
... ('2', 9.2), ('4', 5.6), ('3', 10.5), ('1', 5.8),
... ('4', 3.8), ('3', 8.1), ('3', 8.0), ('1', 6.9),
... ('3', 6.9), ('4', 6.2), ('1', 5.4), ('4', 5.8)]

We've got a sequence of raw data values with a key and a measurement for each key.

One way to produce usable groups from this data is to build a dictionary that maps a key to a list of members in this groups as follows:

from collections import defaultdict
from typing import (
    Iterable, Callable, Dict, List, TypeVar,
    Iterator, Tuple, cast)
D_ = TypeVar("D_")
K_ = TypeVar("K_")

def partition(
        source: Iterable[D_],
        key: Callable[[D_], K_] = lambda x: cast(K_, x)
    ) -> Iterable[Tuple[K_, Iterator[D_]]]:
    pd: Dict[K_, List[D_]] = defaultdict(list)
    for item in source:
        pd[key(item)].append(item)
    for k in sorted(pd):
        yield k, iter(pd[k])

This will separate each item in the iterable into a group based on the key. The iterable source of data is described using a type variable of D_, representing the type of each data item. The key() function is used to extract a key value from each item. This function produces an object of some type, K_, that is generally distinct from the original data item type, D_. When looking at the sample data, the type of each data item is a tuple. The keys are of type str. The callable function for extracting a key transforms a tuple into a string.

This key() value extracted from each data item is used to append each item to a list in the pd dictionary. The defaultdict object is defined as mapping each key, K_, to a list of the data items, List[D_].

The resulting value of this function matches the results of the itertools.groupby() function. It's an iterable sequence of the (group key, iterator) tuples. They group key will be of the type produced by the key function. The iterator will provide a sequence of the original data items.

Following is the same feature defined with the itertools.groupby() function:

from itertools import groupby

def partition_s(
        source: Iterable[D_],
        key: Callable[[D_], K_] = lambda x: cast(K_, x)
    ) -> Iterable[Tuple[K_, Iterator[D_]]]:
    return groupby(sorted(source, key=key), key)

The important difference in the inputs to each function is that the groupby() function version requires data to be sorted by the key, whereas the defaultdict version doesn't require sorting. For very large sets of data, the sort can be expensive, measured in both time and storage.

Here's the core partitioning operation. This might be used prior to filtering out a group, or it might be used prior to computing statistics for each group:

>>> for key, group_iter in partition(data, key=lambda x:x[0] ):
... print(key, tuple(group_iter))
1 (('1', 4.0), ('1', 4.6), ('1', 5.8), ('1', 6.9), ('1', 5.4))
2 (('2', 8.3), ('2', 6.5), ('2', 6.8), ('2', 7.8), ('2', 9.2))
3 (('3', 9.3), ('3', 10.5), ('3', 8.1), ('3', 8.0), ('3', 6.9))
4 (('4', 6.1), ('4', 5.6), ('4', 3.8), ('4', 6.2), ('4', 5.8))

We can summarize the grouped data as follows:

mean = lambda seq: sum(seq)/len(seq)
var = lambda mean, seq: sum((x-mean)**2/mean for x in seq)

Item = Tuple[K_, float]
def summarize(
        key_iter: Tuple[K_, Iterable[Item]]
    ) -> Tuple[K_, float, float]:
    key, item_iter = key_iter
    values = tuple(v for k, v in item_iter)
    m = mean(values)
    return key, m, var(m, values)

The results of the partition() functions will be a sequence of (key, iterator) two-tuples. The summarize() function accepts the two-tuple and and decomposes it into the key and the iterator over the original data items. In this function, the data items are defined as Item, a key of some type, K_, and a numeric value that can be coerced to float. From each two-tuple in the item_iter iterator we want the value portion, and we use a generator expression to create a tuple of only the values.

We can also use the expression map(snd, item_iter) to pick the second item from each of the two-tuples. This requires a definition of snd = lambda x: x[1]. The name snd is a short form of second to pick the second item of a tuple.

We can use the following command to apply the summarize() function to each partition:

>>> partition1 = partition(data, key=lambda x: x[0])
>>> groups1 = map(summarize, partition1)

The alternative commands are as follows:

>>> partition2 = partition_s(data, key=lambda x: x[0])
>>> groups2 = map(summarize, partition2)

Both will provide us with summary values for each group. The resulting group statistics look as follows:

1 5.34 0.93
2 7.72 0.63
3 8.56 0.89
4 5.5 0.7

The variance can be used as part of a test to determine if the null hypothesis holds for this data. The null hypothesis asserts that there's nothing to see: the variance in the data is essentially random. We can also compare the data between the four groups to see if the various means are consistent with the null hypothesis or if there is some statistically significant variation.

Table of Contents for Using the groupby() and reduce() functions

Create new playlist

Sign In

Sign Up

Table of Contents for
Using the groupby() and reduce() functions