Creating a new kind of mapping

Python has a built-in mapping called dict, and numerous library mappings. In addition to the collections module extensions to dict (defaultdict, Counter, and ChainMap), there are several other library modules that contain mapping-like structures.

The shelve module is an important example of another mapping. We'll look at this in Chapter 11, Storing and Retrieving Objects via Shelve. The dbm module is similar to shelve, in that it also maps a key to a value.

The mailbox module and email.message modules both have classes that provide an interface that is similar to dict for the mailbox structure used to manage local e-mails.

As far as design strategies go, we can extend or wrap one of the existing mappings to add even more features.

We could upgrade Counter to add the mean and standard deviation to the data stored as a frequency distribution. Indeed, we can also calculate the median and mode very easily from this class.

Here's a StatsCounter extension to Counter that adds some statistical functions:

import math
from collections import Counter

class StatsCounter(Counter):
@property
def mean(self) -> float:
sum0 = sum(v for k, v in self.items())
sum1 = sum(k * v for k, v in self.items())
return sum1 / sum0

@property
def stdev(self) -> float:
sum0 = sum(v for k, v in self.items())
sum1 = sum(k * v for k, v in self.items())
sum2 = sum(k * k * v for k, v in self.items())
return math.sqrt(sum0 * sum2 - sum1 * sum1) / sum0

We extended the Counter class with two new methods to compute the mean and standard deviation from the frequency distributions. The formulae are similar to the examples shown earlier for the eager calculations on a list object, even though they're lazy calculations on a Counter object.

We used sum0 = sum(v for k,v in self.items()) to compute a sum of the values, v, ignoring the k keys. We could use an underscore (_) instead of k to emphasize that we're ignoring the keys. We could also use sum(v for v in self.values()) to emphasize that we're not using the keys. We prefer obvious parallel structures for sum0 and sum1.

We can use this class to efficiently gather statistics and to perform quantitative analysis on the raw data. We might run a number of simulations, using a Counter object to gather the results.

Here's an interaction with a list of sample data that stands in for real results:

>>> sc = StatsCounter( [2, 4, 4, 4, 5, 5, 7, 9] ) 
>>> sc.mean 
5.0 
>>> sc.stdev 
2.0 
>>> sc.most_common(1) 
[(4, 3)] 
>>> list(sorted(sc.elements())) 
[2, 4, 4, 4, 5, 5, 7, 9] 

The results of most_common() are reported as a sequence of two-tuples with the mode value (4) and the number of times the value occurred (3). We might want to get the top three values to bracket the mode with the next two less-popular items. We get several popular values with an evaluation such as sc.most_common(3).

The elements() method reconstructs a list that's like the original data with the items repeated properly.

From the sorted elements, we can extract the median, the middle-most item:

@property
def median(self) -> Any:
all = list(sorted(self.elements()))
return all[len(all) // 2]

This method is not only lazy, it's rather extravagant with memory; it creates an entire sequence of the available values merely to find the middle-most item.

While it is simple, this is often an expensive way to use Python.

A smarter approach would be to compute the effective length and mid-point via sum(self.values())//2. Once this is known, the keys can be visited in that order, using the counts to compute the range of positions for a given key. Eventually, a key will be found with a range that includes the midpoint.

The code would look something like the following:

@property
def median2(self) -> Optional[float]:
mid = sum(self.values()) // 2
low = 0
for k, v in sorted(self.items()):
if low <= mid < low + v: return k
low += v
return None

We stepped through the keys and the number of times they occur to locate the key that is midmost. Note that this uses the internal sorted() function, which is not without its own cost.

Via timeit, we can learn that the extravagant version takes 9.5 seconds; the smarter version takes 5.2 seconds.

Let's take a look at how to create a new kind of set in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset