Choosing eager versus lazy calculation

Note that the calculations in the previous example are lazy; they are only done when requested. This also means that they're performed each and every time they're requested. This can be a considerable overhead, depending on the context in which objects of these classes are used.

It's may be sensible to transform these statistical summaries into eager calculations, as we know when elements are added and removed from a list. Although there's a hair more programming to create eager versions of these functions, it has a net impact of improving performance when there's a lot of data being accumulated.

The point of eager statistical calculations is to avoid the loops that compute sums. If we compute the sums eagerly, as the list is being created, we avoid extra looping through the data.

When we look at the special methods for a Sequence class, we can see all of the places where data is added to, removed from, and modified in the sequence. We can use this information to recompute the two sums that are involved. We start with the collections.abc section of the Python Standard Library documentation, section 8.4.1, at http://docs.python.org/3.4/library/collections.abc.html#collections-abstract-base-classes.

Here are the required methods for a MutableSequence class: __getitem__,
__setitem__, __delitem__, __len__, insert, append, reverse, extend, pop, remove, and __iadd__. The documentation also mentions the Inherited Sequence methods. However, as they are for immutable sequences, we can certainly ignore them.

Here are the details of what must be done to update the statistical results in each method:

  • __getitem__: There's no change in the state.
  • __setitem__: This changes an item. We need to take the old item out of each sum and fold the new item into each sum.
  • __delitem__: This removes an item. We need to take the old item out of each sum.
  • __len__: There's no change in the state.
  • insert: This adds a new item. We need to fold it into each sum.
  • append: This also adds a new item. We need to fold it into each sum.
  • reverse: There's no change in the value of the mean or standard deviation.
  • extend: This adds many new items. All of the items must be folded into the sums.
  • pop: This removes an item. We need to take the old item out of each sum.
  • remove: This removes an item. We need to take the old item out of each sum.
  • __iadd__: This is the += augmented assignment statement, the in-place addition. It's effectively the same as the extend keyword.

We won't look at each method in detail, because there are really combinations of two use cases:

  • Fold in one new value
  • Remove one old value

The replacement case is a combination of the remove and fold in operations.

Here are the elements of an eager StatsList2 class. We're going to see the insert() and pop() method:

class StatsList2(list):
"""Eager Stats."""

def __init__(self, iterable: Optional[Iterable[float]]) -> None:
self.sum0 = 0 # len(self), sometimes called "N"
self.sum1 = 0.0 # sum(self)
self.sum2 = 0.0 # sum(x**2 for x in self)
super().__init__(cast(Iterable[Any], iterable))
for x in self:
self._new(x)

def _new(self, value: float) -> None:
self.sum0 += 1
self.sum1 += value
self.sum2 += value * value

def _rmv(self, value: float) -> None:
self.sum0 -= 1
self.sum1 -= value
self.sum2 -= value * value

def insert(self, index: int, value: float) -> None:
super().insert(index, value)
self._new(value)

def pop(self, index: int = 0) -> None:
value = super().pop(index)
self._rmv(value)
return value

We provided three internal variables with comments to show the invariants that this class will maintain. We'll call these the sum invariants because each of them contains a particular kind of sum that is maintained as invariant (always true) after each kind of state change. The essence of this eager calculation are the _rmv() and _new() methods, which update our three internal sums based on changes to the list, so that the relationships really remain invariant.

When we remove an item, that is, after a successful pop() operation, we have to adjust our sums. When we add an item (either initially, or via the insert() method), we also have to adjust our sums. The other methods we need to implement will make use of these two methods to ensure that the three sum invariants hold. For a given list of values, L, we ensure that L.sum0 is always , L.sum1 is always , and L.sum2 is always . We can use the sums to compute the mean and standard deviation.

Other methods, such as append(), extend(), and remove(), are similar in many ways to these methods. We haven't shown them because they're similar.

We can see how this list works by playing with some data:

>>> sl2 = StatsList2( [2, 4, 3, 4, 5, 5, 7, 9, 10] ) 
>>> sl2.sum0, sl2.sum1, sl2.sum2 
(9, 49, 325) 
>>> sl2[2]= 4 
>>> sl2.sum0, sl2.sum1, sl2.sum2 
(9, 50, 332) 
>>> del sl2[-1] 
>>> sl2.sum0, sl2.sum1, sl2.sum2 
(8, 40, 232) 
>>> sl2.insert( 0, -1 ) 
>>> sl2.pop()                             
-1 
>>> sl2.sum0, sl2.sum1, sl2.sum2 
(8, 40, 232) 

We can create a list and the sums are computed initially. Each subsequent change eagerly updates the various sums. We can change, remove, insert, and pop an item; each change results in a new set of sums.

All that's left, is to add our mean and standard deviation calculations, which we can do as follows:

@property
def mean(self) -> float:
return self.sum1 / self.sum0

@property
def stdev(self) -> float:
return math.sqrt(
self.sum0*self.sum2 - self.sum1*self.sum1
) / self.sum0

These make use of the sums that have already been computed. There's no additional looping over the data to compute these two statistics.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset