Design considerations and tradeoffs

When working with containers and collections, we have a multistep design strategy:

  1. Consider the built-in versions of sequence, mapping, and set.
  2. Consider the library extensions in the collection module, as well as extras such as heapq, bisect, and array.
  3. Consider a composition of existing class definitions. In many cases, a list of tuple objects or a dict of lists provides the needed features.
  4. Consider extending one of the earlier mentioned classes to provide additional methods or attributes.
  5. Consider wrapping an existing structure as another way to provide additional methods or attributes.
  6. Finally, consider a novel data structure. Generally, there is a lot of careful analysis available. Start with Wikipedia articles such as this one: http://en.wikipedia.org/wiki/List_of_data_structures.

Once the design alternatives have been identified, there are two parts of the evaluation left:

  • How well the interface fits with the problem domain. This is a relatively subjective determination.
  • How well the data structure performs as measured with timeit. This is an entirely objective result.

It's important to avoid the paralysis of analysis. We need to effectively find the proper collection.

In most cases, it is best to profile a working application to see which data structure is the performance bottleneck. In some cases, consideration of the complexity factors for a data structure will reveal its suitability for a particular kind of problem before starting the implementation.

Perhaps the most important consideration is this: for the highest performance, avoid search.

Avoiding search is the reason sets and mappings require hashable objects. A hashable object can be located in a set or mapping with almost no processing. Locating an item by value (not by index) in a list can take a great deal of time.

Here's a comparison of a bad set-like use of a list and a proper use of a set:

>>> import timeit 
>>> timeit.timeit('l.remove(10); l.append(10)', 'l = 
list(range(20))') 0.8182099789992208 >>> timeit.timeit('l.remove(10); l.add(10)', 'l = set(range(20))') 0.30278149300283985

We removed and added an item from a list, as well as a set.

Clearly, abusing a list to get it to perform set-like operations makes the collection take 2.7 times as long to run.

As a second example, we'll abuse a list to make it mapping-like. This is based on a real-world example where the original code had two parallel lists to mimic the keys and values of a mapping.

We'll compare a proper mapping with two parallel lists, as follows:

>>> timeit.timeit('i = k.index(10); v[i]= 0', 'k=list(range(20)); 
v=list(range(20))') 0.6549435159977293 >>> timeit.timeit('m[10] = 0', 'm=dict(zip(list(range(20)),list(range(20))))' ) 0.0764331009995658

The first example uses two parallel lists. One list is used to look up a value, and then a change is made to the parallel list. In the other case, we simply updated a mapping.

Clearly, performing an index and update on two parallel lists is a horrifying mistake. It takes 8.6 times as long to locate something via list.index() as it does to locate it via a mapping and the hash value.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset