Chapter 3. Containers and Collections – Storing Data the Right Way

Python comes bundled with several very useful collections, a few of which are basic Python collection data types. The rest are advanced combinations of these types. In this chapter, we will explain some of these collections, how to use them, and the pros and cons of each of them.

Before we can properly discuss data structures and the related performance, a basic understanding of time complexity (and specifically the big O notation) is required. No need to worry! The concept is really simple, but without it, we cannot easily explain the performance characteristics of operations.

Once the big O notation is clear, we will discuss the basic data structures:

  • list
  • dict
  • set
  • tuple

Building on the basic data structures, we will continue with more advanced collections, such as the following:

  • Dictionary-like types:
    • ChainMap
    • Counter
    • Defaultdict
    • OrderedDict
  • List types:
    • Deque
    • Heapq
  • Tuple types:
    • NamedTuple
  • Other types:
    • Enum

Time complexity – the big O notation

Before we can begin with this chapter, there is a simple notation that you need to understand. This chapter heavily uses the big O notation to indicate the time complexity for an operation. Feel free to skip this paragraph if you are already familiar with this notation. While this notation sounds really complicated, the concept is actually quite simple.

When we say that a function takes O(1) time, it means that it generally only takes 1 step to execute. Similarly, a function with O(n) would take n steps to execute, where n is generally the size of the object. This time complexity is just a basic indication of what to expect when executing the code, as it is generally what matters most.

The purpose of this system is to indicate the approximate performance of an operation; this is separate from code speed but it is still relevant. A piece of code that executes a single step 1000 times faster but needs O(2**n) steps to execute will still be slower than another version of it that takes only O(n) steps for a value of n equal to 10 or more. This is because 2**n for n=10 is 2**10=1024, that is, 1,024 steps to execute the same code. This makes choosing the right algorithm very important. Even though C code is generally faster than Python, if it uses the wrong algorithm, it won't help at all.

For example, suppose you have a list of 1000 items and you walk through them. This will take O(n) time because there are n=1000 items. Checking whether or not an item exists in the list takes O(n), so that's 1,000 steps. Doing this 100 times will take you 100*O(n) = 100 * 1000 = 100,000 steps. When you compare this to a dict, where checking whether the item exists or not takes only O(1) time the difference is huge. With a dict, it would be 100*O(1) = 100 * 1 = 100 steps. So, using a dict instead of a list will be roughly 1,000 times faster for an object with 1,000 items:

n = 1000
a = list(range(n))
b = dict.fromkeys(range(n))
for i in range(100):
    i in a  # takes n=1000 steps
    i in b  # takes 1 step

To illustrate O(1), O(n), and O(n**2) functions:

def o_one(items):
    return 1  # 1 operation so O(1)

def o_n(items):
    total = 0
    # Walks through all items once so O(n)
    for item in items:
        total += item
    return total

def o_n_squared(items):
    total = 0
    # Walks through all items n*n times so O(n**2)
    for a in items:
        for b in items:
            total += a * b
    return total

n = 10
items = range(n)
o_one(items)  # 1 operation
o_n(items)  # n = 10 operations
o_n_squared(items)  # n*n = 10*10 = 100 operations

It should be noted that the big O in this chapter is about the average case and not the worst case. In some cases, they can be much worse, but those are rare enough to be ignored for the general case.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset