5

Built-In Data Structures Part 2: Dictionaries

Python has a rich collection of built-in data structures. These data structures are sometimes called "containers" or "collections" because they contain a collection of individual items. These structures cover a wide variety of common programming situations.

In this chapter, we'll build on some of the basics introduced in Chapter 4, Built-In Data Structures Part 1: Lists and Sets. This chapter covers the dictionary structure. This is a mapping from keys to values, sometimes called an associative array.

This chapter will also look at some more advanced topics related to how Python handles references to mutable collection objects. This has consequences in the way functions need to be defined.

In this chapter, we'll look at the following recipes, all related to Python's built-in data structures:

  • Creating dictionaries – inserting and updating
  • Removing items from dictionaries – the pop() method and the del statement
  • Controlling the order of dictionary keys
  • Writing dictionary-related type hints
  • Understanding variables, references, and assignment
  • Making shallow and deep copies of objects
  • Avoiding mutable default values for function parameters

We'll start with how to create a dictionary.

Creating dictionaries – inserting and updating

A dictionary is one kind of Python mapping. The built-in type dict class provides a number of common features. There are some common variations on these features defined in the collections module.

As we noted in the Choosing a data structure recipe at the beginning of the previous chapter, we'll use a dictionary when we have a key that we need to map to a given value. For example, we might want to map a single word to a long, complex definition of the word, or perhaps some value to a count of the number of times that value has occurred in a dataset.

A dictionary with keys and integer counts is very common. We'll look at a detailed recipe that shows how to initialize the dictionary and update the counter.

Getting ready

We'll look at an algorithm for locating the various stages in transaction processing. This relies on assigning a unique ID to each request and including that ID with each log record written during the transaction starting with that request. Because a multi-threaded server may be handling a number of requests concurrently, the stages for each request's transaction will be interleaved unpredictably. Reorganizing the log by request ID helps segregate the various threads.

Here's a simulated sequence of log entries for three concurrent requests:

[2019/11/12:08:09:10,123] INFO #PJQXB^eRwnEGG?2%32U path="/openapi.yaml" method=GET
[2019/11/12:08:09:10,234] INFO 9DiC!B^nXxnEGG?2%32U path="/items?limit=x" method=GET
[2019/11/12:08:09:10,235] INFO 9DiC!B^nXxnEGG?2%32U error="invalid query"
[2019/11/12:08:09:10,345] INFO #PJQXB^eRwnEGG?2%32U status="200" bytes="11234"
[2019/11/12:08:09:10,456] INFO 9DiC!B^nXxnEGG?2%32U status="404" bytes="987"
[2019/11/12:08:09:10,567] INFO >UL>PB_R>&nEGG?2%32U path="/category/42" method=GET

Each line has a timestamp. The severity level is INFO for each record shown in the example. The next string of 20 characters is a transaction ID. This is followed by log information for a particular step in the transaction.

The following regular expression defines the log records:

log_parser = re.compile(r"[(.*?)] (w+) (S+) (.*)")

This defines four fields. The timestamp is enclosed by []. This is followed a word (w+) and a sequence without any spaces (S+). The balance of the line is the fourth group of characters.

Parsing these lines will produce a sequence of four-tuples. The data looks like this:

[('2019/11/12:08:09:10,123', 'INFO', '#PJQXB^eRwnEGG?2%32U', 'path="/openapi.yaml" method=GET'), 
 ('2019/11/12:08:09:10,234', 'INFO', '9DiC!B^nXxnEGG?2%32U', 'path="/items?limit=x" method=GET'), 
 ('2019/11/12:08:09:10,235', 'INFO', '9DiC!B^nXxnEGG?2%32U', 'error="invalid query"'), 
 ('2019/11/12:08:09:10,345', 'INFO', '#PJQXB^eRwnEGG?2%32U', 'status="200" bytes="11234"'), 
 ('2019/11/12:08:09:10,456', 'INFO', '9DiC!B^nXxnEGG?2%32U', 'status="404" bytes="987"'), 
 ('2019/11/12:08:09:10,567', 'INFO', '>UL>PB_R>&nEGG?2%32U', 'path="/category/42" method=GET')]

We need to know how often each unique path is requested. This means ignoring some log records and collecting data from the other records. A mapping from the path string to a count is an elegant way to gather this data. It's so elegant that the collections module provides some alternatives for handling this use case.

How to do it...

We have a number of ways to build dictionary objects:

  • Literal: We can create a display of a dictionary by using a sequence of key/value pairs surrounded by {} characters. The {} characters overlap with the way set literals are created. The difference is the use of : between keys and values. Literals look like this: {"num": 355, "den": 113}.
  • Conversion function: A sequence of two-tuples can be turned into a dictionary like this: dict([('num', 355), ('den', 113)]). Each two-tuple becomes a key/value pair. The keys must be immutable objects like strings, numbers, or tuples of immutable objects. We can also build dictionaries like this: dict(num=355, den=113). Each of the parameter names becomes a key. This limits the dictionary keys to strings that are also valid Python variable names.
  • Insertion: We can use the dictionary [key] = value syntax to set or replace a value in a dictionary. We'll look at this later in this recipe.
  • Comprehensions: Similar to lists and sets, we can write a dictionary comprehension to build a dictionary from some source of data.

Building a dictionary by setting items

We build a dictionary by creating an empty dictionary and then setting items to it:

  1. Create an empty dictionary to map paths to counts with {}. We can also use dict() to create an empty dictionary. Since we're going to create a histogram that counts the number of times a path is used, we'll call it histogram. The type hint is something we'll look at in the Writing dictionary-related type hints recipe later in this chapter:
    >>> histogram = {}
    
  2. For each of the log lines, filter out the ones that do not have a value that starts with path in the item with an index of 3:
    >>> for line in log_lines:
    ...     path_method = line[3]  # group(4) of the original match
    ...     if path_method.startswith("path"):
    
  3. If the path is not in the dictionary, we need to add it. Once the value of the path_method string is in the dictionary, we can increment the value in the dictionary, based on the key from the data:
    ...         if path_method not in histogram:
    ...             histogram[path_method] = 0
    ...         histogram[path_method] += 1
    

This technique adds each new path_method value to the dictionary. Once it has been established that the path_method key is in the dictionary, we can increment the value associated with the key.

Building a dictionary as a comprehension

The last field of each log line had one or two fields inside. There may have been a value like path="/openapi.yaml" method=GET with two attributes, path and method; or a value like error="invalid query" with only one attribute, error.

Use the following regular expression to decompose this final field:

param_parser = re.compile(r'(w+)=(".*?"|w+)')

This regular expression matches the word in from of the =, saving that as group one. The text after the = sign has one of two forms: it's either a quote and an arbitrary string of characters to the next quote, or it's a simple word of one or more characters without quotes. This will become group two.

The findall() method of this regular expression object can decompose the fields. Each matching group becomes a two-tuple with the name and the value separated by the =. We can then build a dictionary from the list of matched groups:

  1. For each of the log lines, apply the regular expression to create a list of groups:
    >>> for line in log_lines:
    ...     name_value_pairs = param_parser.findall(line[3])
    
  2. Use a dictionary comprehension to use the name as the key and the value as the value of a dictionary:
    ...     params = {match[0]: match[1] for match in name_value_pairs}
    

We can print the params values and we'll see the following dictionaries:

{'path': '"/openapi.yaml"', 'method': 'GET'}
{'path': '"/items?limit=x"', 'method': 'GET'}
{'error': '"invalid query"'}
{'status': '"200"', 'bytes': '"11234"'}
{'status': '"404"', 'bytes': '"987"'}
{'path': '"/category/42"', 'method': 'GET'}

Using a dictionary for the final fields of each log record makes it easy to separate the important pieces of information.

How it works...

The core feature of a dictionary is a mapping from an immutable key to a value object of any kind. In the first example, we've used an immutable string as the key, and an integer as the value. We describe it as Dict[str, int] in the type hint.

As we count, we replace each value associated with a given key. It's important to understand how the following statement works:

histogram[customer] += 1

The implementation is essentially this:

histogram[customer] = histogram[customer] + 1

The expression histogram[customer] + 1 computes a new integer object from two other integer objects. This new object replaces the old value in the dictionary. This construct works elegantly because the dictionary as a whole is mutable.

It's essential that dictionary key objects be immutable. We cannot use a list, set, or dict as the key in a dictionary mapping. We can, however, transform a list into an immutable tuple, or make a set into a frozenset so that we can use one of these more complex objects as a key. In both examples, we had immutable strings as the keys to each dictionary.

There's more...

We don't have to use an if statement to add missing keys. We can use the setdefault() method of a dictionary instead. Our code to compute the path and method histogram would look like this:

>>> histogram = {}
>>> for line in log_lines:
...     path_method = line[3]  # group(4) of the match
...     if path_method.startswith("path"):
...         histogram.setdefault(path_method, 0)
...         histogram[path_method] += 1

If the key, path_method, doesn't exist, a default value of zero is provided. If the key does exist, the setdefault() method does nothing.

The collections module provides a number of alternative mappings that we can use instead of the default dict mapping:

  • defaultdict: This collection saves us from having to write step two explicitly. We provide an initialization function as part of creating a defaultdict instance. We'll look at an example soon.
  • Counter: This collection does the entire key-and-count algorithm as it is being created. We'll look at this soon too.

Here's the version using the defaultdict class:

>>> from collections import defaultdict
>>> histogram = defaultdict(int)
>>> for line in log_lines:
...     path_method = line[3]  # group(4) of the match
...     if path_method.startswith("path"):
...         histogram[path_method] += 1

We've created a defaultdict instance that will initialize any unknown key values using the int() function. We provide int—the function—to the defaultdict constructor. Defaultdict will evaluate the int() function to create default values.

This allows us to simply use histogram[path_method] += 1. If the value associated with the path_method key was previously in the dictionary, it will be incremented. If the path_method value was not in the dictionary, the int function is called with no argument; the return value, 0, is used to create a new entry in the dictionary. Once the new default value is present, it can be incremented.

The other way we can accumulate frequency counts is by creating a Counter object. We need to import the Counter class so that we can build the Counter object from the raw data:

>>> from collections import Counter
>>> histogram = Counter(line[3] 
...     for line in log_lines
...     if line[3].startswith("path")
... )

When we create a Counter from a source of data, the Counter class will scan the data and count the distinct occurrences.

See also

  • In the Removing from dictionaries – the pop() method and the del statement recipe, we'll look at how dictionaries can be modified by removing items.
  • In the Controlling the order of dictionary keys recipe, we'll look at how we can control the order of keys in a dictionary.

Removing from dictionaries – the pop() method and the del statement

A common use case for a dictionary is as an associative store: we can keep an association between key and value objects. This means that we may be doing any of the CRUD operations on an item in the dictionary:

  • Create a new key and value pair
  • Retrieve the value associated with a key
  • Update the value associated with a key
  • Delete the key (and the corresponding value) from the dictionary

We have two common variations on this theme:

  • We have the in-memory dictionary, dict, and the variations on this theme in the collections module. The collection only exists while our program is running.
  • We also have persistent storage in the shelve and dbm modules. The data collection is a persistent file in the filesystem, with a dict-like mapping interface.

These are similar, while the distinctions between a shelf.Shelf and dict object are minor. This allows us to experiment with a dict object and switch to a Shelf object without making dramatic changes to a program.

A server process will often have multiple concurrent sessions. When sessions are created, they can be placed into dict or shelf. When the session exits, the item can be deleted or perhaps archived.

We'll simulate this concept of a service that handles multiple requests. We'll define a service that works in a simulated environment with a single processing thread. We'll avoid concurrency and multi-processing considerations.

Getting ready

A great deal of processing supports the need to group items around one (or more) different common values. We might have web log records grouped by time, or by the resource requested. With more sophisticated websites, we might use cookies to group transactions around sessions defined by the value of a cookie.

We'll look at an algorithm for locating the various stages in transaction processing. This relies on assigning a unique ID to each request and including that ID with each log record written while handling a request. Because a multi-threaded server may be handling a number of sessions concurrently, the steps for each of a number of requests will be interleaved unpredictably. Reorganizing the log by request ID helps segregate the various threads.

Here's a simulated sequence of log entries for three concurrent requests:

[2019/11/12:08:09:10,123] INFO #PJQXB^eRwnEGG?2%32U path="/openapi.yaml" method=GET
[2019/11/12:08:09:10,234] INFO 9DiC!B^nXxnEGG?2%32U path="/items?limit=x" method=GET
[2019/11/12:08:09:10,235] INFO 9DiC!B^nXxnEGG?2%32U error="invalid query"
[2019/11/12:08:09:10,345] INFO #PJQXB^eRwnEGG?2%32U status="200" bytes="11234"
[2019/11/12:08:09:10,456] INFO 9DiC!B^nXxnEGG?2%32U status="404" bytes="987"
[2019/11/12:08:09:10,567] INFO >UL>PB_R>&nEGG?2%32U path="/category/42" method=GET

Each line has a timestamp. The severity level is INFO for each record shown in the example. The next string of 20 characters is a transaction ID. This is followed by log information for a particular step in the transaction.

The following regular expression defines the log records:

log_parser = re.compile(r"[(.*?)] (w+) (S+) (.*)")

This defines four fields. The timestamp is enclosed by []. This is followed a word (w+) and a sequence without any spaces (S+). The balance of the line is the fourth group of characters.

A transaction's end is marked with a status= in the fourth group of characters. This shows the final disposition of the web request.

We'll use an algorithm that uses the transaction ID as a key in a dictionary. The value is the sequence of steps for the transaction. With a very long log, we don't-generally-want to save every transaction in a gigantic dictionary. When we reach the termination, we can yield the list of log entries for a complete transaction.

How to do it...

The context will include match := log_parser.match(line) to apply the regular expression. Given that context, the processing to use each match to update or delete from a dictionary is as follows:

  1. Define a defaultdict object to hold transaction steps. The keys are 20-character strings. The values are lists of log records. In this case, each log record will have been parsed from the source text into a tuple of individual strings:
    LogRec = Tuple[str, ...]
    requests: DefaultDict[str, List[LogRec]] = collections.defaultdict(list)
    
  2. Define the key for each cluster of log entries:
    id = match.group(2)
    
  3. Update a dictionary item with a log record:
    requests[id].append(match.groups())
    
  4. If this log record completes a transaction, yield the group as part of a generator function. Then remove the transaction from the dictionary, since it's complete:
    if match.group(3).startswith('status'):
        yield requests[id]
        del requests[id]
    

Here's the essential algorithm wrapped up as a generator function:

def request_iter_t(source: Iterable[str]) -> Iterator[List[LogRec]]:
    requests: DefaultDict[str, List[LogRec]] = collections.defaultdict(list)
    for line in source:
        if match := log_parser.match(line):
            id = match.group(2)
            requests[id].append(tuple(match.groups()))
            if match.group(3).startswith('status'):
                yield requests[id]
                del requests[id]
    if requests:
        print("Dangling", requests)

This shows how the individual lines are parsed by the regular expression, then organized into clusters around the id value in group number two.

How it works...

Because a dictionary is a mutable object, we can remove keys from a dictionary. This will delete both the key and the value object associated with the key. We do this when a transaction is complete. A moderately busy web server handling an average of ten transactions per second will see 864,000 transactions in a 24-hour period. If there are an average of 2.5 log entries per transaction, there will be at least 2,160,000 lines in the file.

If we only want to know the elapsed time per resource, we don't want to keep the entire dictionary of 864,000 transactions in memory. We'd rather transform the log into a clustered intermediate file for further analysis.

This idea of transient data leads us to accumulate the parsed log lines into a list instance. Each new line is appended to the appropriate list for the transaction in which the line belongs. When the final line has been found, the group of lines can be purged from the dictionary. In the example, we used the del statement, but the remove() or pop() method can also be used.

There's more...

The example relies on the way a regular expression Match object's groups() method produces a Tuple[str, ...] object. This isn't ideal because it leads to the opaque group(3) and group(4) references. It's not at all clear what these groups mean in the overall log record.

If we change the regular expression pattern, we can get a dictionary for each row. The match object's groupdict() method produces a dictionary. Here's the revised regular expression, with named groups:

log_parser_d = re.compile(
    r"[(?P<time>.*?)] "
    r"(?P<sev>w+) "
    r"(?P<id>S+) "
    r"(?P<msg>.*)"
)

This leads to a small but helpful variation on the preceding example, as shown in the following example:

LogRecD = Dict[str, str]
def request_iter_d(source: Iterable[str]) -> Iterator[List[LogRecD]]:
    requests: DefaultDict[str, List[LogRecD]] = collections.defaultdict(list)
    for line in source:
        if match := log_parser_d.match(line):
            record = match.groupdict()
            id = record.pop('id')
            requests[id].append(record)
            if record['msg'].startswith('status'):
                yield requests[id]
                del requests[id]
    if requests:
        print("Dangling", requests)

This example has a slightly different type for the log records. In the example, each log record is a dictionary created by a Match object's groupdict() method.

The id field is popped from each record. There's no benefit in keeping this as a field in the dictionary as well as the key used to find the list of record dictionaries.

The result of running this is a sequence of List[Dict[str, str]] instances. It looks like the following example:

>>> for r in request_iter_d(log.splitlines()):
...     print(r)
[{'time': '2019/11/12:08:09:10,123', 'sev': 'INFO', 'msg': 'path="/openapi.yaml" method=GET'}, {'time': '2019/11/12:08:09:10,345', 'sev': 'INFO', 'msg': 'status="200" bytes="11234"'}]
[{'time': '2019/11/12:08:09:10,234', 'sev': 'INFO', 'msg': 'path="/items?limit=x" method=GET'}, {'time': '2019/11/12:08:09:10,235', 'sev': 'INFO', 'msg': 'error="invalid query"'}, {'time': '2019/11/12:08:09:10,456', 'sev': 'INFO', 'msg': 'status="404" bytes="987"'}]
Dangling defaultdict(<class 'list'>, {'>UL>PB_R>&nEGG?2%32U': [{'time': '2019/11/12:08:09:10,567', 'sev': 'INFO', 'msg': 'path="/category/42" method=GET'}]})

This shows the two complete transactions, assembled from multiple log records. It also shows the Dangling log record, where the completion of the transaction was not found in the log. This may represent a problem, but it's more likely to represent either a transaction in process when one log file was closed and the next log file opened, or it may represent a user that walked away from the browser and never finished the transaction.

See also

  • In the Creating dictionaries – inserting and updating recipe, we look at how we create dictionaries and fill them with keys and values.
  • In the Controlling the order of dictionary keys recipe, we look at how we can control the order of keys in a dictionary.

Controlling the order of dictionary keys

In the Creating dictionaries – inserting and updating recipe, we looked at the basics of creating a dictionary object. In many cases, we'll put items into a dictionary and fetch items from a dictionary individually. The idea of an order to the keys isn't central to using a dictionary.

Here are two common cases where key order may be important:

  • Entry order: When working with documents in JSON or YAML notation, it can help to preserve the order of keys from the source documents.
  • Sorted order: When summarizing data, it often helps to provide the keys in some easier-to-visualize order.

A Python dictionary's keys – since version 3.6 – are saved in the order in which they were initially inserted. This can be made explicit using the collections.OrderedDict class instead of the built-in dict class.

In the case when a dictionary contains a summary, we may want to sort the keys before displaying the contents of the dictionary. This is done with the sorted() function.

Getting ready

We'll look at the process of working with spreadsheet data. When we use a csv.DictReader() each row becomes a dictionary. It can be very helpful for people using the data to preserve the order of the original columns. Here's a simple table of data from a spreadsheet:

date,engine on,fuel height on,engine off,fuel height off
10/25/13,08:24:00,29,13:15:00,27
10/26/13,09:12:00,27,18:25:00,22
10/28/13,13:21:00,22,06:25:00,14

When displaying this data, it's helpful to keep the columns in the original order.

How to do it...

We have two common cases where we want to control the ordering of the keys of a dictionary:

  • Keys in the order the items were inserted:
  • Use the built-in dict class to preserve the order in which keys are inserted.
  • Use collections.OrderedDict to explicitly state the keys are kept in the order they were inserted.
  • Keys in some other order. The ordering can be imposed by using sorted() to iterate over the keys in the desired order. This can be done when making a copy of a dictionary, but it's more commonly done when emitting a final report or summary from dictionary data.

Here's how using the built-in dict class will preserve the order of the columns in the original .csv file:

import csv
from pathlib import Path
from typing import List, Dict
def get_fuel_use(source_path: Path) -> List[Dict[str, str]]:
    with source_path.open() as source_file:
        rdr= csv.DictReader(source_file)
        data: List[Dict[str, str]] = list(rdr)
    return data

Each row in the list collection that's created is a dictionary. The order of the keys in each dictionary will match the order of the columns in the source file. The items in the list will match the original rows.

The output looks like this:

>>> source_path = Path("data/fuel2.csv")
>>> fuel_use = get_fuel_use(source_path)
>>> for row in fuel_use:
...      print(row)
{'date': '10/25/13', 'engine on': '08:24:00', 'fuel height on': '29', 'engine off': '13:15:00', 'fuel height off': '27'}
{'date': '10/26/13', 'engine on': '09:12:00', 'fuel height on': '27', 'engine off': '18:25:00', 'fuel height off': '22'}
{'date': '10/28/13', 'engine on': '13:21:00', 'fuel height on': '22', 'engine off': '06:25:00', 'fuel height off': '14'}

How it works...

Since Python 3.6, the dict class keeps the keys in the order they are inserted. This property of the built-in dict collection class is very handy for ensuring keys remain in an order that's easy to understand.

This is a change from other, older implementations of the dict class. It's also different from the way a set object works. Old implementations of dict (and the current implementation of set) rely on the hash values of the keys. This tends to put keys into a difficult-to-predict order.

We can impose a specific order by creating a dictionary object from a list of two-tuples. This gives us complete control over the ordering of the keys. In these cases, it can be more clear to use the colllections.OrderedDict class, but this is no longer necessary.

For example, we might have a row like this:

>>> row = {'columns': 42, 'data': 3.14, 'of': 2.718, 'some': 1.618}

From this dictionary, we can build a new dictionary with keys in a specific order:

>>> import collections
>>> key_order = ['some', 'columns', 'of', 'data']
>>> collections.OrderedDict(
...     [(name, row[name]) for name in key_order]
... )
OrderedDict([('some', 1.618), ('columns', 42), ('of', 2.718), ('data', 3.14)])

This provides ways to create dictionaries that can be serialized with the attributes in a predictable order. When we do JSON serialization, for example, this can be helpful.

There's more...

There's another time when we can enforce an ordering of dictionary keys. That happens when we iterate through the keys. Implicitly, the keys() method is used as an iterator over a dictionary. Consider this code snippet:

for field in row:
    print(field, row[field])

By definition, the statement for field in row.keys(): has the same behavior as the statement for field in row:. We can use this to explicitly sort the keys into a more useful order for presentation.

Consider this summary of simplified web server log entries:

>>> log_rows = [
...     {'date': '2019-11-12T13:14:15', 'path': '/path/to/resource'},
...     {'date': '2019-11-14T15:16:17', 'path': '/path/to/resource'},
...     {'date': '2019-11-19T20:21:11', 'path': '/path/to/resource'},
...     {'date': '2019-11-20T21:22:23', 'path': '/path/to/resource'},
...     {'date': '2019-11-26T07:08:09', 'path': '/path/to/resource'},
... ]
>>> summary = collections.Counter()
>>> for row in log_rows:
...     date = datetime.datetime.strptime(row['date'], "%Y-%m-%dT%H:%M:%S")
...     summary[date.weekday()] += 1
>>> summary
Counter({1: 3, 3: 1, 2: 1})

The summary object has day of week, as a number, and the number of times a particular path was referenced. We'd like to change this to show the names of the days of the week. And, also very important for understanding the data, we'd like the days to be in proper calendar order, not in alphabetical order by day name.

We can sort the Counter dictionary keys and use the calendar module to provide day names from the day numbers. The following snippet shows the essence of this operation:

>>> import calendar
>>> for k in sorted(summary):
...      print(calendar.day_name[k], summary[k])
Tuesday 3
Wednesday 1
Thursday 1

This shows how the sorted() function can be applied to the keys of a collections.Counter dictionary to produce the keys in a meaningful order.

See also

  • In the Creating dictionaries – inserting and updating recipe, we'll look at how we can create dictionaries.
  • In the Removing from dictionaries – the pop() method and the del statement recipe, we'll look at how dictionaries can be modified by removing items.

Writing dictionary-related type hints

When we look at sets and lists, we generally expect each item within a list (or a set) to be the same type. When we look at object-oriented class designs, in Chapter 7, Basics of Classes and Objects, we'll see how a common superclass can be the common type for a closely related family of object types. While it's possible to have heterogeneous types in a list or set collection, it often becomes quite complex to process.

Dictionaries are used in a number of different ways.

  • Homogeneous types for values: This is common for dictionaries based on collections.Counter or collections.defaultdict. The input from a csv.DictReader will also be homogeneous, since every value from a CSV file is a string.
  • Heterogeneous types for values: This is common for dictionaries used to represent complex objects that have been serialized into JSON notation. It's also common for internal data structures created as part of more complex processing.

This flexibility leads to two kinds of type hints for dictionary objects.

Getting ready

We'll look at two kinds of dictionary type hints for homogeneous value types as well as heterogeneous value types. We'll look at data that starts out as one of these kinds of dictionaries but is transformed to have more complex type definitions.

We'll be starting with the following CSV file:

date,engine on,fuel height on,engine off,fuel height off
10/25/13,08:24:00,29,13:15:00,27
10/26/13,09:12:00,27,18:25:00,22
10/28/13,13:21:00,22,06:25:00,14

This describes three separate legs of a multi-day trip in a sailboat. The fuel is measured by the height in the tank, rather than some indirect method using a float or other gauges. Because the tank is approximately rectangular, there's a relatively simple transformation from height to gallons. For this tank, 31 inches of depth is about 75 gallons of fuel.

How to do it…

The initial use of csv.DictReader will lead to dictionaries with homogeneous type definitions:

  1. Locate the type of the keys in the dictionary. When reading CSV files, the keys are strings, with the type str.
  2. Locate the type of the values in the dictionary. When reading CSV files, the values are strings, with the type str.
  3. Combine the types using the typing.Dict type hint. This yields Dict[str, str].

Here's an example function for reading data from a CSV file:

def get_fuel_use(source_path: Path) -> List[Dict[str, str]]:
    with source_path.open() as source_file:
        rdr = csv.DictReader(source_file)
        data: List[Dict[str, str]] = list(rdr)
    return data

The get_fuel_use() function yields values that match the source data. In this case, it's a dictionary that maps string column names to string cell values.

This data, by itself, is difficult to work with. A common second step is to apply transformations to the source rows to create more useful data types. We can describe the results with a type hint:

  1. Identify the various value types that will be needed. In this example, there are five fields with three different types, shown here:
    • The date field is a datetime.date object.
    • The engine on field is a datetime.time object.
    • The fuel height on field is an integer, but we know that it will be used in a float context, so we'll create a float directly.
    • The engine off field is a datetime.time object.
    • The fuel height off field is also a float value.
  2. Import the TypedDict type definition from the mypy_extensions module.
  3. Define TypedDict with the new heterogeneous dictionary types:
    from mypy_extensions import TypedDict
    History = TypedDict(
        'History',
        {
            'date': datetime.date,
            'start_time': datetime.time,
            'start_fuel': float,
            'end_time': datetime.time,
            'end_fuel': float
        }
    )
    

In this example, we've also renamed the fields to make them shorter and slightly easier to understand.

The function to perform the transformation can look like the following example:

def make_history(source: Iterable[Dict[str, str]]) -> Iterator[History]:
    for row in source:
        yield dict(
            date=datetime.datetime.strptime(
                row['date'], "%m/%d/%y").date(),
            start_time=datetime.datetime.strptime(
                row['engine on'], '%H:%M:%S').time(),
            start_fuel=float(row['fuel height on']),
            end_time=datetime.datetime.strptime(
                row['engine off'], '%H:%M:%S').time(),
            end_fuel=float(row['fuel height off']),
        )

This function consumes instances of the initial Dict[str, str] dictionary and creates instances of the dictionary described by the History type hint. Here's how these two functions work together:

>>> source_path = Path("data/fuel2.csv")
>>> fuel_use = make_history(get_fuel_use(source_path))
>>> for row in fuel_use:
...      print(row)
{'date': datetime.date(2013, 10, 25), 'start_time': datetime.time(8, 24), 'start_fuel': 29.0, 'end_time': datetime.time(13, 15), 'end_fuel': 27.0}
{'date': datetime.date(2013, 10, 26), 'start_time': datetime.time(9, 12), 'start_fuel': 27.0, 'end_time': datetime.time(18, 25), 'end_fuel': 22.0}
{'date': datetime.date(2013, 10, 28), 'start_time': datetime.time(13, 21), 'start_fuel': 22.0, 'end_time': datetime.time(6, 25), 'end_fuel': 14.0}

This shows how the output from the get_fuel_use() function can be processed by the make_history() function to create an iterable sequence of dictionaries. Each of the resulting dictionaries has the source data converted to a more useful type.

How it works…

The core type hint for a dictionary names the key type and the value type, in the form Dict[key, value]. The mypy extension type, TypedDict, lets us be more specific about bindings between dictionary keys and a very broad domain of values.

It's important to note that type hints are only checked by the mypy program. These hints have no runtime impact. We could, for example, write a statement like the following:

result: History = {'date': 42}

This statement claims that the result dictionary will match the type hints in the History type definition. The dictionary literal, however, has the wrong type for the date field and a number of other fields are missing. While this will execute, it will raise errors from mypy.

There's more…

One of the common cases for heterogeneity is optional items. The type hint Optional[str] is the same as Union[str, None]. This optional or union specification is a very specialized kind of heterogeneity. This is rarely needed with a dictionary, since it can be simpler to omit the key: value pair entirely.

There are some parallels between TypedDict and the NamedTuple type definitions. We can, for example, make a few small syntax changes to create a named tuple instead of a typed dictionary. The alternative looks like this:

from typing import NamedTuple
HistoryT = NamedTuple(
    'HistoryT',
    [
        ('date', datetime.date),
        ('start_time', datetime.time),
        ('start_fuel', float),
        ('end_time', datetime.time),
        ('end_fuel', float)
    ]
)

Because a HistoryT object has an _asdict() method, it's possible to produce a dictionary that matches the History typed dict shown above.

The similarities allow us the flexibility to define types with similar syntax. A dictionary that matches the TypedDict hint is a dictionary and is mutable. The HistoryT subclass of NamedTuple, however, is immutable. This is one central difference between these two type hints. More importantly, a dictionary uses row['date'] syntax to refer to one item using the key. NamedTuple uses row.date syntax to refer to one item using a name.

See also

  • The Using NamedTuples to simplify item access in tuples recipe provides more details on the NamedTuple type hint.
  • See the Writing list-related type hints recipe in this chapter for more about type hints for lists.
  • The Writing set-related type hints recipe covers this from the view of set types.

Understanding variables, references, and assignment

How do variables really work? What happens when we assign a mutable object to two variables? We can easily have two variables that share references to a common object; this can lead to potentially confusing results when the shared object is mutable.

We'll focus on this principle: Python shares references; it doesn't copy data.

To see what this rule on reference sharing means, we'll create two data structures: one is mutable and one is immutable.

Getting ready

We'll create two data structures; one is mutable and one is immutable. We'll use two kinds of sequences, although we could do something similar with two kinds of sets:

>>> mutable = [1, 1, 2, 3, 5, 8]
>>> immutable = (5, 8, 13, 21)

We'll look at what happens when references to these objects are shared.

We can do a similar comparison with a set and a frozenset. We can't easily do this with a mapping because Python doesn't offer a handy immutable mapping.

How to do it...

This recipe will show how to observe the spooky action at a distance when there are two references to an underlying mutable object. We'll look at ways to prevent this in the Making shallow and deep copies of objects recipe. Here are the steps for seeing the difference between mutable and immutable collections:

  1. Assign each collection to an additional variable. This will create two references to the structure:
    >>> mutable_b = mutable
    >>> immutable_b = immutable
    

    We now have two references to the list [1, 1, 2, 3, 5, 8] and two references to the tuple (5, 8, 13, 21).

  2. We can confirm this using the is operator. This determines if two variables refer to the same underlying object:
    >>> mutable_b is mutable
    True
    >>> immutable_b is immutable
    True
    
  3. Make a change to one of the two references to the collection. For the list type, we have methods like extend(), append(), or add():
    >>> mutable += [mutable[-2] + mutable[-1]]
    
  4. We can do a similar thing with the immutable structure:
    >>> immutable += (immutable[-2] + immutable[-1],)
    
  5. Look at the other two variables that reference the mutable structure. Because the two variables are references to the same underlying list object, each variable shows the current state:
    >>> mutable_b
    [1, 1, 2, 3, 5, 8, 13]
    >>> mutable is mutable_b
    True
    
    • Look at the two variables referring to immutable structures. Initially, the two variables shared a common object. When the assignment statement was executed, a new tuple was created and only one variable changed to refer to the new tuple:
      >>> immutable_b
      (5, 8, 13, 21)
      >>> immutable
      (5, 8, 13, 21, 34)
      

How it works...

The two variables mutable and mutable_b still refer to the same underlying object. Because of that, we can use either variable to change the object and see the change reflected in the other variable's value.

The two variables, immutable_b and immutable, started out referring to the same object. Because the object cannot be mutated in place, a change to one variable means that a new object is assigned to that variable. The other variable remains firmly attached to the original object.

In Python, a variable is a label that's attached to an object. We can think of them like adhesive notes in bright colors that we stick on an object temporarily. Multiple labels can be attached to an object. An assignment statement places a reference on an object.

Consider the following statement:

immutable += (immutable[-2] + immutable[-1],)

The expression on the right side of += creates a new tuple from the previous value of the immutable tuple. The assignment statement then assigns the label immutable to the resulting object.

When we use a variable on an assignment statement there are two possible actions:

  • For mutable objects that provide definitions for appropriate assignment operators like +=, the assignment is transformed into a special method; in this case, __iadd__(). The special method will mutate the object's internal state.
  • For immutable objects that do not provide definitions for assignment like +=, the assignment is transformed into = and +. A new object is built by the + operator and the variable name is attached to that new object. Other variables that previously referred to the object being replaced are not affected; they will continue to refer to old objects.

Python counts the number of places from which an object is referenced. When the count of references becomes zero, the object is no longer used anywhere, and can be removed from memory.

There's more...

Languages like C++ or Java have primitive types in addition to objects. In these languages, a += statement leverages a feature of the hardware instructions or the Java Virtual Machine instructions to tweak the value of a primitive type.

Python doesn't have this kind of optimization. Numbers are immutable objects. When we do something like this:

>>> a = 355
>>> a += 113

we're not tweaking the internal state of the object 355. This does not rely on the internal __iadd__() special method. This behaves as if we had written:

>>> a = a + 113 

The expression a + 113 is evaluated, and a new immutable integer object is created. This new object is given the label a. The old value previously assigned to a is no longer needed, and the storage can be reclaimed.

See also

  • In the Making shallow and deep copies of objects recipe, we'll look at ways we can copy mutable structures to prevent shared references.
  • Also, see Avoiding mutable default values for function parameters for another consequence of the way references are shared in Python.

Making shallow and deep copies of objects

Throughout this chapter, we've talked about how assignment statements share references to objects. Objects are not normally copied. When we write:

a = b

we now have two references to the same underlying object. If the object of b has a mutable type, like the list, set, or dict types, then both a and b are references to the same mutable object.

As we saw in the Understanding variables, references, and assignment recipe, a change to the a variable changes the list object that both a and b refer to.

Most of the time, this is the behavior we want. There are rare situations in which we want to actually have two independent objects created from one original object.

There are two ways to break the connection that exists when two variables are references to the same underlying object:

  • Making a shallow copy of the structure
  • Making a deep copy of the structure

Getting ready

Python does not automatically make a copy of an object. We've seen several kinds of syntax for making a copy:

  • Sequences – list, as well as str, bytes, and tuple: We can use sequence[:] to copy a sequence by using an empty slice expression. This is a special case for sequences.
  • Almost all collections have a copy() method. We can also use object.copy() to make a copy of a variable named object.
  • Calling a type, with an instance of the type as the only argument, returns a copy. For example, if d is a dictionary, dict(d) creates a shallow copy of d.

What's important is that these are all shallow copies. When two collections are shallow copies, they each contain references to the same underlying objects. If the underlying objects are immutable, such as tuples, numbers, or strings, this distinction doesn't matter.

For example, if we have a = [1, 1, 2, 3], we can't perform any mutation on a[0]. The number 1 in a[0] has no internal state. We can only replace the object.

Questions arise, however, when we have a collection that involves mutable objects. First, we'll create an object, then we'll create a copy:

>>> some_dict = {'a': [1, 1, 2, 3]}
>>> another_dict = some_dict.copy()

This example created a shallow copy of the dictionary. The two copies will look alike because they both contain references to the same objects. There's a shared reference to the immutable string a and a shared reference to the mutable list [1, 1, 2, 3]. We can display the value of another_dict to see that it looks like the some_dict object we started with:

>>> another_dict
{'a': [1, 1, 2, 3]}

Here's what happens when we update the shared list that's inside the copy of the dictionary:

>>> some_dict['a'].append(5)
>>> another_dict
{'a': [1, 1, 2, 3, 5]}

We made a change to a mutable list object that's shared between the two dict instances, some_dict and another_dict. We can see that the item is shared by using the id() function:

>>> id(some_dict['a']) == id(another_dict['a'])
True

Because the two id() values are the same, these are the same underlying object. The value associated with the key a is the same mutable list in both some_dict and another_dict. We can also use the is operator to see that they're the same object.

This mutation of a shallow copy works for list collections that contain other list objects as items as well:

>>> some_list = [[2, 3, 5], [7, 11, 13]]
>>> another_list = some_list.copy()
>>> some_list is another_list
False
>>> some_list[0] is another_list[0]
True

We've made a copy of an object, some_list, and assigned it to the variable another_list. The top-level list object is distinct, but the items within the list are shared references. We used the is operator to show that item zero in each list are both references to the same underlying objects.

Because we can't make a set of mutable objects, we don't really have to consider making shallow copies of sets that share items.

What if we want to completely disconnect two copies? How do we make a deep copy instead of a shallow copy?

How to do it...

Python generally works by sharing references. It only makes copies of objects reluctantly. The default behavior is to make a shallow copy, sharing references to the items within a collection. Here's how we make deep copies:

  1. Import the copy module:
    >>> import copy
    
  2. Use the copy.deepcopy() function to duplicate an object and all of the mutable items contained within that object:
    >>> some_dict = {'a': [1, 1, 2, 3]}
    >>> another_dict = copy.deepcopy(some_dict)
    

This will create copies that have no shared references. A change to one copy's mutable internal items won't have any effect anywhere else:

>>> some_dict['a'].append(5)
>>> some_dict
{'a': [1, 1, 2, 3, 5]}
>>> another_dict
{'a': [1, 1, 2, 3]}

We updated an item in some_dict and it had no effect on the copy in another_dict. We can see that the objects are distinct with the id() function:

>>> id(some_dict['a']) == id(another_dict['a'])
False

Since the id() values are different, these are distinct objects. We can also use the is operator to see that they're distinct objects.

How it works...

Making a shallow copy is relatively easy. We can write our own version of the algorithm using comprehensions (containing generator expressions):

>>> copy_of_list = [item for item in some_list]
>>> copy_of_dict = {key:value for key, value in some_dict.items()}

In the list case, the items for the new list are references to the items in the source list. Similarly, in the dict case, the keys and values are references to the keys and values of the source dictionary.

The deepcopy() function uses a recursive algorithm to look inside each mutable collection.

For an object with a list type, the conceptual algorithm is something like this:

immutable = (numbers.Number, tuple, str, bytes)
def deepcopy_list(some_list: 
    copy = []
    for item in some_list: 
        if isinstance(item, immutable): 
            copy.append(item)
        else: 
            copy.append(deepcopy(item))

The actual code doesn't look like this, of course. It's a bit more clever in the way it handles each distinct Python type. This does, however, provide some insight as to how the deepcopy() function works.

See also

  • In the Understanding variables, references, and assignment recipe, we'll look at how Python prefers to create references to objects.

Avoiding mutable default values for function parameters

In Chapter 3, Function Definitions, we looked at many aspects of Python function definitions. In the Designing functions with optional parameters recipe, we showed a recipe for handling optional parameters. At the time, we didn't dwell on the issue of providing a reference to a mutable structure as a default. We'll take a close look at the consequences of a mutable default value for a function parameter.

Getting ready

Let's imagine a function that either creates or updates a mutable Counter object. We'll call it gather_stats().

Ideally, a small data gathering function could look like this:

import collections
from random import randint, seed
from typing import Counter, Optional, Callable
def gather_stats_bad(
    n: int,
    samples: int = 1000,
    summary: Counter[int] = collections.Counter()
) -> Counter[int]:
    summary.update(
        sum(randint(1, 6)
            for d in range(n)) for _ in range(samples)
    )
    return summary

This shows a bad design for a function with two stories. The first story offers no argument value for the summary parameter. When this is omitted, the function creates and returns a collection of statistics. Here's the example of this story:

>>> seed(1)
>>> s1 = gather_stats_bad(2)
>>> s1
Counter({7: 168, 6: 147, 8: 136, 9: 114, 5: 110, 10: 77, 11: 71, 4: 70, 3: 52, 12: 29, 2: 26})

The second story allows us to provide an explicit argument value for the summary parameter. When this argument is provided this function updates the given object. Here's an example of this story:

>>> seed(1)
>>> mc = Counter()
>>> gather_stats_bad(2, summary=mc)
Counter...
>>> mc
Counter({7: 168, 6: 147, 8: 136, 9: 114, 5: 110, 10: 77, 11: 71, 4: 70, 3: 52, 12: 29, 2: 26})

We've set the random number seed to be sure that the two sequences of random values are identical. This makes it easy to confirm that the results are the same in both stories. If we provide a Counter object or use the default Counter object, we get identical results.

The gather_stats_bad() function returns a value. When writing a script, we'd simply ignore the returned value. When working with Python's interactive REPL the output is printed. We've shown Counter... instead of the lengthy output.

The problem arises when we do the following operation:

>>> seed(1)
>>> s3b = gather_stats_bad(2)
>>> s3b
Counter({7: 336, 6: 294, 8: 272, 9: 228, 5: 220, 10: 154, 11: 142, 4: 140, 3: 104, 12: 58, 2: 52})

The count values in this example are doubled. Something has gone wrong. This only happens when we use the default story more than once. This code can pass a unit test suite and appear correct.

As we saw in the Making shallow and deep copies of objects recipe, Python prefers to share references. A consequence of that sharing is the object referenced by the s1 variable and the object referenced by the s3b variable are the same object:

>>> s1 is s3b
True

This means the value of the s1 variable changed when the value for the s3b variable was created. From this, it should be apparent the function is updating some shared collection and returning the reference to the shared collection.

The default value used for the summary parameter of this gather_stats_bad() function seems to be sharing a single object. How can we avoid this?

How to do it...

There are two approaches to solving this problem of a mutable default parameter:

  • Provide an immutable default
  • Change the design

We'll look at the immutable default first. Changing the design is generally a better idea. In order to see why it's better to change the design, we'll show the purely technical solution.

When we provide default values for functions, the default object is created exactly once and shared forever after. Here's the alternative:

  1. Replace any mutable default parameter value with None:
    def gather_stats_good(
        n: int, samples: int = 1000, summary: Optional[Counter[int]] = None
    ) -> Counter[int]:
    
  2. Add an if statement to check for an argument value of None and replace it with a fresh, new mutable object:
    if summary is None:
        summary = Counter()
    
  3. This will assure us that every time the function is evaluated with no argument value for a parameter, we create a fresh, new mutable object. We will avoid sharing a single mutable object over and over again.

How it works...

As we noted earlier, Python prefers to share references. It rarely creates copies of objects without explicit use of the copy module or the copy() method of an object. Therefore, default values for function parameter values will be shared objects. Python does not create fresh, new objects for default parameter values.

The rule is very important and often confuses programmers new to Python.

Don't use mutable defaults for functions. A mutable object (set, list, dict) should not be a default value for a function parameter.

Providing a mutable object as a default parameter value is often a very bad idea. In most cases, we should consider changing the design, and not offering a default value at all. The best approach is to use two separate functions, distinguished by the user stories. One function creates a value, the second function updates the value.

We'd refactor one function to create two functions, create_stats() and update_stats(), with unambiguous parameters:

def create_stats(n: int, samples: int = 1000) -> Counter[int]:
    return update_stats(n, samples, Counter())
def update_stats(
    n: int, samples: int = 1000, summary: Counter[int]
) -> Counter[int]:
    summary.update(
        sum(randint(1, 6)
            for d in range(n)) for _ in range(samples))
    return summary

We've created two separate functions. This will separate the two stories so that there's no confusion. The idea of optional mutable arguments was not a good idea. The mutable object provided as a default value is reused. This reuse of a mutable object means the default value will be updated, a potential source of confusion. It's very unlikely to want a default value that is updated as an application executes.

There's more...

In the standard library, there are some examples of a cool technique that shows how we can create fresh default objects. A number of functions use a key function to create comparable values from a complex object. We can look at sorted(), min(), and max() for examples of this. A default key function does nothing; it's lambda item: item. A non-default key function requires an item and produces some object that is a better choice for comparison.

In order to leverage this technique, we need to modify the design of our example function. We will no longer update an existing counter object in the function. We'll always create a fresh, new object. We can modify what class of object is created.

Here's a function that allows us to plug in a different class in the case where we don't want the default Counter class to be used:

T = TypeVar('T')
Summarizer = Callable[[Iterable[T]], Counter[T]]
def gather_stats_flex(
    n: int,
    samples: int = 1000,
    summary_func: Summarizer = collections.Counter
) -> Counter[int]:
    summary = summary_func(
        sum(randint(1, 6)
            for d in range(n)) for _ in range(samples))
    return summary

For this version, we've defined an initialization value to be a function of one argument. The default will apply this one-argument function to a generator function for the random samples. We can override this function with another one-argument function that will collect data. This will build a fresh object using any kind of object that can gather data.

Here's an example using list instead of collections.Counter:

test_flex = """
>>> seed(1)
>>> gather_stats_flex(2, 12, summary_func=list)
[7, 4, 5, 8, 10, 3, 5, 8, 6, 10, 9, 7]
>>> seed(1)
>>> gather_stats_flex(2, 12, summary_func=list)
[7, 4, 5, 8, 10, 3, 5, 8, 6, 10, 9, 7]
"""

In this case, we provided the list() function to create a list with the individual random samples in it.

Here's an example without an argument value. It will create a collections.Counter object:

>>> seed(1)
>>> gather_stats_flex(2, 12)
Counter({7: 2, 5: 2, 8: 2, 10: 2, 4: 1, 3: 1, 6: 1, 9: 1})

In this case, we've used the default value. The function created a Counter() object from the random samples.

See also

  • See the Creating dictionaries – inserting and updating recipe, which shows how defaultdict works.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset