Chapter 2. Advanced Basics

Like any other book on programming, the remainder of this book relies on quite a few features that may or may not be considered commonplace by readers. You, the reader, are expected to know a good deal about Python and programming in general, but there are a variety of lesser-used features that are extremely useful in the operations of many techniques shown throughout the book.

Therefore, as unusual as it may seem, this chapter will focus on a concept of advanced basics. The tools and techniques in this chapter aren't necessarily common knowledge, but they form a solid foundation for more advanced implementations to follow. Let's start off with some of the general concepts that tend to come up often in Python development.

General Concepts

Before getting into more concrete details, it's important to get a feel for the concepts that lurk behind the specifics covered later in this chapter. These are different from the principles and philosophies from the first chapter in that they are concerned more with actual programming techniques, while those discussed previously are more generic design goals.

In this regard, you can look at the first chapter as a design guide, while the concepts presented here are more of an implementation guide. Of course, there's only so specific a description like this can get without bogging down in too many details, so this section will defer to more detailed chapters throughout the rest of the book for more information.

Iteration

Although there is a nearly infinite number of different types of sequences that might come up in Python code—more on that later in this chapter and in Chapter 5—most code that uses them can be placed in one of two categories: those that actually use the sequence as a whole and those that just need the items within it. Most functions use both approaches in various ways, but the distinction is important in order to understand what tools Python makes available and how they should be used.

Looking at things from a purely object-oriented perspective, it's easy to understand how to work with sequences that your code actually needs to use. You'll have a concrete object, such as a list, set or dictionary, which not only has data associated with it, but also has methods that allow for accessing and modifying that data. You may need to iterate over it multiple times, access individual items out of order or return it from the function for other code to use, all of which works well with more traditional object usage.

On the other hand, you may not actually need to work with the entire sequence as a whole; you may be interested solely in each item within it. This is often the case when looping over a range of numbers, for instance, because what's important is having each number available within the loop, not having the whole list of numbers available.

The difference between the two approaches is primarily about intention, but there are technological implications as well. Not all sequences need to be loaded in their entirety in advance, and many don't even need to have a finite upper limit at all. This category includes such things as the set of positive odd numbers, squares of integers and the Fibonacci sequence, all of which are infinite in length and easily computable. Therefore, they're best suited for pure iteration, without the need to populate a list in advance.

The main benefit to this is memory allocation. A program designed to print out the entire range of the Fibonacci sequence only needs to keep a few variables in memory at any given time, because each value in the sequence can be calculated from the two previous values. Populating a list of the values, even with a limited length, requires loading all the included values into memory before iterating over them. If the full list will never be acted on as a whole, it's far more efficient to simply generate each item as it's necessary and discard it once it's no longer required in order to produce new items.

Python as a language offers a few different ways to iterate over a sequence without pushing all its values into memory at once. As a library, Python uses those techniques in many of its provided features, which may sometimes lead to confusion. After all, both approaches allow you to write a for loop without a problem, but many sequences won't have the methods and attributes you might expect to see on a list.

The section on iteration later in this chapter will cover some of the more common ways to create iterable sequences and also a simple way to coerce those sequences to lists when you truly do need to operate on the sequence as a whole. Sometimes, though, it's useful to have an object that can function in both respects, which requires the use of caching.

Caching

Outside of computing, a cache is a hidden collection, typically of items either too dangerous or too valuable to be made publicly accessible. The definition in computing is related, with caches storing data in a way that doesn't impact a public-facing interface. Perhaps the most common real-world example is a Web browser, which downloads a document from the Web when it's first requested, but keeps a copy of that document. When the user requests that same document again at a later time, the browser loads the private copy and displays it to the user instead of hitting the remote server again.

In the browser example, the public interface could be the address bar, an entry in the user's favorites or a link from another Web site, where the user never has to indicate whether the document should be retrieved remotely or accessed from a local cache. Instead, the software uses the cache to reduce the number of remote requests that need to be made, as long as the document doesn't change quickly. The details of Web document caching are beyond the scope of this book, but it's a good example of how caching works in general.

More specifically, a cache should be looked at as a time-saving utility that doesn't explicitly need to exist in order for a feature to work properly. If the cache gets deleted or is otherwise unavailable, the function that utilizes it should continue to work properly, perhaps with a dip in performance because it needs to re-create the items that were lost. That also means that code utilizing a cache must always accept enough information to generate a valid result without the use of the cache.

The nature of caching also means that you need to be careful about ensuring the cache is as up-to-date as your needs demand. In the Web browser example, servers can specify how long a browser should hold on to a cached copy of a document before destroying the local copy and requesting a fresh one from the server. In simple mathematical examples, the result can be cached theoretically forever, because the result should always be the same, given the same input. Chapter 3 covers a technique called memoization that does exactly that.

A useful compromise is to cache a value indefinitely, but update it immediately when the value is updated. This isn't always an option, particularly if values are retrieved from an external source, but when the value is updated within your application, updating the cache is an easy step to include, which saves the trouble of having to invalidate the cache and retrieve the value from scratch later on. Doing so can incur a performance penalty, though, so you'll have to weigh the merits of live updates against the speed you might lose by doing so.

Transparency

Whether describing building materials, image formats or government actions, transparency refers to the ability to see through or inside of something, and its use in programming is no different. For our purposes, transparency refers to the ability of your code to see—and in many cases, even edit—nearly everything that the computer has access to.

Python doesn't support the notion of private variables in the typical manner, so all attributes are accessible to any object that requests them. Some languages consider that type of openness to be a risk to stability, instead allowing the code that powers an object to be solely responsible for that object's data. While that does prevent some occasional misuses of internal data structures, Python doesn't take any measures to restrict access to that data.

Although the most obvious use of transparent access is in object attributes—which is where many other languages allow more privacy—Python allows you to inspect a wide range of aspects of objects and the code that powers them. In fact, you can even get access to the compiled bytecode that Python uses to execute functions. Here are just a few examples of information available at runtime.

  • Attributes on an object

  • The names of attributes available on an object

  • The type of an object

  • The module where a class or function was defined

  • The filename where a module was loaded

  • The bytecode for a given function

Most of this information is only used internally, but it's available because there are potential uses that can't be accounted for when code is first written. Retrieving that information at run-time is called introspection and is a common tactic in systems that implement principles like DRY.

The rest of this book will contain many different introspection techniques in the sections where such information is available. For those rare occasions where data should indeed be protected, Chapters 3 and 4 show how data can show the intent of privacy or be hidden entirely.

Control Flow

Generally speaking, the control flow of a program is the path the program takes during execution. The more common examples of control flow are the if, for and while blocks, which are used to manage the most fundamental branches your code could need. Those blocks are also some of the first things a Python programmer will learn, so this section will instead focus on some of the lesser-used and under-utilized control flow mechanisms.

Catching Exceptions

Chapter 1 explained how Python philosophy encourages the use of exceptions wherever an expectation is violated, but expectations often vary between different uses. This is especially common when one application relies on another, but it's also quite common within a single application. Essentially, any time one function calls another, it can add its own expectations on top of what the called function already handles.

Exceptions are raised with a simple syntax using the raise keyword, but catching them is slightly more complicated because it uses a combination of keywords. The try keyword begins a block where you expect exceptions to occur, while the except keyword marks a block to execute when an exception is raised. The first part is easy, since try doesn't have anything to go along with it, and the simplest form of except also doesn't require any additional information.

def count_lines(filename):
    """
    Count the number of lines in a file. If the file can't be
    opened, it should be treated the same as if it was empty.
    """
    try:
        return len(open(filename, 'r').readlines())
    except:
        # Something went wrong reading the file
        # or calculating the number of lines.
        return 0

Any time an exception gets raised inside the try block, the code in the except block will be executed. As it stands, this doesn't make any distinction among the many various exceptions that could be raised; no matter what happens, the function will always return a number. It's actually fairly rare that you'd want to do that, though, because many exceptions should in fact propagate up to the rest of the system—errors should never pass silently. Some notable examples are SystemExit and KeyboardInterrupt, both of which should usually cause the program to stop running.

In order to account for those and other exceptions that your code shouldn't interfere with, the except keyword can accept one or more exception types that should be caught explicitly. Any others will simply be raised as if you didn't have a try block at all. This focuses the except block on just those situations that should definitely be handled, so your code only has to deal with what it's supposed to manage.

def count_lines(filename):
    """
    Count the number of lines in a file. If the file can't be
    opened, it should be treated the same as if it was empty.
    """
    try:
        return len(open(filename, 'r').readlines())
    except IOError:
        # Something went wrong reading the file.
        return 0

By changing the code to accept IOError explicitly, the except block will only execute if there was a problem accessing the file from the filesystem. Any other errors, such as a filename that's not even a string, will simply raise outside of this function, to be handled by some other piece of code.

If you need to catch multiple exception types, there are two approaches. The first and easiest is to simply catch some base class that all the necessary exceptions derive from. Since exception handling matches against the specified class and all its subclasses, this approach works quite well when all the types you need to catch do have a common base class. In the line counting example, you could encounter either IOError or OSError, both of which are descendants of EnvironmentError.

def count_lines(filename):
    """
    Count the number of lines in a file. If the file can't be
    opened, it should be treated the same as if it was empty.
    """
    try:
        return len(open(filename, 'r').readlines())
    except EnvironmentError:
        # Something went wrong reading the file.
        return 0

Note

Even though we're only interested in IOError and OSError, all subclasses of EnvironmentError will get caught as well. In this case, that's fine because those are the only subclasses of EnvironmentError, but in general you'll want to make sure you're not catching too many exceptions.

Other times, you may want to catch multiple exception types that don't share a common base class or perhaps limit it to a smaller list of types. In these cases, you need to specify each type individually, separated by commas. In the case of count_lines(), there's also the possibility of a TypeError that could be raised if the filename passed in isn't a valid string.

def count_lines(filename):
    """
    Count the number of lines in a file. If the file can't be
    opened, it should be treated the same as if it was empty.
    """
    try:
        return len(open(filename, 'r').readlines())
    except (EnvironmentError, TypeError):
        # Something went wrong reading the file.
        return 0

If you need access to the exception itself, perhaps to log the message for later, you can get it by adding an as clause and supplying a variable to contain the exception object.

import logging

def count_lines(filename):
    """
    Count the number of lines in a file. If the file can't be
    opened, it should be treated the same as if it was empty.
    """
    try:
        return len(open(filename, 'r').readlines())
    except (EnvironmentError, TypeError) as e:
        # Something went wrong reading the file.
         logging.error(e)
        return 0

Multiple except clauses can also be combined, allowing you to handle different types of exceptions in different ways. For example, EnvironmentError uses two arguments, an error code and an error message, that combine to form its complete string representation. In order to log just the error message in that case, but still correctly handle the TypeError case, two except clauses could be used.

import logging

def count_lines(filename):
    """
    Count the number of lines in a file. If the file can't be
    opened, it should be treated the same as if it was empty.
    """
    try:
        return len(open(filename, 'r').readlines())
    except TypeError as e:
        # The filename wasn't valid for use with the filesystem.
        logging.error(e)
        return 0
    except EnvironmentError as e:
        # Something went wrong reading the file.
        logging.error(e.args[1])
        return 0

Exception Chains

Sometimes, while handling one exception, another exception might get raised along the way. This can happen either explicitly with a raise keyword or implicitly through some other code that gets executed as part of the handling. Either way, this situation brings up a question of which exception is important enough to present itself to the rest of the application. Exactly how that question is answered depends on how the code is laid out, so let's take a look at a simple example, where the exception handling code opens and writes to a log file.

>>> def get_value(dictionary, name):
>>>     try:
>>>         return dictionary[name]
>>>     except Exception as e:
>>>         log = open('logfile.txt', 'w')
>>>         log.write('%s
' % e)
>>>         log.close()
>>>

If anything should go wrong when writing to the log, a separate exception will be raised. Even though this new exception is important, there was already an exception in play that shouldn't be forgotten. To retain the original information, the file exception gains a new attribute, called __context__, which holds the original exception object. Each exception can possibly reference one other, forming a chain that represents everything that went wrong, in order. Consider what happens when get_value() fails, but logfile.txt is a read-only file.

>>> get_value({}, 'test')
Traceback (most recent call last):
  ...
KeyError: 'test'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  ...
IOError: [Errno 13] Permission denied: 'logfile.txt'
>>>

This is an implicit chain, because the exceptions are linked only by how they're encountered during execution. Sometimes you'll be generating an exception yourself, and you may need to include an exception that was generated elsewhere. One common example of this is validating values using a function that was passed in. Validation functions, as will be described in Chapters 3 and 4, generally raise a ValueError, regardless of what was wrong.

This is a great opportunity to form an explicit chain, so we can raise a ValueError directly, while retaining the actual exception behind the scenes. Python allows this by including the from keyword at the end of the raise statement.

>>> def validate(value, validator):
...     try:
...         return validator(value)
...     except Exception as e:
...         raise ValueError('Invalid value: %s' % value) from e
...
>>> def validator(value):
...     if len(value) > 10:
...         raise ValueError("Value can't exceed 10 characters")
...
>>> validate('test', validator)
>>> validate(False, validator)
Traceback (most recent call last):
  ...
TypeError: object of type 'bool' has no len()
The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  ...
ValueError: invalid value: False

Since this wraps multiple exceptions up together into a single object, it may seem ambiguous as to which exception is really being passed around. A simple rule to remember is that the most recent exception is the one being raised, with any others available by way of the __context__ attribute. This is easy to test by wrapping one of these functions in a new try block and checking the type of the exception.

>>> try:
...     validate(False, validator)
... except Exception as e:
...     print(type(e))
...
<class 'ValueError'>
>>>

When Everything Goes Right

On the other end of the spectrum, you may find that you have a complex block of code, where you need to catch exceptions that may crop up from part of it, but code after that part should proceed without any error handling. The obvious approach is to simply add that code outside of the try/except blocks. Here's how we might adjust the count_lines() function to contain the error-generating code inside the try block, while the line counting takes place after the exceptions have been handled.

import logging

def count_lines(filename):
    """
    Count the number of lines in a file. If the file can't be
    opened, it should be treated the same as if it was empty.
    """
    try:
        file = open(filename, 'r')
    except TypeError as e:
        # The filename wasn't valid for use with the filesystem.
        logging.error(e)
        return 0
    except EnvironmentError as e:
        # Something went wrong reading the file.
        logging.error(e.args[1])
        return 0
    return len(file.readlines())

In this particular case, the function will work as expected, so all seems fine. Unfortunately, it's misleading because of the nature of this specific case. Because each of the except blocks explicitly returns a value from the function, the code after the error handling will only be reached if no exceptions were raised.

Note

We could place the file reading code directly after the file is opened, but then if any exceptions are raised there, they'd get caught using the same error handling as the file opening. Separating them is a way to better control how exceptions are handled overall. You may also notice that the file isn't closed anywhere here. That will be handled in later sections, as this function continues expanding.

If, however, the except blocks simply logged the error and moved on, Python would try to count the lines in the file, even though no file was ever opened. Instead, we need a way to specify a block of code should be run only if no exceptions were raised at all, so it doesn't matter how your except blocks execute. Python provides this feature by way of the else keyword, which defines a separate block.

import logging

def count_lines(filename):
    """
    Count the number of lines in a file. If the file can't be
    opened, it should be treated the same as if it was empty.
    """
    try:
        file = open(filename, 'r')
    except TypeError as e:
        # The filename wasn't valid for use with the filesystem.
        logging.error(e)
        return 0
    except EnvironmentError as e:
        # Something went wrong reading the file.
        logging.error(e.args[1])
        return 0
    else:
        return len(file.readlines())

Warning

Raising an exception isn't the only thing that tells Python to avoid the else block. If the function returns a value at any time inside the try block, Python will simply return the value as instructed, skipping the else block altogether.

Proceeding Regardless of Exceptions

On yet another hand, many functions perform some kind of setup or resource allocation that must be cleaned up before returning control to external code. In the face of exceptions, the cleanup code might not always be executed, which can leave files or sockets open or perhaps leave large objects in memory when they're no longer needed.

To facilitate this, Python also allows the use of a finally block, which gets executed every time the associated try, except and else blocks finish. Since count_lines() opens a file, best practice would suggest that it also explicitly close the file, rather than waiting for garbage collection to deal with it later. Using finally provides a way to make sure the file always gets closed.

There is still one thing to consider though. So far, count_lines() only anticipates exceptions that could occur while trying to open the file, even though there's a common one that comes up when reading the file: UnicodeDecodeError. Chapter 7 covers a bit of Unicode and how Python deals with it, but for now, just know that it comes up fairly often. In order to catch this new exception, it's necessary to move the readlines() call back into the try block, but we can still leave the line counting in the else block.

import logging

def count_lines(filename):
    """
    Count the number of lines in a file. If the file can't be
    opened, it should be treated the same as if it was empty.
    """
    file = None  # file must always have a value
    try:
        file = open(filename, 'r')
        lines = file.readlines()
    except TypeError as e:
        # The filename wasn't valid for use with the filesystem.
        logging.error(e)
        return 0
    except EnvironmentError as e:
        # Something went wrong reading the file.
        logging.error(e.args[1])
        return 0
    except UnicodeDecodeError as e:
        # The contents of the file were in an unknown encoding.
        logging.error(e)
        return 0
    else:
        return lenlines
    finally:
        if file:
            file.close()

Of course, it's not very likely that you'd have this much error handling in a simple line counting function. After all, it really only exists because we wanted to return 0 in the event of any errors. In the real world, you're much more likely to just let the exceptions run their course outside of count_lines(), letting other code be responsible for how to handle it.

Tip

Some of this handling can be made a bit simpler using a with block, described later in this chapter.

Optimizing Loops

Since loops of some kind or another are very common in most types of code, it's important to make sure they can run as efficiently as possible. The iteration section later in this chapter covers a variety of ways to optimize the design of any loops, while Chapter 5 explains how you can control the behavior of for loops. Instead, this section focuses on the optimization of the while loop.

Typically, while is used to check a condition that may change during the course of the loop, so that the loop can finish executing once the condition evaluates to false. When that condition is too complicated to distill into a single expression or when the loop is expected to break due to an exception, it makes more sense to keep the while expression always true and end the loop using a break statement where appropriate.

Although any expression that evaluates to true will induce the intended functionality, there is one specific value you can use to make it even better. Python knows that True will always evaluate to true, so it makes some additional optimizations behind the scenes to speed up the loop. Essentially, it doesn't even bother checking the condition each time; it just runs the code inside the loop indefinitely, until it encounters an exception, a break statement or a return statement.

def echo():
    """Returns everything you type until you press Ctrl-C"""

    while True:
        try:
            print(input('Type Something: '))
        except KeyboardInterrupt:
            print()  # Make sure the prompt appears on a new line.
            break

The with Statement

The finally block covered in the exception handling section previously in this chapter is a convenient way to clean up after a function, but sometimes that's the only reason to use a try block in the first place. Sometimes you don't want to silence any exceptions, but you still want to make sure the cleanup code executes, regardless of what happens. Working solely with exception handling, a simpler version of count_lines() might look something like this.

def count_lines(filename):
    """Count the number of lines in a file."""

    file = open(filename, 'r')
    try:
        return len(file.readlines())
    finally:
        file.close()

If the file fails to open, it'll raise an exception before even entering the try block, while everything else that could go wrong would do so inside the try block, which will cause the finally block to clean up the file. Unfortunately, it's something of a waste to use the power of the exception handling system just for that. Instead, Python provides another option that has some other advantages over exception handling as well.

The with keyword can be used to start a new block of code, much like try, but with a very different purpose in mind. By using a with block, you're defining a specific context, in which the contents of the block should execute. The beauty of it, though, is that the object you provide in the with statement gets to determine what that context means.

For example, you can use open() in a with statement to run some code in the context of that file. In this case, with also provides an as clause, which allows an object to be returned for use while executing in the current context. Here's how we could rewrite the new version of count_lines() to take advantage of all of this.

def count_lines(filename):
    """Count the number of lines in a file."""

    with open(filename, 'r') as file:
        return len(file.readlines())

That's really all that's left of count_lines() after switching to use the with statement. The exception handling gets done by the code that manages the with statement, while the file closing behavior is actually provided by the file itself, by way of a context manager. Context managers are special objects that know about the with statement and can define exactly what it means to have code executed in their context.

In a nutshell, the context manager gets a chance to run its own code before the with block executes; then gets to run some more cleanup code after it's finished. Exactly what happens at each of those stages will vary; in the case of open(), it opens the file and closes it automatically when the block finishes executing.

With files, the context obviously always revolves an open file object, which is made available to the block using the name given in the as clause. Sometimes, though, the context is entirely environmental, so there is no such object to use during execution. To support those cases, the as clause is optional.

In fact, you can even leave off the as clause in the case of open() without causing any errors. Of course, you also won't have the file available to your code, so it'd be of little use, but there's nothing in Python that prevents you from doing so. If you include an as clause when using a context manager that doesn't provide an object, the variable you define will simply be populated with None instead, because all functions return None if no other value is specified.

There are several context managers available in Python, some of which will be detailed throughout the rest of this book. In addition, Chapter 5 will show how you can write your own context managers, so you can customize the contextual behavior to match the needs of your own code.

Conditional Expressions

Fairly often, you may find yourself needing to access one of two values, and which one you use depends on evaluating an expression. For instance, it's quite common to display one string to a user if the a value exceeds a particular value and a different one otherwise. Typically, this would be done using an if/else combination, as follows.

def test_value(value):
    if value < 100:
        return 'The value is just right.'
    else:
        return 'The value is too big!'

Rather than writing this out into four separate lines, it's possible to condense it down to a single line using a conditional expression. By converting the if and else blocks into clauses in an expression, Python does the same effect much more concisely.

def test_value(value):
    return 'The value is ' + ('just right.' if value < 100 else 'too big!')

There's another approach that is sometimes used to simulate the behavior of the conditional expression described in this section. This was often used in older Python installations where the if/else expression wasn't yet available. In its place, many programmers relied on the behavior of the and and or operators, which could be made to do something very similar. Here's how the previous example could be rewritten using only these operators.

def test_value(value):
    return 'The value is ' + (value < 100 and 'just right.' or 'too big!')

This puts the order of components more in line with the form used in other programming languages. That fact may make it more comfortable for programmers used to working with those languages, and it certainly maintains compatibility with even older versions of Python. Unfortunately, it comes with a hidden danger that is often left unknown until it breaks an otherwise working program with little explanation. To understand why, let's examine what's going on.

The and operator works like the && operator in many languages, checking to see if the value to the left of the operator evaluates to true. If it doesn't, and returns the value to its left; otherwise, the value to the left is evaluated and returned. So, if a value of 50 was passed into test_value(), the left side evaluates to true, so the and clause evaluates to the string, 'just right.' Factoring in that process, here's how the code would look.

return 'The value is ' + ('just right.' or 'too big!')

From here, the or operator works similarly to and, checking the value to its left to see if it evaluates to true. The difference is that if the value is true, that value is returned, without even evaluating the right-hand side of the operator at all. Looking at the condensed code here, it's clear that or would then return the string, 'just right.'

On the other hand, if the value passed into the test_value() function was 150, the behavior is changed. Since 150 < 100 evaluates to false, the and operator returns that value, without evaluating the right-hand side. In that case, here's the resulting expression.

return 'The value is ' + (False or 'too big!')

Since False is obviously false, the the or operator returns the value to its right instead, 'too big!' This behavior has led many people to rely on the and/or combination for conditional expressions. But have you noticed the problem? One of the assumptions being made here causes the whole thing to break down in many situations.

The problem is in the or clause when the left side of the and clause is true. In that case, the behavior of the or clause depends entirely on the value to the left of the operator. In the case shown here, it's a non-empty string, which will always evaluate to true, but what happens if you supply it an empty string, the number 0 or, worst of all, a variable that could contain a value you can't be sure of until the code executes?

What essentially happens is that the left side of the and clause evaluates to true, but the right side evaluates to false, so the end result of that clause is a false value. Then, when the or clause evaluates, its left side is false, so it returns the value to its right. In the end, the expression will always return the item to the right of the or operator, regardless of the value at the beginning of the expression.

Because no exceptions are raised, it doesn't look like anything is actually broken in the code. Instead, it simply looks like the first value in the expression was false, because it's returning the value that you would expect in that case. This may lead you to try to debug whatever code defines that value, rather than looking at the real problem, which is the value between the two operators.

Ultimately, what makes it so hard to pin down is that you have to distrust your own code, removing any assumptions you may have had about how it should work. You have to really look at it the way Python sees it, rather than how a human would see it.

Iteration

There are generally two ways of looking at sequences: as a collection of items or as a way to access a single item at a time. These two aren't mutually exclusive, but it's useful to separate them in order to understand the different features available in each case. Working on the collection as a whole requires that all the items be in memory at once, but accessing them one at a time can often be done much more efficiently.

Iteration refers to this more efficient form of traversing a collection, working with just one item at a time before moving on to the next. Iteration is an option for any type of sequence, but the real advantage comes in special types of objects that don't need to load everything in memory all at once. The canonical example of this is Python's built-in range() function, which iterates over the integers that fall within a given range.

>>> for x in range(5):
...     print(x)
...
0
1
2
3
4

At a glance, it may look like range() returns a list containing the appropriate values, but it doesn't. This shows if you examine its return value on its own, without iterating over it.

>>> range(5)
range(0, 5)
>>> list(range(5))
[0, 1, 2, 3, 4]

The range object itself doesn't contain any of the values in the sequence. Instead, it generates them one at a time, on demand, during iteration. If you truly want a list that you can add or remove items from, you can coerce one by passing the range object into a new list object. This internally iterates just like a for loop, so the generated list uses the same values that are available when iterating over the range itself.

Chapter 5 shows how you can write your own iterable objects that work similarly to range(). In addition to providing iterable objects, there are a number of ways to iterate over these objects in different situations, for different purposes. The for loop is the most obvious technique, but Python offers other forms of syntax as well, which are outlined in this section.

Sequence Unpacking

Generally, you would assign one value to one variable at a time, so when you have a sequence, you would assign the entire sequence to a single variable. When the sequences are small and you know how many items are in the sequence and what each item will be, this is fairly limiting, because you'll often end up just accessing each item individually, rather than dealing with them as a sequence.

This is particularly common when working with tuples, where the sequence often has a fixed length and each item in the sequence has a pre-determined meaning. Tuples of this type are also the preferred way to return multiple values from a function, which makes it all the more annoying to have to bother with them as a sequence. Ideally, you should be able to retrieve them as individual items directly when getting the function's return value.

To allow for this, Python supports a special syntax called sequence unpacking. Rather than specifying a single name to assign a value, you can specify a number of names as a tuple on the left side of the = operator. This will cause Python to unpack the sequence on the right side of the operator, assigning each value to the related name on the left side.

>>> 'propython.com'.split('.')
['propython', 'com']
>>> components = 'propython.com'.split('.')
>>> components
['propython', 'com']
>>> domain, tld = 'propython.com'.split('.')
>>> domain
'propython'
>>> tld
'com'
>>> domain, tld = 'www.propython.com'.split('.')
Traceback (most recent call last):
  ...
ValueError: too many values to unpack

The error shown at the end of this example illustrates the only significant limitation of this approach: the number of variables to assign must match the number of items in the sequence. If they don't match, Python can't properly assign the values. If you look at the tuple as being similar to an argument list, though, there's another option available.

If you add an asterisk before the final name in the variable list, Python will keep a list of any values that couldn't be put into one of the other variables. The resulting list is stored in the final variable, so you can still assign a sequence that contains more items than you have explicit variables to hold them. This only works if you have more items in the sequence than you have variables to assign to. If the reverse is true, you'll still run into the TypeError shown previously.

>>> domain, *path = 'propython.com/example/url'.split('/')
>>> domain
'propython.com'
>>> path
['example', 'url']

Note

Chapter 3 will show how a similar syntax applies to function arguments as well.

List Comprehensions

When you have a sequence with more items than you really need, it's often useful to generate a new list and add just those items that meet a certain criteria. There are a few ways to do that, the most obvious being to use a simple for loop, adding each item in turn.

>>> output = []
>>> for value in range(10):
...     if value > 5:
...         output.append(str(value))
...
>>> output
['6', '7', '8', '9']

Unfortunately, that adds four lines and two levels of indentation to your code, even though it's an extremely common pattern to use. Instead, Python offers a more concise syntax for this case, which allows you to express the three main aspects of that code into a single line.

  • A sequence to retrieve values from

  • An expression that's used to determine whether a value should be included

  • An expression that's used to provide a value to the new list

These are all combined into a syntax called list comprehensions. Here's how the preceding example would look, when rewritten to use this construct. The three basic segments of this form have been highlighted for clarity.

>>> output = [str(value) for value in range(10) if value > 5]
>>> output
['6', '7', '8', '9']

As you can see, the three portions of the overall form have been rearranged slightly, with the expression for the final value coming first, followed by the iteration and ending with the condition for deciding which items are included. You may also consider the variable that contains the new list to be its own fourth portion of the form, but since the comprehension is really just an expression, it doesn't have to be assigned to a variable. It could just as easily be used to feed a list into a function.

>>> min([value for value in range(10) if value > 5])
6

Of course, this seems to violate the whole point of iteration that was pointed out earlier. After all, the comprehension returns a full list, only to have it thrown away when min() processes the values. For these situations, Python provides a different option: generator expressions.

Generator Expressions

Instead of creating an entire list based on certain criteria, it's often more useful to leverage the power of iteration for this process as well. Instead of surrounding the compression in brackets, which would indicate the creation of a proper list, you can instead surround it in parentheses, which will create a generator. Here's how it looks in action.

>>> gen = (value for value in range(10) if value > 5)
>>> gen
<generator object <genexpr> at 0x...>
>>> min(gen)
6
>>> min(gen)
Traceback (most recent call last):
  ...
ValueError: min() arg is an empty sequence
>>> min(value for value in range(10) if value > 5)
6

Okay, so there are a few things going on here, but it's easier to understand once you've seen the output, so you have a frame of reference. First off, a generator is really just an iterable object that you don't have to create using the explicit interface. Chapter 5 shows how you can create iterators manually and even how to create generators with more flexibility, but the generator expression is the simplest way to deal with them.

When you create a generator—whether a generator expression or some other form—you don't immediately have access to the sequence. The generator object doesn't yet know what values it'll need to iterate over; it won't know that until it actually starts generating them. So if you view or inspect a generator without iterating over it, you won't have access to the full range of values.

In order to retrieve those values, all you need to do is iterate over the generator like you ordinarily would and it'll happily spit out values as needed. This step is implicitly performed inside many built-in functions, such as min(). If those functions are able to operate without building a complete list, you can use generators to dramatically improve performance over the use of other options. If they do have to create a new list, you're not losing anything by delaying until the function really needs to create it.

But notice what happens if you iterate over the generator twice. The second time through, you get an error that you tried to pass in an empty sequence. Remember, a generator doesn't contain all the values; it just iterates over them when asked to do so. Once the iteration is complete and there are no more values left to iterate, the generator doesn't restart. Instead, it simply returns an empty list each time it's called thereafter.

There are two main reasons behind this behavior. First, it's not always obvious how it should restart the sequence. Some iterables, such as range(), do have an obvious way to restart themselves, so those do restart when iterated multiple times. Unfortunately, because there are any number of ways to create generators—and iterators in general—it's up to the iterable itself to determine when and how the sequence gets reset. Chapter 5 will explain this behavior, and how you can customize it for your own needs, in more detail.

Second, not all sequences should be reset once they complete. For example, you might implement an interface for cycling through a collection of active users, which may change over time. Once your code finishes iterating over the available users, it shouldn't simply reset to the same sequence over and over again. The nature of that ever-changing set of users means that Python itself can't possibly guess at how to control it. Instead, that behavior is controlled by more complex iterators.

One final note to point out about generator expressions is that, even though they must always be surrounded by parentheses, those parentheses don't always need to be unique to the expression. The last expression in this section's example simply use the parentheses from the function call to enclose the generator expression, which also works just fine.

This form may seem a little odd at first, but in this simple case, it saves you from having an extra set of parentheses hanging around. However, if the generator expression is just one of multiple arguments or if it's part of a more complex expression, you still need to include explicit parentheses around the generator expression itself, to make sure Python knows your intent.

Set Comprehensions

Sets—described in more detail in their own section under Collections—are very similar to lists in their construction, so you can build a set using a comprehension in basically the same way as lists. The only significant difference between the two is the use of curly braces instead of brackets surrounding the expression.

>>> {str(value) for value in range(10) if value > 5}
{'6', '7', '8', '9'}

Note

Unlike sequences, sets are unordered, so different platforms may display the items in a different order. The only guarantee is that the same items will be present in the set, regardless of the platform.

Dictionary Comprehensions

There's certainly a theme developing with the construction of comprehensions for different types, and it's limited solely to one-dimensional sequences. Dictionaries can also be a form of sequence, but each item is really a pair of a key and its value. This is reflected in the literal form, by separating each key from its value by the use of a colon.

Since that colon is the factor that distinguishes the syntax for dictionaries from that of sets, the same colon is what separates dictionary comprehensions from set comprehensions. Where you would ordinarily include a single value, simply supply a key/value pair, separated by a colon. The rest of the comprehension follows the same rules as the other types.

>>> {value: str(value) for value in range(10) if value > 5}
{8: '8', 9: '9', 6: '6', 7: '7'}

Note

Remember, dictionaries are unordered, so their keys work a lot like sets. If you need a dictionary with keys that can be reliably ordered, see the Ordered Dictionaries section later in this chapter.

Chaining Iterables Together

Working with one iterable is useful enough in most situations, but sometimes you'll need to access one right after another, performing the same operation on each. The simple approach would be to just use two separate loops, duplicating the code block for each loop. The logical next step would be to factor that code out into a function, but now you have an extra function call in the mix for something that really only needs to be done inside the loop.

Instead, Python provides the chain() function, as part of its itertools module. The itertools module includes a number of different utilities, some of which are described in the following sections. The chain() function, in particular, accepts any number of iterables and returns a new generator that will iterate over each one in turn.

>>> import itertools
>>> list(itertools.chain(range(3), range(4), range(5)))
[0, 1, 2, 0, 1, 2, 3, 0, 1, 2, 3, 4]

Zipping Iterables Together

Another common operation involving multiple iterables is to merge them together, side by side. The first items from each iterable would come together to form a single tuple as the first value returned by a new generator. All the second items become part of the second tuple in the generator, and so on. The built-in zip() function provides this functionality when needed.

>>> list(zip(range(3), reversed(range(5))))
[(0, 4), (1, 3), (2, 2)]

Notice here that even though the second iterable has five values, the resulting sequence only contains three values. When given iterators of varying lengths, zip() goes with the least common denominator, so to speak. Essentially, zip() makes sure that each tuple in the resulting sequence has exactly as many values as there are iterators to join together. Once the smallest sequence has been exhausted, zip() simply stops looking through the others.

This functionality is particularly useful in creating dictionaries, because one sequence can be used to supply the keys, while another supplies the values. Using zip() can join these together into the proper pairings, which can then be passed directly into a new dict().

>>> keys = map(chr, range(97, 102))
>>> values = range(1, 6)
>>> dict(zip(keys, values))
{'a': 1, 'c': 3, 'b': 2, 'e': 5, 'd': 4}

Collections

There are a number of well-known objects that come standard with the Python distribution, both as built-ins available to all modules and as part of the standard package library. Objects such as integers, strings, lists, tuples and dictionaries are in common use among nearly all Python programs, but others, including sets named tuples and some special types of dictionaries, are used less often and may be unfamiliar to those who haven't already needed to discover them.

Some of these are built-in types that are always available to every module, while others are part of the standard library included with every Python installation. There are still more that are provided by third-party applications, some of which have become fairly commonly installed, but this section will only cover those included with Python itself.

Sets

Typically, collections of objects are represented in Python by tuples and lists, but sets provide another way to work with the same data. Essentially, a set works much like a list, but without allowing any duplicates, making it useful for identifying the unique objects in a collection. For example, here's how a simple function might use a set to determine which letters are used in a given string.

>>> def unique_letters(word):
...     return set(word.lower())
...
>>> unique_letters('spam')
{'a', 'p', 's', 'm'}
>>> unique_letters('eggs')
{'s', 'e', 'g'}

There are a few things to notice here. First, the built-in set type takes a sequence as its argument, which populates the set with all the unique elements found in that sequence. This is valid for any sequence, such as a string as shown in the example as well as lists, tuples, dictionary keys or custom iterable objects.

In addition, notice that the items in the set aren't ordered the same way they appeared in the original string. Sets are concerned solely with membership. They keep track of items that are in the set, without any notion of ordering. That seems like a limitation, but if you need ordering, you probably want a list anyway. Sets are very efficient when you only need to know if an item is a member of a collection, without regard to where it is in the collection or how many times it has otherwise appeared.

The third thing to notice is the representation showed when displaying the set in the interactive shell. As these representations are intended to be formatted in the same way as you can type into your source file, this indicates a syntax for declaring sets as literals in your code. It looks very similar to a dictionary, but without any values associated with the keys. That's actually a fairly accurate analogy, because a set works very much like the collection of keys in a dictionary.

Since sets are designed for a different purpose than sequences and dictionaries, the available operations and methods are a bit different than you might be used to. To start, though, let's look at the way sets behave fairly similarly to other types. Perhaps the most common use of sets is to determine membership, a task often asked of both lists and dictionaries as well. In the spirit of matching expectations, this uses the in keyword, familiar from other types.

>>> example = {1, 2, 3, 4, 5}
>>> 4 in  example
True
>>> 6 in  example
False

In addition, items can be added to or removed from the set later on. The list's append() method isn't suitable for sets, because to append an item is to add it at the end, which then implies that the order of items in the collection is important. Since sets aren't at all concerned with ordering, they instead use the add() method, which just makes sure that the specified item ends up in the set. If it was already there, add() does nothing; otherwise, it adds the item to the set, so there are never any duplicates.

>>> example.add(6)
>>> example
{1, 2, 3, 4, 5, 6}
>>> example.add(6)
>>> example
{1, 2, 3, 4, 5, 6}

Dictionaries have the useful update() method, which adds the contents of a new dictionary to one that already exists. Sets have an update() method as well, performing the same task.

>>> example.update({6, 7, 8, 9})
>>> example
{1, 2, 3, 4, 5, 6, 7, 8, 9}

On the other side, removing items from the set can be done in a few different ways, each serving a different need. The most direct complement to add() is the remove() method, which removes a specific item from the set. If that item wasn't in the set in the first place, it raises a KeyError.

>>> example.remove(9)
>>> example.remove(9)
Traceback (most recent call last):
  ...
KeyError: 9
>>> example
{1, 2, 3, 4, 5, 6, 7, 8}

Many times, though, it doesn't matter whether the item was already in the set or not; you may only care that it's not in the set when you're done with it. For this purpose, sets also have a discard() method, which works just like remove() but without raising an exception if the specified item wasn't in the set.

>>> example.discard(8)
>>> example.discard(8)
>>> example
{1, 2, 3, 4, 5, 6, 7}

Of course, remove() and discard() both assume you already know what object you want to remove from the set. To simply remove any item from a set, use the pop() method, which again is borrowed from the list API, but differs slightly. Since sets aren't explicitly ordered, there's no real end of the set for an item to be popped off. Instead, the set's pop() method picks one arbitrarily, returning it for use outside the set.

>>> example.pop()
1
>>> example
{2, 3, 4, 5, 6, 7}

Lastly, sets also provide a way to remove all items in one shot, resetting it to an empty state. The clear() method is used for this purpose.

>>> example.clear()
>>> example
set()

Note

The representation of an empty set is set(), rather than {}, because Python needs to maintain a distinction between sets and dictionaries. In order to preserve compatibility with older code written before the introduction of set literals, empty curly braces remain dedicated to dictionaries, so sets use their name instead.

In addition to methods for modifying the contents in-place, sets also provide operations where two sets combine in some way to return a new set. The most common of these is a union, where the contents of two sets are joined together, so the resulting new set contains all items that were in both of the original sets. It's essentially the same as using the update() method, except that none of the original sets is altered.

The union of two sets is a lot like a bit-wise OR operation, so Python represents it with the pipe character (|), which is the same as is used for bit-wise OR. In addition, sets offer the same functionality using the union() method, which can be called from either set involved.

>>> {1, 2, 3} | {4, 5, 6}
{1, 2, 3, 4, 5, 6}
>>> {1, 2, 3}.union({4, 5, 6})
{1, 2, 3, 4, 5, 6}

The logical complement to that operation is the intersection, where the result is the set of all items common to the original sets. Again, this is analogous to a bit-wise operation, but this time it's the bit-wise AND, and again, Python uses the ampersand (&) to represent the operation as it pertains to sets. Sets also have an intersection() method which performs the same task.

>>> {1, 2, 3, 4, 5} & {4, 5, 6, 7, 8}
{4, 5}
>>> {1, 2, 3, 4, 5}.intersection({4, 5, 6, 7, 8})
{4, 5}

You can also determine the difference between two sets, resulting in a set of all the items that exist in one of the sets but not the other. By removing the contents of one set from another, it works a lot like subtraction, so Python uses the subtraction operator (-) to perform this operation, along with the difference() method.

>>> {1, 2, 3, 4, 5} – {2, 4, 6}
{1, 3, 5}
>>> {1, 2, 3, 4, 5}.difference({2, 4, 6})
{1, 3, 5}

In addition to that basic difference, Python sets offer a variation, called a symmetric difference, using the symmetric_difference() method. Using this method, the resulting set contains all items that were in either set, but not in both. This is equivalent to the bit-wise exclusive OR operation, commonly referred to as XOR. Since Python uses the caret (^) to represent the XOR operation elsewhere, sets use the same operator as well as the method.

>>> {1, 2, 3, 4, 5} ^ {4, 5, 6}
{1, 2, 3, 6}
>>> {1, 2, 3, 4, 5}.symmetric_difference({4, 5, 6})
{1, 2, 3, 6}

Lastly, it's possible to determine whether all the items in one set also exist in another. If one set contains all the items of another, the first is considered to be a superset of the other, even if the first set contains additional items not present in the second. The inverse, where all the items in first are contained in the second, even if the second has more items, means the first set is a subset of the second.

Testing to see if one set is a subset or a superset of another is performed by two methods, issubset() and issuperset(), respectively. The same test can be performed manually, by subtracting one set from the other and checking to see if any items remain. If no items are left, the set evaluates to False, and the first is definitely a subset of the second, and testing for a superset is as simple as swapping the two sets in the operation. Using these methods avoids creating a new set just to have it reduce to a Boolean anyway.

>>> {1, 2, 3}.issubset({1, 2, 3, 4, 5})
True
>>> {1, 2, 3, 4, 5}.issubset({1, 2, 3})
False
>>> {1, 2, 3}.issuperset({1, 2, 3, 4, 5})
False
>>> {1, 2, 3, 4, 5}.issuperset({1, 2, 3})
True

>>> not ({1, 2, 3} – {1, 2, 3, 4, 5})
True
>>> not ({1, 2, 3, 4, 5} – {1, 2, 3})
False

Note

Looking at how subsets and supersets can be determined using subtraction, you might notice that two identical sets will always subtract to an empty set, and the order of the two sets is irrelevant. This is correct, and because {1, 2, 3} – {1, 2, 3} is always empty, each set is both a subset and a superset of the other.

Named Tuples

Dictionaries are extremely useful, but sometimes you may have a fixed set of possible keys available, so you don't need that much flexibility. Instead, Python uses named tuples, which provide some of the same functionality, but they're much more efficient because the instances don't need to contain any of the keys, only the values associated with them.

Named tuples are created using a factory function from the collections module, called namedtuple(). Rather than returning an individual object, namedtuple() returns a new class, which is customized for a given set of names. The first argument is the name of the tuple class itself, but the second is, unfortunately, less straightforward. It takes a string of attribute names, which are separated by either a space or a comma.

>>> from collections import namedtuple
>>> Point = namedtuple('Point', 'x y')
>>> point = Point(13, 25)
>>> point
Point(x=13, y=25)
>>> point.x, point.y
(13, 25)
>>> point[0], point[1]
(13, 25)

As an efficient trade-off between tuples and dictionaries, many functions that need to return multiple values can do so using named tuples to be as useful as possible. There's no need to populate a full dictionary, but values can still be referenced by useful names rather than integer indexes.

Ordered Dictionaries

If you've ever iterated over the keys of a dictionary or printed its contents to the interactive prompt, as has been done previously in this chapter, you'll notice that its keys don't always follow a predictable order. Sometimes they may look like they're sorted numerically or alphabetically, but other times it seems completely random.

Dictionary keys, like sets, are considered to be unordered. Even though there may occasionally appear to be patterns, these are merely the byproduct of the implementation and aren't formally defined. Not only is the ordering inconsistent from one dictionary to another, variations are even more significant when using a different Python implementation, such as Jython or IronPython.

Most of the time, what you're really looking for from a dictionary is a way to map specific keys to associated values, so the ordering of the keys is irrelevant. Sometimes, though, it's also useful to be able to iterate over those keys in a reliable manner. To offer the best of both worlds, Python offers the OrderedDict class by way of its collections module. This provides all the features of a dictionary, but with reliable ordering of keys.

>>> from collections import OrderedDict
>>> d = OrderedDict((value, str(value)) for value in range(10) if value > 5)
>>> d
OrderedDict([(6, '6'), (7, '7'), (8, '8'), (9, '9')])
>>> d[10] = '10'
>>> d
OrderedDict([(6, '6'), (7, '7'), (8, '8'), (9, '9'), (10, '10')])
>>> del d[7]
>>> d
OrderedDict([(6, '6'), (8, '8'), (9, '9'), (10, '10')])

As you can see, the same construction used previously now results in a properly-ordered dictionary that does the right thing even as you add and remove items.

Warning

In the example here, notice that the values for the dictionary are provided using a generator expression. If you supply a standard dictionary, that means your supplied values are unordered prior to going into the ordered database, which will then assume that order was intentional and preserve it. This also occurs if you supply values as keyword arguments, because those are passed as a regular dictionary internally. The only reliable way to supply ordering to OrderedDict() is to use a standard sequence, such as a list or a generator expression.

Dictionaries with Defaults

Another common pattern using dictionaries is to always assume some default value in the event that a key can't be found in the mapping. This behavior can be achieved either by explicitly catching the KeyError raised when accessing the key or by using the available get() method, which can return a suitable default if the key wasn't found. One such example of this pattern is using a dictionary to track how many times each word appears in some text.

def count_words(text):
count = {}
    for word in text.split(' '):
        current = count.get(word, 0) # Make sure we always have a number
        count[word] = current + 1
    return count

Instead of having to deal with that extra get() call, the collections module provides a defaultdict class that can handle that step for you. When you create it, you can pass in a callable as the single argument, which will be used to create a new value when a requested key doesn't exist. In most cases, you can just supply one of the built-in types, which will provide a useful basic value to work with. In the case of count_words(), we can use int.

from collections import defaultdict

def count_words(text):
    count = defaultdict(int)
    for word in text.split(' '):
        count[word] += 1
    return count

Essentially any callable can be used, but the built-in types tend to provide optimal default values for whatever you need to work with. Using list will give you an empty list, str returns an empty string, int returns 0 and dict returns an empty dictionary. If you have more specialized needs, any callable that can be used without any arguments will work. Chapter 3 will introduce lambda functions, which are convenient for cases like this.

Importing Code

Complex Python applications are typically made up of a number of different modules, often separated into packages to supply more granular namespaces. Importing code from one module to another is a simple matter, but that's only part of the story. There are several additional features available for more specific situations that you're likely to run into.

Fallback Imports

By now, you've seen several points where Python changes over time, sometimes in backward-incompatible ways. One particular change that tends to come up occasionally is when a module gets moved or renamed, but still does essentially the same thing as before. The only update needed to make your code work with it is to change to the import location, but you'll often need to maintain compatibility with versions both before and after the change.

The solution to this problem exploits Python's exception handling to determine whether the module exists at the new location. Since imports are processed at run-time, like any other statement, you can wrap them in a try block and catch an ImportError, which is raised if the import failed. Here's how you might import a common hash algorithm both before and after the change in Python 2.5, which moved its import location.

try:
    # Use the new library if available. Added in Python 2.5
    from hashlib import md5
except ImportError:
    # Compatible functionality provided prior to Python 2.5
    from md5 import new as md5

Notice here that the import prefers the newer library first. That's because changes like this usually have a grace period, where the old location is still available, but deprecated. If you check for the older module first, you'll find it long after the new module became available. By checking for the new one first, you take advantage of any newer features or added behaviors as soon as they're available, falling back to older functionality only when necessary. Using the as keyword allows the rest of the module to simply reference the name md5 either way.

This technique is just as applicable to third-party modules as it is to Python's own standard library, but third-party applications often require different handling. Rather than determining which module to use, it's often necessary to distinguish whether the application is available at all. This is determined the same way as the previous example, by wrapping the import statement in a try block.

What happens next, however, depends on how your application should behave if the module is unavailable. Some modules are strictly required, so if it's missing, you should raise an exception directly inside the except ImportError block or simply forgo exception handling altogether. Other times, a missing third-party module simply means a reduction in functionality. In this case, the most common approach is to assign None to the variable that would otherwise contain the imported module.

try:
    import docutils  # Common Python-based documentation tools
except ImportError:
    docutils = None

Then, when your code needs to utilize features in the imported module, it can use something like if docutils to see if the module is available, without having to re-import it.

Importing from the Future

Python's release schedule often incorporates new features, but it's not always a good idea to just introduce them out of nowhere. In particular, syntax additions and behavior changes may break existing code, so it's often necessary to provide a bit of a grace period. During the transition, these new features are made available by way of a special kind of import, letting you choose which features are updated for each module.

The special __future__ module allows you to name specific features that you'd like to use in a given module. This provides a simple compatibility path for your modules, since some modules can rely on new features while other modules can use existing features. Typically, the next release after a feature was added to __future__, it becomes a standard feature available to all modules.

As a quick example, Python 3.0 changed the way integer division worked. In earlier versions, dividing one integer from another always resulted in an integer, which often resulted in a loss of precision if the result would normally produce a remainder. That makes sense to programmers who are familiar with the underlying C implementation, but it's different than what happens if you perform the same calculation on a standard calculator, so it caused a lot of confusion.

The behavior of division was changed to return floating point values if the division would contain a remainder, thus matching how a standard calculator would work. Before making the change across all of Python, however, the division option was added to the __future__ module, allowing the behavior to be changed earlier if necessary. Here's how an interactive interpreter session might look in Python 2.5.

>>> 5 / 2  # Python 2.5 uses integer-only division by default
2
>>> from __future__ import division  # This updates the behavior of division
>>> 5 / 2
2.5

There are a number of such features made available through the __future__ module, with new options being added with each release of Python. Rather than trying to list them all here, the remainder of this book will mention them when the features being described were recent enough to need a __future__ import in older versions of Python, back to Python 2.5. Full details on these feature changes can always be found on the "What's New" page of the Python documentation.[5]

Note

If you try to import a feature from __future__ that already exists in the version of Python you're using, it doesn't do anything. The feature is already available, so no changes have to be made, but it also doesn't raise any exceptions.

Using __all__ to Customize Imports

One of the lesser-used features of Python imports is the ability to import the namespace from one module into that of another. This is achieved by using an asterisk as the portion of the module to import.

>>> from itertools import *
>>> list(chain([1, 2, 3], [4, 5, 6]))
[1, 2, 3, 4, 5, 6]

Ordinarily, this would just take all the entries in the imported module's namespace that don't begin with an underscore and dump them into the current module's namespace. It can save some typing in modules that make heavy use of the imported module, because it saves you from having to include the module name every time you access one of its attributes.

Sometimes, though, it doesn't make sense for every object to be made available in this way. In particular, frameworks often include a number of utility functions and classes that are useful within the framework's module, but don't make much sense when exported out into external code. In order to control what objects get exported when you import a module like this, you can specify __all__ somewhere in the module.

All you need to do is supply a list—or some other sequence—that contains the names of objects that should get imported when the module is imported using an asterisk. Additional objects can still be imported by either importing the name directly or by just importing the module itself, rather than anything inside of it. Here's how an example module might supply its __all__ option.

__all__ = ['public_func']

def public_func():
    pass

def utility_func():
    pass

Of course, there would be useful code in both of those functions in the real world. For the purposes of illustration, though, here's a quick run-down of the different ways you could import that module, which we'll call example.

>>> import example
>>> example.public_func
<function public_func at 0x...>
>>> example.utility_func
<function utility_func at 0x...>
>>> from example import *
>>> public_func
<function public_func at 0x...>
>>> utility_func
Traceback (most recent call last):
  ...
NameError: name 'utility_func' is not defined
>>> from example import utility_func
>>> utility_func
<function utility_func at 0x...>

Notice how, in the final case, you can still import it directly using the from syntax, as long as you specify it explicitly. The only time __all__ comes into play is if you use an asterisk.

Relative Imports

When starting out with a project, you'll spend most of your time importing from external packages, so every import is absolute; its path is rooted in your system's PYTHONPATH. Once your projects start growing to several modules, you'll be importing from one another regularly. And once you establish a hierarchy, you might realize that you don't want to include the full import path when sharing code between two modules at similar parts of the tree.

Python allows you to specify a relative path to the module you'd like to import, so you can move around an entire package, if necessary, with minimal modifications required. The preferred syntax for this is to specify part of the module's path with one or more periods, indicating how far up the path to look for the module. For example, if the acme.shopping.cart module needs to import from acme.billing, the two following import patterns are identical.

from acme import billing
from .. import billing

A single period allows you to import from the current package, so acme.shopping.gallery could be imported as from . import gallery. Alternatively, if you're looking to just import something from that module, you could instead simply prefix the module path with the necessary periods, then specify the names to import as usual: from .gallery import Image.

The __import__() function

You don't always have to place your imports at the top of a module. In fact, sometimes you might not be able to write some of your imports in advance at all. You might be making decisions about which module to import based on user-supplied settings or perhaps you're even allowing users to specify modules directly. These user-supplied settings are a convenient way to allow for extensibility without resorting to automatic discovery.

In order to support this functionality, Python allows you to import code manually, using the __import__() function. It's a built-in function, so it's available everywhere, but using it requires some explanation, because it's not as straightforward as some of the other features provided by Python. There are a possible five arguments that can be used to customize how a module gets imported and what contents are retrieved.

  • name—The only argument that is always required, this accepts the name of the module that should be loaded. If it's part of a package, just separate each part of the path with a period, just like when using import path.to.module.

  • globals—A namespace dictionary that is used to define the context in which the module name is resolved. In standard import cases, the return value from the built-in globals() function is used to populate this argument.

  • locals—Another namespace dictionary, ideally used to help define the context in which the module name is resolved. In reality, however, current implementations of Python simply ignore it. In the event of future support, the standard import provides the return value from the built-in locals() function for this argument.

  • fromlist—A list of individual names that should be imported from the module, rather than importing the full module.

  • level—An integer indicating how the path should be resolved with respect to the module that calls __import__(). A value of −1 allows both absolute and implicit relative imports; 0 allows only absolute imports; positive values indicate how many levels up the path to use for an explicit relative import.

Even though that may seem simple enough, the return value contains a few traps that can cause quite a bit of confusion. It always returns a module object, but it can be surprising to see which module is returned and what attributes are available on it. Since there are a number of different ways to import modules, these variations are worth understanding. First, let's examine how different types of module names impact the return value.

In the simplest case, you'd pass in a single module name to __import__(), and the return value is just what you'd expect: the module referenced by the name provided. The attributes available on that module object are the same as you'd have available if you imported that name directly in your code: the entire namespace that was declared in that module's code.

When you pass in a more complex module path, however, the return value may not match expectations. Complex paths are provided using the same dot-separated syntax used in your source files, so importing os.path, for instance, would be achieved by passing in 'os.path'. The returned value in that case is os, but the path attribute lets you access the module you're really looking for.

The reason for that variation is that __import__() mimics the behavior of Python source files, where import os.path makes the os module available under that name. You can still access os.path, but the module that goes into the main namespace is os. Since __import__() works essentially the same way as a standard import, what you get in the return value is what you would have in the main module namespace ordinarily.

In order to get just the module at the end of the module path, there are a couple different approaches you can take. The most obvious, though not necessarily direct, would be to split the given module name on periods, using each portion of the path to get each attribute layer from the module returned by __import__(). Here's a simple function that would do the job.

>>> def import_child(module_name):
...     module = __import__(module_name)
...     for layer in module_name.split('.')[1:]:
...         module = getattr(module, layer)
...     return module
...
>>> import_child('os.path')
<module 'ntpath' from 'C:Python31lib
tpath.py'>
>>> import_child('os')
<module 'os' from 'C:Python31libos.py'>

Note

The exact name of the module referenced by os.path will vary based on the operating system under which it's imported. For example, it's called ntpath on Windows, while most Linux systems use posixpath. Most of the contents are the same, but they may behave slightly differently depending on the needs of the operating system, and each may have additional attributes that are unique to that environment.

As you can see, it works for the simple case as well as more complex situations, but it still goes through a bit more work than is really necessary to do the job. Of course, the time spent on the loop is fairly insignificant compared to the import itself, but if the module had already been imported, our import_path() function comprises most of the process. An alternate approach takes advantage of Python's own module caching mechanism to take the extra processing out of the picture.

>>> import sys
>>> def import_child(module_name):
...     __import__(module_name)
...     return sys.modules[module_name]
...
>>> import_child('os.path')
<module 'ntpath' from 'C:Python31lib
tpath.py'>
>>> import_child('os')
<module 'os' from 'C:Python31libos.py'>

The sys.modules dictionary maps import paths to the module objects that were generated when importing them. By looking up the module in that dictionary, there's no need to mess around with the particulars of the module name.

Of course, this is really only applicable to absolute imports. Relative imports, no matter how they are referenced, are resolved relative to the module where the import statement—or in this case, the __import__() function call—is located. Since the most common case is to place import_path() in a common location, relative imports would be resolved relative to that, rather than the module that called import_path(). That could mean importing the completely wrong module.

The importlib module

In order to address the issues that are raised by using __import__() directly, Python also includes the importlib module, which provides a more intuitive interface to import modules. The import_module() function is a much simpler way to achieve the same effect as __import__(), but in a way that more closely matches expectations.

For absolute imports, import_module() accepts the module path, just like __import__(). The difference, however, is that import_module() always returns the last module in the path, while __import__() returns the first one. The extra handling that was added in the previous section is made completely unnecessary because of this functionality, so this is a much better approach to use.

>>> from importlib import import_module
>>> import_module('os.path')
<module 'ntpath' from 'C:Python31lib
tpath.py'>
>>> import_module('os')
<module 'os' from 'C:Python31libos.py'>

In addition, import_module() takes relative imports into account by also accepting a package attribute that defines the reference point from which the relative path should be resolved. This is easily done when calling the function, simply by passing in the always-global __name__ variable, which holds the module path that was used to import the current module in the first place.

import_module('.utils', package=__name__)

Warning

Relative imports don't work directly inside the interactive interpreter. The module the interpreter runs in isn't actually in the filesystem, so there are no relative paths to work with.

Taking It With You

The features laid out in this chapter are just a taste of what Python has to offer if you're willing to take the time to learn it. The rest of this book will rely heavily on what was laid out here, but each chapter will add another layer for future chapters to build on as well. In that spirit, let's continue on to what you thought was one of the most basic, unassuming features of Python: functions.



[5] http://propython.com/whats-new/

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset