Chapter 6. Modules

A typical Python program is made up of several source files. Each source file is a module, grouping code and data for reuse. Modules are normally independent of each other, so that other programs can reuse the specific modules they need. Sometimes, to manage complexity, you group together related modules into a package—a hierarchical, tree-like structure.

A module explicitly establishes dependencies upon other modules by using import or from statements. In some programming languages, global variables provide a hidden conduit for coupling between modules. In Python, global variables are not global to all modules, but rather are attributes of a single module object. Thus, Python modules always communicate in explicit and maintainable ways.

Python also supports extension modules—modules coded in other languages such as C, C++, Java, or C#—for use in Python. For the Python code importing a module, it does not matter whether the module is pure Python or an extension. You can always start by coding a module in Python. Later, should you need more speed, you refactor and recode some parts of modules in lower-level languages, without changing the client code that uses those modules. Chapter 24 shows how to write extensions in C and Cython.

This chapter discusses module creation and loading. It also covers grouping modules into packages, and using Python’s distribution utilities (distutils and setuptools) to install distributed packages, and to prepare packages for distribution; this latter subject is more thoroughly covered in Chapter 25. This chapter closes with a discussion on how best to manage your Python environment(s).

Module Objects

A module is a Python object with arbitrarily named attributes that you can bind and reference. The Python code for a module named aname usually lives in a file named aname.py, as covered in “Module Loading”.

In Python, modules are objects (values), handled like other objects. Thus, you can pass a module as an argument in a call to a function. Similarly, a function can return a module as the result of a call. A module, just like any other object, can be bound to a variable, an item in a container, or an attribute of an object. Modules can be keys or values in a dictionary, and can be members of a set. For example, the sys.modules dictionary, covered in “Module Loading”, holds module objects as its values. The fact that modules can be treated like other values in Python is often expressed by saying that modules are first-class objects.

The import Statement

You can use any Python source file as a module by executing an import statement in another Python source file. import has the following syntax:

import modname [as varname][,...]

After the import keyword comes one or more module specifiers separated by commas. In the simplest, most common case, a module specifier is just modname, an identifier—a variable that Python binds to the module object when the import statement finishes. In this case, Python looks for the module of the same name to satisfy the import request. For example:

import mymodule

looks for the module named mymodule and binds the variable named mymodule in the current scope to the module object. modname can also be a sequence of identifiers separated by dots (.) to name a module in a package, as covered in “Packages”.

When as varname is part of a module specifier, Python looks for a module named modname and binds the module object to the variable varname. For example:

import mymodule as alias

looks for the module named mymodule and binds the module object to variable alias in the current scope. varname must always be a simple identifier.

Module body

The body of a module is the sequence of statements in the module’s source file. There is no special syntax required to indicate that a source file is a module; any valid Python source file can be used as a module. A module’s body executes immediately the first time the module is imported in a given run of a program. During execution of the body, the module object has already been created, and an entry in sys.modules is already bound to the module object. The module’s (global) namespace is gradually populated as the module’s body executes.

Attributes of module objects

An import statement creates a new namespace containing all the attributes of the module. To access an attribute in this namespace, use the name or alias of the module as a prefix:

import mymodule
a = mymodule.f()

or:

import mymodule as alias
a = alias.f()

Attributes of a module object are normally bound by statements in the module body. When a statement in the module body binds a (global) variable, what gets bound is an attribute of the module object.

A module body exists to bind the module’s attributes

The normal purpose of a module body is to create the module’s attributes: def statements create and bind functions, class statements create and bind classes, and assignment statements can bind attributes of any type. For clarity and cleanliness of your code, be wary about doing anything else in the top logical level of the module’s body except binding the module’s attributes.

You can also bind module attributes outside the body (i.e., in other modules), generally using the attribute reference syntax M.name (where M is any expression whose value is the module, and identifier name is the attribute name). For clarity, however, it’s usually best to bind module attributes only in the module’s own body.

The import statement sets some module attributes as soon as it creates the module object, before the module’s body executes. The __dict__ attribute is the dict object that the module uses as the namespace for its attributes. Unlike other attributes of the module, __dict__ is not available to code in the module as a global variable. All other attributes in the module are items in __dict__ and are available to code in the module as global variables. Attribute __name__ is the module’s name; attribute __file__ is the filename from which the module was loaded; other dunder-named attributes hold other module metadata (v3 adds yet more dunder-named attributes to all modules).

For any module object M, any object x, and any identifier string S (except __dict__), binding M.S = x is equivalent to binding M.__dict__['S'] = x. An attribute reference such as M.S is also substantially equivalent to M.__dict__['S']. The only difference is that, when S is not a key in M.__dict__, accessing M.__dict__['S'] raises KeyError, while accessing M.S raises AttributeError. Module attributes are also available to all code in the module’s body as global variables. In other words, within the module body, S used as a global variable is equivalent to M.S (i.e., M.__dict__['S']) for both binding and reference (when S is not a key in M.__dict__, however, referring to S as a global variable raises NameError).

Python built-ins

Python supplies several built-in objects (covered in Chapter 7). All built-in objects are attributes of a preloaded module named builtins (in v2, this module’s name is __builtin__). When Python loads a module, the module automatically gets an extra attribute named __builtins__, which refers either to the module builtins (in v2, __builtin__) or to its dictionary. Python may choose either, so don’t rely on __builtins__. If you need to access the module builtins directly (a rare need), use an import builtins statement (in v2, import __builtin__ as builtins). When you access a variable not found in either the local namespace or the global namespace of the current module, Python looks for the identifier in the current module’s __builtins__ before raising NameError.

The lookup is the only mechanism that Python uses to let your code access built-ins. The built-ins’ names are not reserved, nor are they hardwired in Python itself. The access mechanism is simple and documented, so your own code can use the mechanism directly (do so in moderation, or your program’s clarity and simplicity will suffer). Since Python accesses built-ins only when it cannot resolve a name in the local or module namespace, it is usually sufficient to define a replacement in one of those namespaces. You can, however, add your own built-ins or substitute your functions for the normal built-in ones, in which case all modules see the added or replaced one. The following v3 toy example shows how you can wrap a built-in function with your own function, allowing abs() to take a string argument (and return a rather arbitrary mangling of the string):

# abs takes a numeric argument; let's make it accept a string as well
import builtins
_abs = builtins.abs                          # save original built-in
def abs(str_or_num):
    if isinstance(str_or_num, str):          # if arg is a string
        return ''.join(sorted(set(str_or_num)))  # get this instead
    return _abs(str_or_num)                  # call real built-in
builtins.abs = abs                    # override built-in w/wrapper 

The only change needed to make this example work in v2 is to replace import builtins with import __builtin__ as builtins.

Module documentation strings

If the first statement in the module body is a string literal, Python binds that string as the module’s documentation string attribute, named __doc__. Documentation strings are also called docstrings; we cover them in “Docstrings”.

Module-private variables

No variable of a module is truly private. However, by convention, every identifier starting with a single underscore (_), such as _secret, is meant to be private. In other words, the leading underscore communicates to client-code programmers that they should not access the identifier directly.

Development environments and other tools rely on the leading-underscore naming convention to discern which attributes of a module are public (i.e., part of the module’s interface) and which are private (i.e., to be used only within the module).

Respect the “leading underscore means private” convention

It’s important to respect the “leading underscore means private” convention, particularly when you write client code that uses modules written by others. Avoid using any attributes in such modules whose names start with _. Future releases of the modules will presumably maintain their public interface, but are quite likely to change private implementation details, and private attributes are meant exactly for such implementation details.

The from Statement

Python’s from statement lets you import specific attributes from a module into the current namespace. from has two syntax variants:

from modname import attrname [as varname][,...]
from modname import *

A from statement specifies a module name, followed by one or more attribute specifiers separated by commas. In the simplest and most common case, an attribute specifier is just an identifier attrname, which is a variable that Python binds to the attribute of the same name in the module named modname. For example:

from mymodule import f

modname can also be a sequence of identifiers separated by dots (.) to name a module within a package, as covered in “Packages”.

When as varname is part of an attribute specifier, Python gets from the module the value of attribute attrname and binds it to variable varname. For example:

from mymodule import f as foo

attrname and varname are always simple identifiers.

You may optionally enclose in parentheses all the attribute specifiers that follow the keyword import in a from statement. This is sometimes useful when you have many attribute specifiers, in order to split the single logical line of the from statement into multiple logical lines more elegantly than by using backslashes ():

from some_module_with_a_long_name import (
    another_name, and_another as x, one_more, and_yet_another as y)

from ... import *

Code that is directly inside a module body (not in the body of a function or class) may use an asterisk (*) in a from statement:

from mymodule import *

The * requests that “all” attributes of module modname be bound as global variables in the importing module. When module modname has an attribute named __all__, the attribute’s value is the list of the attribute names that are bound by this type of from statement. Otherwise, this type of from statement binds all attributes of modname except those beginning with underscores.

Beware using from M import * in your code

Since from M import * may bind an arbitrary set of global variables, it can often have unforeseen and undesired side effects, such as hiding built-ins and rebinding variables you still need. Use the * form of from very sparingly, if at all, and only to import modules that are explicitly documented as supporting such usage. Your code is most likely better off never using this form, meant just as a convenience for occasional use in interactive Python sessions.

from versus import

The import statement is often a better choice than the from statement. Think of the from statement, particularly from M import *, as a convenience meant only for occasional use in interactive Python sessions. When you always access module M with the statement import M and always access M’s attributes with explicit syntax M.A, your code is slightly less concise but far clearer and more readable. One good use of from is to import specific modules from a package, as we discuss in “Packages”. But, in most other cases, import is better than from.

Module Loading

Module-loading operations rely on attributes of the built-in sys module (covered in “The sys Module”) and are implemented in the built-in function __import__. Your code could call __import__ directly, but this is strongly discouraged in modern Python; rather, import importlib and call importlib.import_module with the module name string as the argument. import_module returns the module object or, should the import fail, raises ImportError. However, it’s best to have a clear understanding of the semantics of __import__, because import_module and import statements both depend on it.

To import a module named M, __import__ first checks dictionary sys.modules, using string M as the key. When key M is in the dictionary, __import__ returns the corresponding value as the requested module object. Otherwise, __import__ binds sys.modules[M] to a new empty module object with a __name__ of M, then looks for the right way to initialize (load) the module, as covered in “Searching the Filesystem for a Module”.

Thanks to this mechanism, the relatively slow loading operation takes place only the first time a module is imported in a given run of the program. When a module is imported again, the module is not reloaded, since __import__ rapidly finds and returns the module’s entry in sys.modules. Thus, all imports of a given module after the first one are very fast: they’re just dictionary lookups. (To force a reload, see “Reloading Modules”.)

Built-in Modules

When a module is loaded, __import__ first checks whether the module is built-in. The tuple sys.builtin_module_names names all built-in modules, but rebinding that tuple does not affect module loading. When Python loads a built-in module, as when it loads any other extension, Python calls the module’s initialization function. The search for built-in modules also looks for modules in platform-specific locations, such as the Registry in Windows.

Searching the Filesystem for a Module

If module M is not built-in, __import__ looks for M’s code as a file on the filesystem. __import__ looks at the strings, which are the items of list sys.path, in order. Each item is the path of a directory, or the path of an archive file in the popular ZIP format. sys.path is initialized at program startup, using the environment variable PYTHONPATH (covered in “Environment Variables”), if present. The first item in sys.path is always the directory from which the main program is loaded. An empty string in sys.path indicates the current directory.

Your code can mutate or rebind sys.path, and such changes affect which directories and ZIP archives __import__ searches to load modules. Changing sys.path does not affect modules that are already loaded (and thus already recorded in sys.modules) when you change sys.path.

If a text file with the extension .pth is found in the PYTHONHOME directory at startup, the file’s contents are added to sys.path, one item per line. .pth files can contain blank lines and comment lines starting with the character #; Python ignores any such lines. .pth files can also contain import statements (which Python executes before your program starts to execute), but no other kinds of statements.

When looking for the file for the module M in each directory and ZIP archive along sys.path, Python considers the following extensions in this order:

  1. .pyd and .dll (Windows) or .so (most Unix-like platforms), which indicate Python extension modules. (Some Unix dialects use different extensions; e.g., .sl on HP-UX.) On most platforms, extensions cannot be loaded from a ZIP archive—only source or bytecode-compiled Python modules can.

  2. .py, which indicates Python source modules.

  3. .pyc (or .pyo, in v2, if Python is run with option -O), which indicates bytecode-compiled Python modules.

  4. If a .py file is found, in v3 only, then Python also looks for a directory called ___pycache__; if such a directory is found, then Python looks in that directory for the extension .<tag>.pyc, where <tag> is a string specific to the version of Python that is looking for the module.

One last path in which Python looks for the file for module M is __init__.py, meaning a file named __init__.py in a directory named M, as covered in “Packages”.

Upon finding source file M.py, Python (v3) compiles it to M.<tag>.pyc, unless the bytecode file is already present, is newer than M.py, and was compiled by the same version of Python. If M.py is compiled from a writable directory, Python creates a __pycache__ directory if necessary and saves the bytecode file to the filesystem in that subdirectory so that future runs won’t needlessly recompile. (In v2, M.py compiles to M.pyc or M.pyo, and Python saves the bytecode file in the same directory as M.py.) When the bytecode file is newer than the source file (based on an internal timestamp in the bytecode file, not on trusting the date as recorded in the filesystem), Python does not recompile the module.

Once Python has the bytecode, whether just built by compilation or read from the filesystem, Python executes the module body to initialize the module object. If the module is an extension, Python calls the module’s initialization function.

The Main Program

Execution of a Python application normally starts with a top-level script (also known as the main program), as explained in “The python Program”. The main program executes like any other module being loaded, except that Python keeps the bytecode in memory without saving it to disk. The module name for the main program is always '__main__', both as the __name__ global variable (module attribute) and as the key in sys.modules.

Don’t import the .py file you’re using as the main program

You should not import the same .py file that is the main program. If you do, the module is loaded again, and the body executes once more from the top in a separate module object with a different __name__.

Code in a Python module can test if the module is being used as the main program by checking if global variable __name__ has the value '__main__'. The idiom:

if __name__ == '__main__':

is often used to guard some code so that it executes only when the module is run as the main program. If a module is meant only to be imported, it should normally execute unit tests when it is run as the main program, as covered in “Unit Testing and System Testing”.

Reloading Modules

Python loads a module only the first time you import the module during a program run. When you develop interactively, you need to make sure you explicitly reload your modules each time you edit them (some development environments provide automatic reloading).

To reload a module, in v3, pass the module object (not the module name) as the only argument to the function reload from the importlib module (in v2, call the built-in function reload instead, to the same effect). importlib.reload(M) ensures the reloaded version of M is used by client code that relies on import M and accesses attributes with the syntax M.A. However, importlib.reload(M) has no effect on other existing references bound to previous values of M’s attributes (e.g., with a from statement). In other words, already-bound variables remain bound as they were, unaffected by reload. reload’s inability to rebind such variables is a further incentive to use import rather than from.

reload is not recursive: when you reload module M, this does not imply that other modules imported by M get reloaded in turn. You must arrange to reload, by explicit calls to the reload function, each and every module you have modified.

Circular Imports

Python lets you specify circular imports. For example, you can write a module a.py that contains import b, while module b.py contains import a.

If you decide to use a circular import for some reason, you need to understand how circular imports work in order to avoid errors in your code.

Avoid circular imports

In practice, you are nearly always better off avoiding circular imports, since circular dependencies are fragile and hard to manage.

Say that the main script executes import a. As discussed earlier, this import statement creates a new empty module object as sys.modules['a'] and then the body of module a starts executing. When a executes import b, this creates a new empty module object as sys.modules['b'], and then the body of module b starts executing. The execution of a’s module body cannot proceed until b’s module body finishes.

Now, when b executes import a, the import statement finds sys.modules['a'] already bound, and therefore binds global variable a in module b to the module object for module a. Since the execution of a’s module body is currently blocked, module a is usually only partly populated at this time. Should the code in b’s module body immediately try to access some attribute of module a that is not yet bound, an error results.

If you insist on keeping a circular import, you must carefully manage the order in which each module binds its own globals, imports other modules, and accesses globals of other modules. You can have greater control on the sequence in which things happen by grouping your statements into functions, and calling those functions in a controlled order, rather than just relying on sequential execution of top-level statements in module bodies. Usually, removing circular dependencies is easier than ensuring bomb-proof ordering with circular dependencies.

sys.modules Entries

__import__ never binds anything other than a module object as a value in sys.modules. However, if __import__ finds an entry already in sys.modules, it returns that value, whatever type it may be. The import and from statements internally rely on __import__, so they too can end up using objects that are not modules. This lets you set class instances as entries in sys.modules, in order to exploit features such as __getattr__ and __setattr__ special methods, covered in “General-Purpose Special Methods”. This rarely needed advanced technique lets you import module-like objects whose attributes you can compute on the fly. Here’s a toy-like example:

class TT(object):
    def __getattr__(self, name): return 23
import sys
sys.modules[__name__] = TT()

The first import of this code as a module overwrites the module’s sys.modules entry with an instance of class TT. Any attribute name you try to get from it then appears to have the integer value 23.

Custom Importers

An advanced, rarely needed functionality that Python offers is the ability to change the semantics of some or all import and from statements.

Rebinding __import__

You can rebind the __import__ attribute of the module builtin to your own custom importer function—for example, one using the generic built-in-wrapping technique shown in “Python built-ins”. Such a rebinding affects all import and from statements that execute after the rebinding and thus can have undesired global impact. A custom importer built by rebinding __import__ must implement the same interface and semantics as the built-in __import__, and, in particular, it is responsible for supporting the correct use of sys.modules.

Beware rebinding builtin __import__

While rebinding __import__ may initially look like an attractive approach, in most cases where custom importers are necessary, you’re better off implementing them via import hooks (described next).

Import hooks

Python offers rich support for selectively changing the details of imports. Custom importers are an advanced and rarely needed technique, yet some applications may need them for purposes such as importing code from archives other than ZIP files, databases, network servers, and so on.

The most suitable approach for such highly advanced needs is to record importer factory callables as items in the attributes meta_path and/or path_hooks of the module sys, as detailed in PEP 302 (and, for v3, PEP 451). This is how Python hooks up the standard library module zipimport to allow seamless importing of modules from ZIP files, as previously mentioned. A full study of the details of PEP 302 and 451 is indispensable for any substantial use of sys.path_hooks and friends, but here’s a toy-level example to help understand the possibilities, should you ever need them.

Suppose that, while developing a first outline of some program, you want to be able to use import statements for modules that you haven’t written yet, getting just messages (and empty modules) as a consequence. You can obtain such functionality (leaving aside the complexities connected with packages, and dealing with simple modules only) by coding a custom importer module as follows:

import sys, types

class ImporterAndLoader(object):
     '''importer and loader are often a single class'''
     fake_path = '!dummy!'
     def __init__(self, path):
         # only handle our own fake-path marker
         if path != self.fake_path: raise ImportError
     def find_module(self, fullname):
         # don't even try to handle any qualified module name
         if '.' in fullname: return None
         return self
     def create_module(self, spec):
         # run in v3 only: create module "the default way"
         return None
     def exec_module(self, mod):
         # run in v3 only: populate the already-initialized module
         # just print a message in this toy example
         print('NOTE: module {!r} not yet written'.format(mod))
     def load_module(self, fullname):
         # run in v2 only: make and populate the module
         if fullname in sys.modules: return sys.modules[fullname]
         # just print a message in this toy example
         print('NOTE: module {!r} not written yet'.format(fullname))
         # make new empty module, put it in sys.modules
         mod = sys.modules[fullname] = types.ModuleType(fullname)
         # minimally initialize new module and return it
         mod.__file__ = 'dummy_{}'.format(fullname)
         mod.__loader__ = self
         return mod
# add the class to the hook and its fake-path marker to the path
sys.path_hooks.append(ImporterAndLoader)
sys.path.append(ImporterAndLoader.fake_path)

if __name__ == '__main__':      # self-test when run as main script
    import missing_module       # importing a simple missing module
    print(missing_module)       # ...should succeed
    print(sys.modules.get('missing_module'))  # ...should also succeed

In v2, we have to implement the method load_module, which must also perform some boilerplate tasks such as dealing with sys.modules and setting dunder attributes such as __file__ on the module object. In v3, the system does such boilerplate for us, so we just write trivial versions of create_module (which in this case just returns None, asking the system to create the module object in the default way) and exec_module (which receives the module object already initialized with dunder attributes, and would normally populate it appropriately).

In v3, we could use the powerful new module spec concept. However, that requires the standard library module importlib, which is mostly missing in v2; moreover, for this toy example we don’t need such power. Therefore, we choose instead to implement the method find_module, which we do need anyway in v2—and, although now deprecated, it also works in v3 for backward compatibility.

Packages

A package is a module containing other modules. Some or all of the modules in a package may be subpackages, resulting in a hierarchical tree-like structure. A package named P resides in a subdirectory, also called P, of some directory in sys.path. Packages can also live in ZIP files; in this section, we explain the case in which the package lives on the filesystem, since the case in which a package is in a ZIP file is similar, relying on the hierarchical filesystem structure within the ZIP file.

The module body of P is in the file P/__init__.py. This file must exist (except, in v3, for namespace packages, covered in “Namespace Packages (v3 Only)”), even if it’s empty (representing an empty module body), in order to tell Python that directory P is indeed a package. The module body of a package is loaded when you first import the package (or any of the package’s modules) and behaves in all respects like any other Python module. The other .py files in directory P are the modules of package P. Subdirectories of P containing __init__.py files are sub-packages of P. Nesting can proceed to any depth.

You can import a module named M in package P as P.M. More dots let you navigate a hierarchical package structure. (A package’s module body is always loaded before any module in the package is loaded.) If you use the syntax import P.M, the variable P is bound to the module object of package P, and the attribute M of object P is bound to the module P.M. If you use the syntax import P.M as V, the variable V is bound directly to the module P.M.

Using from P import M to import a specific module M from package P is a perfectly acceptable, indeed highly recommended practice: the from statement is specifically okay in this case. from P import M as V is also just fine, and perfectly equivalent to import P.M as V.

In v2, by default, when module M in package P executes import X, Python searches for X in M, before searching in sys.path. However, this does not apply in v3: indeed, to avoid the ambiguity of these semantics, we strongly recommend that, in v2, you use from __future__ import absolute_import to make v2 behave like v3 in this respect. Module M in package P can then explicitly import its “sibling” module X (also in package P) with from . import X.

Sharing objects among modules in a package

The simplest, cleanest way to share objects (e.g., functions or constants) among modules in a package P is to group the shared objects in a module conventionally named P/common.py. That way, you can from . import common in every module in the package that needs to access some of the common objects, and then refer to the objects as common.f, common.K, and so on.

Special Attributes of Package Objects

A package P’s __file__ attribute is the string that is the path of P’s module body—that is, the path of the file P/__init__.py. P’s __package__ attribute is the name of P’s package.

A package P’s module body—that is, the Python source that is in the file P/__init__.py—can optionally set a global variable named __all__ (just like any other module can) to control what happens if some other Python code executes the statement from P import *. In particular, if __all__ is not set, from P import * does not import P’s modules, but only names that are set in P’s module body and lack a leading _. In any case, this is not recommended usage.

A package P’s __path__ attribute is the list of strings that are the paths to the directories from which P’s modules and subpackages are loaded. Initially, Python sets __path__ to a list with a single element: the path of the directory containing the file __init__.py that is the module body of the package. Your code can modify this list to affect future searches for modules and subpackages of this package. This advanced technique is rarely necessary, but can be useful when you want to place a package’s modules in several disjoint directories. In v3 only, a namespace package, as covered next, is the usual way to accomplish this goal.

Namespace Packages (v3 Only)

In v3 only, on import foo, when one or more directories that are immediate children of sys.path members are named foo, and none of them contains a file named __init__.py, Python deduces that foo is a namespace package. As a result, Python creates (and assigns to sys.modules['foo']) a package object foo without a __file__ attribute; foo.__path__ is the list of all the various directories that make up the package (and, like for a normal package, your code may optionally choose to further alter it). This advanced approach is rarely needed.

Absolute Versus Relative Imports

As mentioned in “Packages”, an import statement normally expects to find its target somewhere on sys.path, a behavior known as an absolute import (to ensure this reliable behavior in v2, start your module with from __future__ import absolute_import). Alternatively, you can explicitly use a relative import, meaning an import of an object from within the current package. Relative imports use module or package names beginning with one or more dots, and are only available in the from statement. from . import X looks for the module or object named X in the current package; from .X import y looks in module or subpackage X within the current package for the module or object named y. If your package has subpackages, their code can access higher-up objects in the package by using multiple dots at the start of the module or subpackage name you place between from and import. Each additional dot ascends the directory hierarchy one level. Getting too fancy with this feature can easily damage your code’s clarity, so use it with care, and only when necessary.

Distribution Utilities (distutils) and setuptools

Python modules, extensions, and applications can be packaged and distributed in several forms:

Compressed archive files

Generally .zip or .tar.gz (AKA .tgz) files—both forms are portable, and many other forms of compressed archives of trees of files and directories exist

Self-unpacking or self-installing executables

Normally .exe for Windows

Self-contained, ready-to-run executables that require no installation

For example, .exe for Windows, ZIP archives with a short script prefix on Unix, .app for the Mac, and so on

Platform-specific installers

For example, .msi on Windows, .rpm and .srpm on many Linux distributions, .deb on Debian GNU/Linux and Ubuntu, .pkg on macOS

Python Wheels (and Eggs)

Popular third-party extensions, covered in “Python Wheels (and Eggs)”

When you distribute a package as a self-installing executable or platform-specific installer, a user installs the package simply by running the installer. How to run such an installer program depends on the platform, but it no longer matters which language the program was written in. We cover building self-contained, runnable executables for various platforms in Chapter 25.

When you distribute a package as an archive file or as an executable that unpacks but does not install itself, it does matter that the package was coded in Python. In this case, the user must first unpack the archive file into some appropriate directory, say C:TempMyPack on a Windows machine or ~/MyPack on a Unix-like machine. Among the extracted files there should be a script, conventionally named setup.py, which uses the Python facility known as the distribution utilities (the standard library package distutils) or the more popular and powerful third-party package setuptools. The distributed package is then almost as easy to install as a self-installing executable. The user opens a command prompt window and changes to the directory into which the archive is unpacked. Then the user runs, for example:

C:TempMyPack> python setup.py install

(pip is the preferred way to install packages nowadays, and is briefly discussed in “Python Environments”.) The setup.py script, run with this install command, installs the package as a part of the user’s Python installation, according to the options specified by the package’s author in the setup script. Of course, the user needs appropriate permissions to write into the directories of the Python installations, so permission-raising commands such as sudo may also be needed, or better yet, you can install into a virtual environment, covered in “Python Environments”. distutils and setuptools, by default, print information when the user runs setup.py. Option --quiet, right before the install command, hides most details (the user still sees error messages, if any). The following command gives detailed help on distutils or setuptools, depending on which toolset the package author used in their setup.py:

C:TempMyPack> python setup.py --help

Recent versions of both v2 and v3 come with the excellent installer pip (a recursive acronym for “pip installs packages”), copiously documented online, yet very simple to use in most cases. pip install package finds the online version of package (usually on the huge PyPI repository, hosting almost 100,000 packages at the time of writing), downloads it, and installs it for you (in a virtual environment, if one is active—see “Python Environments”). This books’ authors have been using that simple, powerful approach for well over 90% of their installs for quite a while now.

Even if you have downloaded the package locally (say to /tmp/mypack), for whatever reason (maybe it’s not on PyPI, or you’re trying out an experimental version not yet there), pip can still install it for you: just run pip install --no-index --find-links=/tmp/mypack and pip does the rest.

Python Wheels (and Eggs)

Python wheels (like their precursor, eggs, still supported but not recommended for future development) are an archive format including structured metadata as well as Python code. Both formats, especially wheels, offer excellent ways to package and distribute your Python packages, and setuptools (with the wheel extension, easily installed with, of course, pip install wheel) works seamlessly with them. Read all about them online and in Chapter 25.

Python Environments

A typical Python programmer works on several projects concurrently, each with its own list of dependencies (typically, third-party libraries and data files). When the dependencies for all projects are installed into the same Python interpreter, it is very difficult to determine which projects use which dependencies, and impossible to handle projects with conflicting versions of certain dependencies.

Early Python interpreters were built on the assumption that each computer system would have “a Python interpreter” installed on it, which would be used to process all Python that ran on that system. Operating system distributions started to include Python in their base installation, but, because Python was being actively developed, users often complained that they would like to use a more up-to-date version of the language than their operating system provided.

Techniques arose to let multiple versions of the language be installed on a system, but installation of third-party software remained nonstandard and intrusive. This problem was eased by the introduction of the site-packages directory as the repository for modules added to a Python installation, but it was still not possible to maintain projects with conflicting requirements using the same interpreter.

Programmers accustomed to command-line operations are familiar with the concept of a shell environment. A shell program running in a process has a current directory, variables that can be set by shell commands (very similar to a Python namespace), and various other pieces of process-specific state data. Python programs have access to the shell environment through os.environ.

Various aspects of the shell environment affect Python’s operation, as mentioned in “Environment Variables”. For example, the interpreter executed in response to python and other commands is determined by the PATH environment variable. You can think of those aspects of your shell environment that affect Python’s operation as your Python environment. By modifying it you can determine which Python interpreter runs in response to the python command, which packages and modules are available under certain names, and so on.

Leave the system’s Python to the system

We recommend taking control of your Python environment. In particular, do not build applications on top of a system’s distributed Python. Instead, install another Python distribution independently and adjust your shell environment so that the python command runs your locally installed Python rather than the system’s Python.

Enter the Virtual Environment

The introduction of the pip utility created a simple way to install (and, for the first time, to uninstall) packages and modules in a Python environment. Modifying the system Python’s site-packages still requires administrative privileges, and hence so does pip (although it can optionally install somewhere other than site-packages). Installed modules are still visible to all programs.

The missing piece is the ability to make controlled changes to the Python environment, to direct the use of a specific interpreter and a specific set of Python libraries. That is just what virtual environments (virtualenvs) give you. Creating a virtualenv based on a specific Python interpreter copies or links to components from that interpreter’s installation. Critically, though, each one has its own site-packages directory, into which you can install the Python resources of your choice.

Creating a virtualenv is much simpler than installing Python, and requires far less system resources (a typical newly created virtualenv takes less than 20 MB). You can easily create and activate them on demand, and deactivate and destroy them just as easily. You can activate and deactivate a virtualenv as many times as you like during its lifetime, and if necessary use pip to update the installed resources. When you are done with it, removing its directory tree reclaims all storage occupied by the virtualenv. A virtualenv’s lifetime can be from minutes to months.

What Is a Virtual Environment?

A virtualenv is essentially a self-contained subset of your Python environment that you can switch in or out on demand. For a Python X.Y interpreter it includes, among other things, a bin directory containing a Python X.Y interpreter and a lib/pythonX.Y/site-packages directory containing preinstalled versions of easy-install, pip, pkg_resources, and setuptools. Maintaining separate copies of these important distribution-related resources lets you update them as necessary rather than forcing reliance on the base Python distribution.

A virtualenv has its own copies of (on Windows), or symbolic links to (on other platforms), Python distribution files. It adjusts the values of sys.prefix and sys.exec_prefix, from which the interpreter and various installation utilities determine the location of some libraries. This means that pip can install dependencies in isolation from other environments, in the virtualenv’s site-packages directory. In effect the virtualenv redefines which interpreter runs when you run the python command and which libraries are available to it, but leaves most aspects of your Python environment (such as the PYTHONPATH and PYTHONHOME variables) alone. Since its changes affect your shell environment they also affect any subshells in which you run commands.

With separate virtualenvs you can, for example, test two different versions of the same library with a project, or test your project with multiple versions of Python (very useful to check v2/v3 compatibility of your code). You can also add dependencies to your Python projects without needing any special privileges, since you normally create your virtualenvs somewhere you have write permission.

For a long time the only way to create virtual environments was the third-party virtualenv package, with or without help from virtualenvwrapper, both of which are still available for v2. You can read more about these tools in the Python Packaging User Guide. They also work with v3, but the 3.3 release added the venv module, making virtual environments a native feature of Python for the first time. New in 3.6: use python -m venv envpath in preference to the pyvenv command, which is now deprecated.

Creating and Deleting Virtual Environments

In v3 the command python -m venv envpath creates a virtual environment (in the envpath directory, which it also creates if necessary) based on the Python interpreter used to run the command. You can give multiple directory arguments to create with a single command several virtual environments, into which you then install different sets of dependencies. venv can take a number of options, as shown in Table 6-1.

Table 6-1. venv options
Option Purpose
--clear Removes any existing directory content before installing the virtual environment
--copies Installs files by copying on the Unix-like platforms where using symbolic links is the default

--h or

--help

Prints out a command-line summary and a list of available options
--system-site-packages Adds the standard system site-packages directory to the environment’s search path, making modules already installed in the base Python available inside the environment
--symlinks Installs files by using symbolic links on platforms where copying is the system default
--upgrade Installs the running Python in the virtual environment, replacing whichever version the environment was created with
--without-pip Inhibits the usual behavior of calling ensurepip to bootstrap the pip installer utility into the environment

v2 users must use the python -m virtualenv command, which does not accept multiple directory arguments.

The following terminal session shows the creation of a virtualenv and the structure of the directory tree created. The listing of the bin subdirectory shows that this particular user by default uses a v3 interpreter installed in /usr/local/bin.

machine:~ user$ python3 -m venv /tmp/tempenv
machine:~ user$ tree -dL 4 /tmp/tempenv
/tmp/tempenv
├── bin
├── include
└── lib
    └── python3.5
        └── site-packages
            ├── __pycache__
            ├── pip
            ├── pip-8.1.1.dist-info
            ├── pkg_resources
            ├── setuptools
            └── setuptools-20.10.1.dist-info

11 directories
machine:~ user$ ls -l /tmp/tempenv/bin/
total 80
-rw-r--r-- 1 sh wheel 2134 Oct 24 15:26 activate
-rw-r--r-- 1 sh wheel 1250 Oct 24 15:26 activate.csh
-rw-r--r-- 1 sh wheel 2388 Oct 24 15:26 activate.fish
-rwxr-xr-x 1 sh wheel  249 Oct 24 15:26 easy_install
-rwxr-xr-x 1 sh wheel  249 Oct 24 15:26 easy_install-3.5
-rwxr-xr-x 1 sh wheel  221 Oct 24 15:26 pip
-rwxr-xr-x 1 sh wheel  221 Oct 24 15:26 pip3
-rwxr-xr-x 1 sh wheel  221 Oct 24 15:26 pip3.5
lrwxr-xr-x 1 sh wheel    7 Oct 24 15:26 python->python3
lrwxr-xr-x 1 sh wheel   22 Oct 24 15:26 python3->/usr/local/bin/python3

Deletion of the virtualenv is as simple as removing the directory in which it resides (and all subdirectories and files in the tree: rm -rf envpath in Unix-like systems). Ease of removal is a helpful aspect of using virtualenvs.

The venv module includes features to help the programmed creation of tailored environments (e.g., by preinstalling certain modules in the environment or performing other post-creation steps). It is comprehensively documented online, and we therefore do not cover the API further in this book.

Working with Virtual Environments

To use a virtualenv you activate it from your normal shell environment. Only one virtualenv can be active at a time—activations don’t “stack” like function calls. Activation conditions your Python environment to use the virtualenv’s Python interpreter and site-packages (along with the interpreter’s full standard library). When you want to stop using those dependencies, deactivate the virtualenv and your standard Python environment is once again available. The virtualenv directory tree continues to exist until deleted, so you can activate and deactivate it at will.

Activating a virtualenv in Unix-based environments requires use of the source shell command so that the commands in the activation script make changes to the current shell environment. Simply running the script would mean its commands were executed in a subshell, and the changes would be lost when the subshell terminated. For bash and similar shells, you activate an environment located at path envpath with the command:

source envpath/bin/activate

Users of other shells are accommodated with activate.csh and activate.fish scripts located in the same directory. On Windows systems, use activate.bat:

envpath/Scripts/activate.bat 

Activation does several things, most importantly:

  • Adds the virtualenv’s bin directory at the beginning of the shell’s PATH environment variable, so its commands get run in preference to anything of the same name already on the PATH

  • Defines a deactivate command to remove all effects of activation and return the Python environment to its former state

  • Modifies the shell prompt to include the virtualenv’s name at the start

  • Defines a VIRTUAL_ENV environment variable as the path to the virtualenv’s root directory (scripting can use this to introspect the virtualenv)

As a result of these actions, once a virtualenv is activated the python command runs the interpreter associated with that virtualenv. The interpreter sees the libraries (modules and packages) that have been installed in that environment, and pip—now the one from the virtualenv, since installing the module also installed the command in the virtualenv’s bin directory—by default installs new packages and modules in the environment’s site-packages directory.

Those new to virtualenvs should understand that a virtualenv is not tied to any project directory. It’s perfectly possible to work on several projects, each with its own source tree, using the same virtualenv. Activate it, then move around your filestore as necessary to accomplish your programming tasks, with the same libraries available (because the virtualenv determines the Python environment).

When you want to disable the virtualenv and stop using that set of resources, simply issue the command deactivate.

This undoes the changes made on activation, removing the virtualenv’s bin directory from your PATH, so the python command once again runs your usual interpreter. As long as you don’t delete it, the virtualenv remains available for future use by repeating the invocation to activate it.

Managing Dependency Requirements

Since virtualenvs were designed to complement installation with pip, it should come as no surprise that pip is the preferred way to maintain dependencies in a virtualenv. Because pip is already extensively documented, we mention only enough here to demonstrate its advantages in virtual environments. Having created a virtualenv, activated it, and installed dependencies, you can use the pip freeze command to learn the exact versions of those dependencies:

(tempenv) machine:~ user$ pip freeze
appnope==0.1.0
decorator==4.0.10
ipython==5.1.0
ipython-genutils==0.1.0
pexpect==4.2.1
pickleshare==0.7.4
prompt-toolkit==1.0.8
ptyprocess==0.5.1
Pygments==2.1.3
requests==2.11.1
simplegeneric==0.8.1
six==1.10.0
traitlets==4.3.1
wcwidth==0.1.7

If you redirect the output of this command to a file called filename, you can re-create the same set of dependencies in a different virtualenv with the command pip install -r filename.

To distribute code for use by others, Python developers conventionally include a requirements.txt file listing the necessary dependencies. When you are installing software from the Python Package Index, pip installs the packages you request along with any indicated dependencies. When you’re developing software it’s convenient to have a requirements file, as you can use it to add the necessary dependencies to the active virtualenv (unless they are already installed) with a simple pip install -r requirements.txt.

To maintain the same set of dependencies in several virtualenvs, use the same requirements file to add dependencies to each one. This is a convenient way to develop projects to run on multiple Python versions: create virtualenvs based on each of your required versions, then install from the same requirements file in each. While the preceding example uses exactly versioned dependency specifications as produced by pip freeze, in practice you can specify dependencies and constrain version requirements in quite complex ways.

Best Practices with virtualenvs

There is remarkably little advice on how best to manage your work with virtualenvs, though there are several sound tutorials: any good search engine gives you access to the most current ones. We can, however, offer a modest amount of advice that we hope will help you to get the most out of them.

When you are working with the same dependencies in multiple Python versions, it is useful to indicate the version in the environment name and use a common prefix. So for project mutex you might maintain environments called mutex_35 and mutex_27 for v3 and v2 development. When it’s obvious which Python is involved (and remember you see the environment name in your shell prompt), there’s less chance of testing with the wrong version. You maintain dependencies using common requirements to control resource installation in both.

Keep the requirements file(s) under source control, not the whole environment. Given the requirements file it’s easy to re-create a virtualenv, which depends only on the Python release and the requirements. You distribute your project, and let your consumers decide which version(s) of Python to run it on and create the appropriate (hopefully virtual) environment(s).

Keep your virtualenvs outside your project directories. This avoids the need to explicitly force source code control systems to ignore them. It really doesn’t matter where you store them—the virtualenvwrapper system keeps them all in a central location.

Your Python environment is independent of your process’s location in the filesystem. You can activate a virtual environment and then switch branches and move around a change-controlled source tree to use it wherever convenient.

To investigate a new module or package, create and activate a new virtualenv and then pip install the resources that interest you. You can play with this new environment to your heart’s content, confident in the knowledge that you won’t be installing rogue dependencies into other projects.

You may find that experiments in a virtualenv require installation of resources that aren’t currently project requirements. Rather than pollute your development environment, fork it: create a new virtualenv from the same requirements plus the testing functionality. Later, to make these changes permanent, use change control to merge your source and requirements changes back in from the fork.

If you are so inclined, you can create virtual environments based on debug builds of Python, giving you access to a wealth of instrumentation information about the performance of your Python code (and, of course, the interpreter).

Developing your virtual environment itself requires change control, and their ease of creation helps here too. Suppose that you recently released version 4.3 of a module, and you want to test your code with new versions of two of its dependencies. You could, with sufficient skill, persuade pip to replace the existing copies of dependencies in your existing virtualenv.

It’s much easier, though, to branch your project, update the requirements, and create an entirely new virtual environment based on the updated requirements. You still have the original virtualenv intact, and you can switch between virtualenvs to investigate specific aspects of any migration issues that might arise. Once you have adjusted your code so that all tests pass with the updated dependencies, you check in your code and requirement changes, and merge into version 4.4 to complete the update, advising your colleagues that your code is now ready for the updated versions of the dependencies.

Virtual environments won’t solve all of a Python programmer’s problems. Tools can always be made more sophisticated, or more general. But, by golly, they work, and we should take all the advantage of those that we can.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset