17. Databases and Persistence

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 17. Databases and Persistence

“Give Me an Order of Persistence, but Hold the Pickles”

So far in this book, we’ve used Python in the system programming, GUI development, and Internet scripting domains—three of Python’s most common applications, and representative of its use as an application programming language at large. In the next four chapters, we’re going to take a quick look at other major Python programming topics: persistent data, data structure techniques, text and language processing, and Python/C integration.

These four topics are not really application areas themselves, but they are techniques that span domains. The database topics in this chapter, for instance, can be applied on the Web, in desktop GUI applications, and so on. Text processing is a similarly general tool. Moreover, none of these final four topics is covered exhaustively (each could easily fill a book alone), but we’ll sample Python in action in these domains and highlight their core concepts and tools. If any of these chapters spark your interest, additional resources are readily available in the Python world.

Persistence Options in Python

In this chapter, our focus is on persistent data—the kind that outlives a program that creates it. That’s not true by default for objects a script constructs, of course; things like lists, dictionaries, and even class instance objects live in your computer’s memory and are lost as soon as the script ends. To make data live longer, we need to do something special. In Python programming, there are today at least six traditional ways to save information in between program executions:

Flat files: Text and bytes stored directly on your computer
DBM keyed files: Keyed access to strings stored in dictionary-like files
Pickled objects: Serialized Python objects saved to files and streams
Shelve files: Pickled Python objects saved in DBM keyed files
Object-oriented databases (OODBs): Persistent Python objects stored in persistent dictionaries (ZODB, Durus)
SQL relational databases (RDBMSs): Table-based storage that supports SQL queries (SQLite, MySQL, PostGreSQL, etc.)
Object relational mappers (ORMs): Mediators that map Python classes to relational tables (SQLObject, SQLAlchemy)

In some sense, Python’s interfaces to network-based object transmission protocols such as SOAP, XML-RPC, and CORBA also offer persistence options, but they are beyond the scope of this chapter. Here, our interest is in techniques that allow a program to store its data directly and, usually, on the local machine. Although some database servers may operate on a physically remote machine on a network, this is largely transparent to most of the techniques we’ll study here.

We studied Python’s simple (or “flat”) file interfaces in earnest in Chapter 4, and we have been using them ever since. Python provides standard access to both the stdio filesystem (through the built-in open function), as well as lower-level descriptor-based files (with the built-in os module). For simple data storage tasks, these are all that many scripts need. To save for use in a future program run, simply write data out to a newly opened file on your computer in text or binary mode, and read it back from that file later. As we’ve seen, for more advanced tasks, Python also supports other file-like interfaces such as pipes, fifos, and sockets.

Since we’ve already explored flat files, I won’t say more about them here. The rest of this chapter introduces the remaining topics on the preceding list. At the end, we’ll also meet a GUI program for browsing the contents of things such as shelves and DBM files. Before that, though, we need to learn what manner of beast these are.

Note

Fourth edition coverage note: The prior edition of this book used the mysql-python interface to the MySQL relational database system, as well as the ZODB object database system. As I update this chapter in June 2010, neither of these is yet available for Python 3.X, the version of Python used in this edition. Because of that, most ZODB information has been trimmed, and the SQL database examples here were changed to use the SQLite in-process database system that ships with Python 3.X as part of its standard library. The prior edition’s ZODB and MySQL examples and overviews are still available in the examples package, as described later. Because Python’s SQL database API is portable, though, the SQLite code here should work largely unchanged on most other systems.

DBM Files

Flat files are handy for simple persistence tasks, but they are generally geared toward a sequential processing mode. Although it is possible to jump around to arbitrary locations with seek calls, flat files don’t provide much structure to data beyond the notion of bytes and text lines.

DBM files, a standard tool in the Python library for database management, improve on that by providing key-based access to stored text strings. They implement a random-access, single-key view on stored data. For instance, information related to objects can be stored in a DBM file using a unique key per object and later can be fetched back directly with the same key. DBM files are implemented by a variety of underlying modules (including one coded in Python), but if you have Python, you have a DBM.

Using DBM Files

Although DBM filesystems have to do a bit of work to map chunks of stored data to keys for fast retrieval (technically, they generally use a technique called hashing to store data in files), your scripts don’t need to care about the action going on behind the scenes. In fact, DBM is one of the easiest ways to save information in Python—DBM files behave so much like in-memory dictionaries that you may forget you’re actually dealing with a file at all. For instance, given a DBM file object:

Indexing by key fetches data from the file.
Assigning to an index stores data in the file.

DBM file objects also support common dictionary methods such as keys-list fetches and tests and key deletions. The DBM library itself is hidden behind this simple model. Since it is so simple, let’s jump right into an interactive example that creates a DBM file and shows how the interface works:

C:...PP4EDbase> python
>>> import dbm                              # get interface: bsddb, gnu, ndbm, dumb
>>> file = dbm.open('movie', 'c')           # make a DBM file called 'movie'
>>> file['Batman'] = 'Pow!'                 # store a string under key 'Batman'
>>> file.keys()                             # get the file's key directory
[b'Batman']
>>> file['Batman']                          # fetch value for key 'Batman'
b'Pow!'

>>> who  = ['Robin', 'Cat-woman', 'Joker']
>>> what = ['Bang!', 'Splat!', 'Wham!']
>>> for i in range(len(who)):
...     file[who[i]] = what[i]              # add 3 more "records"
...
>>> file.keys()
[b'Cat-woman', b'Batman', b'Joker', b'Robin']
>>> len(file), 'Robin' in file, file['Joker']
(4, True, b'Wham!')
>>> file.close()                            # close sometimes required

Internally, importing the dbm standard library module automatically loads whatever DBM interface is available in your Python interpreter (attempting alternatives in a fixed order), and opening the new DBM file creates one or more external files with names that start with the string 'movie' (more on the details in a moment). But after the import and open, a DBM file is virtually indistinguishable from a dictionary.

In effect, the object called file here can be thought of as a dictionary mapped to an external file called movie; the only obvious differences are that keys must be strings (not arbitrary immutables), and we need to remember to open to access and close after changes.

Unlike normal dictionaries, though, the contents of file are retained between Python program runs. If we come back later and restart Python, our dictionary is still available. Again, DBM files are like dictionaries that must be opened:

C:...PP4EDbase> python
>>> import dbm
>>> file = dbm.open('movie', 'c')           # open existing DBM file
>>> file['Batman']
b'Pow!'

>>> file.keys()                             # keys gives an index list
[b'Cat-woman', b'Batman', b'Joker', b'Robin']

>>> for key in file.keys(): print(key, file[key])
...
b'Cat-woman' b'Splat!'
b'Batman' b'Pow!'
b'Joker' b'Wham!'
b'Robin' b'Bang!'

Notice how DBM files return a real list for the keys call; not shown here, their values method instead returns an iterable view like dictionaries. Further, DBM files always store both keys and values as bytes objects; interpretation as arbitrary types of Unicode text is left to the client application. We can use either bytes or str strings in our code when accessing or storing keys and values—using bytes allows your keys and values to retain arbitrary Unicode encodings, but str objects in our code will be encoded to bytes internally using the UTF-8 Unicode encoding by Python’s DBM implementation.

Still, we can always decode to Unicode str strings to display in a more friendly fashion if desired, and DBM files have a keys iterator just like dictionaries. Moreover, assigning and deleting keys updates the DBM file, and we should close after making changes (this ensure that changes are flushed to disk):

>>> for key in file: print(key.decode(), file[key].decode())
...
Cat-woman Splat!
Batman Pow!
Joker Wham!
Robin Bang!

>>> file['Batman'] = 'Ka-Boom!'               # change Batman slot
>>> del file['Robin']                         # delete the Robin entry
>>> file.close()                              # close it after changes

Apart from having to import the interface and open and close the DBM file, Python programs don’t have to know anything about DBM itself. DBM modules achieve this integration by overloading the indexing operations and routing them to more primitive library tools. But you’d never know that from looking at this Python code—DBM files look like normal Python dictionaries, stored on external files. Changes made to them are retained indefinitely:

C:...PP4EDbase> python
>>> import dbm                              # open DBM file again
>>> file = dbm.open('movie', 'c')
>>> for key in file: print(key.decode(), file[key].decode())
...
Cat-woman Splat!
Batman Ka-Boom!
Joker Wham!

As you can see, this is about as simple as it can be. Table 17-1 lists the most commonly used DBM file operations. Once such a file is opened, it is processed just as though it were an in-memory Python dictionary. Items are fetched by indexing the file object by key and are stored by assigning to a key.

Table 17-1. DBM file operations

Python code	Action	Description
`import dbm`	Import	Get DBM implementation
`file=dbm.open('filename', 'c')`	Open	Create or open an existing DBM file for I/O
`file['key'] = 'value'`	Store	Create or change the entry for `key`
`value = file['key']`	Fetch	Load the value for the entry `key`
`count = len(file)`	Size	Return the number of entries stored
`index = file.keys()`	Index	Fetch the stored keys list (not a view)
`found = 'key' in file`	Query	See if there’s an entry for `key`
`del file['key']`	Delete	Remove the entry for `key`
`for key in file:`	Iterate	Iterate over stored keys
`file.close()`	Close	Manual close, not always needed

DBM Details: Files, Portability, and Close

Despite the dictionary-like interface, DBM files really do map to one or more external files. For instance, the underlying default dbm interface used by Python 3.1 on Windows writes two files—movie.dir and movie.dat—when a DBM file called movie is made, and saves a movie.bak on later opens. If your Python has access to a different underlying keyed-file interface, different external files might show up on your computer.

Technically, the module dbm is really an interface to whatever DBM-like filesystem you have available in your Python:

When opening an already existing DBM file, dbm tries to determine the system that created it with the dbm.whichdb function instead. This determination is based upon the content of the database itself.
When creating a new file, dbm today tries a set of keyed-file interface modules in a fixed order. According to its documentation, it attempts to import the interfaces dbm.bsd, dbm.gnu, dbm.ndbm, or dbm.dumb, and uses the first that succeeds. Pythons without any of these automatically fall back on an all-Python and always-present implementation called dbm.dumb, which is not really “dumb,” of course, but may not be as fast or robust as other options.

Future Pythons are free to change this selection order, and may even add additional alternatives to it. You normally don’t need to care about any of this, though, unless you delete any of the files your DBM creates, or transfer them between machines with different configurations—if you need to care about the portability of your DBM files (and as we’ll see later, by proxy, that of your shelve files), you should configure machines such that all have the same DBM interface installed or rely upon the dumb fallback. For example, the Berkeley DB package (a.k.a. bsddb) used by dbm.bsd is widely available and portable.

Note that DBM files may or may not need to be explicitly closed, per the last entry in Table 17-1. Some DBM files don’t require a close call, but some depend on it to flush changes out to disk. On such systems, your file may be corrupted if you omit the close call. Unfortunately, the default DBM in some older Windows Pythons, dbhash (a.k.a. bsddb), is one of the DBM systems that requires a close call to avoid data loss. As a rule of thumb, always close your DBM files explicitly after making changes and before your program exits to avoid potential problems; it’s essential a “commit” operation for these files. This rule extends by proxy to shelves, a topic we’ll meet later in this chapter.

Note

Recent changes: Be sure to also pass a string 'c' as a second argument when calling dbm.open, to force Python to create the file if it does not yet exist and to simply open it for reads and writes otherwise. This used to be the default behavior but is no longer. You do not need the 'c' argument when opening shelves discussed ahead—they still use an “open or create” 'c' mode by default if passed no open mode argument. Other open mode strings can be passed to dbm, including n to always create the file, and r for read-only of an existing file—the new default. See the Python library manual for more details.

In addition, Python 3.X stores both key and value strings as bytes instead of str as we’ve seen (which turns out to be convenient for pickled data in shelves, discussed ahead) and no longer ships with bsddb as a standard component—it’s available independently on the Web as a third-party extension, but in its absence Python falls back on its own DBM file implementation. Since the underlying DBM implementation rules are prone to change with time, you should always consult Python’s library manuals as well as the dbm module’s standard library source code for more information.

Pickled Objects

Probably the biggest limitation of DBM keyed files is in what they can store: data stored under a key must be a simple string. If you want to store Python objects in a DBM file, you can sometimes manually convert them to and from strings on writes and reads (e.g., with str and eval calls), but this takes you only so far. For arbitrarily complex Python objects such as class instances and nested data structures, you need something more. Class instance objects, for example, cannot usually be later re-created from their standard string representations. Moreover, custom to-string conversions and from-string parsers are error prone and not general.

The Python pickle module, a standard part of the Python system, provides the conversion step needed. It’s a sort of super general data formatting and de-formatting tool—pickle converts nearly arbitrary Python in-memory objects to and from a single linear string format, suitable for storing in flat files, shipping across network sockets between trusted sources, and so on. This conversion from object to string is often called serialization—arbitrary data structures in memory are mapped to a serial string form.

The string representation used for objects is also sometimes referred to as a byte stream, due to its linear format. It retains all the content and references structure of the original in-memory object. When the object is later re-created from its byte string, it will be a new in-memory object identical in structure and value to the original, though located at a different memory address.

The net effect is that the re-created object is effectively a copy of the original; in Python-speak, the two will be == but not is. Since the recreation typically happens in an entirely new process, this difference is often irrelevant (though as we saw in Chapter 5, this generally precludes using pickled objects directly as cross-process shared state).

Pickling works on almost any Python datatype—numbers, lists, dictionaries, class instances, nested structures, and more—and so is a general way to store data. Because pickles contain native Python objects, there is almost no database API to be found; the objects stored with pickling are processed with normal Python syntax when they are later retrieved.

Using Object Pickling

Pickling may sound complicated the first time you encounter it, but the good news is that Python hides all the complexity of object-to-string conversion. In fact, the pickle module ’s interfaces are incredibly simple to use. For example, to pickle an object into a serialized string, we can either make a pickler and call its methods or use convenience functions in the module to achieve the same effect:

P = pickle.Pickler(file): Make a new pickler for pickling to an open output file object file.
P.dump(object ): Write an object onto the pickler’s file/stream.
pickle.dump(object, file): Same as the last two calls combined: pickle an object onto an open file.
string = pickle.dumps(object): Return the pickled representation of object as a character string.

Unpickling from a serialized string back to the original object is similar—both object and convenience function interfaces are available:

U = pickle.Unpickler(file): Make an unpickler for unpickling from an open input file object file.
object = U.load(): Read an object from the unpickler’s file/stream.
object = pickle.load(file): Same as the last two calls combined: unpickle an object from an open file.
object = pickle.loads(string): Read an object from a character string rather than a file.

Pickler and Unpickler are exported classes. In all of the preceding cases, file is either an open file object or any object that implements the same attributes as file objects:

Pickler calls the file’s write method with a string argument.
Unpickler calls the file’s read method with a byte count, and readline without arguments.

Any object that provides these attributes can be passed in to the file parameters. In particular, file can be an instance of a Python class that provides the read/write methods (i.e., the expected file-like interface). This lets you map pickled streams to in-memory objects with classes, for arbitrary use. For instance, the io.BytesIO class in the standard library discussed in Chapter 3 provides an interface that maps file calls to and from in-memory byte strings and is an alternative to the pickler’s dumps/loads string calls.

This hook also lets you ship Python objects across a network, by providing sockets wrapped to look like files in pickle calls at the sender, and unpickle calls at the receiver (see Making Sockets Look Like Files and Streams for more details). In fact, for some, pickling Python objects across a trusted network serves as a simpler alternative to network transport protocols such as SOAP and XML-RPC, provided that Python is on both ends of the communication (pickled objects are represented with a Python-specific format, not with XML text).

Note

Recent changes: In Python 3.X, pickled objects are always represented as bytes, not str, regardless of the protocol level which you request (even the oldest ASCII protocol yields bytes). Because of this, files used to store pickled Python objects should always be opened in binary mode. Moreover, in 3.X an optimized _pickle implementation module is also selected and used automatically if present. More on both topics later.

Pickling in Action

Although pickled objects can be shipped in exotic ways, in more typical use, to pickle an object to a flat file, we just open the file in write mode and call the dump function:

C:...PP4EDbase> python
>>> table = {'a': [1, 2, 3],
             'b': ['spam', 'eggs'],
             'c': {'name':'bob'}}
>>>
>>> import pickle
>>> mydb  = open('dbase', 'wb')
>>> pickle.dump(table, mydb)

Notice the nesting in the object pickled here—the pickler handles arbitrary structures. Also note that we’re using binary mode files here; in Python 3.X, we really must, because the pickled representation of an object is always a bytes object in all cases. To unpickle later in another session or program run, simply reopen the file and call load:

C:...PP4EDbase> python
>>> import pickle
>>> mydb  = open('dbase', 'rb')
>>> table = pickle.load(mydb)
>>> table
{'a': [1, 2, 3], 'c': {'name': 'bob'}, 'b': ['spam', 'eggs']}

The object you get back from unpickling has the same value and reference structure as the original, but it is located at a different address in memory. This is true whether the object is unpickled in the same or a future process. Again, the unpickled object is == but is not is:

C:...PP4EDbase> python
>>> import pickle
>>> f = open('temp', 'wb')
>>> x = ['Hello', ('pickle', 'world')]          # list with nested tuple
>>> pickle.dump(x, f)
>>> f.close()                                   # close to flush changes
>>>
>>> f = open('temp', 'rb')
>>> y = pickle.load(f)
>>> y
['Hello', ('pickle', 'world')]
>>>
>>> x == y, x is y                              # same value, diff objects
(True, False)

To make this process simpler still, the module in Example 17-1 wraps pickling and unpickling calls in functions that also open the files where the serialized form of the object is stored.

Example 17-1. PP4EDbasefilepickle.py

"Pickle to/from flat file utilities"
import pickle

def saveDbase(filename, object):
    "save object to file"
    file = open(filename, 'wb')
    pickle.dump(object, file)        # pickle to binary file
    file.close()                     # any file-like object will do

def loadDbase(filename):
    "load object from file"
    file = open(filename, 'rb')
    object = pickle.load(file)       # unpickle from binary file
    file.close()                     # re-creates object in memory
    return object

To store and fetch now, simply call these module functions; here they are in action managing a fairly complex structure with multiple references to the same nested object—the nested list called L at first is stored only once in the file:

C:...PP4EDbase> python
>>> from filepickle import *
>>> L = [0]
>>> D = {'x':0, 'y':L}
>>> table = {'A':L, 'B':D}              # L appears twice
>>> saveDbase('myfile', table)          # serialize to file

C:...PP4EDbase>python
>>> from filepickle import *
>>> table = loadDbase('myfile')         # reload/unpickle
>>> table
{'A': [0], 'B': {'y': [0], 'x': 0}}
>>> table['A'][0] = 1                   # change shared object
>>> saveDbase('myfile', table)          # rewrite to the file

C:...PP4EDbase>python
>>> from filepickle import *
>>> print(loadDbase('myfile'))          # both L's updated as expected
{'A': [1], 'B': {'y': [1], 'x': 0}}

Besides built-in types like the lists, tuples, and dictionaries of the examples so far, class instances may also be pickled to file-like objects. This provides a natural way to associate behavior with stored data (class methods process instance attributes) and provides a simple migration path (class changes made in module files are automatically picked up by stored instances). Here’s a brief interactive demonstration:

>>> class Rec:
        def __init__(self, hours):
            self.hours = hours
        def pay(self, rate=50):
            return self.hours * rate

>>> bob = Rec(40)
>>> import pickle
>>> pickle.dump(bob, open('bobrec', 'wb'))
>>>
>>> rec = pickle.load(open('bobrec', 'rb'))
>>> rec.hours
40
>>> rec.pay()
2000

We’ll explore how this works in more detail in conjunction with shelves later in this chapter—as we’ll see, although the pickle module can be used directly this way, it is also the underlying translation engine in both shelves and ZODB databases.

In general, Python can pickle just about anything, except for:

Compiled code objects: functions and classes record just their names and those of their modules in pickles, to allow for later reimport and automatic acquisition of changes made in module files.
Instances of classes that do not follow class importability rules: in short, the class must be importable on object loads (more on this at the end of the section Shelve Files).
Instances of some built-in and user-defined types that are coded in C or depend upon transient operating system states (e.g., open file objects cannot be pickled).

A PicklingError is raised if an object cannot be pickled. Again, we’ll revisit the pickler’s constraints on pickleable objects and classes when we study shelves.

Pickle Details: Protocols, Binary Modes, and _pickle

In later Python releases, the pickler introduced the notion of protocols—storage formats for pickled data. Specify the desired protocol by passing an extra parameter to the pickling calls (but not to unpickling calls: the protocol is automatically determined from the pickled data):

pickle.dump(object, file, protocol)       # or protocol=N keyword argument

Pickled data may be created in either text or binary protocols; the binary protocols’ format is more efficient, but it cannot be readily understood if inspected. By default, the storage protocol in Python 3.X is a 3.X-only binary bytes format (also known as protocol 3). In text mode (protocol 0), the pickled data is printable ASCII text, which can be read by humans (it’s essentially instructions for a stack machine), but it is still a bytes object in Python 3.X. The alternative protocols (protocols 1 and 2) create the pickled data in binary format as well.

For all protocols, pickled data is a bytes object in 3.X, not a str, and therefore implies binary-mode reads and writes when stored in flat files (see Chapter 4 if you’ve forgotten why). Similarly, we must use a bytes-oriented object when forging the file object’s interface:

>>> import io, pickle
>>> pickle.dumps([1, 2, 3])                      # default=binary protocol
b'x80x03]qx00(Kx01Kx02Kx03e.'
>>> pickle.dumps([1, 2, 3], protocol=0)          # ASCII format protocol
b'(lp0
L1L
aL2L
aL3L
a.'

>>> pickle.dump([1, 2, 3], open('temp','wb'))    # same if protocol=0, ASCII
>>> pickle.dump([1, 2, 3], open('temp','w'))     # must use 'rb' to read too
TypeError: must be str, not bytes
>>> pickle.dump([1, 2, 3], open('temp','w'), protocol=0)
TypeError: must be str, not bytes

>>> B = io.BytesIO()                             # use bytes streams/buffers
>>> pickle.dump([1, 2, 3], B)
>>> B.getvalue()
b'x80x03]qx00(Kx01Kx02Kx03e.'

>>> B = io.BytesIO()                             # also bytes for ASCII
>>> pickle.dump([1, 2, 3], B, protocol=0)
>>> B.getvalue()
b'(lp0
L1L
aL2L
aL3L
a.'

>>> S = io.StringIO()                            # it's not a str anymore
>>> pickle.dump([1, 2, 3], S)                    # same if protocol=0, ASCII
TypeError: string argument expected, got 'bytes'
>>> pickle.dump([1, 2, 3], S, protocol=0)
TypeError: string argument expected, got 'bytes'

Refer to Python’s library manual for more information on the pickler; it supports additional interfaces that classes may use to customize its behavior, which we’ll bypass here in the interest of space. Also check out marshal, a module that serializes an object too, but can handle only simple object types. pickle is more general than marshal and is normally preferred.

An additional related module, _pickle, is a C-coded optimization of pickle, and is automatically used by pickle internally if available; it need not be selected or used directly. The shelve module inherits this optimization automatically by proxy. I haven’t explained shelve yet, but I will now.

Shelve Files

Pickling allows you to store arbitrary objects on files and file-like objects, but it’s still a fairly unstructured medium; it doesn’t directly support easy access to members of collections of pickled objects. Higher-level structures can be added to pickling, but they are not inherent:

You can sometimes craft your own higher-level pickle file organizations with the underlying filesystem (e.g., you can store each pickled object in a file whose name uniquely identifies the object), but such an organization is not part of pickling itself and must be manually managed.
You can also store arbitrarily large dictionaries in a pickled file and index them by key after they are loaded back into memory, but this will load and store the entire dictionary all at once when unpickled and pickled, not just the entry you are interested in.

Shelves provide structure for collections of pickled objects that removes some of these constraints. They are a type of file that stores arbitrary Python objects by key for later retrieval, and they are a standard part of the Python system. Really, they are not much of a new topic—shelves are simply a combination of the DBM files and object pickling we just met:

To store an in-memory object by key, the shelve module first serializes the object to a string with the pickle module, and then it stores that string in a DBM file by key with the dbm module.
To fetch an object back by key, the shelve module first loads the object’s serialized string by key from a DBM file with the dbm module, and then converts it back to the original in-memory object with the pickle module.

Because shelve uses pickle internally, it can store any object that pickle can: strings, numbers, lists, dictionaries, cyclic objects, class instances, and more. Because shelve uses dbm internally, it inherits all of that module’s capabilities, as well as its portability constraints.

Using Shelves

In other words, shelve is just a go-between; it serializes and deserializes objects so that they can be placed in string-based DBM files. The net effect is that shelves let you store nearly arbitrary Python objects on a file by key and fetch them back later with the same key.

Your scripts never see all of this interfacing, though. Like DBM files, shelves provide an interface that looks like a dictionary that must be opened. In fact, a shelve is simply a persistent dictionary of persistent Python objects—the shelve dictionary’s content is automatically mapped to a file on your computer so that it is retained between program runs. This is quite a feat, but it’s simpler to your code than it may sound. To gain access to a shelve, import the module and open your file:

import shelve
dbase = shelve.open("mydbase")

Internally, Python opens a DBM file with the name mydbase, or creates it if it does not yet exist (it uses the DBM 'c' input/output open mode by default). Assigning to a shelve key stores an object:

dbase['key'] = object      # store object

Internally, this assignment converts the object to a serialized byte stream with pickling and stores it by key on a DBM file. Indexing a shelve fetches a stored object:

value = dbase['key']       # fetch object

Internally, this index operation loads a string by key from a DBM file and unpickles it into an in-memory object that is the same as the object originally stored. Most dictionary operations are supported here, too:

len(dbase)                 # number of items stored
dbase.keys()               # stored item key index iterable

And except for a few fine points, that’s really all there is to using a shelve. Shelves are processed with normal Python dictionary syntax, so there is no new database API to learn. Moreover, objects stored and fetched from shelves are normal Python objects; they do not need to be instances of special classes or types to be stored away. That is, Python’s persistence system is external to the persistent objects themselves. Table 17-2 summarizes these and other commonly used shelve operations.

Table 17-2. Shelve file operations

Python code	Action	Description
`import shelve`	Import	Get `bsddb`, `gdbm`, and so on…whatever is installed
`file=shelve.open('filename')`	Open	Create or open an existing shelve’s DBM file
`file['key'] = anyvalue`	Store	Create or change the entry for `key`
`value = file['key']`	Fetch	Load the value for the entry `key`
`count = len(file)`	Size	Return the number of entries stored
`index = file.keys()`	Index	Fetch the stored keys list (an iterable view)
`found = 'key' in file`	Query	See if there’s an entry for `key`
`del file['key']`	Delete	Remove the entry for `key`
`for key in file:`	Iterate	Iterate over stored keys
`file.close()`	Close	Manual close, not always needed

Because shelves export a dictionary-like interface, too, this table is almost identical to the DBM operation table. Here, though, the module name dbm is replaced by shelve, open calls do not require a second c argument, and stored values can be nearly arbitrary kinds of objects, not just strings. Keys are still strings, though (technically, keys are always a str which is encoded to and from bytes automatically per UTF-8), and you still should close shelves explicitly after making changes to be safe: shelves use dbm internally, and some underlying DBMs require closes to avoid data loss or damage.

Note

Recent changes: The shelve module now has an optional writeback argument; if passed True, all entries fetched are cached in memory, and written back to disk automatically at close time. This obviates the need to manually reassign changed mutable entries to flush them to disk, but can perform poorly if many items are fetched—it may require a large amount of memory for the cache, and it can make the close operation slow since all fetched entries must be written back to disk (Python cannot tell which of the objects may have been changed).

Besides allowing values to be arbitrary objects instead of just strings, in Python 3.X the shelve interface differs from the DBM interface in two subtler ways. First, the keys method returns an iterable view object (not a physical list). Second, the values of keys are always str in your code, not bytes—on fetches, stores, deletes, and other contexts, the str keys you use are encoded to the bytes expected by DBM using the UTF-8 Unicode encoding. This means that unlike dbm, you cannot use bytes for shelve keys in your code to employ arbitrary encodings.

Shelve keys are also decoded from bytes to str per UTF-8 whenever they are returned from the shelve API (e.g., keys iteration). Stored values are always the bytes object produced by the pickler to represent a serialized object. We’ll see these behaviors in action in the examples of this section.

Storing Built-in Object Types in Shelves

Let’s run an interactive session to experiment with shelve interfaces. As mentioned, shelves are essentially just persistent dictionaries of objects, which you open and close:

C:...PP4EDbase> python
>>> import shelve
>>> dbase = shelve.open("mydbase")
>>> object1 = ['The', 'bright', ('side', 'of'), ['life']]
>>> object2 = {'name': 'Brian', 'age': 33, 'motto': object1}

>>> dbase['brian']  = object2
>>> dbase['knight'] = {'name': 'Knight', 'motto': 'Ni!'}
>>> dbase.close()

Here, we open a shelve and store two fairly complex dictionary and list data structures away permanently by simply assigning them to shelve keys. Because shelve uses pickle internally, almost anything goes here—the trees of nested objects are automatically serialized into strings for storage. To fetch them back, just reopen the shelve and index:

C:...PP4EDbase> python
>>> import shelve
>>> dbase = shelve.open("mydbase")
>>> len(dbase)                             # entries
2

>>> dbase.keys()                           # index
KeysView(<shelve.DbfilenameShelf object at 0x0181F630>)

>>> list(dbase.keys())
['brian', 'knight']

>>> dbase['knight']                        # fetch
{'motto': 'Ni!', 'name': 'Knight'}

>>> for row in dbase.keys():               # .keys() is optional
...     print(row, '=>')
...     for field in dbase[row].keys():
...         print('  ', field, '=', dbase[row][field])
...
brian =>
   motto = ['The', 'bright', ('side', 'of'), ['life']]
   age = 33
   name = Brian
knight =>
   motto = Ni!
   name = Knight

The nested loops at the end of this session step through nested dictionaries—the outer scans the shelve and the inner scans the objects stored in the shelve (both could use key iterators and omit their .keys() calls). The crucial point to notice is that we’re using normal Python syntax, both to store and to fetch these persistent objects, as well as to process them after loading. It’s persistent Python data on disk.

Storing Class Instances in Shelves

One of the more useful kinds of objects to store in a shelve is a class instance. Because its attributes record state and its inherited methods define behavior, persistent class objects effectively serve the roles of both database records and database-processing programs. We can also use the underlying pickle module to serialize instances to flat files and other file-like objects (e.g., network sockets), but the higher-level shelve module also gives us a convenient keyed-access storage medium. For instance, consider the simple class shown in Example 17-2, which is used to model people in a hypothetical work scenario.

Example 17-2. PP4EDbaseperson.py (version 1)

"a person object: fields + behavior"

class Person:
    def __init__(self, name, job, pay=0):
        self.name = name
        self.job  = job
        self.pay  = pay               # real instance data
    def tax(self):
        return self.pay * 0.25        # computed on call
    def info(self):
        return self.name, self.job, self.pay, self.tax()

Nothing about this class suggests it will be used for database records—it can be imported and used independent of external storage. It’s easy to use it for a database’s records, though: we can make some persistent objects from this class by simply creating instances as usual, and then storing them by key on an opened shelve:

C:...PP4EDbase> python
>>> from person import Person
>>> bob   = Person('bob', 'psychologist', 70000)
>>> emily = Person('emily', 'teacher', 40000)
>>>
>>> import shelve
>>> dbase = shelve.open('cast')          # make new shelve
>>> for obj in (bob, emily):             # store objects
...     dbase[obj.name] = obj            # use name for key
...
>>> dbase.close()                        # need for bsddb

Here we used the instance objects’ name attribute as their key in the shelve database. When we come back and fetch these objects in a later Python session or script, they are re-created in memory exactly as they were when they were stored:

C:...PP4EDbase> python
>>> import shelve
>>> dbase = shelve.open('cast')          # reopen shelve
>>>
>>> list(dbase.keys())                   # both objects are here
['bob', 'emily']
>>> print(dbase['emily'])
<person.Person object at 0x0197EF70>
>>>
>>> print(dbase['bob'].tax())            # call: bob's tax
17500.0

Notice that calling Bob’s tax method works even though we didn’t import the Person class in this last session. Python is smart enough to link this object back to its original class when unpickled, such that all the original methods are available through fetched objects.

Changing Classes of Objects Stored in Shelves

Technically, Python reimports a class to re-create its stored instances as they are fetched and unpickled. Here’s how this works:

Store: When Python pickles a class instance to store it in a shelve, it saves the instance’s attributes plus a reference to the instance’s class. In effect, pickled class instances in the prior example record the self attributes assigned in the class. Really, Python serializes and stores the instance’s __dict__ attribute dictionary along with enough source file information to be able to locate the class’s module later—the names of the instance’s class as well as its class’s enclosing module.
Fetch: When Python unpickles a class instance fetched from a shelve, it re-creates the instance object in memory by reimporting the class using the save class and module name strings, assigning the saved attribute dictionary to a new empty instance, and linking the instance back to the class. This is be default, and it can be tailored by defining special methods that will be called by pickle to fetch and store instance state (see the Python library manual for details).

The key point in this is that the class and stored instance data are separate. The class itself is not stored with its instances, but is instead located in the Python source file and reimported later when instances are fetched.

The downside of this model is that the class must be importable to load instances off a shelve (more on this in a moment). The upside is that by modifying external classes in module files, we can change the way stored objects’ data is interpreted and used without actually having to change those stored objects. It’s as if the class is a program that processes stored records.

To illustrate, suppose the Person class from the previous section was changed to the source code in Example 17-3.

Example 17-3. PP4EDbaseperson.py (version 2)

"""
a person object: fields + behavior
change: the tax method is now a computed attribute
"""

class Person:
    def __init__(self, name, job, pay=0):
        self.name = name
        self.job  = job
        self.pay  = pay                  # real instance data

    def __getattr__(self, attr):         # on person.attr
        if attr == 'tax':
            return self.pay * 0.30       # computed on access
        else:
            raise AttributeError()       # other unknown names

    def info(self):
        return self.name, self.job, self.pay, self.tax

This revision has a new tax rate (30 percent), introduces a __getattr__ qualification overload method, and deletes the original tax method. Because this new version of the class is re-imported when its existing instances are loaded from the shelve file, they acquire the new behavior automatically—their tax attribute references are now intercepted and computed when accessed:

C:...PP4EDbase> python
>>> import shelve
>>> dbase = shelve.open('cast')      # reopen shelve
>>>
>>> print(list(dbase.keys()))        # both objects are here
['bob', 'emily']
>>> print(dbase['emily'])
<person.Person object at 0x019AEE90>
>>>
>>> print(dbase['bob'].tax)          # no need to call tax()
21000.0

Because the class has changed, tax is now simply qualified, not called. In addition, because the tax rate was changed in the class, Bob pays more this time around. Of course, this example is artificial, but when used well, this separation of classes and persistent instances can eliminate many traditional database update programs. In most cases, you can simply change the class, not each stored instance, for new behavior.

Shelve Constraints

Although shelves are generally straightforward to use, there are a few rough edges worth knowing about.

Keys must be strings (and str)

First, although they can store arbitrary objects, keys must still be strings. The following fails, unless you convert the integer 42 to the string 42 manually first:

dbase[42] = value      # fails, but str(42) will work

This is different from in-memory dictionaries, which allow any immutable object to be used as a key, and derives from the shelve’s use of DBM files internally. As we’ve seen, keys must further be str strings in Python 3.X, not bytes, because the shelve will attempt to encode them in all cases.

Objects are unique only within a key

Although the shelve module is smart enough to detect multiple occurrences of a nested object and re-create only one copy when fetched, this holds true only within a given slot:

dbase[key] = [object, object]    # OK: only one copy stored and fetched

dbase[key1] = object
dbase[key2] = object             # bad?: two copies of object in the shelve

When key1 and key2 are fetched, they reference independent copies of the original shared object; if that object is mutable, changes from one won’t be reflected in the other. This really stems from the fact the each key assignment runs an independent pickle operation—the pickler detects repeated objects but only within each pickle call. This may or may not be a concern in your practice, and it can be avoided with extra support logic, but an object can be duplicated if it spans keys.

Updates must treat shelves as fetch-modify-store mappings

Because objects fetched from a shelve don’t know that they came from a shelve, operations that change components of a fetched object change only the in-memory copy, not the data on a shelve:

dbase[key].attr = value   # shelve unchanged

To really change an object stored on a shelve, fetch it into memory, change its parts, and then write it back to the shelve as a whole by key assignment:

object = dbase[key]       # fetch it
object.attr = value       # modify it
dbase[key] = object       # store back-shelve changed (unless writeback)

As noted earlier, the shelve.open call’s optional writeback argument can be used to avoid the last step here, by automatically caching objects fetched and writing them to disk when the shelve is closed, but this can require substantial memory resources and make close operations slow.

Concurrent updates are not directly supported

The shelve module does not currently support simultaneous updates. Simultaneous readers are OK, but writers must be given exclusive access to the shelve. You can trash a shelve if multiple processes write to it at the same time, which is a common potential in things such as server-side scripts on the Web. If your shelves may be updated by multiple processes, be sure to wrap updates in calls to the os.open standard library function to lock files and provide exclusive access.

Underlying DBM format portability

With shelves, the files created by an underlying DBM system used to store your persistent objects are not necessarily compatible with all possible DBM implementations or Pythons. For instance, a file generated by gdbm on Linux, or by the bsddb library on Windows, may not be readable by a Python with other DBM modules installed.

This is really the same portability issue we discussed for DBM files earlier. As you’ll recall, when a DBM file (or by proxy, a shelve) is created, the dbm module tries to import all possible DBM system modules in a predefined order and uses the first that it finds. When dmb later opens an existing file, it attempts to determine which DBM system created it by inspecting the files(s). Because the bsddb system is tried first at file creation time and is available on both Windows and many Unix-like systems, your DBM file is portable as long as your Pythons support BSD on both platforms. This is also true if all platforms you’ll use fall back on Python’s own dbm.dumb implementation. If the system used to create a DBM file is not available on the underlying platform, though, the DBM file cannot be used.

If DBM file portability is a concern, make sure that all the Pythons that will read your data use compatible DBM modules. If that is not an option, use the pickle module directly and flat files for storage (thereby bypassing both shelve and dbm), or use the OODB systems we’ll meet later in this chapter. Such systems may also offer a more complete answer to transaction processing, with calls to commit changes, and automatic rollback to prior commit points on errors.

Pickled Class Constraints

In addition to these shelve constraints, storing class instances in a shelve adds a set of additional rules you need to be aware of. Really, these are imposed by the pickle module, not by shelve, so be sure to follow these if you store class instance objects with pickle directly too:

Classes must be importable

As we’ve seen, the Python pickler stores instance attributes only when pickling an instance object, and it reimports the class later to re-create the instance. Because of that, the classes of stored objects must be importable when objects are unpickled—they must be coded unnested at the top level of a module file that is accessible on the module import search path at load time (e.g., named in PYTHONPATH or in a .pth file, or the current working directory or that of the top-level script).

Further, the class usually must be associated with a real imported module when instances are pickled, not with a top-level script (with the module name __main__), unless they will only ever be used in the top-level script. You also need to be careful about moving class modules after instances are stored. When an instance is unpickled, Python must find its class’s module on the module search using the original module name (including any package path prefixes) and fetch the class from that module using the original class name. If the module or class has been moved or renamed, it might not be found.

In applications where pickled objects are shipped over network sockets, it’s possible to satisfy this constraint by shipping the text of the class along with stored instances; recipients may simply store the class in a local module file on the import search path prior to unpickling received instances. Where this is inconvenient or impossible, simpler pickled objects such as lists and dictionaries with nesting may be transferred instead, as they require no source file to be reconstructed.

Class changes must be backward compatible

Although Python lets you change a class while instances of it are stored on a shelve, those changes must be backward compatible with the objects already stored. For instance, you cannot change the class to expect an attribute not associated with already stored persistent instances unless you first manually update those stored instances or provide extra conversion protocols on the class.

Other pickle module constraints

Shelves also inherit the pickling systems’ nonclass limitations. As discussed earlier, some types of objects (e.g., open files and sockets) cannot be pickled, and thus cannot be stored in a shelve.

In an early Python release, persistent object classes also had to either use constructors with no arguments or provide defaults for all constructor arguments (much like the notion of a C++ copy constructor). This constraint was dropped as of Python 1.5.2—classes with nondefaulted constructor arguments now work as is in the pickling system.^[69]

Other Shelve Limitations

Finally, although shelves store objects persistently, they are not really object-oriented database systems. Such systems also implement features such as immediate automatic write-through on changes, transaction commits and rollbacks, safe concurrent updates, and object decomposition and delayed (“lazy”) component fetches based on generated object IDs. Parts of larger objects may be loaded into memory only as they are accessed. It’s possible to extend shelves to support such features manually, but you don’t need to—the ZODB system, among others, provides an implementation of a more complete object-oriented database system. It is constructed on top of Python’s built-in pickling persistence support, but it offers additional features for advanced data stores. For more on ZODB, let’s move on to the next section.

The ZODB Object-Oriented Database

ZODB, the Zope Object Database, is a full-featured and Python-specific object-oriented database (OODB) system. ZODB can be thought of as a more powerful alternative to Python’s shelves of the preceding section. It allows you to store nearly arbitrary Python objects persistently by key, like shelves, but it adds a set of additional features in exchange for a small amount of extra interface code.

ZODB is not the only OODB available for Python: the Durus system is generally seen as a simpler OODB which was inspired by ZODB. While Durus offers some advantages, it does not provide all the features of ZODB today, and it has not been as widely deployed (though perhaps in part because it is newer). Because of that, this section focuses on ZODB to introduce OODB concepts in general.

ZODB is an open source, third-party add-on for Python. It was originally developed as the database mechanism for websites developed with the Zope web framework mentioned in Chapter 12, but it is now available as a standalone package. It’s useful outside the context of both Zope and the Web as a general database management system in any domain.

Although ZODB does not support SQL queries, objects stored in ZODB can leverage the full power of the Python language. Moreover, in some applications, stored data is more naturally represented as a structured Python object. Table-based relational systems often must represent such data as individual parts scattered across multiple tables and associate them with complex and potentially slow key-based joins, or otherwise map them to and from the Python class model. Because OODBs store native Python objects directly, they can often provide a simpler model in systems which do not require the full power of SQL.

Using a ZODB database is very similar to Python’s standard library shelves, described in the prior section. Just like shelves, ZODB uses the Python pickling system to implement a persistent dictionary of persistent Python objects. In fact, there is almost no database interface to be found—objects are made persistent simply by assigning them to keys of the root ZODB dictionary object, or embedding them in objects stored in the database root. And as in a shelve, “records” take the form of native Python objects, processed with normal Python syntax and tools.

Unlike shelves, though, ZODB adds features critical to some types of programs:

Concurrent updates: You don’t need to manually lock files to avoid data corruption if there are potentially many concurrent writers, the way you would for shelves.
Transaction commit and rollback: If your program crashes, your changes are not retained unless you explicitly commit them to the database.
Automatic updates for some types of in-memory object changes: Objects in ZODB derived from a persistence superclass are smart enough to know the database must be updated when an attribute is assigned.
Automatic caching of objects: Objects are cached in memory for efficiency and are automatically removed from the cache when they haven’t been used.
Platform-independent storage: Because ZODB stores your database in a single flat file with large-file support, it is immune to the potential size constraints and DBM filesystem format differences of shelves. As we saw earlier in this chapter, a shelve created on Windows using bsddb may not be accessible to a script running with gdbm on Linux.

Because of such advantages, ZODB is probably worth your attention if you need to store Python objects in a database persistently in a production environment. The only significant price you’ll pay for using ZODB is a small amount of extra code:

Accessing the database requires a small amount of extra boilerplate code to interface with ZODB—it’s not a simple open call.
Classes are derived from a persistence superclass if you want them to take advantage of automatic updates on changes—persistent classes are generally not as completely independent of the database as in shelves, though they can be.

Considering the extra functionality ZODB provides beyond shelves, these trade-offs are usually more than justified for many applications.

The Mostly Missing ZODB Tutorial

Unfortunately, as I write this edition in June 2010, ZODB is not yet available for Python 3.X, the version used in this book. Because of that, the prior edition’s Python 2.X examples and material have been removed from this section. However, in deference to Python 2.X users, as well as 3.X readers of some bright future where a 3.X-base ZODB has materialized, I’ve made the prior edition’s ZODB materials and examples available in this edition’s examples package.

See the Preface for details on the examples package, and see these locations within it for more on ZODB:

C:...Dbaseodb-2.x                   # ZODB examples code third edition
C:...Dbaseodb-2.xDocumentaion      # The 3rd Edition's ZODB tutorial

Although I cannot predict the future, ZODB will likely become available for Python 3.X eventually. In the absence of this, other Python-based OODBs may offer additional 3.X options.

To give you a brief sample of ZODB’s flavor, though, here’s a quick spin through its operation in Python 2.X. Once we’ve installed a compatible ZODB, we begin by first creating a database:

...PP4EDbaseodb-2.x> python
>>> from ZODB import FileStorage, DB
>>> storage = FileStorage.FileStorage(r'C:	empmydb.fs')
>>> db = DB(storage)
>>> connection = db.open()
>>> root = connection.root()

This is mostly standard “boilerplate” code for connecting to a ZODB database: we import its tools, create a FileStorage and a DB from it, and then open the database and create the root object. The root object is the persistent dictionary in which objects are stored. FileStorage is an object that maps the database to a flat file. Other storage interface options, such as relational database-based storage, are also possible.

Adding objects to a ZODB database is as simple as in shelves. Almost any Python object will do, including tuples, lists, dictionaries, class instances, and nested combinations thereof. As for shelve, simply assign your objects to a key in the database root object to make them persistent:

>>> object1 = (1, 'spam', 4, 'YOU')
>>> object2 = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
>>> object3 = {'name': ['Bob', 'Doe'],
               'age':  42,
               'job':  ('dev', 'mgr')}

>>> root['mystr']   = 'spam' * 3
>>> root['mytuple'] = object1
>>> root['mylist']  = object2
>>> root['mydict']  = object3

>>> root['mylist']
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

Because ZODB supports transaction rollbacks, you must commit your changes to the database to make them permanent. Ultimately, this transfers the pickled representation of your objects to the underlying file storage medium—here, three files that include the name of the file we gave when opening:

>>> import transaction
>>> transaction.commit()
>>> storage.close()

...PP4EDbaseodb-2.x> dir /B c:	empmydb*
mydb.fs
mydb.fs.index
mydb.fs.tmp

Without the final commit in this session, none of the changes we made would be saved. This is what we want in general—if a program aborts in the middle of an update task, none of the partially complete work it has done is retained. In fact, ZODB supports general database undo operations.

Pulling persistent objects back from ZODB in another session or program is just as straightforward: reopen the database as before and index the root to fetch objects back into memory. Like shelves, the database root supports dictionary interfaces—it may be indexed, has dictionary methods and a length, and so on:

...PP4EDbaseodb-2.x> python
>>> from ZODB import FileStorage, DB
>>> storage = FileStorage.FileStorage(r'C:	empmydb.fs')
>>> db = DB(storage)
>>> connection = db.open()
>>> root = connection.root()                          # connect

>>> len(root), root.keys()                            # size, index
(4 ['mylist', 'mystr', 'mytuple', 'mydict'])

>>> root['mylist']                                    # fetch objects
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]
>>> root['mydict']
{'job': ('dev', 'mgr'), 'age': 42, 'name': ['Bob', 'Doe']}

>>> root['mydict']['name'][-1]                        # Bob's last name
'Doe'

Because the database root looks just like a dictionary, we can process it with normal dictionary code—stepping through the keys list to scan record by record, for instance:

>>> for key in root.keys():
        print('%s => %s' % (key.ljust(10), root[key]))

mylist     => [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
mystr      => spamspamspam
mytuple    => (1, 'spam', 4, 'YOU')
mydict     => {'job': ('dev', 'mgr'), 'age': 42, 'name': ['Bob', 'Doe']}

Also like pickling and shelves, ZODB supports storage and retrieval of class instance objects, though they must inherit from a superclass which provides required protocol and intercepts attribute changes in order to flush them to disk automatically:

from persistent import Persistent
class Person(Persistent):
    def __init__(self, name, job=None, rate=0):
        self.name = name
        self.job  = job
        self.rate = rate
    def changeRate(self, newrate):
        self.rate = newrate                   # automatically updates database

When changing ZODB persistent class instances, in-memory attribute changes are automatically written back to the database. Other types of changes, such as in-place appends and key assignments, still require reassignment to the original key as in shelves to force the change to be written to disk (built-in list and dictionary objects do not know that they are persistent).

Because ZODB does not yet work with Python 3.X, that’s as much as we can say about it in this book. For more details, search for ZODB and Zope resources on the Web, and see the examples package resources listed earlier. Here, let’s move on to see how Python programs can make use of a very different sort of database interface—relational databases and SQL.

SQL Database Interfaces

The shelve module and ZODB package of the prior sections are powerful tools. Both allow scripts to throw nearly arbitrary Python objects on a keyed-access file and load them back later—in a single step for shelves and with a small amount of administrative code for ZODB. Especially for applications that record highly structured data, object databases can be convenient and efficient—there is no need to split and later join together the parts of large objects, and stored data is processed with normal Python syntax because it is normal Python objects.

Shelves and ZODB aren’t relational database systems, though; objects (records) are accessed with a single key, and there is no notion of SQL queries. Shelves, for instance, are essentially databases with a single index and no other query-processing support. Although it’s possible to build a multiple-index interface to store data with multiple shelves, it’s not a trivial task and requires manually coded extensions.

ZODB supports some types of searching beyond shelve (e.g., its cataloging feature), and persistent objects may be traversed with all the power of the Python language. However, neither shelves nor ZODB object-oriented databases provide the full generality of SQL queries. Moreover, especially for data that has a naturally tabular structure, relational databases may sometimes be a better fit.

For programs that can benefit from the power of SQL, Python also broadly supports relational database management systems (RDBMSs). Relational databases are not necessarily mutually exclusive with the object persistence topics we studied earlier in this chapter—it is possible, for example, to store the serialized string representation of a Python object produced by pickling in a relational database. ZODB also supports the notion of mapping an object database to a relational storage medium.

The databases we’ll meet in this section, though, are structured and processed in very different ways:

They store data in related tables of columns (rather than in persistent dictionaries of arbitrarily structured persistent Python objects).
They support the SQL query language for accessing data and exploiting relationships among it (instead of Python object traversals).

For some applications, the end result can be a potent combination. Moreover, some SQL-based database systems provide industrial-strength persistence support for enterprise-level data.

Today, there are freely available interfaces that let Python scripts utilize all common relational database systems, both free and commercial : MySQL, Oracle, Sybase, Informix, InterBase, PostgreSQL (Postgres), SQLite, ODBC, and more. In addition, the Python community has defined a database API specification that works portably with a variety of underlying database packages. Scripts written for this API can be migrated to different database vendor packages, with minimal or no source code changes.

As of Python 2.5, Python itself includes built-in support for the SQLite relational database system as part of its standard library. Because this system supports the portable database API, it serves as a tool for both program storage and prototyping—systems developed with SQLite work largely unchanged when a more feature-rich database such as MySQL or Oracle is deployed.

Moreover, the popular SQLObject and SQLAlchemy third-party systems both provide an Object Relational Mapper (ORM), which grafts an object interface onto your database, in which tables are modeled by as Python classes, rows by instances of those classes, and columns by instance attributes. Since ORMs largely just wrap SQL databases in Python classes, we’ll defer their coverage until later in this chapter; for now, let’s explore SQL basics in Python.

SQL Interface Overview

Like ZODB, and unlike the pickle and shelve persistence modules presented earlier, most SQL databases are optional extensions that are not part of Python itself. SQLite is the only relational database package that comes with Python. Moreover, you need to know SQL to fully understand their interfaces. Because we don’t have space to teach SQL in this text, this section gives a brief overview of the API; please consult other SQL references and the database API resources mentioned in the next section for more details that we’ll skip here.

The good news is that you can access SQL databases from Python, through a straightforward and portable model. The Python database API specification defines an interface for communicating with underlying database systems from Python scripts. Vendor-specific database interfaces for Python may or may not conform to this API completely, but all database extensions for Python in common use are minor variations on a theme. Under the database API, SQL databases in Python are grounded on three core concepts:

Connection objects: Represent a connection to a database, are the interface to rollback and commit operations, provide package implementation details, and generate cursor objects.
Cursor objects: Represent an SQL statement submitted as a string and can be used to access and step through SQL statement results.
Query results of SQL select statements: Are returned to scripts as Python sequences of sequences (e.g., a list of tuples), representing database tables of rows. Within these row sequences, column field values are normal Python objects such as strings, integers, and floats (e.g., [('bob',48), ('emily',47)]). Column values may also be special types that encapsulate things such as date and time, and database NULL values are returned as the Python None object.

Beyond this, the API defines a standard set of database exception types, special database type object constructors, and informational top-level calls including thread safety and replacement style checks.

For instance, to establish a database connection under the Python API-compliant Oracle interface, install the commonly used Python Oracle extension module as well as Oracle itself, and then run a statement of this form:

connobj = connect("user/password@system")

This call’s arguments may vary per database and vendor (e.g., some may require network details or a local file’s name), but they generally contain what you provide to log in to your database system. Once you have a connection object, there a variety of things you can do with it, including:

connobj.close()                       close connection now (not at object __del__ time)
connobj.commit()                      commit any pending transactions to the database
connobj.rollback()                    roll database back to start of pending transactions

But one of the most useful things to do with a connection object is to generate a cursor object:

cursobj = connobj.cursor()            return a new cursor object for running SQL

Cursor objects have a set of methods, too (e.g., close to close the cursor before its destructor runs, and callproc to call a stored procedure), but the most important may be this one:

cursobj.execute(sqlstring [, parameters])   run SQL query or command string

Parameters are passed in as a sequence or mapping of values, and are substituted into the SQL statement string according to the interface module’s replacement target conventions. The execute method can be used to run a variety of SQL statement strings:

DDL definition statements (e.g., CREATE TABLE)
DML modification statements (e.g., UPDATE or INSERT)
DQL query statements (e.g., SELECT)

After running an SQL statement, the cursor’s rowcount attribute gives the number of rows changed (for DML changes) or fetched (for DQL queries), and the cursor’s description attribute gives column names and types after a query; execute also returns the number of rows affected or fetched in the most vendor interfaces. For DQL query statements, you must call one of the fetch methods to complete the operation:

tuple       = cursobj.fetchone()          fetch next row of a query result
listoftuple = cursobj.fetchmany([size])   fetch next set of rows of query result
listoftuple = cursobj.fetchall()          fetch all remaining rows of the result

And once you’ve received fetch method results, table information is processed using normal Python sequence operations; for example, you can step through the tuples in a fetchall result list with a simple for loop or comprehension expression. Most Python database interfaces also allow you to provide values to be passed to SQL statement strings, by providing targets and a tuple of parameters. For instance:

query = 'SELECT name, shoesize FROM spam WHERE job = ? AND age = ?'
cursobj.execute(query, (value1, value2))
results = cursobj.fetchall()
for row in results: ...

In this event, the database interface utilizes prepared statements (an optimization and convenience) and correctly passes the parameters to the database regardless of their Python types. The notation used to code targets in the query string may vary in some database interfaces (e.g., :p1 and :p2 or two %s, rather than the two ?s used by the Oracle interface); in any event, this is not the same as Python’s % string formatting operator, as it sidesteps security issues along the way.

Finally, if your database supports stored procedures, you can call them with the callproc method or by passing an SQL CALL or EXEC statement string to the execute method. callproc may generate a result table retrieved with a fetch variant, and returns a modified copy of the input sequence—input parameters are left untouched, and output and input/output parameters are replaced with possibly new values. Additional API features, including support for database blobs (roughly, with sized results), is described in the API’s documentation. For now, let’s move on to do some real SQL processing in Python.

An SQL Database API Tutorial with SQLite

We don’t have space to provide an exhaustive reference for the database API in this book. To sample the flavor of the interface, though, let’s step through a few simple examples. We’ll use the SQLite database system for this tutorial. SQLite is a standard part of Python itself, which you can reasonably expect to be available in all Python installations. Although SQLite implements a complete relational database system, it takes the form of an in-process library instead of a server. This generally makes it better suited for program storage than for enterprise-level data needs.

Thanks to Python’s portable database API, though, other popular database packages such as PostgreSQL, MySQL, and Oracle are used almost identically; the initial call to log in to the database will be all that normally requires different argument values for scripts that use standard SQL code. Because of this, we can use the SQLite system both as a prototyping tool in applications development and as an easy way to get started with the Python SQL database API in this book.

Note

As mentioned earlier, the third edition’s coverage of MySQL had to be replaced here because the interface used is not yet ported to Python 3.X. However, the third edition’s MySQL-based examples and overview are available in the book examples package, in directory C:...PP4EDbaseSqlMySql-2.X, and its Documentation subdirectory. The examples are in Python 2.X form, but their database-related code is largely version neutral. Since that code is also largely database neutral, it is probably of limited value to most readers; the scripts listed in this book should work on other database packages like MySQL with only trivial changes.

Getting started

Regardless of which database system your scripts talk to, the basic SQL interface in Python is very simple. In fact, it’s hardly object-oriented at all—queries and other database commands are sent as strings of SQL. If you know SQL, you already have most of what you need to use relational databases in Python. That’s good news if you fall into this category, but adds a prerequisite if you don’t.

This isn’t a book on the SQL language, so we’ll defer to other resources for details on the commands we’ll be running here (O’Reilly has a suite of books on the topic). In fact, the databases we’ll use are tiny, and the commands we’ll use are deliberately simple as SQL goes—you’ll want to extrapolate from what you see here to the more realistic tasks you face. This section is just a brief look at how to use the Python language in conjunction with an SQL database.

Whether large or small, though, the Python code needed to process your database turns out to be surprisingly straightforward. To get started, the first thing we need to do is open a connection to the database and create a table for storing records:

C:...PP4EDbaseSql> python
>>> import sqlite3
>>> conn = sqlite3.connect('dbase1')     # use a full path for files elsewhere

We start out by importing the Python SQLite interface here—it’s a standard library module called sqlite3 to our scripts. Next we create a connection object, passing in the items our database requires at start-up time—here, the name of the local file where our databases will be stored. This file is what you’ll want to back up to save your database. It will create the file if needed, or open its current content; SQLite also accepts that special string “:memory:” to create a temporary database in memory instead.

As long as a script sticks to using standard SQL code, the connect call’s arguments are usually the only thing that can vary across different database systems. For example, in the MySQL interface this call accepts a network host’s domain name, user name, and password, passed as keyword arguments instead, and the Oracle example sketched earlier expects a more specific sting syntax. Once we’ve gotten past this platform-specific call, though, the rest of the API is largely database neutral.

Making databases and tables

Next, let’s make a cursor for submitting SQL statements to the database server, and submit one to create a first table:

>>> curs = conn.cursor()
>>>>
>>> tblcmd = 'create table people (name char(30), job char(10), pay int(4))'
>>> curs.execute(tblcmd)

The last command here creates the table called “people” within the database; the name, job, and pay information specifies the columns in this table, as well as their datatypes, using a “type(size)” syntax—two strings and an integer. Datatypes can be more sophisticated than ours, but we’ll ignore such details here (see SQL references). In SQLite, the file is the database, so there’s no notion of creating or using a specific database within it, as there is in some systems. At this point, there is a simple flat file in the current working directory named data1, which contains binary data and contains our people database table.

Adding records

So far, we’ve logged in (which just means opening a local file in SQLite) and created a table. Next let’s start a new Python session and create some records. There are three basic statement-based approaches we can use here: inserting one row at a time or inserting multiple rows with a single call statement or a Python loop. Here is the simple case (I’m omitting some call return values here if they are irrelevant to the story):

C:...PP4EDbaseSql> python
>>> import sqlite3
>>> conn = sqlite3.connect('dbase1')
>>> curs = conn.cursor()
>>> curs.execute('insert into people values (?, ?, ?)', ('Bob', 'dev', 5000))
>>> curs.rowcount
1
>>> sqlite3.paramstyle
'qmark'

Create a cursor object to submit SQL statements to the database server as before. The SQL insert command adds a single row to the table. After an execute call, the cursor’s rowcount attribute gives the number of rows produced or affected by the last statement run. This is also available as the return value of an execute call in some database interface modules, but this is not defined in the database API specification, and isn’t provided in SQLite; in other words, don’t depend on it if you want your database scripts to work on other database systems.

Parameters to substitute into the SQL statement string are generally passed in as a sequence (e.g., list or tuple). Notice the module’s paramstyle—this tells us what style it uses for substitution targets in the statement string. Here, qmark means this module accepts ? for replacement targets. Other database modules might use styles such as format (meaning a %s target), or numeric indexes or mapping keys; see the DB API for more details.

To insert multiple rows with a single statement, use the executemany method and a sequence of row sequences (e.g., a list of lists). This call is like calling execute once for each row sequence in the argument, and in fact may be implemented as such; database interfaces may also use database-specific techniques to make this run quicker, though:

>>> curs.executemany('insert into people values (?, ?, ?)',
...          [ ('Sue', 'mus', '70000'),
...            ('Ann', 'mus', '60000')])

>>> curs.rowcount
2

We inserted two rows at once in the last statement. It’s hardly any more work to achieve the same result by inserting one row at a time with a Python loop:

>>> rows = [['Tom', 'mgr', 100000],
...         ['Kim', 'adm', 30000],
...         ['pat', 'dev', 90000]]

>>> for row in rows:
...     curs.execute('insert into people values (? , ?, ?)', row)
...

>>> conn.commit()

Blending Python and SQL like this starts to open up all sorts of interesting possibilities. Notice the last command; we always need to call the connection’s commit method to write our changes out to the database. Otherwise, when the connection is closed, our changes may be lost. In fact, until we call the commit method, none of our inserts may be visible from other database connections.

Technically, the API suggests that a connection object should automatically call its rollback method to back out changes that have not yet been committed, when it is closed (which happens manually when its close method is called, or automatically when the connection object is about to be garbage collected). For database systems that don’t support transaction commit and rollback operations, these calls may do nothing. SQLite implements both the commit and rollback methods; the latter rolls back any changes made since the last commit.

Running queries

OK, we’ve now added six records to our database table. Let’s run an SQL query to see how we did:

>>> curs.execute('select * from people')
>>> curs.fetchall()
[('Bob', 'dev', 5000), ('Sue', 'mus', 70000), ('Ann', 'mus', 60000), ('Tom', 'mgr',
100000), ('Kim', 'adm', 30000), ('pat', 'dev', 90000)]

Run an SQL select statement with a cursor object to grab all rows and call the cursor’s fetchall to retrieve them. They come back to our script as a sequence of sequences. In this module, it’s a list of tuples—the outer list represents the result table, the nested tuples are that table’s rows, and the nested tuple’s contents are the column data. Because it’s all Python data, once we get the query result, we process it with normal Python code. For example, to make the display a bit more coherent, loop through the query’s result as usual:

>>> curs.execute('select * from people')
>>> for row in curs.fetchall():
...     print(row)
...
('Bob', 'dev', 5000)
('Sue', 'mus', 70000)
('Ann', 'mus', 60000)
('Tom', 'mgr', 100000)
('Kim', 'adm', 30000)
('pat', 'dev', 90000)

Tuple unpacking comes in handy in loops here, too, to pick out column values as we go. Here’s a simple formatted display of two of the columns’ values:

>>> curs.execute('select * from people')
>>> for (name, job, pay) in curs.fetchall():
...     print(name, ':', pay)
...
Bob : 5000
Sue : 70000
Ann : 60000
Tom : 100000
Kim : 30000
pat : 90000

Because the query result is a sequence, we can use Python’s powerful sequence and iteration tools to process it. For instance, to select just the name column values, we can run a more specific SQL query and get a list of tuples:

>>> curs.execute('select name from people')
>>> names = curs.fetchall()
>>> names
[('Bob',), ('Sue',), ('Ann',), ('Tom',), ('Kim',), ('pat',)]

Or we can use a Python list comprehension to pick out the fields we want—by using Python code, we have more control over the data’s content and format:

>>> curs.execute('select * from people')
>>> names = [rec[0] for rec in curs.fetchall()]
>>> names
['Bob', 'Sue', 'Ann', 'Tom', 'Kim', 'pat']

The fetchall call we’ve used so far fetches the entire query result table all at once, as a single sequence (an empty sequence comes back, if the result is empty). That’s convenient, but it may be slow enough to block the caller temporarily for large result tables or generate substantial network traffic if the server is running remotely (something could easily require a parallel thread in GUI). To avoid such a bottleneck, we can also grab just one row, or a bunch of rows, at a time with fetchone and fetchmany. The fetchone call returns the next result row or a None false value at the end of the table:

>>> curs.execute('select * from people')
>>> while True:
...     row = curs.fetchone()
...     if not row: break
...     print(row)
...
('Bob', 'dev', 5000)
('Sue', 'mus', 70000)
('Ann', 'mus', 60000)
('Tom', 'mgr', 100000)
('Kim', 'adm', 30000)
('pat', 'dev', 90000)

The fetchmany call returns a sequence of rows from the result, but not the entire table; you can specify how many rows to grab each time with a parameter or rely on the default as given by the cursor’s arraysize attribute. Each call gets at most that many more rows from the result or an empty sequence at the end of the table:

>>> curs.execute('select * from people')
>>> while True:
...     rows = curs.fetchmany()           # size=N optional argument
...     if not rows: break
...     for row in rows:
...         print(row)
...
('Bob', 'dev', 5000)
('Sue', 'mus', 70000)
('Ann', 'mus', 60000)
('Tom', 'mgr', 100000)
('Kim', 'adm', 30000)
('pat', 'dev', 90000)

For this module at least, the result table is exhausted after a fetchone or fetchmany returns a False value. The DB API says that fetchall returns “all (remaining) rows,” so you generally need to call execute again to regenerate results before fetching new data:

>>> curs.fetchone()
>>> curs.fetchmany()
[]
>>> curs.fetchall()
[]

Naturally, we can do more than fetch an entire table; the full power of the SQL language is at your disposal in Python:

>>> curs.execute('select name, job from people where pay > 60000')
>>> curs.fetchall()
[('Sue', 'mus'), ('Tom', 'mgr'), ('pat', 'dev')]

The last query fetches name and job fields for people who earn more than $60,000. The next is similar, but passes in the selection value as a parameter and orders the result table:

>>> query = 'select name, job from people where pay >= ? order by name'
>>> curs.execute(query, [60000])
>>> for row in curs.fetchall(): print(row)
...
('Ann', 'mus')
('Sue', 'mus')
('Tom', 'mgr')
('pat', 'dev')

Running updates

Cursor objects also are used to submit SQL update statements to the database server—updates, deletes, and inserts. We’ve already seen the insert statement at work. Let’s start a new session to perform some other kinds of updates; we begin with the same data we had in the prior session:

C:...PP4EDbaseSql> python
>>> import sqlite3
>>> conn = sqlite3.connect('dbase1')
>>> curs = conn.cursor()
>>> curs.execute('select * from people')
>>> curs.fetchall()
[('Bob', 'dev', 5000), ('Sue', 'mus', 70000), ('Ann', 'mus', 60000), ('Tom', 'mgr',
100000), ('Kim', 'adm', 30000), ('pat', 'dev', 90000)]

The SQL update statement changes records—the following changes three records’ pay column values to 65000 (Bob, Ann, and Kim), because their pay was no more than $60,000. As usual, the cursor’s rowcount gives the number of records changed:

>>> curs.execute('update people set pay=? where pay <= ?', [65000, 60000])
>>> curs.rowcount
3
>>> curs.execute('select * from people')
>>> curs.fetchall()
[('Bob', 'dev', 65000), ('Sue', 'mus', 70000), ('Ann', 'mus', 65000), ('Tom', 'mgr',
100000), ('Kim', 'adm', 65000), ('pat', 'dev', 90000)]

The SQL delete statement removes records, optionally according to a condition (to delete all records, omit the condition). In the following, we delete Bob’s record, as well as any record with a pay that is at least $90,000:

>>> curs.execute('delete from people where name = ?', ['Bob'])
>>> curs.execute('delete from people where pay >= ?',(90000,))
>>> curs.execute('select * from people')
>>> curs.fetchall()
[('Sue', 'mus', 70000), ('Ann', 'mus', 65000), ('Kim', 'adm', 65000)]

>>> conn.commit()

Finally, remember to commit your changes to the database before exiting Python, assuming you wish to keep them. Without a commit, a connection rollback or close call, as well as the connection’s __del__ deletion method, will back out uncommitted changes. Connection objects are automatically closed if they are still open when they are garbage collected, which in turn triggers a __del__ and a rollback; garbage collection happens automatically on program exit, if not sooner.

Building Record Dictionaries

Now that we’ve seen the basics in action, let’s move on and apply them to a few larger tasks. The SQL API defines query results to be sequences of sequences. One of the more common features that people seem to miss from the API is the ability to get records back as something more structured—a dictionary or class instance, for example, with keys or attributes giving column names. The ORMs we’ll meet at the end of this chapter map to class instances, but because this is Python, it’s easy to code this kind of transformation in other ways. Moreover, the API already gives us the tools we need.

Using table descriptions

For example, after a query execute call, the DB API specifies that the cursor’s description attribute gives the names and (for some databases) types of the columns in the result table. To see how, let’s continue with the database in the state in which we left it in the prior section:

>>> curs.execute('select * from people')
>>> curs.description
(('name', None, None, None, None, None, None), ('job', None, None, None, None, None,
None), ('pay', None, None, None, None, None, None))

>>> curs.fetchall()
[('Sue', 'mus', 70000), ('Ann', 'mus', 65000), ('Kim', 'adm', 65000)]

Formally, the description is a sequence of column-description sequences, each of the following form. See the DB API for more on the meaning of the type code slot—it maps to objects at the top level of the database interface module, but the sqlite3 module implements only the name component:

(name, type_code, display_size, internal_size, precision, scale, null_ok)

Now, we can use this metadata anytime we want to label the columns—for instance, in a formatted records display (be sure to regenerate a query result first, since the prior result has been fetched):

>>> curs.execute('select * from people')
>>> colnames = [desc[0] for desc in curs.description]
>>> colnames
['name', 'job', 'pay']

>>> for row in curs.fetchall():
...     for name, value in zip(colnames, row):
...         print(name, '	=>', value)
...     print()
...
name    => Sue
job     => mus
pay     => 70000

name    => Ann
job     => mus
pay     => 65000

name    => Kim
job     => adm
pay     => 65000

Notice how a tab character is used to try to make this output align; a better approach might be to determine the maximum field name length (we’ll see how in a later example).

Record dictionaries construction

It’s a minor extension of our formatted display code to create a dictionary for each record, with field names for keys—we just need to fill in the dictionary as we go:

>>> curs.execute('select * from people')
>>> colnames = [desc[0] for desc in curs.description]
>>> rowdicts = []
>>> for row in curs.fetchall():
...     newdict = {}
...     for name, val in zip(colnames, row):
...         newdict[name] = val
...     rowdicts.append(newdict)
...
>>> for row in rowdicts: print(row)
...
{'pay': 70000, 'job': 'mus', 'name': 'Sue'}
{'pay': 65000, 'job': 'mus', 'name': 'Ann'}
{'pay': 65000, 'job': 'adm', 'name': 'Kim'}

Because this is Python, though, there are more powerful ways to build up these record dictionaries. For instance, the dictionary constructor call accepts the zipped name/value pairs to fill out the dictionaries for us:

>>> curs.execute('select * from people')
>>> colnames = [desc[0] for desc in curs.description]
>>> rowdicts = []
>>> for row in curs.fetchall():
...     rowdicts.append( dict(zip(colnames, row)) )
...
>>> rowdicts[0]
{'pay': 70000, 'job': 'mus', 'name': 'Sue'}

And finally, a list comprehension will do the job of collecting the dictionaries into a list—not only is this less to type, but it probably runs quicker than the original version:

>>> curs.execute('select * from people')
>>> colnames = [desc[0] for desc in curs.description]
>>> rowdicts = [dict(zip(colnames, row)) for row in curs.fetchall()]
>>> rowdicts[0]
{'pay': 70000, 'job': 'mus', 'name': 'Sue'}

One of the things we lose when moving to dictionaries is record field order—if you look back at the raw result of fetchall, you’ll notice that record fields are in the name, job, and pay order in which they were stored. Our dictionary’s fields come back in the pseudorandom order of Python mappings. As long as we fetch fields by key, this is irrelevant to our script. Tables still maintain their order, and dictionary construction works fine because the description result tuple is in the same order as the fields in row tuples returned by queries.

We’ll leave the task of translating record tuples into class instances as a suggested exercise, except for two hints: Python’s standard library collections module implements more exotic data types, such as named tuples and ordered dictionaries; and we can access fields as attributes rather than as keys, by simply creating an empty class instance and assigning to attributes with the Python setattr function. Classes would also provide a natural place to code inheritable tools such as standard display methods. In fact, this is part of the utility that the upcoming ORMs can provide for us.

Automating with scripts and modules

Up to this point, we’ve essentially used Python as a command-line SQL client—our queries have been typed and run interactively. All the kinds of code we’ve run, though, can be used as the basis of database access in script files. Working interactively requires retyping things such as multiline loops, which can become tedious. With scripts, we can automate our work.

To demonstrate, let’s make the last section’s prior example into a utility module—Example 17-4 is a reusable module that knows how to translate the result of a query from row tuples to row dictionaries.

Example 17-4. PP4EDbaseSqlmakedicts.py

"""
convert list of row tuples to list of row dicts with field name keys
this is not a command-line utility: hardcoded self-test if run
"""

def makedicts(cursor, query, params=()):
    cursor.execute(query, params)
    colnames = [desc[0] for desc in cursor.description]
    rowdicts = [dict(zip(colnames, row)) for row in cursor.fetchall()]
    return rowdicts

if __name__ == '__main__':   # self test
    import sqlite3
    conn = sqlite3.connect('dbase1')
    cursor = conn.cursor()
    query  = 'select name, pay from people where pay < ?'
    lowpay = makedicts(cursor, query, [70000])
    for rec in lowpay: print(rec)

As usual, we can run this file from the system command line as a script to invoke its self-test code:

...PP4EDbaseSql> makedicts.py
{'pay': 65000, 'name': 'Ann'}
{'pay': 65000, 'name': 'Kim'}

Or we can import it as a module and call its function from another context, like the interactive prompt. Because it is a module, it has become a reusable database tool:

...PP4EDbaseSql> python
>>> from makedicts import makedicts
>>> from sqlite3 import connect
>>> conn = connect('dbase1')
>>> curs = conn.cursor()
>>> curs.execute('select * from people')
>>> curs.fetchall()
[('Sue', 'mus', 70000), ('Ann', 'mus', 65000), ('Kim', 'adm', 65000)]

>>> rows = makedicts(curs, "select name from people where job = 'mus'")
>>> rows
[{'name': 'Sue'}, {'name': 'Ann'}]

Our utility handles arbitrarily complex queries—they are simply passed through the DB API to the database server or library. The order by clause here sorts the result on the name field:

>>> query = 'select name, pay from people where job = ? order by name'
>>> musicians = makedicts(curs, query, ['mus'])
>>> for row in musicians: print(row)
...
{'pay': 65000, 'name': 'Ann'}
{'pay': 70000, 'name': 'Sue'}

Tying the Pieces Together

So far, we’ve learned how to make databases and tables, insert records into tables, query table contents, and extract column names. For reference, and to show how these techniques are combined, Example 17-5 collects them into a single script.

Example 17-5. PP4EDbaseSql estdb.py

from sqlite3 import connect
conn = connect('dbase1')
curs = conn.cursor()
try:
    curs.execute('drop table people')
except:
    pass  # did not exist
curs.execute('create table people (name char(30), job char(10), pay int(4))')

curs.execute('insert into people values (?, ?, ?)', ('Bob', 'dev', 50000))
curs.execute('insert into people values (?, ?, ?)', ('Sue', 'dev', 60000))

curs.execute('select * from people')
for row in curs.fetchall():
    print(row)

curs.execute('select * from people')
colnames = [desc[0] for desc in curs.description]
while True:
    print('-' * 30)
    row = curs.fetchone()
    if not row: break
    for (name, value) in zip(colnames, row):
        print('%s => %s' % (name, value))

conn.commit()   # save inserted records

Refer to prior sections in this tutorial if any of the code in this script is unclear. When run, it creates a two-record database and lists its content to the standard output stream:

C:...PP4EDbaseSql> testdb.py
('Bob', 'dev', 50000)
('Sue', 'dev', 60000)
------------------------------
name => Bob
job => dev
pay => 50000
------------------------------
name => Sue
job => dev
pay => 60000
------------------------------

As is, this example is really just meant to demonstrate the database API. It hardcodes database names, and it re-creates the database from scratch each time. We could turn this code into generally useful tools by refactoring it into reusable parts, as we’ll see later in this section. First, though, let’s explore techniques for getting data into our databases.

Loading Database Tables from Files

One of the nice things about using Python in the database domain is that you can combine the power of the SQL query language with the power of the Python general-purpose programming language. They naturally complement each other.

Loading with SQL and Python

Suppose, for example, that you want to load a database table from a flat file, where each line in the file represents a database row, with individual field values separated by commas. Examples 17-6 and 17-7 list two such datafiles we’re going to be using here.

Example 17-6. PP4EDbaseSqldata.txt

bob,devel,50000
sue,music,60000
ann,devel,40000
tim,admin,30000
kim,devel,60000

Example 17-7. PP4EDbaseSqldata2.txt

bob,developer,80000
sue,music,90000
ann,manager,80000

Now, some database systems like MySQL have a handy SQL statement for loading such a table quickly. Its load data statement parses and loads data from a text file, located on either the client or the server machine. In the following, the first command deletes all records in the table, and we’re using the fact that Python automatically concatenates adjacent string literals to split the SQL statement over multiple lines:

# Using MySQL (currently available for Python 2.X only)
...log into MySQL first...

>>> curs.execute('delete from people')                # all records
>>> curs.execute(
...     "load data local infile 'data.txt' "
...           "into table people fields terminated by ','")

>>> curs.execute('select * from people')
>>> for row in curs.fetchall(): print(row)
...
('bob', 'devel', 50000L)
('sue', 'music', 60000L)       # 2.X long integers
('ann', 'devel', 40000L)
('tim', 'admin', 30000L)
('kim', 'devel', 60000L)
>>> conn.commit()

This works as expected. But what if you must use a system like the SQLite database used in this book, which lacks this specific SQL statement? Or, perhaps you just need to do something more custom than this MySQL statement allows. Not to worry—a small amount of simple Python code can easily accomplish the same result with SQLite and Python 3.X (again, some irrelevant output lines are omitted here):

C:...PP4EDbaseSql> python
>>> from sqlite3 import connect
>>> conn = connect('dbase1')
>>> curs = conn.cursor()

>>> curs.execute('delete from people')         # empty the table
>>> curs.execute('select * from people')
>>> curs.fetchall()
[]

>>> file = open('data.txt')
>>> rows = [line.rstrip().split(',') for line in file]
>>> rows[0]
['bob', 'devel', '50000']

>>> for rec in rows:
...     curs.execute('insert into people values (?, ?, ?)', rec)
...
>>> curs.execute('select * from people')
>>> for rec in curs.fetchall(): print(rec)
...
('bob', 'devel', 50000)
('sue', 'music', 60000)
('ann', 'devel', 40000)
('tim', 'admin', 30000)
('kim', 'devel', 60000)

This code makes use of a list comprehension to collect string split results for all lines in the file after removing any newline characters, and file iterators to step through the file line by line. Its Python loop does the same work as the MySQL load statement, and it will work on more database types, including SQLite. We can get some similar result from an executemany DB API call shown earlier as well, but the Python for loop here has the potential to be more general.

Python versus SQL

In fact, you have the entire Python language at your disposal for processing database results, and a little Python can often duplicate or go beyond SQL syntax. For instance, SQL has special aggregate function syntax for computing things such as sums and averages:

>>> curs.execute("select sum(pay), avg(pay) from people where job = 'devel'")
>>> curs.fetchall()
[(150000, 50000.0)]

By shifting the processing to Python, we can sometimes simplify and do more than SQL’s syntax allows (albeit potentially sacrificing any query performance optimizations the database may perform). Computing pay sums and averages with Python can be accomplished with a simple loop:

>>> curs.execute("select name, pay from people where job = 'devel'")
>>> result = curs.fetchall()
>>> result
(('bob', 50000L), ('ann', 40000L), ('kim', 60000L))

>>> tot = 0
>>> for (name, pay) in result: tot += pay
...
>>> print('total:', tot, 'average:', tot / len(result))         # use // to truncate
total: 150000 average: 50000.0

Or we can use more advanced tools such as comprehensions and generator expressions to calculate sums, averages, maximums, and the like:

>>> print(sum(rec[1] for rec in result))         # generator expr
150000
>>> print(sum(rec[1] for rec in result) / len(result))
50000.0
>>> print(max(rec[1] for rec in result))
60000

The Python approach is more general, but it doesn’t buy us much until things become more complex. For example, here are a few more advanced comprehensions that collect the names of people whose pay is above and below the average in the query result set:

>>> avg = sum(rec[1] for rec in result) / len(result)
>>> print([rec[0] for rec in result if rec[1] > avg])
['kim']
>>> print([rec[0] for rec in result if rec[1] < avg])
['ann']

We may be able to do some of these kinds of tasks with more advanced SQL techniques such as nested queries, but we eventually reach a complexity threshold where Python’s general-purpose nature makes it attractive and potentially more portable. For comparison, here is the equivalent SQL:

>>> query = ("select name from people where job = 'devel' and "
...                "pay > (select avg(pay) from people where job = 'devel')")
>>> curs.execute(query)
>>> curs.fetchall()
[('kim',)]

>>> query = ("select name from people where job = 'devel' and "
...                "pay < (select avg(pay) from people where job = 'devel')")
>>> curs.execute(query)
>>> curs.fetchall()
[('ann',)]

This isn’t the most complex SQL you’re likely to meet, but beyond this point, SQL can become more involved. Moreover, unlike Python, SQL is limited to database-specific tasks by design. Imagine a query that compares a column’s values to data fetched off the Web or from a user in a GUI—simple with Python’s Internet and GUI support, but well beyond the scope of a special-purpose language such as SQL. By combining Python and SQL, you get the best of both and can choose which is best suited to your goals.

With Python, you also have access to utilities you’ve already coded: your database tool set is arbitrarily extensible with functions, modules, and classes. To illustrate, here are some of the same operations coded in a more mnemonic fashion with the dictionary-record module we wrote earlier:

>>> from makedicts import makedicts
>>> recs = makedicts(curs, "select * from people where job = 'devel'")
>>> print(len(recs), recs[0])
3 {'pay': 50000, 'job': 'devel', 'name': 'bob'}

>>> print([rec['name'] for rec in recs])
['bob', 'ann', 'kim']
>>> print(sum(rec['pay'] for rec in recs))
150000

>>> avg = sum(rec['pay'] for rec in recs) / len(recs)
>>> print([rec['name'] for rec in recs if rec['pay'] > avg])
['kim']
>>> print([rec['name'] for rec in recs if rec['pay'] >= avg])
['bob', 'kim']

Similarly, Python’s set object type provides operations such as intersection, union, and difference which can serve as alternatives to other SQL database operations (in the interest of space, we’ll leave their exploration as a suggested side trip). For more advanced database extensions, see the SQL-related tools available for Python in the third-party domain. For example, a variety of packages add an OOP flavor to the DB API—the ORMs we’ll explore near the end of this chapter.

SQL Utility Scripts

At this point in our SQL DB API tour, we’ve started to stretch the interactive prompt to its breaking point—we wind up retyping the same boilerplate code again every time we start a session and every time we run a test. Moreover, the code we’re writing is substantial enough to be reused in other programs. Let’s wrap up by transforming our code into reusable scripts that automate tasks and support reuse.

To illustrate more of the power of the Python/SQL mix, this section presents a handful of utility scripts that perform common tasks—the sorts of things you’d otherwise have to recode often during development. As an added bonus, most of these files are both command-line utilities and modules of functions that can be imported and called from other programs. Most of the scripts in this section also allow a database file name to be passed in on the command line; this allows us to use different databases for different purposes during development—changes in one won’t impact others.

Table load scripts

Let’s take a quick look at code first, before seeing it in action; feel free to skip ahead to correlate the code here with its behavior. As a first (and less than ideal) step, Example 17-8 shows a simple way to script-ify the table-loading logic of the prior section.

Example 17-8. PP4EDbaseSqlloaddb1.py

"""
load table from comma-delimited text file; equivalent to this nonportable SQL:
load data local infile 'data.txt' into table people fields terminated by ','"
"""

import sqlite3
conn = sqlite3.connect('dbase1')
curs = conn.cursor()

file = open('data.txt')
rows = [line.rstrip().split(',') for line in file]
for rec in rows:
    curs.execute('insert into people values (?, ?, ?)', rec)

conn.commit()       # commit changes now, if db supports transactions
conn.close()        # close, __del__ call rollback if changes not committed yet

As is, Example 17-8 is a top-level script geared toward one particular case. It’s hardly any extra work to generalize this into a function that can be imported and used in a variety of scenarios, as in Example 17-9—a much more widely useful module and command-line script.

Example 17-9. PP4EDbaseSqlloaddb.py

"""
load table from comma-delimited text file: reusable/generalized version
Importable functions; command-line usage: loaddb.py dbfile? datafile? table?
"""

def login(dbfile):
    import sqlite3
    conn = sqlite3.connect(dbfile)         # create or open db file
    curs = conn.cursor()
    return conn, curs

def loaddb(curs, table, datafile, conn=None, verbose=True):
    file = open(datafile)                                  # x,x,x
x,x,x

    rows = [line.rstrip().split(',') for line in file]     # [[x,x,x], [x,x,x]]
    rows = [str(tuple(rec)) for rec in rows]               # ["(x,x,x)", "(x,x,x)"]
    for recstr in rows:
        curs.execute('insert into ' + table + ' values ' + recstr)
    if conn: conn.commit()
    if verbose: print(len(rows), 'rows loaded')

if __name__ == '__main__':
    import sys
    dbfile, datafile, table = 'dbase1', 'data.txt', 'people'
    if len(sys.argv) > 1: dbfile   = sys.argv[1]
    if len(sys.argv) > 2: datafile = sys.argv[2]
    if len(sys.argv) > 3: table    = sys.argv[3]
    conn, curs = login(dbfile)
    loaddb(curs, table, datafile, conn)

Notice the way this code uses two list comprehensions to build a string of record values for the insert statement (see its comments for the transforms applied). We could also use an executemany call as we did earlier, but we want to be general and avoid hardcoding the fields insertion template—this function might be used for tables with any number of columns.

This file also defines a login function to automate the initial connection calls—after retyping this four-command sequence enough times, it seemed a prime candidate for a function. In addition, this reduces code redundancy; in the future, such logic need only be changed in a single location if we change database systems, as long as the login function is used everywhere.

Table display script

Once we load data, we probably will want to display it. Example 17-10 allows us to display results as we go—it prints an entire table with either a simple display (which could be parsed by other tools) or a formatted display (generated with the dictionary-record utility we wrote earlier). Notice how it computes the maximum field-name size for alignment with a generator expression; the size is passed in to a string formatting expression by specifying an asterisk (*) for the field size in the format string.

Example 17-10. PP4EDbaseSqldumpdb.py

"""
display table contents as raw tuples, or formatted with field names
command-line usage: dumpdb.py dbname? table? [-] (dash=formatted display)
"""

def showformat(recs, sept=('-' * 40)):
    print(len(recs), 'records')
    print(sept)
    for rec in recs:
        maxkey = max(len(key) for key in rec)                # max key len
        for key in rec:                                      # or: 	 align
            print('%-*s => %s' % (maxkey, key, rec[key]))    # -ljust, *len
        print(sept)

def dumpdb(cursor, table, format=True):
    if not format:
        cursor.execute('select * from ' + table)
        while True:
            rec = cursor.fetchone()
            if not rec: break
            print(rec)
    else:
        from makedicts import makedicts
        recs = makedicts(cursor, 'select * from ' + table)
        showformat(recs)

if __name__ == '__main__':
    import sys
    dbname, format, table = 'dbase1', False, 'people'
    cmdargs = sys.argv[1:]
    if '-' in cmdargs:                     # format if '-' in cmdline args
        format = True                      # dbname if other cmdline arg
        cmdargs.remove('-')
    if cmdargs: dbname = cmdargs.pop(0)
    if cmdargs: table  = cmdargs[0]

    from loaddb import login
    conn, curs = login(dbname)
    dumpdb(curs, table, format)

While we’re at it, let’s code some utility scripts to initialize and erase the database, so we do not have to type these by hand at the interactive prompt again every time we want to start from scratch. Example 17-11 completely deletes and re-creates the database, to reset it to an initial state (we did this manually at the start of the tutorial).

Example 17-11. PP4EDbaseSqlmakedb.py

"""
physically delete and re-create database files
usage: makedb.py dbname? tablename?
"""

import sys
if input('Are you sure?').lower() not in ('y', 'yes'):
    sys.exit()

dbname = (len(sys.argv) > 1 and sys.argv[1]) or 'dbase1'
table  = (len(sys.argv) > 2 and sys.argv[2]) or 'people'

from loaddb import login
conn, curs = login(dbname)
try:
    curs.execute('drop table ' + table)
except:
    print('database table did not exist')

command = 'create table %s (name char(30), job char(10), pay int(4))' % table
curs.execute(command)
conn.commit()                 # commit may be optional here
print('made', dbname, table)

Next, the clear script in Example 17-12 deletes all rows in the table, instead of dropping and re-creating them entirely. For testing purposes, either approach is sufficient. Minor caveat: the rowcount attribute doesn’t always reflect the number of rows deleted in SQLite; see its library manual entry for details.

Example 17-12. PP4EDbaseSqlcleardb.py

"""
delete all rows in table, but don't drop the table or database it is in
usage: cleardb.py dbname? tablename?
"""

import sys
if input('Are you sure?').lower() not in ('y', 'yes'):
    sys.exit()

dbname = sys.argv[1] if len(sys.argv) > 1 else 'dbase1'
table  = sys.argv[2] if len(sys.argv) > 2 else 'people'

from loaddb import login
conn, curs = login(dbname)
curs.execute('delete from ' + table)
#print(curs.rowcount, 'records deleted')        # conn closed by its __del__
conn.commit()                                   # else rows not really deleted

Finally, Example 17-13 provides a command-line tool that runs a query and prints its result table in formatted style. There’s not much to this script; because we’ve automated most of its tasks already, this is largely just a combination of existing tools. Such is the power of code reuse in Python.

Example 17-13. PP4EDbaseSqlquerydb.py

"""
run a query string, display formatted result table
example: querydb.py dbase1 "select name, job from people where pay > 50000"
"""

import sys
database, querystr = 'dbase1', 'select * from people'
if len(sys.argv) > 1: database = sys.argv[1]
if len(sys.argv) > 2: querystr = sys.argv[2]

from makedicts import makedicts
from dumpdb    import showformat
from loaddb    import login

conn, curs = login(database)
rows = makedicts(curs, querystr)
showformat(rows)

Using the scripts

Last but not least, here is a log of a session that makes use of these scripts in command-line mode, to illustrate their operation. Most of the files also have functions that can be imported and called from a different program; the scripts simply map command-line arguments to the functions’ arguments when run standalone. The first thing we do is initialize a testing database and load its table from a text file:

...PP4EDbaseSql> makedb.py testdb
Are you sure?y
database table did not exist
made testdb people

...PP4EDbaseSql> loaddb.py testdb data2.txt
3 rows loaded

Next, let’s check our work with the dump utility (use a - argument to force a formatted display):

...PP4EDbaseSql> dumpdb.py testdb
('bob', 'developer', 80000)
('sue', 'music', 90000)
('ann', 'manager', 80000)

...PP4EDbaseSql> dumpdb.py testdb -
3 records
----------------------------------------
pay  => 80000
job  => developer
name => bob
----------------------------------------
pay  => 90000
job  => music
name => sue
----------------------------------------
pay  => 80000
job  => manager
name => ann
----------------------------------------

The dump script is an exhaustive display; to be more specific about which records to view, use the query script and pass in a query string on the command line (the command lines are split here to fit in this book):

...PP4EDbaseSql> querydb.py testdb
                               "select name, job from people where pay = 80000"
2 records
----------------------------------------
job  => developer
name => bob
----------------------------------------
job  => manager
name => ann
----------------------------------------

...PP4EDbaseSql> querydb.py testdb
                               "select * from people where name = 'sue'"
1 records
----------------------------------------
pay  => 90000
job  => music
name => sue
----------------------------------------

Now, let’s erase and start again with a new data set file. The clear script erases all records but doesn’t reinitialize the database completely:

...PP4EDbaseSql> cleardb.py testdb
Are you sure?y

...PP4EDbaseSql> dumpdb.py testdb -
0 records
----------------------------------------

...PP4EDbaseSql> loaddb.py testdb data.txt
5 rows loaded

...PP4EDbaseSql> dumpdb.py testdb
('bob', 'devel', 50000)
('sue', 'music', 60000)
('ann', 'devel', 40000)
('tim', 'admin', 30000)
('kim', 'devel', 60000)

In closing, here are three queries in action on this new data set: they fetch names of developers, jobs that pay above an amount, and records with a given pay level sorted by job. We could run these at the Python interactive prompt, of course, but we’re getting a lot of setup and boilerplate code for free here:

...PP4EDbaseSql> querydb.py testdb
                               "select name from people where job = 'devel'"
3 records
----------------------------------------
name => bob
----------------------------------------
name => ann
----------------------------------------
name => kim
----------------------------------------

...PP4EDbaseSql> querydb.py testdb
                               "select job from people where pay >= 60000"
2 records
----------------------------------------
job => music
----------------------------------------
job => devel
----------------------------------------

...PP4EDbaseSql> querydb.py testdb
                               "select * from people where pay >= 60000 order by job"
2 records
----------------------------------------
pay  => 60000
job  => devel
name => kim
----------------------------------------
pay  => 60000
job  => music
name => sue
----------------------------------------

Before we move on, some context: the scripts in this section illustrate the benefits of code reuse, accomplish their purpose (which was partly demonstrating the SQL API), and serve as a model for canned database utilities. But they are still not as general as they could be; support for sorting options in the dump script, for example, may be a useful extension. Although we could generalize to support more options, at some point we may need to revert to typing SQL commands in a client—part of the reason SQL is a language is because it must support so much generality. Further extensions to these scripts are left as exercises. Change this code as you like; it’s Python, after all.

SQL Resources

Although the examples we’ve seen in this section are simple, their techniques scale up to much more realistic databases and contexts. The websites we studied in the prior part of the book, for instance, can make use of SQL-based systems such as MySQL to store page state information as well as long-lived client information. Because MySQL (among others) supports both large databases and concurrent updates, it’s a natural for website implementation.

There is more to database interfaces than we’ve seen, but additional API documentation is readily available on the Web. To find the full database API specification, search the Web for “Python Database API.” You’ll find the formal API definition—really just a text file describing the PEP (the Python Enhancement Proposal) under which the API was hashed out.

Perhaps the best resource for additional information about database extensions today is the home page of the Python database SIG. Go to http://www.python.org, click on the Community and SIGs links there, and navigate to the database group’s page, or run a search. There, you’ll find API documentation (this is where it is officially maintained), links to database vendor–specific extension modules, and more. And as always, see the PyPI website and search the Web at large for related third-party tools and extensions.

ORMs: Object Relational Mappers

In this chapter, we’ve seen OODBs that store native Python objects persistently, as well as SQL databases that store information in tables. It turns out that there is another class of system that attempts to bridge the object and table worlds, which I’ve hinted at earlier in this chapter: ORMs graft the Python class model onto the tables of relational databases. They combine the power of relational database systems with the simplicity of Python class-based syntax—you don’t need to forgo SQL-based databases, but you can still store data that seems like Python objects to your scripts.

Today, there are two leading open source third-party systems that implement this mapping: SQLObject and SQLAlchemy. Both are fairly complex systems that we cannot do full justice to in this text, and you’re best off researching their documentation on the Web for the full story (there are also dedicated books covering SQLAlchemy today). Moreover, neither is completely Python 3.X ready as I write these words, so we can’t run live examples with them in this text.

To give you a slightly more concrete flavor of the ORM model, though, here is a very quick look at how you might use it to create and process database records in the SQLObject system. In brief, SQLObject maps:

Python classes to database tables
Python class instances to rows in the table
Python instance attributes to row columns

For example, to create a table, we define it with a class, with class attributes that define columns, and call its creation method (this code is derived from a more complete example at SQLObject’s website):

from sqlobject import *
sqlhub.processConnection = connectionForURI('sqlite:/:memory:')

class Person(SQLObject):                  # class: describes table
    first = StringCol()                   # class attributes: row columns
    mid   = StringCol(length=1, default=None)
    last  = StringCol()

Person.createTable()                      # create a database table

Once created, making an instance automatically inserts a row into the database, and attribute fetches and assignments are automatically mapped to fetches and updates of the corresponding table row’s column:

p = Person(first='Bob', last='Smith')     # new instance: makes and inserts row
p                                         # prints all attributes by name

p.first                                   # attribute: fetches row column
p.mid = 'M'                               # attribute: updates record

Existing rows/instances may be fetched by methods calls, and we can assign multiple columns/attributes with a single update operation:

p2 = Person.get(1)                        # fetch existing record/instance: p2 is p
p.set(first='Tom', last='Jones')          # update two attributes/fields at once

In addition, we can select by column values by creating a query object and executing it:

ts = Person.select(Person.q.first=='Tom') # query: select by column value
list(ts)                                  # run the query: list of instances

tjs = Person.selectBy(first='Tom', last='Jones')   # alternative query form (AND)

Naturally, this barely scratches the surface of the available functionality. Even at this level of complexity, though, this is quite a trick—SQLObject automatically issues all the SQL required to fetch, store, and query the table and rows implied by the Python class syntax here. Again, the net effect allows systems to leverage the power of enterprise-level relational databases, but still use familiar Python class syntax to process stored data in Python scripts.

The code used with the SQLAlchemy ORM is of course very different, but the end result is functionally similar. For more details on ORMs for Python, consult your friendly neighborhood web search engine. You can also learn more about such systems by their roles in some larger web development frameworks; Django, for instance, has an ORM which is another variation on this theme.

PyForm: A Persistent Object Viewer (External)

Instead of going into additional database interface details that are freely available on the Web, I’m going to close out this chapter by directing you to a supplemental example that shows one way to combine the GUI technology we met earlier in the text with the persistence techniques introduced in this chapter. This example is named PyForm—a Python/tkinter GUI designed to let you browse and edit tables of records:

Tables browsed may be shelves, DBM files, in-memory dictionaries, or any other object that looks and feels like a dictionary.
Records within tables browsed can be class instances, simple dictionaries, strings, or any other object that can be translated to and from a dictionary.

Although this example is about GUIs and persistence, it also illustrates Python design techniques. To keep its implementation both simple and type-independent, the PyForm GUI is coded to expect tables to look like dictionaries of dictionaries. To support a variety of table and record types, PyForm relies on separate wrapper classes to translate tables and records to the expected protocol:

At the top table level, the translation is easy—shelves, DBM files, and in-memory dictionaries all have the same key-based interface.
At the nested record level, the GUI is coded to assume that stored items have a dictionary-like interface, too, but classes intercept dictionary operations to make records compatible with the PyForm protocol. Records stored as strings are converted to and from the dictionary objects on fetches and stores; records stored as class instances are translated to and from attribute dictionaries. More specialized translations can be added in new table wrapper classes.

The net effect is that PyForm can be used to browse and edit a wide variety of table types, despite its dictionary interface expectations. When PyForm browses shelves and DBM files, table changes made within the GUI are persistent—they are saved in the underlying files. When used to browse a shelve of class instances, PyForm essentially becomes a GUI frontend to a simple object database that is built using standard Python persistence tools. To view and update a shelve of objects with PyForm, for example, code like the following will suffice:

import shelve
from formgui import FormGui                 # after initcast
db = shelve.open('../data/castfile')        # reopen shelve file
FormGui(db).mainloop()                      # browse existing shelve-of-dicts

To view or update a shelve of instances of an imported Actor class, we can use code like this:

from PP4E.Dbase.testdata import Actor
from formgui import FormGui                 # run in TableBrowser dir
from formtable import ShelveOfInstance

testfile = '../data/shelve'                 # external filename
table = ShelveOfInstance(testfile, Actor)   # wrap shelf in Table object
FormGui(table).mainloop()
table.close()                               # close needed for some dbm

Figure 17-1 captures the scene under Python 3.1 and Windows 7 when viewing a shelve of persistent class instance objects. This PyForm session was kicked off by a command-line described in its form table module’s self-test code: formtable.py shelve 1, and omit the 1 (or pass it as 0) to avoid reinitializing the shelve at the start of each session so changes are retained.

PyForm’s GUI can also be started from the PyDemos launcher we met in Chapter 10, though it does not save changes persistently in this mode. Run the example on your own computer to get a better sample of its operation. Though not a fully general Python persistent object table viewer, PyForm serves as a simple object database front end.

Figure 17-1. PyForm displaying a shelve of Actor objects

Because we are short on time and space in this edition, I’m going to omit both the source code for this example and its description here. To study PyForm, see the following directory in the book’s examples package distribution described in the Preface:

C:...PP4EDbaseTableBrowser

See especially the Documentation subdirectory there, which contains the original PyForm overview material from the third edition in a PDF file. PyForm’s source code files are ported to Python 3.X form, though code in the overview document still shows its 2.X third edition roots. For the purposes of the published portions of this book, let’s move on to the next chapter and our next tools topic: data structure implementations.

^[69]Interestingly, Python avoids calling the class to re-create a pickled instance and instead simply makes a class object generically, inserts instance attributes, and sets the instance’s __class__ pointer to the original class directly. This avoids the need for defaults, but it also means that the class __init__ constructors that are no longer called as objects are unpickled, unless you provide extra methods to force the call. See the library manual for more details, and see the pickle module’s source code (pickle.py in the source library) if you’re curious about how this works. Or see the PyForm example later in this chapter—it does something very similar with __class__ links to build an instance object from a class and dictionary of attributes, without calling the class’s __init__ constructor. This makes constructor argument defaults unnecessary in classes used for records browsed by PyForm, but it’s the same idea.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 17. Databases and Persistence

Create new playlist

Sign In

Sign Up

Chapter 17. Databases and Persistence

“Give Me an Order of Persistence, but Hold the Pickles”

Persistence Options in Python

Note

DBM Files

Using DBM Files

DBM Details: Files, Portability, and Close

Note

Pickled Objects

Using Object Pickling

Note

Pickling in Action

Pickle Details: Protocols, Binary Modes, and _pickle

Shelve Files

Using Shelves

Note

Storing Built-in Object Types in Shelves

Storing Class Instances in Shelves

Changing Classes of Objects Stored in Shelves

Shelve Constraints

Keys must be strings (and str)

Objects are unique only within a key

Updates must treat shelves as fetch-modify-store mappings

Concurrent updates are not directly supported

Underlying DBM format portability

Pickled Class Constraints

Other Shelve Limitations

The ZODB Object-Oriented Database

The Mostly Missing ZODB Tutorial

SQL Database Interfaces

SQL Interface Overview

An SQL Database API Tutorial with SQLite

Note

Getting started

Making databases and tables

Adding records

Running queries

Running updates

Building Record Dictionaries

Using table descriptions

Record dictionaries construction

Automating with scripts and modules

Tying the Pieces Together

Loading Database Tables from Files

Loading with SQL and Python

Python versus SQL

SQL Utility Scripts

Table load scripts

Table display script

Using the scripts

SQL Resources

ORMs: Object Relational Mappers

PyForm: A Persistent Object Viewer (External)

Table of Contents for
17. Databases and Persistence