So far in this book, we’ve used Python in the system programming, GUI development, and Internet scripting domains—three of Python’s most common applications, and representative of its use as an application programming language at large. In the next four chapters, we’re going to take a quick look at other major Python programming topics: persistent data, data structure techniques, text and language processing, and Python/C integration.
These four topics are not really application areas themselves, but they are techniques that span domains. The database topics in this chapter, for instance, can be applied on the Web, in desktop GUI applications, and so on. Text processing is a similarly general tool. Moreover, none of these final four topics is covered exhaustively (each could easily fill a book alone), but we’ll sample Python in action in these domains and highlight their core concepts and tools. If any of these chapters spark your interest, additional resources are readily available in the Python world.
In this chapter, our focus is on persistent data—the kind that outlives a program that creates it. That’s not true by default for objects a script constructs, of course; things like lists, dictionaries, and even class instance objects live in your computer’s memory and are lost as soon as the script ends. To make data live longer, we need to do something special. In Python programming, there are today at least six traditional ways to save information in between program executions:
Text and bytes stored directly on your computer
Persistent Python objects stored in persistent dictionaries (ZODB, Durus)
Table-based storage that supports SQL queries (SQLite, MySQL, PostGreSQL, etc.)
Mediators that map Python classes to relational tables (SQLObject, SQLAlchemy)
In some sense, Python’s interfaces to network-based object transmission protocols such as SOAP, XML-RPC, and CORBA also offer persistence options, but they are beyond the scope of this chapter. Here, our interest is in techniques that allow a program to store its data directly and, usually, on the local machine. Although some database servers may operate on a physically remote machine on a network, this is largely transparent to most of the techniques we’ll study here.
We studied Python’s simple (or “flat”) file interfaces in earnest in Chapter 4, and we have been using them
ever since. Python provides standard access to both the stdio
filesystem (through the built-in
open
function), as well as
lower-level descriptor-based files (with the built-in os
module). For simple data storage tasks,
these are all that many scripts need. To save for use in a future
program run, simply write data out to a newly opened file on your
computer in text or binary mode, and read it back from that file
later. As we’ve seen, for more advanced tasks, Python also supports
other file-like interfaces such
as pipes, fifos, and sockets.
Since we’ve already explored flat files, I won’t say more about them here. The rest of this chapter introduces the remaining topics on the preceding list. At the end, we’ll also meet a GUI program for browsing the contents of things such as shelves and DBM files. Before that, though, we need to learn what manner of beast these are.
Fourth edition coverage note: The prior
edition of this book used the mysql-python
interface to the MySQL
relational database system, as well as the ZODB object database
system. As I update this chapter in June 2010, neither of these is
yet available for Python 3.X, the version of Python used in this
edition. Because of that, most ZODB information has been trimmed,
and the SQL database examples here were changed to use the SQLite
in-process database system that ships with Python 3.X as part of its
standard library. The prior edition’s ZODB and MySQL examples and
overviews are still available in the examples package, as described
later. Because Python’s SQL database API is portable, though, the
SQLite code here should work largely unchanged on most other
systems.
Flat files are handy for simple persistence tasks, but they are generally
geared toward a sequential processing mode. Although it is possible to
jump around to arbitrary locations with seek
calls, flat files don’t provide much
structure to data beyond the notion of bytes and text lines.
DBM files, a standard tool in the Python library for database management, improve on that by providing key-based access to stored text strings. They implement a random-access, single-key view on stored data. For instance, information related to objects can be stored in a DBM file using a unique key per object and later can be fetched back directly with the same key. DBM files are implemented by a variety of underlying modules (including one coded in Python), but if you have Python, you have a DBM.
Although DBM filesystems have to do a bit of work to map chunks of stored data to keys for fast retrieval (technically, they generally use a technique called hashing to store data in files), your scripts don’t need to care about the action going on behind the scenes. In fact, DBM is one of the easiest ways to save information in Python—DBM files behave so much like in-memory dictionaries that you may forget you’re actually dealing with a file at all. For instance, given a DBM file object:
Indexing by key fetches data from the file.
Assigning to an index stores data in the file.
DBM file objects also support common dictionary methods such as keys-list fetches and tests and key deletions. The DBM library itself is hidden behind this simple model. Since it is so simple, let’s jump right into an interactive example that creates a DBM file and shows how the interface works:
C:...PP4EDbase>python
>>>import dbm
# get interface: bsddb, gnu, ndbm, dumb >>>file = dbm.open('movie', 'c')
# make a DBM file called 'movie' >>>file['Batman'] = 'Pow!'
# store a string under key 'Batman' >>>file.keys()
# get the file's key directory [b'Batman'] >>>file['Batman']
# fetch value for key 'Batman' b'Pow!' >>>who = ['Robin', 'Cat-woman', 'Joker']
>>>what = ['Bang!', 'Splat!', 'Wham!']
>>>for i in range(len(who)):
...file[who[i]] = what[i]
# add 3 more "records" ... >>>file.keys()
[b'Cat-woman', b'Batman', b'Joker', b'Robin'] >>>len(file), 'Robin' in file, file['Joker']
(4, True, b'Wham!') >>>file.close()
# close sometimes required
Internally, importing the dbm
standard
library module automatically loads whatever DBM interface is
available in your Python interpreter (attempting alternatives in a
fixed order), and opening the new DBM file creates one or more
external files with names that start with the string 'movie'
(more on the details in a moment).
But after the import and open, a DBM file is virtually
indistinguishable from a dictionary.
In effect, the object called file
here can be thought of as a
dictionary mapped to an external file called movie
; the only obvious differences are
that keys must be strings (not arbitrary immutables), and we need to
remember to open to access and close after changes.
Unlike normal dictionaries, though, the contents of file
are retained between Python program
runs. If we come back later and restart Python, our dictionary is
still available. Again, DBM files are like dictionaries that must be
opened:
C:...PP4EDbase>python
>>>import dbm
>>>file = dbm.open('movie', 'c')
# open existing DBM file >>>file['Batman']
b'Pow!' >>>file.keys()
# keys gives an index list [b'Cat-woman', b'Batman', b'Joker', b'Robin'] >>>for key in file.keys(): print(key, file[key])
... b'Cat-woman' b'Splat!' b'Batman' b'Pow!' b'Joker' b'Wham!' b'Robin' b'Bang!'
Notice how DBM files return a real list for the keys
call; not shown here, their values
method instead returns an iterable
view like dictionaries. Further, DBM files always store both keys
and values as bytes
objects;
interpretation as arbitrary types of Unicode text is left to the
client application. We can use either bytes
or str
strings in our code when accessing or
storing keys and values—using bytes
allows your keys and values to
retain arbitrary Unicode encodings, but str
objects in our code will be encoded to
bytes
internally using the UTF-8
Unicode encoding by Python’s DBM implementation.
Still, we can always decode to Unicode str
strings to display in a more friendly
fashion if desired, and DBM files have a keys iterator just like
dictionaries. Moreover, assigning and deleting keys updates the DBM
file, and we should close after making changes (this ensure that
changes are flushed to disk):
>>>for key in file: print(key.decode(), file[key].decode())
... Cat-woman Splat! Batman Pow! Joker Wham! Robin Bang! >>>file['Batman'] = 'Ka-Boom!'
# change Batman slot >>>del file['Robin']
# delete the Robin entry >>>file.close()
# close it after changes
Apart from having to import the interface and open and close the DBM file, Python programs don’t have to know anything about DBM itself. DBM modules achieve this integration by overloading the indexing operations and routing them to more primitive library tools. But you’d never know that from looking at this Python code—DBM files look like normal Python dictionaries, stored on external files. Changes made to them are retained indefinitely:
C:...PP4EDbase>python
>>>import dbm
# open DBM file again >>>file = dbm.open('movie', 'c')
>>>for key in file: print(key.decode(), file[key].decode())
... Cat-woman Splat! Batman Ka-Boom! Joker Wham!
As you can see, this is about as simple as it can be. Table 17-1 lists the most commonly used DBM file operations. Once such a file is opened, it is processed just as though it were an in-memory Python dictionary. Items are fetched by indexing the file object by key and are stored by assigning to a key.
Python code | Action | Description |
| Import | Get DBM implementation |
| Open | Create or open an existing DBM file for I/O |
| Store | Create or change the
entry for |
| Fetch | Load the value for
the entry |
| Size | Return the number of entries stored |
| Index | Fetch the stored keys list (not a view) |
| Query | See if there’s an
entry for |
| Delete | Remove the entry for
|
| Iterate | Iterate over stored keys |
| Close | Manual close, not always needed |
Despite the dictionary-like interface, DBM files really do map
to one or more external files. For instance, the underlying default
dbm
interface used by Python 3.1
on Windows writes two files—movie.dir and
movie.dat—when a DBM file called movie
is made, and saves a movie.bak on later opens. If your Python
has access to a different underlying keyed-file interface, different
external files might show up on your computer.
Technically, the module dbm
is really an interface to whatever DBM-like filesystem
you have available in your Python:
When opening an already existing DBM file, dbm
tries to determine the system that
created it with the dbm.whichdb
function instead. This
determination is based upon the content of the database
itself.
When creating a new file, dbm
today tries a set of keyed-file
interface modules in a fixed order. According to its
documentation, it attempts to import the interfaces dbm.bsd
, dbm.gnu
, dbm.ndbm
, or dbm.dumb
, and uses the first that
succeeds. Pythons without any of these automatically fall back
on an all-Python and always-present implementation called
dbm.dumb
, which is not really
“dumb,” of course, but may not be as fast or robust as other
options.
Future Pythons are free to change this selection order, and
may even add additional alternatives to it. You normally don’t need
to care about any of this, though, unless you delete any of the
files your DBM creates, or transfer them between machines with
different configurations—if you need to care about the
portability of your DBM files (and as we’ll see
later, by proxy, that of your shelve files), you should configure
machines such that all have the same DBM interface installed or rely
upon the dumb
fallback. For
example, the Berkeley DB package (a.k.a. bsddb
) used by dbm.bsd
is widely available and
portable.
Note that DBM files may or may not need to be explicitly
closed, per the last entry in Table 17-1. Some DBM files don’t require a
close call, but some depend on it to flush changes out to disk. On
such systems, your file may be corrupted if you omit the close call.
Unfortunately, the default DBM in some older Windows Pythons,
dbhash
(a.k.a. bsddb
), is one of the DBM systems that
requires a close call to avoid data loss. As a rule of thumb, always
close your DBM files explicitly after making changes and before your
program exits to avoid potential problems; it’s essential a “commit”
operation for these files. This rule extends by proxy to shelves, a
topic we’ll meet later in this chapter.
Recent changes: Be sure to also pass a
string 'c'
as a second argument
when calling dbm.open
, to force
Python to create the file if it does not yet exist and to simply
open it for reads and writes otherwise. This used to be the
default behavior but is no longer. You do not need the 'c'
argument when opening
shelves discussed ahead—they still use an
“open or create” 'c'
mode by
default if passed no open mode argument. Other open mode strings
can be passed to dbm
, including
n
to always create the file,
and r
for read-only of an
existing file—the new default. See the Python library manual for
more details.
In addition, Python 3.X stores both key and value strings as
bytes
instead of str
as we’ve seen (which turns out to be
convenient for pickled data in shelves, discussed ahead) and no
longer ships with bsddb
as a
standard component—it’s available independently on the Web as a
third-party extension, but in its absence Python falls back on its
own DBM file implementation. Since the underlying DBM
implementation rules are prone to change with time, you should
always consult Python’s library manuals as well as the dbm
module’s standard library source
code for more information.
Probably the biggest limitation of DBM keyed files is in what they
can store: data stored under a key must be a simple string. If you
want to store Python objects in a DBM file, you can sometimes manually
convert them to and from strings on writes and reads (e.g., with
str
and eval
calls), but this takes you only so far.
For arbitrarily complex Python objects such as class instances and
nested data structures, you need something more. Class instance
objects, for example, cannot usually be later re-created from their
standard string representations. Moreover, custom to-string
conversions and from-string parsers are error prone and not
general.
The Python pickle
module, a
standard part of the Python system, provides the
conversion step needed. It’s a sort of super general data formatting
and de-formatting tool—pickle
converts nearly arbitrary Python
in-memory objects to and from a single linear string format, suitable
for storing in flat files, shipping across network sockets between
trusted sources, and so on. This conversion from object to string is often called serialization—arbitrary data
structures in memory are mapped to a serial string form.
The string representation used for objects is also sometimes referred to as a byte stream, due to its linear format. It retains all the content and references structure of the original in-memory object. When the object is later re-created from its byte string, it will be a new in-memory object identical in structure and value to the original, though located at a different memory address.
The net effect is that the re-created object is effectively a
copy of the original; in Python-speak, the two
will be ==
but not is
. Since the recreation typically happens
in an entirely new process, this difference is often irrelevant
(though as we saw in Chapter 5, this
generally precludes using pickled objects directly as cross-process
shared state).
Pickling works on almost any Python datatype—numbers, lists, dictionaries, class instances, nested structures, and more—and so is a general way to store data. Because pickles contain native Python objects, there is almost no database API to be found; the objects stored with pickling are processed with normal Python syntax when they are later retrieved.
Pickling may sound complicated the first time you encounter it, but the good news is that Python hides all the complexity of object-to-string conversion. In fact, the pickle module ’s interfaces are incredibly simple to use. For example, to pickle an object into a serialized string, we can either make a pickler and call its methods or use convenience functions in the module to achieve the same effect:
P =
pickle.Pickler(
file
)
Make a new pickler for pickling to an open output file
object file
.
P.dump(
object
)
Write an object onto the pickler’s file/stream.
pickle.dump(
object,
file
)
Same as the last two calls combined: pickle an object onto an open file.
string =
pickle.dumps(
object
)
Return the pickled representation of object
as a character string.
Unpickling from a serialized string back to the original object is similar—both object and convenience function interfaces are available:
U =
pickle.Unpickler(
file
)
Make an unpickler for unpickling from an open input file
object file
.
object =
U.load()
Read an object from the unpickler’s file/stream.
object =
pickle.load(
file
)
Same as the last two calls combined: unpickle an object from an open file.
object =
pickle.loads(
string
)
Read an object from a character string rather than a file.
Pickler
and Unpickler
are exported classes. In all of
the preceding cases, file
is
either an open file object or any object that implements the same
attributes as file objects:
Any object that provides these attributes can be passed in to
the file
parameters. In
particular, file
can be an
instance of a Python class that provides the read/write methods
(i.e., the expected file-like interface). This
lets you map pickled streams to in-memory objects with classes, for
arbitrary use. For instance, the io.BytesIO
class in the standard library
discussed in Chapter 3 provides an
interface that maps file calls to and from in-memory byte strings
and is an alternative to the pickler’s dumps
/loads
string calls.
This hook also lets you ship Python objects across a network, by providing sockets wrapped to look like files in pickle calls at the sender, and unpickle calls at the receiver (see Making Sockets Look Like Files and Streams for more details). In fact, for some, pickling Python objects across a trusted network serves as a simpler alternative to network transport protocols such as SOAP and XML-RPC, provided that Python is on both ends of the communication (pickled objects are represented with a Python-specific format, not with XML text).
Recent changes: In Python 3.X, pickled
objects are always represented as bytes
, not str
, regardless of the protocol level
which you request (even the oldest ASCII protocol yields bytes).
Because of this, files used to store pickled Python objects should
always be opened in binary mode. Moreover, in 3.X an optimized
_pickle
implementation module
is also selected and used automatically if present. More on both
topics later.
Although pickled objects can be shipped in exotic ways, in
more typical use, to pickle an object to a flat file, we just open
the file in write mode and call the dump
function:
C:...PP4EDbase>python
>>>table = {'a': [1, 2, 3],
'b': ['spam', 'eggs'],
'c': {'name':'bob'}}
>>> >>>import pickle
>>>mydb = open('dbase', 'wb')
>>>pickle.dump(table, mydb)
Notice the nesting in the object pickled here—the pickler handles
arbitrary structures. Also note that we’re using binary mode files
here; in Python 3.X, we really must, because the pickled
representation of an object is always a bytes
object in all cases. To unpickle
later in another session or program run, simply reopen the file and
call load
:
C:...PP4EDbase>python
>>>import pickle
>>>mydb = open('dbase', 'rb')
>>>table = pickle.load(mydb)
>>>table
{'a': [1, 2, 3], 'c': {'name': 'bob'}, 'b': ['spam', 'eggs']}
The object you get back from unpickling has the same value and
reference structure as the original, but it is located at a
different address in memory. This is true whether the object is
unpickled in the same or a future process. Again, the unpickled
object is ==
but is not is
:
C:...PP4EDbase>python
>>>import pickle
>>>f = open('temp', 'wb')
>>>x = ['Hello', ('pickle', 'world')]
# list with nested tuple >>>pickle.dump(x, f)
>>>f.close()
# close to flush changes >>> >>>f = open('temp', 'rb')
>>>y = pickle.load(f)
>>>y
['Hello', ('pickle', 'world')] >>> >>>x == y, x is y
# same value, diff objects (True, False)
To make this process simpler still, the module in Example 17-1 wraps pickling and unpickling calls in functions that also open the files where the serialized form of the object is stored.
"Pickle to/from flat file utilities" import pickle def saveDbase(filename, object): "save object to file" file = open(filename, 'wb') pickle.dump(object, file) # pickle to binary file file.close() # any file-like object will do def loadDbase(filename): "load object from file" file = open(filename, 'rb') object = pickle.load(file) # unpickle from binary file file.close() # re-creates object in memory return object
To store and fetch now, simply call these module functions;
here they are in action managing a fairly complex structure with
multiple references to the same nested object—the nested list called L
at first is stored only once in the
file:
C:...PP4EDbase>python
>>>from filepickle import *
>>>L = [0]
>>>D = {'x':0, 'y':L}
>>>table = {'A':L, 'B':D}
# L appears twice >>>saveDbase('myfile', table)
# serialize to file C:...PP4EDbase>python
>>>from filepickle import *
>>>table = loadDbase('myfile')
# reload/unpickle >>>table
{'A': [0], 'B': {'y': [0], 'x': 0}} >>>table['A'][0] = 1
# change shared object >>>saveDbase('myfile', table)
# rewrite to the file C:...PP4EDbase>python
>>>from filepickle import *
>>>print(loadDbase('myfile'))
# both L's updated as expected {'A': [1], 'B': {'y': [1], 'x': 0}}
Besides built-in types like the lists, tuples, and dictionaries of the examples so far, class instances may also be pickled to file-like objects. This provides a natural way to associate behavior with stored data (class methods process instance attributes) and provides a simple migration path (class changes made in module files are automatically picked up by stored instances). Here’s a brief interactive demonstration:
>>>class Rec:
def __init__(self, hours):
self.hours = hours
def pay(self, rate=50):
return self.hours * rate
>>>bob = Rec(40)
>>>import pickle
>>>pickle.dump(bob, open('bobrec', 'wb'))
>>> >>>rec = pickle.load(open('bobrec', 'rb'))
>>>rec.hours
40 >>>rec.pay()
2000
We’ll explore how this works in more detail in conjunction
with shelves later in this chapter—as we’ll see, although the
pickle
module can be used
directly this way, it is also the underlying translation engine in
both shelves and ZODB databases.
In general, Python can pickle just about anything, except for:
Compiled code objects: functions and classes record just their names and those of their modules in pickles, to allow for later reimport and automatic acquisition of changes made in module files.
Instances of classes that do not follow class importability rules: in short, the class must be importable on object loads (more on this at the end of the section Shelve Files).
Instances of some built-in and user-defined types that are coded in C or depend upon transient operating system states (e.g., open file objects cannot be pickled).
A PicklingError
is raised
if an object cannot be pickled. Again, we’ll revisit the pickler’s
constraints on pickleable objects and classes when we study
shelves.
In later Python releases, the pickler introduced the notion of protocols—storage formats for pickled data. Specify the desired protocol by passing an extra parameter to the pickling calls (but not to unpickling calls: the protocol is automatically determined from the pickled data):
pickle.dump(object, file, protocol
) # or protocol=N keyword argument
Pickled data may be created in either text or binary
protocols; the binary protocols’ format is more efficient, but it
cannot be readily understood if inspected. By default, the storage
protocol in Python 3.X is a 3.X-only binary bytes
format (also known as protocol 3).
In text mode (protocol 0), the pickled data is printable ASCII text,
which can be read by humans (it’s essentially instructions for a
stack machine), but it is still a bytes
object in Python 3.X. The
alternative protocols (protocols 1 and 2) create the pickled data in
binary format as well.
For all protocols, pickled data is a bytes
object in 3.X, not a str
, and therefore implies binary-mode
reads and writes when stored in flat files (see Chapter 4 if you’ve forgotten why).
Similarly, we must use a bytes
-oriented object when forging the
file object’s interface:
>>>import io, pickle
>>>pickle.dumps([1, 2, 3])
# default=binary protocol b'x80x03]qx00(Kx01Kx02Kx03e.' >>>pickle.dumps([1, 2, 3], protocol=0)
# ASCII format protocol b'(lp0 L1L aL2L aL3L a.' >>>pickle.dump([1, 2, 3], open('temp','wb'))
# same if protocol=0, ASCII >>>pickle.dump([1, 2, 3], open('temp','w'))
# must use 'rb' to read too TypeError: must be str, not bytes >>>pickle.dump([1, 2, 3], open('temp','w'), protocol=0)
TypeError: must be str, not bytes >>>B = io.BytesIO()
# use bytes streams/buffers >>>pickle.dump([1, 2, 3], B)
>>>B.getvalue()
b'x80x03]qx00(Kx01Kx02Kx03e.' >>>B = io.BytesIO()
# also bytes for ASCII >>>pickle.dump([1, 2, 3], B, protocol=0)
>>>B.getvalue()
b'(lp0 L1L aL2L aL3L a.' >>>S = io.StringIO()
# it's not a str anymore >>>pickle.dump([1, 2, 3], S)
# same if protocol=0, ASCII TypeError: string argument expected, got 'bytes' >>>pickle.dump([1, 2, 3], S, protocol=0)
TypeError: string argument expected, got 'bytes'
Refer to Python’s library manual for
more information on the pickler; it supports additional interfaces
that classes may use to customize its behavior, which we’ll bypass
here in the interest of space. Also check out marshal
, a module that serializes an
object too, but can handle only simple object types. pickle
is more general than marshal
and is normally preferred.
An additional related module, _pickle
, is a C-coded optimization of
pickle
, and is automatically used
by pickle
internally if
available; it need not be selected or used directly. The shelve
module inherits this optimization
automatically by proxy. I haven’t explained shelve
yet, but I will now.
Pickling allows you to store arbitrary objects on files and file-like objects, but it’s still a fairly unstructured medium; it doesn’t directly support easy access to members of collections of pickled objects. Higher-level structures can be added to pickling, but they are not inherent:
You can sometimes craft your own higher-level pickle file organizations with the underlying filesystem (e.g., you can store each pickled object in a file whose name uniquely identifies the object), but such an organization is not part of pickling itself and must be manually managed.
You can also store arbitrarily large dictionaries in a pickled file and index them by key after they are loaded back into memory, but this will load and store the entire dictionary all at once when unpickled and pickled, not just the entry you are interested in.
Shelves provide structure for collections of pickled objects that removes some of these constraints. They are a type of file that stores arbitrary Python objects by key for later retrieval, and they are a standard part of the Python system. Really, they are not much of a new topic—shelves are simply a combination of the DBM files and object pickling we just met:
To store an in-memory object by key,
the shelve
module first serializes the object to a string with the
pickle
module, and then it
stores that string in a DBM file by key with the dbm
module.
To fetch an object back by key, the
shelve
module first loads the
object’s serialized string by key from a DBM file with the
dbm
module, and then converts
it back to the original in-memory object with the pickle
module.
Because shelve
uses pickle
internally, it can store any object
that pickle
can: strings, numbers,
lists, dictionaries, cyclic objects, class instances, and more.
Because shelve uses dbm
internally, it
inherits all of that module’s capabilities, as well as its portability
constraints.
In other words, shelve
is
just a go-between; it serializes and deserializes
objects so that they can be placed in string-based DBM files. The
net effect is that shelves let you store nearly arbitrary Python
objects on a file by key and fetch them back later with the same key.
Your scripts never see all of this interfacing, though. Like DBM files, shelves provide an interface that looks like a dictionary that must be opened. In fact, a shelve is simply a persistent dictionary of persistent Python objects—the shelve dictionary’s content is automatically mapped to a file on your computer so that it is retained between program runs. This is quite a feat, but it’s simpler to your code than it may sound. To gain access to a shelve, import the module and open your file:
import shelve dbase = shelve.open("mydbase")
Internally, Python opens a DBM file with the name
mydbase, or creates it if it does not yet exist
(it uses the DBM 'c'
input/output
open mode by default). Assigning to a shelve key stores an
object:
dbase['key'] = object # store object
Internally, this assignment converts the object to a serialized byte stream with pickling and stores it by key on a DBM file. Indexing a shelve fetches a stored object:
value = dbase['key'] # fetch object
Internally, this index operation loads a string by key from a DBM file and unpickles it into an in-memory object that is the same as the object originally stored. Most dictionary operations are supported here, too:
len(dbase) # number of items stored dbase.keys() # stored item key index iterable
And except for a few fine points, that’s really all there is to using a shelve. Shelves are processed with normal Python dictionary syntax, so there is no new database API to learn. Moreover, objects stored and fetched from shelves are normal Python objects; they do not need to be instances of special classes or types to be stored away. That is, Python’s persistence system is external to the persistent objects themselves. Table 17-2 summarizes these and other commonly used shelve operations.
Python code | Action | Description |
| Import | Get |
| Open | Create or open an existing shelve’s DBM file |
| Store | Create or change the
entry for |
| Fetch | Load the value for
the entry |
| Size | Return the number of entries stored |
| Index | Fetch the stored keys list (an iterable view) |
| Query | See if there’s an
entry for |
| Delete | Remove the entry for
|
| Iterate | Iterate over stored keys |
| Close | Manual close, not always needed |
Because shelves export a dictionary-like interface, too, this
table is almost identical to the DBM operation table. Here, though,
the module name dbm
is replaced
by shelve
, open
calls do not require a second
c
argument, and stored values can
be nearly arbitrary kinds of objects, not just strings. Keys are
still strings, though (technically, keys are always a str
which is encoded to and from bytes
automatically per UTF-8), and you
still should close
shelves
explicitly after making changes to be safe: shelves use dbm
internally, and some underlying DBMs require closes to avoid data loss
or damage.
Recent changes: The shelve
module now has an optional
writeback
argument; if passed
True
, all entries fetched are
cached in memory, and written back to disk automatically at close
time. This obviates the need to manually reassign changed mutable
entries to flush them to disk, but can perform poorly if many
items are fetched—it may require a large amount of memory for the
cache, and it can make the close
operation slow since all fetched
entries must be written back to disk (Python cannot tell which of
the objects may have been changed).
Besides allowing values to be arbitrary objects instead of
just strings, in Python 3.X the shelve interface differs from the
DBM interface in two subtler ways. First, the keys
method returns an iterable
view object (not a physical list). Second,
the values of keys are always str
in your code, not bytes
—on fetches, stores, deletes, and
other contexts, the str
keys
you use are encoded to the bytes
expected by DBM using the UTF-8
Unicode encoding. This means that unlike dbm
, you cannot use bytes
for shelve
keys in your code to employ
arbitrary encodings.
Shelve keys are also decoded from bytes
to str
per UTF-8 whenever they are returned
from the shelve API (e.g., keys iteration). Stored
values are always the bytes
object produced by the pickler to
represent a serialized object. We’ll see these behaviors in action
in the examples of this section.
Let’s run an interactive session to experiment with shelve interfaces. As mentioned, shelves are essentially just persistent dictionaries of objects, which you open and close:
C:...PP4EDbase>python
>>>import shelve
>>>dbase = shelve.open("mydbase")
>>>object1 = ['The', 'bright', ('side', 'of'), ['life']]
>>>object2 = {'name': 'Brian', 'age': 33, 'motto': object1}
>>>dbase['brian'] = object2
>>>dbase['knight'] = {'name': 'Knight', 'motto': 'Ni!'}
>>>dbase.close()
Here, we open a shelve and store two fairly complex dictionary
and list data structures away permanently by simply assigning them
to shelve keys. Because shelve
uses pickle
internally, almost
anything goes here—the trees of nested objects are automatically
serialized into strings for storage. To fetch them back, just reopen
the shelve and index:
C:...PP4EDbase>python
>>>import shelve
>>>dbase = shelve.open("mydbase")
>>>len(dbase)
# entries 2 >>>dbase.keys()
# index KeysView(<shelve.DbfilenameShelf object at 0x0181F630>) >>>list(dbase.keys())
['brian', 'knight'] >>>dbase['knight']
# fetch {'motto': 'Ni!', 'name': 'Knight'} >>>for row in dbase.keys():
# .keys() is optional ...print(row, '=>')
...for field in dbase[row].keys():
...print(' ', field, '=', dbase[row][field])
... brian => motto = ['The', 'bright', ('side', 'of'), ['life']] age = 33 name = Brian knight => motto = Ni! name = Knight
The nested loops at the end of this session step through
nested dictionaries—the outer scans the shelve and the inner scans
the objects stored in the shelve (both could use key iterators and
omit their .keys()
calls). The
crucial point to notice is that we’re using normal Python syntax,
both to store and to fetch these persistent objects, as well as to
process them after loading. It’s persistent Python data on
disk.
One of the more useful kinds of objects to store in a shelve is a class instance.
Because its attributes record state and its inherited methods define
behavior, persistent class objects effectively serve the roles of
both database records and database-processing
programs. We can also use the underlying
pickle
module to serialize
instances to flat files and other file-like objects (e.g., network
sockets), but the higher-level shelve
module also gives us a convenient
keyed-access storage medium. For instance, consider the simple class
shown in Example 17-2,
which is used to model people in a hypothetical work
scenario.
"a person object: fields + behavior" class Person: def __init__(self, name, job, pay=0): self.name = name self.job = job self.pay = pay # real instance data def tax(self): return self.pay * 0.25 # computed on call def info(self): return self.name, self.job, self.pay, self.tax()
Nothing about this class suggests it will be used for database records—it can be imported and used independent of external storage. It’s easy to use it for a database’s records, though: we can make some persistent objects from this class by simply creating instances as usual, and then storing them by key on an opened shelve:
C:...PP4EDbase>python
>>>from person import Person
>>>bob = Person('bob', 'psychologist', 70000)
>>>emily = Person('emily', 'teacher', 40000)
>>> >>>import shelve
>>>dbase = shelve.open('cast')
# make new shelve >>>for obj in (bob, emily):
# store objects ...dbase[obj.name] = obj
# use name for key ... >>>dbase.close()
# need for bsddb
Here we used the instance objects’ name
attribute as their key in the shelve
database. When we come back and fetch these objects in a later
Python session or script, they are re-created in memory exactly as
they were when they were stored:
C:...PP4EDbase>python
>>>import shelve
>>>dbase = shelve.open('cast')
# reopen shelve >>> >>>list(dbase.keys())
# both objects are here ['bob', 'emily'] >>>print(dbase['emily'])
<person.Person object at 0x0197EF70> >>> >>>print(dbase['bob'].tax())
# call: bob's tax 17500.0
Notice that calling Bob’s tax
method works even though we didn’t
import the Person
class in this
last session. Python is smart enough to link this object back to its
original class when unpickled, such that all the original methods
are available through fetched objects.
Technically, Python reimports a class to re-create its stored instances as they are fetched and unpickled. Here’s how this works:
When Python pickles a class instance to store it in a
shelve, it saves the instance’s attributes plus a reference to
the instance’s class. In effect, pickled class instances in
the prior example record the self
attributes assigned in the
class. Really, Python serializes and stores the instance’s
__dict__
attribute
dictionary along with enough source file information to be
able to locate the class’s module later—the names of the
instance’s class as well as its class’s enclosing
module.
When Python unpickles a class instance fetched from a
shelve, it re-creates the instance object in memory by
reimporting the class using the save class and module name
strings, assigning the saved attribute dictionary to a new
empty instance, and linking the instance back to the class.
This is be default, and it can be tailored by defining special
methods that will be called by pickle
to fetch and store instance
state (see the Python library manual for details).
The key point in this is that the class and stored instance data are separate. The class itself is not stored with its instances, but is instead located in the Python source file and reimported later when instances are fetched.
The downside of this model is that the class must be importable to load instances off a shelve (more on this in a moment). The upside is that by modifying external classes in module files, we can change the way stored objects’ data is interpreted and used without actually having to change those stored objects. It’s as if the class is a program that processes stored records.
To illustrate, suppose the Person
class from the previous section was
changed to the source code in Example 17-3.
""" a person object: fields + behavior change: the tax method is now a computed attribute """ class Person: def __init__(self, name, job, pay=0): self.name = name self.job = job self.pay = pay # real instance data def __getattr__(self, attr): # on person.attr if attr == 'tax': return self.pay * 0.30 # computed on access else: raise AttributeError() # other unknown names def info(self): return self.name, self.job, self.pay, self.tax
This revision has a new tax rate (30 percent), introduces a
__getattr__
qualification
overload method, and deletes the original tax
method. Because this new version of
the class is re-imported when its existing instances are loaded from
the shelve file, they acquire the new behavior automatically—their
tax attribute references are now intercepted and computed when
accessed:
C:...PP4EDbase>python
>>>import shelve
>>>dbase = shelve.open('cast')
# reopen shelve >>> >>>print(list(dbase.keys()))
# both objects are here ['bob', 'emily'] >>>print(dbase['emily'])
<person.Person object at 0x019AEE90> >>> >>>print(dbase['bob'].tax)
# no need to call tax() 21000.0
Because the class has changed, tax
is now simply qualified, not called.
In addition, because the tax rate was changed in the class, Bob pays
more this time around. Of course, this example is artificial, but
when used well, this separation of classes and persistent instances
can eliminate many traditional database update programs. In most
cases, you can simply change the class, not each stored instance,
for new behavior.
Although shelves are generally straightforward to use, there are a few rough edges worth knowing about.
First, although they can store arbitrary objects, keys must still be
strings. The following fails, unless you convert the integer 42 to
the string 42
manually
first:
dbase[42] = value # fails, but str(42) will work
This is different from in-memory dictionaries, which allow
any immutable object to be used as a key, and derives from the
shelve’s use of DBM files internally. As we’ve seen, keys must
further be str
strings in
Python 3.X, not bytes
, because
the shelve will attempt to encode them in all cases.
Although the shelve
module is smart enough to detect multiple occurrences of a
nested object and re-create only one copy when fetched, this holds
true only within a given slot:
dbase[key] = [object, object] # OK: only one copy stored and fetched dbase[key1] = object dbase[key2] = object # bad?: two copies of object in the shelve
When key1
and key2
are fetched, they reference
independent copies of the original shared object; if that object
is mutable, changes from one won’t be reflected in the other. This
really stems from the fact the each key assignment runs an
independent pickle operation—the pickler detects repeated objects
but only within each pickle call. This may or may not be a concern
in your practice, and it can be avoided with extra support logic,
but an object can be duplicated if it spans keys.
Because objects fetched from a shelve don’t know that they came from a shelve, operations that change components of a fetched object change only the in-memory copy, not the data on a shelve:
dbase[key].attr = value # shelve unchanged
To really change an object stored on a shelve, fetch it into memory, change its parts, and then write it back to the shelve as a whole by key assignment:
object = dbase[key] # fetch it object.attr = value # modify it dbase[key] = object # store back-shelve changed (unless writeback)
As noted earlier, the shelve.open
call’s optional writeback
argument can be used to avoid the last step here, by automatically
caching objects fetched and writing them to disk when the shelve
is closed, but this can require substantial memory resources and
make close operations slow.
The shelve
module does
not currently support simultaneous updates. Simultaneous readers
are OK, but writers must be given exclusive access to the shelve.
You can trash a shelve if multiple processes write to it at the
same time, which is a common potential in things such as
server-side scripts on the Web. If your shelves may be updated by
multiple processes, be sure to wrap updates in calls to the
os.open
standard library
function to lock files and provide exclusive access.
With shelves, the files created by an underlying DBM system used to
store your persistent objects are not necessarily compatible with
all possible DBM implementations or Pythons. For instance, a file
generated by gdbm
on Linux, or
by the bsddb
library on
Windows, may not be readable by a Python with other DBM modules
installed.
This is really the same portability issue we discussed for
DBM files earlier. As you’ll recall, when a DBM file (or by proxy,
a shelve) is created, the dbm
module tries to import all possible DBM system modules in a
predefined order and uses the first that it finds. When dmb
later opens an existing file, it
attempts to determine which DBM system created it by inspecting
the files(s). Because the bsddb
system is tried first at file creation time and is available on
both Windows and many Unix-like systems, your DBM file is portable
as long as your Pythons support BSD on both platforms. This is
also true if all platforms you’ll use fall back on Python’s own
dbm.dumb
implementation. If the
system used to create a DBM file is not available on the
underlying platform, though, the DBM file cannot be used.
If DBM file portability is a concern, make sure that all the
Pythons that will read your data use compatible DBM modules. If
that is not an option, use the pickle
module directly and flat files
for storage (thereby bypassing both shelve
and dbm
), or use the OODB systems we’ll meet
later in this chapter. Such systems may also offer a more complete
answer to transaction processing, with calls
to commit changes, and automatic rollback to prior commit points
on errors.
In addition to these shelve constraints, storing class instances in a
shelve adds a set of additional rules you need to be aware of.
Really, these are imposed by the pickle
module, not by shelve
, so be sure to follow these if you
store class instance objects with pickle
directly too:
As we’ve seen, the Python pickler stores instance
attributes only when pickling an instance object, and it
reimports the class later to re-create the instance. Because
of that, the classes of stored objects must be importable when
objects are unpickled—they must be coded
unnested at the top level of a module file that is accessible
on the module import search path at load time (e.g., named
in PYTHONPATH
or in a .pth file, or the current working
directory or that of the top-level script).
Further, the class usually must be associated with a
real imported module when instances are pickled, not with a
top-level script (with the module name __main__
), unless they will only
ever be used in the top-level script. You also need to be
careful about moving class modules after instances are stored.
When an instance is unpickled, Python must find its class’s
module on the module search using the original module name
(including any package path prefixes) and fetch the class from
that module using the original class name. If the module or
class has been moved or renamed, it might not be found.
In applications where pickled objects are shipped over network sockets, it’s possible to satisfy this constraint by shipping the text of the class along with stored instances; recipients may simply store the class in a local module file on the import search path prior to unpickling received instances. Where this is inconvenient or impossible, simpler pickled objects such as lists and dictionaries with nesting may be transferred instead, as they require no source file to be reconstructed.
Although Python lets you change a class while instances of it are stored on a shelve, those changes must be backward compatible with the objects already stored. For instance, you cannot change the class to expect an attribute not associated with already stored persistent instances unless you first manually update those stored instances or provide extra conversion protocols on the class.
Shelves also inherit the pickling systems’ nonclass limitations. As discussed earlier, some types of objects (e.g., open files and sockets) cannot be pickled, and thus cannot be stored in a shelve.
In an early Python release, persistent object classes also had to either use constructors with no arguments or provide defaults for all constructor arguments (much like the notion of a C++ copy constructor). This constraint was dropped as of Python 1.5.2—classes with nondefaulted constructor arguments now work as is in the pickling system.[69]
Finally, although shelves store objects persistently, they are not really object-oriented database systems. Such systems also implement features such as immediate automatic write-through on changes, transaction commits and rollbacks, safe concurrent updates, and object decomposition and delayed (“lazy”) component fetches based on generated object IDs. Parts of larger objects may be loaded into memory only as they are accessed. It’s possible to extend shelves to support such features manually, but you don’t need to—the ZODB system, among others, provides an implementation of a more complete object-oriented database system. It is constructed on top of Python’s built-in pickling persistence support, but it offers additional features for advanced data stores. For more on ZODB, let’s move on to the next section.
ZODB, the Zope Object Database, is a full-featured and Python-specific object-oriented database (OODB) system. ZODB can be thought of as a more powerful alternative to Python’s shelves of the preceding section. It allows you to store nearly arbitrary Python objects persistently by key, like shelves, but it adds a set of additional features in exchange for a small amount of extra interface code.
ZODB is not the only OODB available for Python: the Durus system is generally seen as a simpler OODB which was inspired by ZODB. While Durus offers some advantages, it does not provide all the features of ZODB today, and it has not been as widely deployed (though perhaps in part because it is newer). Because of that, this section focuses on ZODB to introduce OODB concepts in general.
ZODB is an open source, third-party add-on for Python. It was originally developed as the database mechanism for websites developed with the Zope web framework mentioned in Chapter 12, but it is now available as a standalone package. It’s useful outside the context of both Zope and the Web as a general database management system in any domain.
Although ZODB does not support SQL queries, objects stored in ZODB can leverage the full power of the Python language. Moreover, in some applications, stored data is more naturally represented as a structured Python object. Table-based relational systems often must represent such data as individual parts scattered across multiple tables and associate them with complex and potentially slow key-based joins, or otherwise map them to and from the Python class model. Because OODBs store native Python objects directly, they can often provide a simpler model in systems which do not require the full power of SQL.
Using a ZODB database is very similar to Python’s standard library shelves, described in the prior section. Just like shelves, ZODB uses the Python pickling system to implement a persistent dictionary of persistent Python objects. In fact, there is almost no database interface to be found—objects are made persistent simply by assigning them to keys of the root ZODB dictionary object, or embedding them in objects stored in the database root. And as in a shelve, “records” take the form of native Python objects, processed with normal Python syntax and tools.
Unlike shelves, though, ZODB adds features critical to some types of programs:
You don’t need to manually lock files to avoid data corruption if there are potentially many concurrent writers, the way you would for shelves.
If your program crashes, your changes are not retained unless you explicitly commit them to the database.
Objects in ZODB derived from a persistence superclass are smart enough to know the database must be updated when an attribute is assigned.
Objects are cached in memory for efficiency and are automatically removed from the cache when they haven’t been used.
Because ZODB stores your database in a single flat file
with large-file support, it is immune to the potential size
constraints and DBM filesystem format differences of shelves. As
we saw earlier in this chapter, a shelve created on Windows
using bsddb
may not be
accessible to a script running with gdbm
on Linux.
Because of such advantages, ZODB is probably worth your attention if you need to store Python objects in a database persistently in a production environment. The only significant price you’ll pay for using ZODB is a small amount of extra code:
Accessing the database requires a small amount of extra boilerplate code to interface with ZODB—it’s not a simple open call.
Classes are derived from a persistence superclass if you want them to take advantage of automatic updates on changes—persistent classes are generally not as completely independent of the database as in shelves, though they can be.
Considering the extra functionality ZODB provides beyond shelves, these trade-offs are usually more than justified for many applications.
Unfortunately, as I write this edition in June 2010, ZODB is not yet available for Python 3.X, the version used in this book. Because of that, the prior edition’s Python 2.X examples and material have been removed from this section. However, in deference to Python 2.X users, as well as 3.X readers of some bright future where a 3.X-base ZODB has materialized, I’ve made the prior edition’s ZODB materials and examples available in this edition’s examples package.
See the Preface for details on the examples package, and see these locations within it for more on ZODB:
C:...Dbaseodb-2.x # ZODB examples code third edition C:...Dbaseodb-2.xDocumentaion # The 3rd Edition's ZODB tutorial
Although I cannot predict the future, ZODB will likely become available for Python 3.X eventually. In the absence of this, other Python-based OODBs may offer additional 3.X options.
To give you a brief sample of ZODB’s flavor, though, here’s a quick spin through its operation in Python 2.X. Once we’ve installed a compatible ZODB, we begin by first creating a database:
...PP4EDbaseodb-2.x>python
>>>from ZODB import FileStorage, DB
>>>storage = FileStorage.FileStorage(r'C: empmydb.fs')
>>>db = DB(storage)
>>>connection = db.open()
>>>root = connection.root()
This is mostly standard “boilerplate” code for connecting to a
ZODB database: we import its tools, create a FileStorage
and a DB
from it, and then open the database and
create the root object. The root object is the
persistent dictionary in which objects are stored. FileStorage
is an object that maps the
database to a flat file. Other storage interface options, such as
relational database-based storage, are also possible.
Adding objects to a ZODB database is as simple as in shelves.
Almost any Python object will do, including tuples, lists,
dictionaries, class instances, and nested combinations thereof. As
for shelve
, simply assign your
objects to a key in the database root object to make them
persistent:
>>>object1 = (1, 'spam', 4, 'YOU')
>>>object2 = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
>>>object3 = {'name': ['Bob', 'Doe'],
'age': 42,
'job': ('dev', 'mgr')}
>>>root['mystr'] = 'spam' * 3
>>>root['mytuple'] = object1
>>>root['mylist'] = object2
>>>root['mydict'] = object3
>>>root['mylist']
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]
Because ZODB supports transaction rollbacks, you must commit your changes to the database to make them permanent. Ultimately, this transfers the pickled representation of your objects to the underlying file storage medium—here, three files that include the name of the file we gave when opening:
>>>import transaction
>>>transaction.commit()
>>>storage.close()
...PP4EDbaseodb-2.x>dir /B c: empmydb*
mydb.fs mydb.fs.index mydb.fs.tmp
Without the final commit in this session, none of the changes we made would be saved. This is what we want in general—if a program aborts in the middle of an update task, none of the partially complete work it has done is retained. In fact, ZODB supports general database undo operations.
Pulling persistent objects back from ZODB in another session or program is just as straightforward: reopen the database as before and index the root to fetch objects back into memory. Like shelves, the database root supports dictionary interfaces—it may be indexed, has dictionary methods and a length, and so on:
...PP4EDbaseodb-2.x>python
>>>from ZODB import FileStorage, DB
>>>storage = FileStorage.FileStorage(r'C: empmydb.fs')
>>>db = DB(storage)
>>>connection = db.open()
>>>root = connection.root()
# connect >>>len(root), root.keys()
# size, index (4 ['mylist', 'mystr', 'mytuple', 'mydict']) >>>root['mylist']
# fetch objects [[1, 2, 3], [4, 5, 6], [7, 8, 9]] >>>root['mydict']
{'job': ('dev', 'mgr'), 'age': 42, 'name': ['Bob', 'Doe']} >>>root['mydict']['name'][-1]
# Bob's last name 'Doe'
Because the database root looks just like a dictionary, we can process it with normal dictionary code—stepping through the keys list to scan record by record, for instance:
>>>for key in root.keys():
print('%s => %s' % (key.ljust(10), root[key]))
mylist => [[1, 2, 3], [4, 5, 6], [7, 8, 9]] mystr => spamspamspam mytuple => (1, 'spam', 4, 'YOU') mydict => {'job': ('dev', 'mgr'), 'age': 42, 'name': ['Bob', 'Doe']}
Also like pickling and shelves, ZODB supports storage and retrieval of class instance objects, though they must inherit from a superclass which provides required protocol and intercepts attribute changes in order to flush them to disk automatically:
from persistent import Persistent class Person(Persistent): def __init__(self, name, job=None, rate=0): self.name = name self.job = job self.rate = rate def changeRate(self, newrate): self.rate = newrate # automatically updates database
When changing ZODB persistent class instances, in-memory attribute changes are automatically written back to the database. Other types of changes, such as in-place appends and key assignments, still require reassignment to the original key as in shelves to force the change to be written to disk (built-in list and dictionary objects do not know that they are persistent).
Because ZODB does not yet work with Python 3.X, that’s as much as we can say about it in this book. For more details, search for ZODB and Zope resources on the Web, and see the examples package resources listed earlier. Here, let’s move on to see how Python programs can make use of a very different sort of database interface—relational databases and SQL.
The shelve
module and ZODB package of the prior sections are
powerful tools. Both allow scripts to throw nearly arbitrary Python
objects on a keyed-access file and load them back later—in a single
step for shelves and with a small amount of administrative code for
ZODB. Especially for applications that record highly structured data,
object databases can be convenient and efficient—there is no need to
split and later join together the parts of large objects, and stored
data is processed with normal Python syntax because it is normal
Python objects.
Shelves and ZODB aren’t relational database systems, though; objects (records) are accessed with a single key, and there is no notion of SQL queries. Shelves, for instance, are essentially databases with a single index and no other query-processing support. Although it’s possible to build a multiple-index interface to store data with multiple shelves, it’s not a trivial task and requires manually coded extensions.
ZODB supports some types of searching beyond shelve (e.g., its cataloging feature), and persistent objects may be traversed with all the power of the Python language. However, neither shelves nor ZODB object-oriented databases provide the full generality of SQL queries. Moreover, especially for data that has a naturally tabular structure, relational databases may sometimes be a better fit.
For programs that can benefit from the power of SQL, Python also broadly supports relational database management systems (RDBMSs). Relational databases are not necessarily mutually exclusive with the object persistence topics we studied earlier in this chapter—it is possible, for example, to store the serialized string representation of a Python object produced by pickling in a relational database. ZODB also supports the notion of mapping an object database to a relational storage medium.
The databases we’ll meet in this section, though, are structured and processed in very different ways:
They store data in related tables of columns (rather than in persistent dictionaries of arbitrarily structured persistent Python objects).
They support the SQL query language for accessing data and exploiting relationships among it (instead of Python object traversals).
For some applications, the end result can be a potent combination. Moreover, some SQL-based database systems provide industrial-strength persistence support for enterprise-level data.
Today, there are freely available interfaces that let Python scripts utilize all common relational database systems, both free and commercial: MySQL, Oracle, Sybase, Informix, InterBase, PostgreSQL (Postgres), SQLite, ODBC, and more. In addition, the Python community has defined a database API specification that works portably with a variety of underlying database packages. Scripts written for this API can be migrated to different database vendor packages, with minimal or no source code changes.
As of Python 2.5, Python itself includes built-in support for the SQLite relational database system as part of its standard library. Because this system supports the portable database API, it serves as a tool for both program storage and prototyping—systems developed with SQLite work largely unchanged when a more feature-rich database such as MySQL or Oracle is deployed.
Moreover, the popular SQLObject and SQLAlchemy third-party systems both provide an Object Relational Mapper (ORM), which grafts an object interface onto your database, in which tables are modeled by as Python classes, rows by instances of those classes, and columns by instance attributes. Since ORMs largely just wrap SQL databases in Python classes, we’ll defer their coverage until later in this chapter; for now, let’s explore SQL basics in Python.
Like ZODB, and unlike the pickle
and shelve
persistence modules
presented earlier, most SQL databases are optional extensions that
are not part of Python itself. SQLite is the only relational
database package that comes with Python. Moreover, you need to know
SQL to fully understand their interfaces. Because we don’t have
space to teach SQL in this text, this section gives a brief overview
of the API; please consult other SQL references and the database API
resources mentioned in the next section for more details that we’ll
skip here.
The good news is that you can access SQL databases from Python, through a straightforward and portable model. The Python database API specification defines an interface for communicating with underlying database systems from Python scripts. Vendor-specific database interfaces for Python may or may not conform to this API completely, but all database extensions for Python in common use are minor variations on a theme. Under the database API, SQL databases in Python are grounded on three core concepts:
Represent a connection to a database, are the interface to rollback and commit operations, provide package implementation details, and generate cursor objects.
Represent an SQL statement submitted as a string and can be used to access and step through SQL statement results.
select
statementsAre returned to scripts as Python sequences of sequences
(e.g., a list of tuples), representing database tables of
rows. Within these row sequences, column field values are
normal Python objects such as strings, integers, and floats
(e.g., [('bob',48),
('emily',47)]
). Column values may also be special
types that encapsulate things such as date and time, and
database NULL values are returned as the Python None
object.
Beyond this, the API defines a standard set of database exception types, special database type object constructors, and informational top-level calls including thread safety and replacement style checks.
For instance, to establish a database connection under the Python API-compliant Oracle interface, install the commonly used Python Oracle extension module as well as Oracle itself, and then run a statement of this form:
connobj = connect("user/password@system")
This call’s arguments may vary per database and vendor (e.g., some may require network details or a local file’s name), but they generally contain what you provide to log in to your database system. Once you have a connection object, there a variety of things you can do with it, including:
connobj.close() close connection now (not at object __del__ time) connobj.commit() commit any pending transactions to the database connobj.rollback() roll database back to start of pending transactions
But one of the most useful things to do with a connection object is to generate a cursor object:
cursobj = connobj.cursor() return a new cursor object for running SQL
Cursor objects have a set of methods, too (e.g., close
to close the cursor before its
destructor runs, and callproc
to
call a stored procedure), but the most important may be this
one:
cursobj.execute(sqlstring
[,parameters
]) run SQL query or command string
Parameters are passed in as a sequence or mapping of values,
and are substituted into the SQL statement string according to the
interface module’s replacement target conventions. The execute
method can be used to run a
variety of SQL statement strings:
DDL definition statements (e.g., CREATE TABLE
)
DML modification statements (e.g., UPDATE
or INSERT
)
DQL query statements (e.g., SELECT
)
After running an SQL statement, the cursor’s rowcount
attribute gives the number of
rows changed (for DML changes) or fetched (for DQL queries), and the
cursor’s description
attribute
gives column names and types after a query; execute
also returns the number of rows
affected or fetched in the most vendor interfaces. For DQL query
statements, you must call one of the fetch
methods to complete the
operation:
tuple = cursobj.fetchone() fetch next row of a query result
listoftuple = cursobj.fetchmany([size
]) fetch next set of rows of query result
listoftuple = cursobj.fetchall() fetch all remaining rows of the result
And once you’ve received fetch method results, table
information is processed using normal Python sequence operations;
for example, you can step through the tuples in a fetchall
result list with a simple
for
loop or comprehension
expression. Most Python database interfaces also allow you to
provide values to be passed to SQL statement strings, by providing
targets and a tuple of parameters. For instance:
query = 'SELECT name, shoesize FROM spam WHERE job = ? AND age = ?'
cursobj.execute(query, (value1, value2))
results = cursobj.fetchall()
for row in results: ...
In this event, the database interface utilizes prepared
statements (an optimization and convenience) and correctly passes
the parameters to the database regardless of their Python types. The
notation used to code targets in the query string may vary in some
database interfaces (e.g., :p1
and :p2
or two %s
, rather than the two ?
s used by the Oracle interface); in any
event, this is not the same as Python’s %
string formatting operator, as it
sidesteps security issues along the way.
Finally, if your database supports stored procedures, you can
call them with the callproc
method or by passing an SQL CALL
or EXEC
statement string to the
execute
method. callproc
may generate a result table
retrieved with a fetch
variant,
and returns a modified copy of the input sequence—input parameters
are left untouched, and output and input/output parameters are
replaced with possibly new values. Additional API features,
including support for database blobs (roughly, with sized results),
is described in the API’s documentation. For now, let’s move on to
do some real SQL processing in Python.
We don’t have space to provide an exhaustive reference for the database API in this book. To sample the flavor of the interface, though, let’s step through a few simple examples. We’ll use the SQLite database system for this tutorial. SQLite is a standard part of Python itself, which you can reasonably expect to be available in all Python installations. Although SQLite implements a complete relational database system, it takes the form of an in-process library instead of a server. This generally makes it better suited for program storage than for enterprise-level data needs.
Thanks to Python’s portable database API, though, other popular database packages such as PostgreSQL, MySQL, and Oracle are used almost identically; the initial call to log in to the database will be all that normally requires different argument values for scripts that use standard SQL code. Because of this, we can use the SQLite system both as a prototyping tool in applications development and as an easy way to get started with the Python SQL database API in this book.
As mentioned earlier, the third edition’s coverage of MySQL had to be replaced here because the interface used is not yet ported to Python 3.X. However, the third edition’s MySQL-based examples and overview are available in the book examples package, in directory C:...PP4EDbaseSqlMySql-2.X, and its Documentation subdirectory. The examples are in Python 2.X form, but their database-related code is largely version neutral. Since that code is also largely database neutral, it is probably of limited value to most readers; the scripts listed in this book should work on other database packages like MySQL with only trivial changes.
Regardless of which database system your scripts talk to, the basic SQL interface in Python is very simple. In fact, it’s hardly object-oriented at all—queries and other database commands are sent as strings of SQL. If you know SQL, you already have most of what you need to use relational databases in Python. That’s good news if you fall into this category, but adds a prerequisite if you don’t.
This isn’t a book on the SQL language, so we’ll defer to other resources for details on the commands we’ll be running here (O’Reilly has a suite of books on the topic). In fact, the databases we’ll use are tiny, and the commands we’ll use are deliberately simple as SQL goes—you’ll want to extrapolate from what you see here to the more realistic tasks you face. This section is just a brief look at how to use the Python language in conjunction with an SQL database.
Whether large or small, though, the Python code needed to process your database turns out to be surprisingly straightforward. To get started, the first thing we need to do is open a connection to the database and create a table for storing records:
C:...PP4EDbaseSql>python
>>>import sqlite3
>>>conn = sqlite3.connect('dbase1')
# use a full path for files elsewhere
We start out by importing the Python SQLite interface
here—it’s a standard library module called sqlite3
to our scripts. Next we create a
connection object, passing in the items our database requires at
start-up time—here, the name of the local file where our databases
will be stored. This file is what you’ll want to back up to save
your database. It will create the file if needed, or open its
current content; SQLite also accepts that special string
“:memory:” to create a temporary database in memory
instead.
As long as a script sticks to using standard SQL code, the
connect
call’s arguments are
usually the only thing that can vary across different database
systems. For example, in the MySQL interface this call accepts a
network host’s domain name, user name, and password, passed as
keyword arguments instead, and the Oracle example sketched earlier
expects a more specific sting syntax. Once we’ve gotten past this
platform-specific call,
though, the rest of the API is largely database neutral.
Next, let’s make a cursor for submitting SQL statements to the database server, and submit one to create a first table:
>>>curs = conn.cursor()
>>>> >>>tblcmd = 'create table people (name char(30), job char(10), pay int(4))'
>>>curs.execute(tblcmd)
The last command here creates the table called “people”
within the database; the name, job, and pay information specifies
the columns in this table, as well as their datatypes, using a
“type(size)” syntax—two strings and an integer. Datatypes can be
more sophisticated than ours, but we’ll ignore such details here
(see SQL references). In SQLite, the file is the database, so
there’s no notion of creating or using a specific database within
it, as there is in some systems. At this point, there is a simple
flat file in the current working directory named data1, which contains binary data and
contains our people
database
table.
So far, we’ve logged in (which just means opening a local file in SQLite) and created a table. Next let’s start a new Python session and create some records. There are three basic statement-based approaches we can use here: inserting one row at a time or inserting multiple rows with a single call statement or a Python loop. Here is the simple case (I’m omitting some call return values here if they are irrelevant to the story):
C:...PP4EDbaseSql>python
>>>import sqlite3
>>>conn = sqlite3.connect('dbase1')
>>>curs = conn.cursor()
>>>curs.execute('insert into people values (?, ?, ?)', ('Bob', 'dev', 5000))
>>>curs.rowcount
1 >>>sqlite3.paramstyle
'qmark'
Create a cursor object to submit SQL statements to the
database server as before. The SQL insert
command adds a single row to the table. After an
execute call, the cursor’s rowcount
attribute gives the number of
rows produced or affected by the last statement run. This is also
available as the return value of an execute
call in some database interface
modules, but this is not defined in the database API
specification, and isn’t provided in SQLite; in other words,
don’t depend on it if you want your database scripts to work on
other database systems.
Parameters to substitute into the SQL statement string are
generally passed in as a sequence (e.g., list or tuple). Notice
the module’s paramstyle
—this
tells us what style it uses for substitution targets in the
statement string. Here, qmark
means this module accepts ?
for
replacement targets. Other database modules might use styles such
as format
(meaning a %s
target), or numeric indexes or
mapping keys; see the DB API for more details.
To insert multiple rows with a single statement, use the
executemany
method and a
sequence of row sequences (e.g., a list of lists). This call is
like calling execute
once for
each row sequence in the argument, and in fact may be implemented
as such; database interfaces may also use database-specific
techniques to make this run quicker, though:
>>>curs.executemany('insert into people values (?, ?, ?)',
...[ ('Sue', 'mus', '70000'),
...('Ann', 'mus', '60000')])
>>>curs.rowcount
2
We inserted two rows at once in the last statement. It’s hardly any more work to achieve the same result by inserting one row at a time with a Python loop:
>>>rows = [['Tom', 'mgr', 100000],
...['Kim', 'adm', 30000],
...['pat', 'dev', 90000]]
>>>for row in rows:
...curs.execute('insert into people values (? , ?, ?)', row)
... >>>conn.commit()
Blending Python and SQL like this starts to open up all
sorts of interesting possibilities. Notice the last command; we
always need to call the connection’s commit
method to write our changes out
to the database. Otherwise, when the connection is closed, our
changes may be lost. In fact, until we call the commit
method, none of our inserts may
be visible from other database connections.
Technically, the API suggests that a connection object
should automatically call its rollback
method to back out changes that
have not yet been committed, when it is closed (which happens
manually when its close
method
is called, or automatically when the connection object is about to
be garbage collected). For database systems that don’t support
transaction commit and rollback operations, these calls may do
nothing. SQLite implements both the commit
and rollback
methods; the latter rolls back
any changes made since the last commit
.
OK, we’ve now added six records to our database table. Let’s run an SQL query to see how we did:
>>>curs.execute('select * from people')
>>>curs.fetchall()
[('Bob', 'dev', 5000), ('Sue', 'mus', 70000), ('Ann', 'mus', 60000), ('Tom', 'mgr', 100000), ('Kim', 'adm', 30000), ('pat', 'dev', 90000)]
Run an SQL select
statement with a cursor object to grab all rows and call the
cursor’s fetchall
to retrieve
them. They come back to our script as a sequence of sequences. In
this module, it’s a list of tuples—the outer list represents the
result table, the nested tuples are that table’s rows, and the
nested tuple’s contents are the column data. Because it’s all
Python data, once we get the query result, we process it with
normal Python code. For example, to make the display a bit more
coherent, loop through the query’s result as usual:
>>>curs.execute('select * from people')
>>>for row in curs.fetchall():
...print(row)
... ('Bob', 'dev', 5000) ('Sue', 'mus', 70000) ('Ann', 'mus', 60000) ('Tom', 'mgr', 100000) ('Kim', 'adm', 30000) ('pat', 'dev', 90000)
Tuple unpacking comes in handy in loops here, too, to pick out column values as we go. Here’s a simple formatted display of two of the columns’ values:
>>>curs.execute('select * from people')
>>>for (name, job, pay) in curs.fetchall():
...print(name, ':', pay)
... Bob : 5000 Sue : 70000 Ann : 60000 Tom : 100000 Kim : 30000 pat : 90000
Because the query result is a sequence, we can use Python’s powerful sequence and iteration tools to process it. For instance, to select just the name column values, we can run a more specific SQL query and get a list of tuples:
>>>curs.execute('select name from people')
>>>names = curs.fetchall()
>>>names
[('Bob',), ('Sue',), ('Ann',), ('Tom',), ('Kim',), ('pat',)]
Or we can use a Python list comprehension to pick out the fields we want—by using Python code, we have more control over the data’s content and format:
>>>curs.execute('select * from people')
>>>names = [rec[0] for rec in curs.fetchall()]
>>>names
['Bob', 'Sue', 'Ann', 'Tom', 'Kim', 'pat']
The fetchall
call we’ve
used so far fetches the entire query result table all at once, as
a single sequence (an empty sequence comes back, if the result is
empty). That’s convenient, but it may be slow enough to block the
caller temporarily for large result tables or generate substantial
network traffic if the server is running remotely (something could
easily require a parallel thread in GUI). To avoid such a
bottleneck, we can also grab just one row, or a bunch of rows, at
a time with fetchone
and
fetchmany
. The fetchone
call returns the next result
row or a None
false value at
the end of the table:
>>>curs.execute('select * from people')
>>>while True:
...row = curs.fetchone()
...if not row: break
...print(row)
... ('Bob', 'dev', 5000) ('Sue', 'mus', 70000) ('Ann', 'mus', 60000) ('Tom', 'mgr', 100000) ('Kim', 'adm', 30000) ('pat', 'dev', 90000)
The fetchmany
call
returns a sequence of rows from the result, but not the entire
table; you can specify how many rows to grab each time with a
parameter or rely on the default as given by the cursor’s arraysize
attribute. Each call gets at
most that many more rows from the result or an empty sequence at
the end of the table:
>>>curs.execute('select * from people')
>>>while True:
...rows = curs.fetchmany()
# size=N optional argument ...if not rows: break
...for row in rows:
...print(row)
... ('Bob', 'dev', 5000) ('Sue', 'mus', 70000) ('Ann', 'mus', 60000) ('Tom', 'mgr', 100000) ('Kim', 'adm', 30000) ('pat', 'dev', 90000)
For this module at least, the result table is exhausted
after a fetchone
or fetchmany
returns a False
value. The DB API says that
fetchall
returns “all
(remaining) rows,” so you generally need to call execute
again to regenerate results
before fetching new
data:
>>>curs.fetchone()
>>>curs.fetchmany()
[] >>>curs.fetchall()
[]
Naturally, we can do more than fetch an entire table; the full power of the SQL language is at your disposal in Python:
>>>curs.execute('select name, job from people where pay > 60000')
>>>curs.fetchall()
[('Sue', 'mus'), ('Tom', 'mgr'), ('pat', 'dev')]
The last query fetches name and job fields for people who earn more than $60,000. The next is similar, but passes in the selection value as a parameter and orders the result table:
>>>query = 'select name, job from people where pay >= ? order by name'
>>>curs.execute(query, [60000])
>>>for row in curs.fetchall(): print(row)
... ('Ann', 'mus') ('Sue', 'mus') ('Tom', 'mgr') ('pat', 'dev')
Cursor objects also are used to submit SQL update statements to the
database server—updates,
deletes, and inserts. We’ve already seen the insert
statement at work. Let’s start a
new session to perform some other kinds of updates; we begin with
the same data we had in the prior session:
C:...PP4EDbaseSql>python
>>>import sqlite3
>>>conn = sqlite3.connect('dbase1')
>>>curs = conn.cursor()
>>>curs.execute('select * from people')
>>>curs.fetchall()
[('Bob', 'dev', 5000), ('Sue', 'mus', 70000), ('Ann', 'mus', 60000), ('Tom', 'mgr', 100000), ('Kim', 'adm', 30000), ('pat', 'dev', 90000)]
The SQL update
statement
changes records—the following changes three records’ pay column
values to 65000
(Bob, Ann, and
Kim), because their pay was no more than $60,000. As usual, the
cursor’s rowcount
gives the
number of records changed:
>>>curs.execute('update people set pay=? where pay <= ?', [65000, 60000])
>>>curs.rowcount
3 >>>curs.execute('select * from people')
>>>curs.fetchall()
[('Bob', 'dev', 65000), ('Sue', 'mus', 70000), ('Ann', 'mus', 65000), ('Tom', 'mgr', 100000), ('Kim', 'adm', 65000), ('pat', 'dev', 90000)]
The SQL delete
statement
removes records, optionally according to a condition
(to delete all records, omit the condition). In the following, we
delete Bob’s record, as well as any record with a pay that is at
least $90,000:
>>>curs.execute('delete from people where name = ?', ['Bob'])
>>>curs.execute('delete from people where pay >= ?',(90000,))
>>>curs.execute('select * from people')
>>>curs.fetchall()
[('Sue', 'mus', 70000), ('Ann', 'mus', 65000), ('Kim', 'adm', 65000)] >>>conn.commit()
Finally, remember to commit your changes to the database
before exiting Python, assuming you wish to keep them. Without a
commit, a connection rollback or close call, as well as the
connection’s __del__
deletion
method, will back out uncommitted changes. Connection objects are
automatically closed if they are still open when they are garbage
collected, which in turn triggers a __del__
and a rollback; garbage
collection happens automatically on program exit, if not
sooner.
Now that we’ve seen the basics in action, let’s move on and apply them to a few larger tasks. The SQL API defines query results to be sequences of sequences. One of the more common features that people seem to miss from the API is the ability to get records back as something more structured—a dictionary or class instance, for example, with keys or attributes giving column names. The ORMs we’ll meet at the end of this chapter map to class instances, but because this is Python, it’s easy to code this kind of transformation in other ways. Moreover, the API already gives us the tools we need.
For example, after a query execute
call, the DB API specifies that the cursor’s description
attribute gives the names and (for some databases) types of the
columns in the result table. To see how, let’s continue with the
database in the state in which we left it in the prior
section:
>>>curs.execute('select * from people')
>>>curs.description
(('name', None, None, None, None, None, None), ('job', None, None, None, None, None, None), ('pay', None, None, None, None, None, None)) >>>curs.fetchall()
[('Sue', 'mus', 70000), ('Ann', 'mus', 65000), ('Kim', 'adm', 65000)]
Formally, the description is a sequence of
column-description sequences, each of the following form. See the
DB API for more on the meaning of the type code slot—it maps to
objects at the top level of the database interface module, but the
sqlite3
module implements only
the name component:
(name, type_code, display_size, internal_size, precision, scale, null_ok)
Now, we can use this metadata anytime we want to label the columns—for instance, in a formatted records display (be sure to regenerate a query result first, since the prior result has been fetched):
>>>curs.execute('select * from people')
>>>colnames = [desc[0] for desc in curs.description]
>>>colnames
['name', 'job', 'pay'] >>>for row in curs.fetchall():
...for name, value in zip(colnames, row):
...print(name, ' =>', value)
...print()
... name => Sue job => mus pay => 70000 name => Ann job => mus pay => 65000 name => Kim job => adm pay => 65000
Notice how a tab character is used to try to make this output align; a better approach might be to determine the maximum field name length (we’ll see how in a later example).
It’s a minor extension of our formatted display code to create a dictionary for each record, with field names for keys—we just need to fill in the dictionary as we go:
>>>curs.execute('select * from people')
>>>colnames = [desc[0] for desc in curs.description]
>>>rowdicts = []
>>>for row in curs.fetchall():
...newdict = {}
...for name, val in zip(colnames, row):
...newdict[name] = val
...rowdicts.append(newdict)
... >>>for row in rowdicts: print(row)
... {'pay': 70000, 'job': 'mus', 'name': 'Sue'} {'pay': 65000, 'job': 'mus', 'name': 'Ann'} {'pay': 65000, 'job': 'adm', 'name': 'Kim'}
Because this is Python, though, there are more powerful ways to build up these record dictionaries. For instance, the dictionary constructor call accepts the zipped name/value pairs to fill out the dictionaries for us:
>>>curs.execute('select * from people')
>>>colnames = [desc[0] for desc in curs.description]
>>>rowdicts = []
>>>for row in curs.fetchall():
...rowdicts.append( dict(zip(colnames, row)) )
... >>>rowdicts[0]
{'pay': 70000, 'job': 'mus', 'name': 'Sue'}
And finally, a list comprehension will do the job of collecting the dictionaries into a list—not only is this less to type, but it probably runs quicker than the original version:
>>>curs.execute('select * from people')
>>>colnames = [desc[0] for desc in curs.description]
>>>rowdicts = [dict(zip(colnames, row)) for row in curs.fetchall()]
>>>rowdicts[0]
{'pay': 70000, 'job': 'mus', 'name': 'Sue'}
One of the things we lose when moving to dictionaries is
record field order—if you look back at the raw result of fetchall
, you’ll notice that record
fields are in the name, job, and pay order in which they were
stored. Our dictionary’s fields come back in the pseudorandom
order of Python mappings. As long as we fetch fields by key, this
is irrelevant to our script. Tables still maintain their order,
and dictionary construction works fine because the description
result tuple is in the same order as the fields in row tuples
returned by queries.
We’ll leave the task of translating record tuples into class
instances as a suggested exercise, except for two hints: Python’s
standard library collections
module implements more exotic data types, such as named tuples and
ordered dictionaries; and we can access fields as attributes
rather than as keys, by simply creating an empty class instance
and assigning to attributes with the Python setattr
function. Classes would also
provide a natural place to code inheritable tools such as standard
display methods. In fact, this is part of the utility that the
upcoming ORMs can provide for us.
Up to this point, we’ve essentially used Python as a command-line SQL client—our queries have been typed and run interactively. All the kinds of code we’ve run, though, can be used as the basis of database access in script files. Working interactively requires retyping things such as multiline loops, which can become tedious. With scripts, we can automate our work.
To demonstrate, let’s make the last section’s prior example into a utility module—Example 17-4 is a reusable module that knows how to translate the result of a query from row tuples to row dictionaries.
""" convert list of row tuples to list of row dicts with field name keys this is not a command-line utility: hardcoded self-test if run """ def makedicts(cursor, query, params=()): cursor.execute(query, params) colnames = [desc[0] for desc in cursor.description] rowdicts = [dict(zip(colnames, row)) for row in cursor.fetchall()] return rowdicts if __name__ == '__main__': # self test import sqlite3 conn = sqlite3.connect('dbase1') cursor = conn.cursor() query = 'select name, pay from people where pay < ?' lowpay = makedicts(cursor, query, [70000]) for rec in lowpay: print(rec)
As usual, we can run this file from the system command line as a script to invoke its self-test code:
...PP4EDbaseSql> makedicts.py
{'pay': 65000, 'name': 'Ann'}
{'pay': 65000, 'name': 'Kim'}
Or we can import it as a module and call its function from another context, like the interactive prompt. Because it is a module, it has become a reusable database tool:
...PP4EDbaseSql>python
>>>from makedicts import makedicts
>>>from sqlite3 import connect
>>>conn = connect('dbase1')
>>>curs = conn.cursor()
>>>curs.execute('select * from people')
>>>curs.fetchall()
[('Sue', 'mus', 70000), ('Ann', 'mus', 65000), ('Kim', 'adm', 65000)] >>>rows = makedicts(curs, "select name from people where job = 'mus'")
>>>rows
[{'name': 'Sue'}, {'name': 'Ann'}]
Our utility handles arbitrarily complex queries—they are
simply passed through the DB API to the database server or
library. The order by
clause
here sorts the result on the name field:
>>>query = 'select name, pay from people where job = ? order by name'
>>>musicians = makedicts(curs, query, ['mus'])
>>>for row in musicians: print(row)
... {'pay': 65000, 'name': 'Ann'} {'pay': 70000, 'name': 'Sue'}
So far, we’ve learned how to make databases and tables, insert records into tables, query table contents, and extract column names. For reference, and to show how these techniques are combined, Example 17-5 collects them into a single script.
from sqlite3 import connect conn = connect('dbase1') curs = conn.cursor() try: curs.execute('drop table people') except: pass # did not exist curs.execute('create table people (name char(30), job char(10), pay int(4))') curs.execute('insert into people values (?, ?, ?)', ('Bob', 'dev', 50000)) curs.execute('insert into people values (?, ?, ?)', ('Sue', 'dev', 60000)) curs.execute('select * from people') for row in curs.fetchall(): print(row) curs.execute('select * from people') colnames = [desc[0] for desc in curs.description] while True: print('-' * 30) row = curs.fetchone() if not row: break for (name, value) in zip(colnames, row): print('%s => %s' % (name, value)) conn.commit() # save inserted records
Refer to prior sections in this tutorial if any of the code in this script is unclear. When run, it creates a two-record database and lists its content to the standard output stream:
C:...PP4EDbaseSql> testdb.py
('Bob', 'dev', 50000)
('Sue', 'dev', 60000)
------------------------------
name => Bob
job => dev
pay => 50000
------------------------------
name => Sue
job => dev
pay => 60000
------------------------------
As is, this example is really just meant to demonstrate the database API. It hardcodes database names, and it re-creates the database from scratch each time. We could turn this code into generally useful tools by refactoring it into reusable parts, as we’ll see later in this section. First, though, let’s explore techniques for getting data into our databases.
One of the nice things about using Python in the database domain is that you can combine the power of the SQL query language with the power of the Python general-purpose programming language. They naturally complement each other.
Suppose, for example, that you want to load a database table from a flat file, where each line in the file represents a database row, with individual field values separated by commas. Examples 17-6 and 17-7 list two such datafiles we’re going to be using here.
bob,devel,50000 sue,music,60000 ann,devel,40000 tim,admin,30000 kim,devel,60000
Now, some database systems like MySQL have a handy SQL
statement for loading such a table quickly. Its load data
statement parses and loads
data from a text file, located on either the client or the server
machine. In the following, the first command deletes all records
in the table, and we’re using the fact that Python automatically
concatenates adjacent string literals to split the SQL statement
over multiple lines:
# Using MySQL (currently available for Python 2.X only) ...log into MySQL first... >>>curs.execute('delete from people')
# all records >>>curs.execute(
..."load data local infile 'data.txt' "
..."into table people fields terminated by ','")
>>>curs.execute('select * from people')
>>>for row in curs.fetchall(): print(row)
... ('bob', 'devel', 50000L) ('sue', 'music', 60000L) # 2.X long integers ('ann', 'devel', 40000L) ('tim', 'admin', 30000L) ('kim', 'devel', 60000L) >>>conn.commit()
This works as expected. But what if you must use a system like the SQLite database used in this book, which lacks this specific SQL statement? Or, perhaps you just need to do something more custom than this MySQL statement allows. Not to worry—a small amount of simple Python code can easily accomplish the same result with SQLite and Python 3.X (again, some irrelevant output lines are omitted here):
C:...PP4EDbaseSql>python
>>>from sqlite3 import connect
>>>conn = connect('dbase1')
>>>curs = conn.cursor()
>>>curs.execute('delete from people')
# empty the table >>>curs.execute('select * from people')
>>>curs.fetchall()
[] >>>file = open('data.txt')
>>>rows = [line.rstrip().split(',') for line in file]
>>>rows[0]
['bob', 'devel', '50000'] >>>for rec in rows:
...curs.execute('insert into people values (?, ?, ?)', rec)
... >>>curs.execute('select * from people')
>>>for rec in curs.fetchall(): print(rec)
... ('bob', 'devel', 50000) ('sue', 'music', 60000) ('ann', 'devel', 40000) ('tim', 'admin', 30000) ('kim', 'devel', 60000)
This code makes use of a list comprehension to collect
string split results for all lines in the file after removing any
newline characters, and file iterators to step through the file
line by line. Its Python loop does the same work as the MySQL
load
statement, and it will
work on more database types, including SQLite. We can get some
similar result from an executemany
DB API call shown earlier as
well, but the Python for
loop
here has the potential to be more general.
In fact, you have the entire Python language at your disposal for processing database results, and a little Python can often duplicate or go beyond SQL syntax. For instance, SQL has special aggregate function syntax for computing things such as sums and averages:
>>>curs.execute("select sum(pay), avg(pay) from people where job = 'devel'")
>>>curs.fetchall()
[(150000, 50000.0)]
By shifting the processing to Python, we can sometimes simplify and do more than SQL’s syntax allows (albeit potentially sacrificing any query performance optimizations the database may perform). Computing pay sums and averages with Python can be accomplished with a simple loop:
>>>curs.execute("select name, pay from people where job = 'devel'")
>>>result = curs.fetchall()
>>>result
(('bob', 50000L), ('ann', 40000L), ('kim', 60000L)) >>>tot = 0
>>>for (name, pay) in result: tot += pay
... >>>print('total:', tot, 'average:', tot / len(result))
# use // to truncate total: 150000 average: 50000.0
Or we can use more advanced tools such as comprehensions and generator expressions to calculate sums, averages, maximums, and the like:
>>>print(sum(rec[1] for rec in result))
# generator expr 150000 >>>print(sum(rec[1] for rec in result) / len(result))
50000.0 >>>print(max(rec[1] for rec in result))
60000
The Python approach is more general, but it doesn’t buy us much until things become more complex. For example, here are a few more advanced comprehensions that collect the names of people whose pay is above and below the average in the query result set:
>>>avg = sum(rec[1] for rec in result) / len(result)
>>>print([rec[0] for rec in result if rec[1] > avg])
['kim'] >>>print([rec[0] for rec in result if rec[1] < avg])
['ann']
We may be able to do some of these kinds of tasks with more advanced SQL techniques such as nested queries, but we eventually reach a complexity threshold where Python’s general-purpose nature makes it attractive and potentially more portable. For comparison, here is the equivalent SQL:
>>>query = ("select name from people where job = 'devel' and "
..."pay > (select avg(pay) from people where job = 'devel')")
>>>curs.execute(query)
>>>curs.fetchall()
[('kim',)] >>>query = ("select name from people where job = 'devel' and "
..."pay < (select avg(pay) from people where job = 'devel')")
>>>curs.execute(query)
>>>curs.fetchall()
[('ann',)]
This isn’t the most complex SQL you’re likely to meet, but beyond this point, SQL can become more involved. Moreover, unlike Python, SQL is limited to database-specific tasks by design. Imagine a query that compares a column’s values to data fetched off the Web or from a user in a GUI—simple with Python’s Internet and GUI support, but well beyond the scope of a special-purpose language such as SQL. By combining Python and SQL, you get the best of both and can choose which is best suited to your goals.
With Python, you also have access to utilities you’ve already coded: your database tool set is arbitrarily extensible with functions, modules, and classes. To illustrate, here are some of the same operations coded in a more mnemonic fashion with the dictionary-record module we wrote earlier:
>>>from makedicts import makedicts
>>>recs = makedicts(curs, "select * from people where job = 'devel'")
>>>print(len(recs), recs[0])
3 {'pay': 50000, 'job': 'devel', 'name': 'bob'} >>>print([rec['name'] for rec in recs])
['bob', 'ann', 'kim'] >>>print(sum(rec['pay'] for rec in recs))
150000 >>>avg = sum(rec['pay'] for rec in recs) / len(recs)
>>>print([rec['name'] for rec in recs if rec['pay'] > avg])
['kim'] >>>print([rec['name'] for rec in recs if rec['pay'] >= avg])
['bob', 'kim']
Similarly, Python’s set
object type provides operations such as intersection, union, and
difference which can serve as alternatives to other SQL database
operations (in the interest of space, we’ll leave their
exploration as a suggested side trip). For more advanced database extensions, see
the SQL-related tools available for Python in the third-party domain. For example, a variety
of packages add an OOP flavor to the DB API—the ORMs we’ll explore near the
end of this chapter.
At this point in our SQL DB API tour, we’ve started to stretch the interactive prompt to its breaking point—we wind up retyping the same boilerplate code again every time we start a session and every time we run a test. Moreover, the code we’re writing is substantial enough to be reused in other programs. Let’s wrap up by transforming our code into reusable scripts that automate tasks and support reuse.
To illustrate more of the power of the Python/SQL mix, this section presents a handful of utility scripts that perform common tasks—the sorts of things you’d otherwise have to recode often during development. As an added bonus, most of these files are both command-line utilities and modules of functions that can be imported and called from other programs. Most of the scripts in this section also allow a database file name to be passed in on the command line; this allows us to use different databases for different purposes during development—changes in one won’t impact others.
Let’s take a quick look at code first, before seeing it in action; feel free to skip ahead to correlate the code here with its behavior. As a first (and less than ideal) step, Example 17-8 shows a simple way to script-ify the table-loading logic of the prior section.
""" load table from comma-delimited text file; equivalent to this nonportable SQL: load data local infile 'data.txt' into table people fields terminated by ','" """ import sqlite3 conn = sqlite3.connect('dbase1') curs = conn.cursor() file = open('data.txt') rows = [line.rstrip().split(',') for line in file] for rec in rows: curs.execute('insert into people values (?, ?, ?)', rec) conn.commit() # commit changes now, if db supports transactions conn.close() # close, __del__ call rollback if changes not committed yet
As is, Example 17-8 is a top-level script geared toward one particular case. It’s hardly any extra work to generalize this into a function that can be imported and used in a variety of scenarios, as in Example 17-9—a much more widely useful module and command-line script.
""" load table from comma-delimited text file: reusable/generalized version Importable functions; command-line usage: loaddb.py dbfile? datafile? table? """ def login(dbfile): import sqlite3 conn = sqlite3.connect(dbfile) # create or open db file curs = conn.cursor() return conn, curs def loaddb(curs, table, datafile, conn=None, verbose=True): file = open(datafile) # x,x,x x,x,x rows = [line.rstrip().split(',') for line in file] # [[x,x,x], [x,x,x]] rows = [str(tuple(rec)) for rec in rows] # ["(x,x,x)", "(x,x,x)"] for recstr in rows: curs.execute('insert into ' + table + ' values ' + recstr) if conn: conn.commit() if verbose: print(len(rows), 'rows loaded') if __name__ == '__main__': import sys dbfile, datafile, table = 'dbase1', 'data.txt', 'people' if len(sys.argv) > 1: dbfile = sys.argv[1] if len(sys.argv) > 2: datafile = sys.argv[2] if len(sys.argv) > 3: table = sys.argv[3] conn, curs = login(dbfile) loaddb(curs, table, datafile, conn)
Notice the way this code uses two list comprehensions to
build a string of record values for the insert
statement (see its comments for
the transforms applied). We could also use an executemany
call as we did earlier, but
we want to be general and avoid hardcoding the fields insertion
template—this function might be used for tables with any number of
columns.
This file also defines a login
function to automate the initial
connection calls—after retyping this four-command sequence enough
times, it seemed a prime candidate for a function. In addition,
this reduces code redundancy; in the future, such logic need only
be changed in a single location if we change database systems, as
long as the login
function is
used everywhere.
Once we load data, we probably will want to display it. Example 17-10 allows us to display results as we go—it prints an entire table with either a simple display (which could be parsed by other tools) or a formatted display (generated with the dictionary-record utility we wrote earlier). Notice how it computes the maximum field-name size for alignment with a generator expression; the size is passed in to a string formatting expression by specifying an asterisk (*) for the field size in the format string.
""" display table contents as raw tuples, or formatted with field names command-line usage: dumpdb.py dbname? table? [-] (dash=formatted display) """ def showformat(recs, sept=('-' * 40)): print(len(recs), 'records') print(sept) for rec in recs: maxkey = max(len(key) for key in rec) # max key len for key in rec: # or: align print('%-*s => %s' % (maxkey, key, rec[key])) # -ljust, *len print(sept) def dumpdb(cursor, table, format=True): if not format: cursor.execute('select * from ' + table) while True: rec = cursor.fetchone() if not rec: break print(rec) else: from makedicts import makedicts recs = makedicts(cursor, 'select * from ' + table) showformat(recs) if __name__ == '__main__': import sys dbname, format, table = 'dbase1', False, 'people' cmdargs = sys.argv[1:] if '-' in cmdargs: # format if '-' in cmdline args format = True # dbname if other cmdline arg cmdargs.remove('-') if cmdargs: dbname = cmdargs.pop(0) if cmdargs: table = cmdargs[0] from loaddb import login conn, curs = login(dbname) dumpdb(curs, table, format)
While we’re at it, let’s code some utility scripts to initialize and erase the database, so we do not have to type these by hand at the interactive prompt again every time we want to start from scratch. Example 17-11 completely deletes and re-creates the database, to reset it to an initial state (we did this manually at the start of the tutorial).
""" physically delete and re-create database files usage: makedb.py dbname? tablename? """ import sys if input('Are you sure?').lower() not in ('y', 'yes'): sys.exit() dbname = (len(sys.argv) > 1 and sys.argv[1]) or 'dbase1' table = (len(sys.argv) > 2 and sys.argv[2]) or 'people' from loaddb import login conn, curs = login(dbname) try: curs.execute('drop table ' + table) except: print('database table did not exist') command = 'create table %s (name char(30), job char(10), pay int(4))' % table curs.execute(command) conn.commit() # commit may be optional here print('made', dbname, table)
Next, the clear script in Example 17-12 deletes all
rows in the table, instead of dropping and re-creating them
entirely. For testing purposes, either approach is sufficient.
Minor caveat: the rowcount
attribute doesn’t always reflect the number of rows deleted in
SQLite; see its library manual entry for details.
""" delete all rows in table, but don't drop the table or database it is in usage: cleardb.py dbname? tablename? """ import sys if input('Are you sure?').lower() not in ('y', 'yes'): sys.exit() dbname = sys.argv[1] if len(sys.argv) > 1 else 'dbase1' table = sys.argv[2] if len(sys.argv) > 2 else 'people' from loaddb import login conn, curs = login(dbname) curs.execute('delete from ' + table) #print(curs.rowcount, 'records deleted') # conn closed by its __del__ conn.commit() # else rows not really deleted
Finally, Example 17-13 provides a command-line tool that runs a query and prints its result table in formatted style. There’s not much to this script; because we’ve automated most of its tasks already, this is largely just a combination of existing tools. Such is the power of code reuse in Python.
""" run a query string, display formatted result table example: querydb.py dbase1 "select name, job from people where pay > 50000" """ import sys database, querystr = 'dbase1', 'select * from people' if len(sys.argv) > 1: database = sys.argv[1] if len(sys.argv) > 2: querystr = sys.argv[2] from makedicts import makedicts from dumpdb import showformat from loaddb import login conn, curs = login(database) rows = makedicts(curs, querystr) showformat(rows)
Last but not least, here is a log of a session that makes use of these scripts in command-line mode, to illustrate their operation. Most of the files also have functions that can be imported and called from a different program; the scripts simply map command-line arguments to the functions’ arguments when run standalone. The first thing we do is initialize a testing database and load its table from a text file:
...PP4EDbaseSql>makedb.py testdb
Are you sure?y
database table did not exist made testdb people ...PP4EDbaseSql>loaddb.py testdb data2.txt
3 rows loaded
Next, let’s check our work with the dump utility (use a
-
argument to force a formatted
display):
...PP4EDbaseSql>dumpdb.py testdb
('bob', 'developer', 80000) ('sue', 'music', 90000) ('ann', 'manager', 80000) ...PP4EDbaseSql>dumpdb.py testdb -
3 records ---------------------------------------- pay => 80000 job => developer name => bob ---------------------------------------- pay => 90000 job => music name => sue ---------------------------------------- pay => 80000 job => manager name => ann ----------------------------------------
The dump script is an exhaustive display; to be more specific about which records to view, use the query script and pass in a query string on the command line (the command lines are split here to fit in this book):
...PP4EDbaseSql>querydb.py testdb
"select name, job from people where pay = 80000"
2 records ---------------------------------------- job => developer name => bob ---------------------------------------- job => manager name => ann ---------------------------------------- ...PP4EDbaseSql>querydb.py testdb
"select * from people where name = 'sue'"
1 records ---------------------------------------- pay => 90000 job => music name => sue ----------------------------------------
Now, let’s erase and start again with a new data set file. The clear script erases all records but doesn’t reinitialize the database completely:
...PP4EDbaseSql>cleardb.py testdb
Are you sure?y
...PP4EDbaseSql>dumpdb.py testdb -
0 records ---------------------------------------- ...PP4EDbaseSql>loaddb.py testdb data.txt
5 rows loaded ...PP4EDbaseSql>dumpdb.py testdb
('bob', 'devel', 50000) ('sue', 'music', 60000) ('ann', 'devel', 40000) ('tim', 'admin', 30000) ('kim', 'devel', 60000)
In closing, here are three queries in action on this new data set: they fetch names of developers, jobs that pay above an amount, and records with a given pay level sorted by job. We could run these at the Python interactive prompt, of course, but we’re getting a lot of setup and boilerplate code for free here:
...PP4EDbaseSql>querydb.py testdb
"select name from people where job = 'devel'"
3 records ---------------------------------------- name => bob ---------------------------------------- name => ann ---------------------------------------- name => kim ---------------------------------------- ...PP4EDbaseSql>querydb.py testdb
"select job from people where pay >= 60000"
2 records ---------------------------------------- job => music ---------------------------------------- job => devel ---------------------------------------- ...PP4EDbaseSql>querydb.py testdb
"select * from people where pay >= 60000 order by job"
2 records ---------------------------------------- pay => 60000 job => devel name => kim ---------------------------------------- pay => 60000 job => music name => sue ----------------------------------------
Before we move on, some context: the scripts in this section illustrate the benefits of code reuse, accomplish their purpose (which was partly demonstrating the SQL API), and serve as a model for canned database utilities. But they are still not as general as they could be; support for sorting options in the dump script, for example, may be a useful extension. Although we could generalize to support more options, at some point we may need to revert to typing SQL commands in a client—part of the reason SQL is a language is because it must support so much generality. Further extensions to these scripts are left as exercises. Change this code as you like; it’s Python, after all.
Although the examples we’ve seen in this section are simple, their techniques scale up to much more realistic databases and contexts. The websites we studied in the prior part of the book, for instance, can make use of SQL-based systems such as MySQL to store page state information as well as long-lived client information. Because MySQL (among others) supports both large databases and concurrent updates, it’s a natural for website implementation.
There is more to database interfaces than we’ve seen, but additional API documentation is readily available on the Web. To find the full database API specification, search the Web for “Python Database API.” You’ll find the formal API definition—really just a text file describing the PEP (the Python Enhancement Proposal) under which the API was hashed out.
Perhaps the best resource for additional information about database extensions today is the home page of the Python database SIG. Go to http://www.python.org, click on the Community and SIGs links there, and navigate to the database group’s page, or run a search. There, you’ll find API documentation (this is where it is officially maintained), links to database vendor–specific extension modules, and more. And as always, see the PyPI website and search the Web at large for related third-party tools and extensions.
In this chapter, we’ve seen OODBs that store native Python objects persistently, as well as SQL databases that store information in tables. It turns out that there is another class of system that attempts to bridge the object and table worlds, which I’ve hinted at earlier in this chapter: ORMs graft the Python class model onto the tables of relational databases. They combine the power of relational database systems with the simplicity of Python class-based syntax—you don’t need to forgo SQL-based databases, but you can still store data that seems like Python objects to your scripts.
Today, there are two leading open source third-party systems that implement this mapping: SQLObject and SQLAlchemy. Both are fairly complex systems that we cannot do full justice to in this text, and you’re best off researching their documentation on the Web for the full story (there are also dedicated books covering SQLAlchemy today). Moreover, neither is completely Python 3.X ready as I write these words, so we can’t run live examples with them in this text.
To give you a slightly more concrete flavor of the ORM model, though, here is a very quick look at how you might use it to create and process database records in the SQLObject system. In brief, SQLObject maps:
Python classes to database tables
Python class instances to rows in the table
Python instance attributes to row columns
For example, to create a table, we define it with a class, with class attributes that define columns, and call its creation method (this code is derived from a more complete example at SQLObject’s website):
from sqlobject import * sqlhub.processConnection = connectionForURI('sqlite:/:memory:') class Person(SQLObject): # class: describes table first = StringCol() # class attributes: row columns mid = StringCol(length=1, default=None) last = StringCol() Person.createTable() # create a database table
Once created, making an instance automatically inserts a row into the database, and attribute fetches and assignments are automatically mapped to fetches and updates of the corresponding table row’s column:
p = Person(first='Bob', last='Smith') # new instance: makes and inserts row p # prints all attributes by name p.first # attribute: fetches row column p.mid = 'M' # attribute: updates record
Existing rows/instances may be fetched by methods calls, and we can assign multiple columns/attributes with a single update operation:
p2 = Person.get(1) # fetch existing record/instance: p2 is p p.set(first='Tom', last='Jones') # update two attributes/fields at once
In addition, we can select by column values by creating a query object and executing it:
ts = Person.select(Person.q.first=='Tom') # query: select by column value list(ts) # run the query: list of instances tjs = Person.selectBy(first='Tom', last='Jones') # alternative query form (AND)
Naturally, this barely scratches the surface of the available functionality. Even at this level of complexity, though, this is quite a trick—SQLObject automatically issues all the SQL required to fetch, store, and query the table and rows implied by the Python class syntax here. Again, the net effect allows systems to leverage the power of enterprise-level relational databases, but still use familiar Python class syntax to process stored data in Python scripts.
The code used with the SQLAlchemy ORM is of course very different, but the end result is functionally similar. For more details on ORMs for Python, consult your friendly neighborhood web search engine. You can also learn more about such systems by their roles in some larger web development frameworks; Django, for instance, has an ORM which is another variation on this theme.
Instead of going into additional database interface details that are freely available on the Web, I’m going to close out this chapter by directing you to a supplemental example that shows one way to combine the GUI technology we met earlier in the text with the persistence techniques introduced in this chapter. This example is named PyForm—a Python/tkinter GUI designed to let you browse and edit tables of records:
Tables browsed may be shelves, DBM files, in-memory dictionaries, or any other object that looks and feels like a dictionary.
Records within tables browsed can be class instances, simple dictionaries, strings, or any other object that can be translated to and from a dictionary.
Although this example is about GUIs and persistence, it also illustrates Python design techniques. To keep its implementation both simple and type-independent, the PyForm GUI is coded to expect tables to look like dictionaries of dictionaries. To support a variety of table and record types, PyForm relies on separate wrapper classes to translate tables and records to the expected protocol:
At the top table level, the translation is easy—shelves, DBM files, and in-memory dictionaries all have the same key-based interface.
At the nested record level, the GUI is coded to assume that stored items have a dictionary-like interface, too, but classes intercept dictionary operations to make records compatible with the PyForm protocol. Records stored as strings are converted to and from the dictionary objects on fetches and stores; records stored as class instances are translated to and from attribute dictionaries. More specialized translations can be added in new table wrapper classes.
The net effect is that PyForm can be used to browse and edit a wide variety of table types, despite its dictionary interface expectations. When PyForm browses shelves and DBM files, table changes made within the GUI are persistent—they are saved in the underlying files. When used to browse a shelve of class instances, PyForm essentially becomes a GUI frontend to a simple object database that is built using standard Python persistence tools. To view and update a shelve of objects with PyForm, for example, code like the following will suffice:
import shelve from formgui import FormGui # after initcast db = shelve.open('../data/castfile') # reopen shelve file FormGui(db).mainloop() # browse existing shelve-of-dicts
To view or update a shelve of instances of an imported Actor
class, we can use code like
this:
from PP4E.Dbase.testdata import Actor from formgui import FormGui # run in TableBrowser dir from formtable import ShelveOfInstance testfile = '../data/shelve' # external filename table = ShelveOfInstance(testfile, Actor) # wrap shelf in Table object FormGui(table).mainloop() table.close() # close needed for some dbm
Figure 17-1
captures the scene under Python 3.1 and Windows 7 when viewing a
shelve of persistent class instance objects. This PyForm session was
kicked off by a command-line described in its form table module’s
self-test code: formtable.py shelve
1
, and omit the 1
(or
pass it as 0
) to avoid
reinitializing the shelve at the start of each session so changes are
retained.
PyForm’s GUI can also be started from the PyDemos launcher we met in Chapter 10, though it does not save changes persistently in this mode. Run the example on your own computer to get a better sample of its operation. Though not a fully general Python persistent object table viewer, PyForm serves as a simple object database front end.
Because we are short on time and space in this edition, I’m going to omit both the source code for this example and its description here. To study PyForm, see the following directory in the book’s examples package distribution described in the Preface:
C:...PP4EDbaseTableBrowser
See especially the Documentation subdirectory there, which contains the original PyForm overview material from the third edition in a PDF file. PyForm’s source code files are ported to Python 3.X form, though code in the overview document still shows its 2.X third edition roots. For the purposes of the published portions of this book, let’s move on to the next chapter and our next tools topic: data structure implementations.
[69] Interestingly, Python avoids calling the class to
re-create a pickled instance and instead simply makes a class
object generically, inserts instance attributes, and sets the
instance’s __class__
pointer
to the original class directly. This avoids the need for
defaults, but it also means that the class __init__
constructors that are no
longer called as objects are unpickled, unless you provide extra
methods to force the call. See the library manual for more
details, and see the pickle
module’s source code (pickle.py in the
source library) if you’re curious about how this works. Or see
the PyForm example later in this chapter—it does something very
similar with __class__
links
to build an instance object from a class and dictionary of
attributes, without calling the class’s __init__
constructor. This makes
constructor argument defaults unnecessary in classes used for
records browsed by PyForm, but it’s the same idea.