It is a capital mistake to theorize before one has data.
Arthur Conan Doyle
An active program accesses data that is stored in Random Access Memory, or RAM. RAM is very fast, but it is expensive and requires a constant supply of power; if the power goes out, all the data in memory is lost. Disk drives are slower than RAM but have more capacity, cost less, and retain data even after someone trips over the power cord. Thus, a huge amount of effort in computer systems has been devoted to making the best tradeoffs between storing data on disk and RAM. As programmers, we need persistence: storing and retrieving data using nonvolatile media such as disks.
This chapter is all about the different flavors of data storage, each optimized for different purposes: flat files, structured files, and databases. File operations other than input and output are covered in “Files”.
This is also the first chapter to show examples
of nonstandard Python modules;
that is, Python code apart from the standard library.
You’ll install them by using the
pip
command,
which is painless.
There are more details on its usage in Appendix D.
The simplest kind of persistence is a plain old file, sometimes called a flat file. This is just a sequence of bytes stored under a filename. You read from a file into memory and write from memory to a file. Python makes these jobs easy. Its file operations were modeled on the familiar and popular Unix equivalents.
Before reading or writing a file, you need to open it:
fileobj
= open(filename
,mode
)
Here’s a brief explanation of the pieces of this call:
fileobj
is the file object returned by open()
filename
is the string name of the file
mode
is a string indicating the file’s type and what you want to do with it
The first letter of mode
indicates
the operation:
r
means read.
w
means write. If the file doesn’t exist, it’s created. If the file does exist, it’s overwritten.
x
means write, but only if the file does not already exist.
a
means append (write after the end) if the file exists.
The second letter of mode
is the file’s type:
t
(or nothing) means text.
b
means binary.
After opening the file, you call functions to read or write data; these will be shown in the examples that follow.
Last, you need to close the file.
Let’s create a file from a Python string in one program and then read it back in the next.
For some reason, there aren’t many limericks about special relativity. This one will just have to do for our data source:
>>>
poem
=
'''There was a young lady named Bright,
...
Whose speed was far faster than light;
...
She started one day
...
In a relative way,
...
And returned on the previous night.'''
>>>
len
(
poem
)
150
The following code writes the entire poem to the file 'relativity'
in one call:
>>>
fout
=
open
(
'relativity'
,
'wt'
)
>>>
fout
.
write
(
poem
)
150
>>>
fout
.
close
()
The write()
function returns the number of bytes written.
It does not add any spaces or newlines, as print()
does.
You can also print()
to a text file:
>>>
fout
=
open
(
'relativity'
,
'wt'
)
>>>
(
poem
,
file
=
fout
)
>>>
fout
.
close
()
This brings up the question:
should I use write()
or print()
?
By default,
print()
adds a space after each argument and
a newline at the end.
In the previous example,
it appended a newline to the relativity
file.
To make print()
work like write()
,
pass the following two arguments:
sep
(separator, which defaults to a space, ' '
)
end
(end string, which defaults to a newline, '
'
)
print()
uses the defaults unless you pass something else.
We’ll pass
empty strings to suppress all of the fussiness normally added by print()
:
>>>
fout
=
open
(
'relativity'
,
'wt'
)
>>>
(
poem
,
file
=
fout
,
sep
=
''
,
end
=
''
)
>>>
fout
.
close
()
If you have a large source string, you can also write chunks until the source is done:
>>>
fout
=
open
(
'relativity'
,
'wt'
)
>>>
size
=
len
(
poem
)
>>>
offset
=
0
>>>
chunk
=
100
>>>
while
True
:
...
if
offset
>
size
:
...
break
...
fout
.
write
(
poem
[
offset
:
offset
+
chunk
])
...
offset
+=
chunk
...
100
50
>>>
fout
.
close
()
This wrote 100 characters on the first try and the last 50 characters on the next.
If the relativity
file is precious to us,
let’s see if using mode x
really protects us from overwriting it:
>>>
fout
=
open
(
'relativity'
,
'xt'
)
Traceback (most recent call last):
File"<stdin>"
, line1
, in<module>
FileExistsError
:[Errno 17] File exists: 'relativity'
You can use this with an exception handler:
>>>
try
:
...
fout
=
open
(
'relativity'
,
'xt'
)]
...
fout
.
write
(
'stomp stomp stomp'
)
...
except
FileExistsError
:
...
(
'relativity already exists!. That was a close one.'
)
...
relativity already exists!. That was a close one.
You can call read()
with no arguments to slurp up the entire file at once, as shown in the example that follows.
Be careful when doing this with large files; a gigabyte file will consume a gigabyte of memory.
>>>
fin
=
open
(
'relativity'
,
'rt'
)
>>>
poem
=
fin
.
read
()
>>>
fin
.
close
()
>>>
len
(
poem
)
150
You can provide a maximum character count to limit
how much read()
returns at one time.
Let’s read 100 characters at a time
and append each chunk to a poem
string to rebuild the original:
>>>
poem
=
''
>>>
fin
=
open
(
'relativity'
,
'rt'
)
>>>
chunk
=
100
>>>
while
True
:
...
fragment
=
fin
.
read
(
chunk
)
...
if
not
fragment
:
...
break
...
poem
+=
fragment
...
>>>
fin
.
close
()
>>>
len
(
poem
)
150
After you’ve read all the way to the end,
further calls to read()
will return
an empty string (''
), which is treated as
False
in if not fragment
.
This breaks out of the while True
loop.
You can also read the file a line at a time by using readline()
.
In this next example, we’ll append each line to the poem
string to rebuild the original:
>>>
poem
=
''
>>>
fin
=
open
(
'relativity'
,
'rt'
)
>>>
while
True
:
...
line
=
fin
.
readline
()
...
if
not
line
:
...
break
...
poem
+=
line
...
>>>
fin
.
close
()
>>>
len
(
poem
)
150
For a text file, even a blank line has a length of one
(the newline character), and is evaluated as True
.
When the file has been read,
readline()
(like read()
) also returns an empty
string, which is also evaluated as False
.
The easiest way to read a text file is by using an iterator. This returns one line at a time. It’s similar to the previous example, but with less code:
>>>
poem
=
''
>>>
fin
=
open
(
'relativity'
,
'rt'
)
>>>
for
line
in
fin
:
...
poem
+=
line
...
>>>
fin
.
close
()
>>>
len
(
poem
)
150
All of the preceding examples eventually built the single string poem
.
The readlines()
call reads a line at a time,
and returns a list of one-line strings:
>>>
fin
=
open
(
'relativity'
,
'rt'
)
>>>
lines
=
fin
.
readlines
()
>>>
fin
.
close
()
>>>
(
len
(
lines
),
'lines read'
)
5 lines read
>>>
for
line
in
lines
:
...
(
line
,
end
=
''
)
...
There was a young lady named Bright,
Whose speed was far faster than light;
She started one day
In a relative way,
And returned on the previous night.>>>
We told print()
to suppress the automatic newlines
because the first four lines already had them.
The last line did not, causing the interactive prompt >>>
to occur right after the last line.
If you include a 'b'
in the mode string,
the file is opened in binary mode.
In this case, you read and write bytes
instead of a string.
We don’t have a binary poem lying around, so we’ll just generate the 256 byte values from 0 to 255:
>>>
bdata
=
bytes
(
range
(
0
,
256
))
>>>
len
(
bdata
)
256
Open the file for writing in binary mode and write all the data at once:
>>>
fout
=
open
(
'bfile'
,
'wb'
)
>>>
fout
.
write
(
bdata
)
256
>>>
fout
.
close
()
Again, write()
returns the number of bytes written.
As with text, you can write binary data in chunks:
>>>
fout
=
open
(
'bfile'
,
'wb'
)
>>>
size
=
len
(
bdata
)
>>>
offset
=
0
>>>
chunk
=
100
>>>
while
True
:
...
if
offset
>
size
:
...
break
...
fout
.
write
(
bdata
[
offset
:
offset
+
chunk
])
...
offset
+=
chunk
...
100
100
56
>>>
fout
.
close
()
If you forget to close a file that you’ve opened, it will be closed by Python after it’s no longer referenced. This means that if you open a file within a function and don’t close it explicitly, it will be closed automatically when the function ends. But you might have opened the file in a long-running function or the main section of the program. The file should be closed to force any remaining writes to be completed.
Python has context managers
to clean up things such as open files.
You use the form with
expression
as
variable
:
>>>
with
open
(
'relativity'
,
'wt'
)
as
fout
:
...
fout
.
write
(
poem
)
...
That’s it. After the block of code under the context manager (in this case, one line) completes (normally or by a raised exception), the file is closed automatically.
As you read and write,
Python keeps track of where you are in the file.
The tell()
function returns your current offset from the beginning of the file, in bytes.
The seek()
function lets you jump to another byte offset in the file.
This means that you don’t have to read every byte in a file to read
the last one; you can seek()
to the last one and just read one byte.
For this example, use the 256-byte binary file 'bfile'
that you wrote earlier:
>>>
fin
=
open
(
'bfile'
,
'rb'
)
>>>
fin
.
tell
()
0
Use seek()
to one byte before the end of the file:
>>>
fin
.
seek
(
255
)
255
Read until the end of the file:
>>>
bdata
=
fin
.
read
()
>>>
len
(
bdata
)
1
>>>
bdata
[
0
]
255
seek()
also returns the current offset.
You can call seek()
with a second argument:
seek(
offset
, origin
)
:
If origin
is 0
(the default), go offset
bytes from the start
If origin
is 1
, go offset
bytes from the current position
If origin
is 2
, go offset
bytes relative to the end
These values are also defined in the standard os
module:
>>>
import
os
>>>
os
.
SEEK_SET
0
>>>
os
.
SEEK_CUR
1
>>>
os
.
SEEK_END
2
So, we could have read the last byte in different ways:
>>>
fin
=
open
(
'bfile'
,
'rb'
)
One byte before the end of the file:
>>>
fin
.
seek
(
-
1
,
2
)
255
>>>
fin
.
tell
()
255
Read until the end of the file:
>>>
bdata
=
fin
.
read
()
>>>
len
(
bdata
)
1
>>>
bdata
[
0
]
255
You don’t need to call tell()
for seek()
to work. I just wanted to show that they both report the same offset.
Here’s an example of seeking from the current position in the file:
>>>
fin
=
open
(
'bfile'
,
'rb'
)
This next example ends up two bytes before the end of the file:
>>>
fin
.
seek
(
254
,
0
)
254
>>>
fin
.
tell
()
254
Now, go forward one byte:
>>>
fin
.
seek
(
1
,
1
)
255
>>>
fin
.
tell
()
255
Finally, read until the end of the file:
>>>
bdata
=
fin
.
read
()
>>>
len
(
bdata
)
1
>>>
bdata
[
0
]
255
These functions are most useful for binary files. You can use them with text files, but unless the file is ASCII (one byte per character), you would have a hard time calculating offsets. These would depend on the text encoding, and the most popular encoding (UTF-8) uses varying numbers of bytes per character.
With simple text files, the only level of organization is the line. Sometimes, you want more structure than that. You might want to save data for your program to use later, or send data to another program.
There are many formats, and here’s how you can distinguish them:
A separator, or delimiter, character
like tab (' '
),
comma (','
), or vertical bar ('|'
). This is an example of the
comma-separated values (CSV) format.
'<'
and '>'
around tags. Examples include XML and HTML.
Punctuation. An example is JavaScript Object Notation (JSON).
Indentation. An example is YAML (which depending on the source you use means “YAML Ain’t Markup Language;” you’ll need to research that one yourself).
Miscellaneous, such as configuration files for programs.
Each of these structured file formats can be read and written by at least one Python module.
Delimited files are often used as an exchange format
for spreadsheets and databases.
You could read CSV files manually,
a line at a time,
splitting each line into fields at comma separators,
and adding the results to data structures such as lists
and dictionaries.
But it’s better to use the standard csv
module, because parsing these files
can get more complicated than you think.
Some have alternate delimiters besides a comma: '|'
and ' '
(tab) are common.
Some have escape sequences. If the delimiter character can occur within a field, the entire field might be surrounded by quote characters or preceded by some escape character.
Files have different line-ending characters. Unix uses '
'
,
Microsoft uses '
'
, and Apple used to use '
'
but now uses '
'
.
There can be column names in the first line.
First, we’ll see how to read and write a list of rows, each containing a list of columns:
>>>
import
csv
>>>
villains
=
[
...
[
'Doctor'
,
'No'
],
...
[
'Rosa'
,
'Klebb'
],
...
[
'Mister'
,
'Big'
],
...
[
'Auric'
,
'Goldfinger'
],
...
[
'Ernst'
,
'Blofeld'
],
...
]
>>>
with
open
(
'villains'
,
'wt'
)
as
fout
:
# a context manager
...
csvout
=
csv
.
writer
(
fout
)
...
csvout
.
writerows
(
villains
)
This creates the file villains
with these lines:
Doctor,No Rosa,Klebb Mister,Big Auric,Goldfinger Ernst,Blofeld
Now, we’ll try to read it back in:
>>>
import
csv
>>>
with
open
(
'villains'
,
'rt'
)
as
fin
:
# context manager
...
cin
=
csv
.
reader
(
fin
)
...
villains
=
[
row
for
row
in
cin
]
# This uses a list comprehension
...
>>>
(
villains
)
[['Doctor', 'No'], ['Rosa', 'Klebb'], ['Mister', 'Big'],
['Auric', 'Goldfinger'], ['Ernst', 'Blofeld']]
Take a moment to think about list comprehensions (feel free to go to “Comprehensions” and brush up on that syntax).
We took advantage of
the structure created by the reader()
function.
It obligingly created rows
in the cin
object that we can extract in a for
loop.
Using reader()
and writer()
with their default options,
the columns are separated by commas
and the rows by line feeds.
The data can be a list of dictionaries
rather than a list of lists.
Let’s read the villains
file again,
this time using the new DictReader()
function and specifying the column names:
>>>
import
csv
>>>
with
open
(
'villains'
,
'rt'
)
as
fin
:
...
cin
=
csv
.
DictReader
(
fin
,
fieldnames
=
[
'first'
,
'last'
])
...
villains
=
[
row
for
row
in
cin
]
...
>>>
(
villains
)
[{'last': 'No', 'first': 'Doctor'},
{'last': 'Klebb', 'first': 'Rosa'},
{'last': 'Big', 'first': 'Mister'},
{'last': 'Goldfinger', 'first': 'Auric'},
{'last': 'Blofeld', 'first': 'Ernst'}]
Let’s rewrite the CSV file by using the new
DictWriter()
function.
We’ll also call writeheader()
to write an initial line of column names
to the CSV file:
import csv
villains = [
{'first': 'Doctor', 'last': 'No'},
{'first': 'Rosa', 'last': 'Klebb'},
{'first': 'Mister', 'last': 'Big'},
{'first': 'Auric', 'last': 'Goldfinger'},
{'first': 'Ernst', 'last': 'Blofeld'},
]
with open('villains', 'wt') as fout:
cout = csv.DictWriter(fout, ['first', 'last'])
cout.writeheader()
cout.writerows(villains)
That creates a villains
file with a header line:
first,last Doctor,No Rosa,Klebb Mister,Big Auric,Goldfinger Ernst,Blofeld
Now we’ll read it back.
By omitting the fieldnames
argument
in the DictReader()
call, we instruct it to
use the values in the first line of the file
(first,last
)
as column labels and matching
dictionary keys:
>>>
import
csv
>>>
with
open
(
'villains'
,
'rt'
)
as
fin
:
...
cin
=
csv
.
DictReader
(
fin
)
...
villains
=
[
row
for
row
in
cin
]
...
>>>
(
villains
)
[{'last': 'No', 'first': 'Doctor'},
{'last': 'Klebb', 'first': 'Rosa'},
{'last': 'Big', 'first': 'Mister'},
{'last': 'Goldfinger', 'first': 'Auric'},
{'last': 'Blofeld', 'first': 'Ernst'}]
Delimited files convey only two dimensions: rows (lines) and columns (fields within a line). If you want to exchange data structures among programs, you need a way to encode hierarchies, sequences, sets, and other structures as text.
XML is the most prominent markup format that suits the bill. It uses tags to delimit data, as in this sample menu.xml file:
<?xml version="1.0"?> <menu> <breakfast hours="7-11"> <item price="$6.00">breakfast burritos</item> <item price="$4.00">pancakes</item> </breakfast> <lunch hours="11-3"> <item price="$5.00">hamburger</item> </lunch> <dinner hours="3-10"> <item price="8.00">spaghetti</item> </dinner> </menu>
Following are a few important characteristics of XML:
Tags begin with a <
character. The tags in this sample were menu
, breakfast
, lunch
, dinner
, and item
.
Whitespace is ignored.
Usually a start tag such as <menu>
is followed by other content and then a final matching
end tag such as </menu>
.
Tags can nest within other tags to any level. In this example, item
tags are children of the breakfast
, lunch
,
and dinner
tags;
they, in turn, are children of menu
.
Optional attributes can occur within the start tag. In this example, price
is an attribute of item
.
Tags can contain values. In this example, each item
has a value,
such as pancakes
for the second breakfast item.
If a tag named thing
has no values or children,
it can be expressed as the single tag by including a forward slash just before the closing angle bracket, such as <thing/>
, rather than a start and end tag, like
<thing></thing>
.
The choice of where to put data—attributes, values,
child tags—is somewhat arbitrary. For instance,
we could have written
the last item
tag as
<item price="$8.00" food="spaghetti"/>
.
XML is often used for data feeds and messages, and has subformats like RSS and Atom. Some industries have many specialized XML formats, such as the finance field.
XML’s über-flexibility has inspired multiple Python libraries that differ in approach and capabilities.
The simplest way to parse XML
in Python is by using ElementTree
.
Here’s a little program to parse
the menu.xml file and print
some tags and attributes:
>>>
import
xml.etree.ElementTree
as
et
>>>
tree
=
et
.
ElementTree
(
file
=
'menu.xml'
)
>>>
root
=
tree
.
getroot
()
>>>
root
.
tag
'menu'
>>>
for
child
in
root
:
...
(
'tag:'
,
child
.
tag
,
'attributes:'
,
child
.
attrib
)
...
for
grandchild
in
child
:
...
(
'
tag:'
,
grandchild
.
tag
,
'attributes:'
,
grandchild
.
attrib
)
...
tag: breakfast attributes: {'hours': '7-11'}
tag: item attributes: {'price': '$6.00'}
tag: item attributes: {'price': '$4.00'}
tag: lunch attributes: {'hours': '11-3'}
tag: item attributes: {'price': '$5.00'}
tag: dinner attributes: {'hours': '3-10'}
tag: item attributes: {'price': '8.00'}
>>>
len
(
root
)
# number of menu sections
3
>>>
len
(
root
[
0
])
# number of breakfast items
2
For each element in the nested lists,
tag
is the tag string
and attrib
is a dictionary of its attributes.
ElementTree
has many other ways of searching
XML-derived data,
modifying it,
and even writing XML files.
The ElementTree
documentation
has the details.
Other standard Python XML libraries include:
xml.dom
The Document Object Model (DOM), familiar to JavaScript developers, represents Web documents as hierarchical structures. This module loads the entire XML file into memory and lets you access all the pieces equally.
xml.sax
Simple API for XML, or SAX, parses XML on the fly, so it does not have to load everything into memory at once. Therefore, it can be a good choice if you need to process very large streams of XML.
Enormous amounts of data are saved as Hypertext Markup Language (HTML), the basic document format of the Web. The problem is so much of it doesn’t follow the HTML rules, which can make it difficult to parse. Also, much of HTML is intended more to format output than interchange data. Because this chapter is intended to describe fairly well-defined data formats, I have separated out the discussion about HTML to Chapter 9.
JavaScript Object Notation (JSON) has become a very popular data interchange format, beyond its JavaScript origins. The JSON format is a subset of JavaScript, and often legal Python syntax as well. Its close fit to Python makes it a good choice for data interchange among programs. You’ll see many examples of JSON for web development in Chapter 9.
Unlike the variety of XML modules,
there’s one main JSON module,
with the unforgettable name json
.
This program encodes (dumps) data to a JSON string
and decodes (loads) a JSON string back to data.
In this next example, let’s build a Python data structure containing
the data from the earlier XML example:
>>>
menu
=
...
{
...
"breakfast"
:
{
...
"hours"
:
"7-11"
,
...
"items"
:
{
...
"breakfast burritos"
:
"$6.00"
,
...
"pancakes"
:
"$4.00"
...
}
...
},
...
"lunch"
:
{
...
"hours"
:
"11-3"
,
...
"items"
:
{
...
"hamburger"
:
"$5.00"
...
}
...
},
...
"dinner"
:
{
...
"hours"
:
"3-10"
,
...
"items"
:
{
...
"spaghetti"
:
"$8.00"
...
}
...
}
...
}
.
Next, encode the data structure (menu
) to a JSON string
(menu_json
) by using dumps()
:
>>>
import
json
>>>
menu_json
=
json
.
dumps
(
menu
)
>>>
menu_json
'{"dinner": {"items": {"spaghetti": "$8.00"}, "hours": "3-10"},
"lunch": {"items": {"hamburger": "$5.00"}, "hours": "11-3"},
"breakfast": {"items": {"breakfast burritos": "$6.00", "pancakes":
"$4.00"}, "hours": "7-11"}}'
And now, let’s turn the JSON string menu_json
back into a Python
data structure (menu2
) by using loads()
:
>>>
menu2
=
json
.
loads
(
menu_json
)
>>>
menu2
{'breakfast': {'items': {'breakfast burritos': '$6.00', 'pancakes':
'$4.00'}, 'hours': '7-11'}, 'lunch': {'items': {'hamburger': '$5.00'},
'hours': '11-3'}, 'dinner': {'items': {'spaghetti': '$8.00'}, 'hours': '3-10'}}
menu
and menu2
are both dictionaries with the same keys and values.
As always with standard dictionaries, the order in which you get the keys
varies.
You might get an exception while trying to encode
or decode some objects, including objects such as datetime
(covered in detail in “Calendars and Clocks”), as demonstrated here.
>>>
import
datetime
>>>
now
=
datetime
.
datetime
.
utcnow
()
>>>
now
datetime.datetime(2013, 2, 22, 3, 49, 27, 483336)
>>>
json
.
dumps
(
now
)
Traceback (most recent call last):
# ... (deleted stack trace to save trees)
TypeError: datetime.datetime(2013, 2, 22, 3, 49, 27, 483336) is not JSON serializable
>>>
This can happen because the JSON standard does not define
date or time types; it expects you to define how to handle them.
You could convert the datetime
to something
JSON understands, such as a string
or an epoch value (coming in Chapter 10):
>>>
now_str
=
str
(
now
)
>>>
json
.
dumps
(
now_str
)
'"2013-02-22 03:49:27.483336"'
>>>
from
time
import
mktime
>>>
now_epoch
=
int
(
mktime
(
now
.
timetuple
()))
>>>
json
.
dumps
(
now_epoch
)
'1361526567'
If the datetime
value could occur in the middle
of normally converted data types,
it might be annoying to make these special
conversions.
You can modify how JSON is encoded
by using inheritance,
which is described in “Inheritance”.
Python’s JSON documentation
gives an example of this for complex numbers,
which also makes JSON play dead.
Let’s modify it for datetime
:
>>>
class
DTEncoder
(
json
.
JSONEncoder
):
...
def
default
(
self
,
obj
):
...
# isinstance() checks the type of obj
...
if
isinstance
(
obj
,
datetime
.
datetime
):
...
return
int
(
mktime
(
obj
.
timetuple
()))
...
# else it's something the normal decoder knows:
...
return
json
.
JSONEncoder
.
default
(
self
,
obj
)
...
>>>
json
.
dumps
(
now
,
cls
=
DTEncoder
)
'1361526567'
The new class DTEncoder
is a subclass, or child class,
of JSONEncoder
.
We only need to override its default()
method
to add datetime
handling.
Inheritance ensures that everything else will be
handled by the parent class.
The isinstance()
function checks whether the object obj
is of the class datetime.datetime
.
Because everything in Python is an object,
isinstance()
works everywhere:
>>>
type
(
now
)
<class 'datetime.datetime'>
>>>
isinstance
(
now
,
datetime
.
datetime
)
True
>>>
type
(
234
)
<class 'int'>
>>>
isinstance
(
234
,
int
)
True
>>>
type
(
'hey'
)
<class 'str'>
>>>
isinstance
(
'hey'
,
str
)
True
For JSON and other structured text formats,
you can load from a file into data structures
without knowing anything about the structures ahead of time.
Then, you can walk through the structures by
using isinstance()
and type-appropriate methods to examine their values.
For example, if one of the items is a dictionary,
you can extract contents through keys()
, values()
, and items()
.
Similar to JSON, YAML has keys and values,
but handles more data types such as dates and times.
The standard Python library does not yet include
YAML handling, so you need to install a third-party library named
yaml
to manipulate it.
load()
converts a YAML string to Python data,
whereas dump()
does the opposite.
The following YAML file, mcintyre.yaml, contains information on the Canadian poet James McIntyre, including two of his poems:
name: first: James last: McIntyre dates: birth: 1828-05-25 death: 1906-03-31 details: bearded: true themes: [cheese, Canada] books: url: http://www.gutenberg.org/files/36068/36068-h/36068-h.htm poems: - title: 'Motto' text: | Politeness, perseverance and pluck, To their possessor will bring good luck. - title: 'Canadian Charms' text: | Here industry is not in vain, For we have bounteous crops of grain, And you behold on every field Of grass and roots abundant yield, But after all the greatest charm Is the snug home upon the farm, And stone walls now keep cattle warm.
Values such as true
, false
, on
, and off
are converted to Python
Booleans. Integers and strings are converted to their
Python equivalents. Other syntax creates lists and dictionaries:
>>>
import
yaml
>>>
with
open
(
'mcintyre.yaml'
,
'rt'
)
as
fin
:
>>>
text
=
fin
.
read
()
>>>
data
=
yaml
.
load
(
text
)
>>>
data
[
'details'
]
{'themes': ['cheese', 'Canada'], 'bearded': True}
>>>
len
(
data
[
'poems'
])
2
The data structures that are created match those in the YAML file, which in this case are more than one level deep in places. You can get the title of the second poem with this dict/list/dict reference:
>>>
data
[
'poems'
][
1
][
'title'
]
'Canadian Charms'
PyYAML can load Python objects from strings,
and this is dangerous.
Use safe_load()
instead of load()
if you’re
importing YAML that you don’t trust.
Better yet, always use safe_load()
.
Read war is peace
for a description of how unprotected YAML loading
compromised the Ruby on Rails platform.
You can use all the formats described in this chapter to save objects to files and read them back again. It’s possible to exploit this process and cause security problems.
For example, the following XML snippet from the billion laughs Wikipedia page defines ten nested entities, each expanding the lower level ten times for a total expansion of one billion:
<?xml version="1.0"?> <!DOCTYPE lolz [ <!ENTITY lol "lol"> <!ENTITY lol1 "&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;&lol;"> <!ENTITY lol2 "&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;&lol1;"> <!ENTITY lol3 "&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;&lol2;"> <!ENTITY lol4 "&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;&lol3;"> <!ENTITY lol5 "&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;&lol4;"> <!ENTITY lol6 "&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;&lol5;"> <!ENTITY lol7 "&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;&lol6;"> <!ENTITY lol8 "&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;&lol7;"> <!ENTITY lol9 "&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;&lol8;"> ]> <lolz>&lol9;</lolz>
The bad news: billion laughs would
blow up all of the XML libraries
mentioned in the previous sections.
Defused XML
lists this attack and others,
along with the vulnerability of Python libraries.
The link shows how to change the settings
for many of the libraries to avoid these problems.
Also, you can use the defusedxml
library as a security
frontend for the other libraries:
>>>
# insecure:
>>>
from
xml.etree.ElementTree
import
parse
>>>
et
=
parse
(
xmlfile
)
>>>
# protected:
>>>
from
defusedxml.ElementTree
import
parse
>>>
et
=
parse
(
xmlfile
)
Most programs offer various options or settings. Dynamic ones can be provided as program arguments, but long-lasting ones need to be kept somewhere. The temptation to define your own quick and dirty config file format is strong—but resist it. It often turns out to be dirty, but not so quick. You need to maintain both the writer program and the reader program (sometimes called a parser). There are good alternatives that you can just drop into your program, including those in the previous sections.
Here, we’ll use the standard configparser
module,
which handles Windows-style .ini files.
Such files have sections
of key = value definitions.
Here’s a minimal settings.cfg file:
[english] greeting = Hello [french] greeting = Bonjour [files] home = /usr/local # simple interpolation: bin = %(home)s/bin
Here’s the code to read it into Python data structures:
>>>
import
configparser
>>>
cfg
=
configparser
.
ConfigParser
()
>>>
cfg
.
read
(
'settings.cfg'
)
['settings.cfg']
>>>
cfg
<configparser.ConfigParser object at 0x1006be4d0>
>>>
cfg
[
'french'
]
<Section: french>
>>>
cfg
[
'french'
][
'greeting'
]
'Bonjour'
>>>
cfg
[
'files'
][
'bin'
]
'/usr/local/bin'
Other options are available, including fancier interpolation.
See the configparser
documentation.
If you need deeper nesting than two levels,
try YAML or JSON.
Saving data structures to a file is called serializing.
Formats such as JSON might require some custom converters
to serialize all the data types from a Python program.
Python provides the pickle
module to save and restore any object
in a special binary format.
Remember how JSON lost its mind when
encountering a datetime
object?
Not a problem for pickle
:
>>>
import
pickle
>>>
import
datetime
>>>
now1
=
datetime
.
datetime
.
utcnow
()
>>>
pickled
=
pickle
.
dumps
(
now1
)
>>>
now2
=
pickle
.
loads
(
pickled
)
>>>
now1
datetime.datetime(2014, 6, 22, 23, 24, 19, 195722)
>>>
now2
datetime.datetime(2014, 6, 22, 23, 24, 19, 195722)
pickle
works with your own classes and objects, too.
We’ll define a little class called Tiny
that returns the string 'tiny'
when treated as a string:
>>>
import
pickle
>>>
class
Tiny
():
...
def
__str__
(
self
):
...
return
'tiny'
...
>>>
obj1
=
Tiny
()
>>>
obj1
<__main__.Tiny object at 0x10076ed10>
>>>
str
(
obj1
)
'tiny'
>>>
pickled
=
pickle
.
dumps
(
obj1
)
>>>
pickled
b'x80x03c__main__ Tiny qx00)x81qx01.'
>>>
obj2
=
pickle
.
loads
(
pickled
)
>>>
obj2
<__main__.Tiny object at 0x10076e550>
>>>
str
(
obj2
)
'tiny'
pickled
is the pickled binary string made from the object obj1
.
We converted that back to the object obj2
to make a copy of obj1
.
Use dump()
to pickle to a file, and load()
to unpickle from one.
Some file formats were designed to store particular data structures but are neither relational nor NoSQL databases. The sections that follow present some of them.
Spreadsheets, notably Microsoft Excel,
are widespread binary data formats.
If you can save your spreadsheet to a CSV file,
you can read it by using the standard csv
module
that was described earlier.
If you have a binary xls
file,
xlrd
is a third-party package for reading and writing.
HDF5 is a binary data format for multidimensional or hierarchical numeric data. It’s used mainly in science, where fast random access to large datasets (gigabytes to terabytes) is a common requirement. Even though HDF5 could be a good alternative to databases in some cases, for some reason HDF5 is almost unknown in the business world. It’s best suited to WORM (write once/read many) applications for which database protection against conflicting writes is not needed. Here are a couple of modules that you might find useful:
h5py
is a full-featured low-level interface.
Read the
documentation
and
code.
PyTables
is a bit higher-level,
with database-like features. Read the
documentation
and
code.
Both of these are discussed in terms of scientific applications of Python in Appendix C. I’m mentioning HDF5 here in case you have a need to store and retrieve large amounts of data and are willing to consider something outside the box, as well as the usual database solutions. A good example is the Million Song dataset, which has downloadable song data in HDF5 format.
Relational databases are only about 40 years old but are ubiquitous in the computing world. You’ll almost certainly have to deal with them at one time or another. When you do, you’ll appreciate what they provide:
Access to data by multiple simultaneous users
Protection from corruption by those users
Efficient methods to store and retrieve the data
Joins to find relationships across diverse types of data
A declarative (rather than imperative) query language: SQL (Structured Query Language)
These are called relational because they show relationships among different kinds of data in the form of tables (as they are usually called nowadays). For instance, in our menu example earlier, there is a relationship between each item and its price.
A table is a grid of rows and columns, similar to a spreadsheet. To create a table, name it and specify the order, names, and types of its columns. Each row has the same columns, although a column may be defined to allow missing data (called nulls). In the menu example, you could create a table with one row for each item being sold. Each item has the same columns, including one for the price.
A column or group of columns is usually the table’s primary key; its values must be unique in the table. This prevents adding the same data to the table more than once. This key is indexed for fast lookups during queries. An index works a little like a book index, making it fast to find a particular row.
Each table lives within a parent database, like a file within a directory. Two levels of hierarchy help keep things organized a little better.
Yes, the word database is used in multiple ways: as the server, the table container, and the data stored therein. If you’ll be referring to all of them at the same time, it might help to call them database server, database, and data.
If you want to find rows by some non-key column value, define a secondary index on that column. Otherwise, the database server must perform a table scan—a brute-force search of every row for matching column values.
Tables can be related to each other with foreign keys, and column values can be constrained to these keys.
SQL is not an API or a protocol, but a declarative language: you say what you want rather than how to do it. It’s the universal language of relational databases. SQL queries are text strings, that a client sends to the database server, which figures out what to do with them.
There have been various SQL standard definitions, but all database vendors have added their own tweaks and extensions, resulting in many SQL dialects. If you store your data in a relational database, SQL gives you some portability. Still, dialect and operational differences can make it difficult to move your data to another type of database.
There are two main categories of SQL statements:
Handles creation, deletion, constraints, and permissions for tables, databases, and uses
Table 8-1 lists the basic SQL DDL commands.
Operation | SQL pattern | SQL example |
---|---|---|
Create a database |
|
|
Select current database |
|
|
Delete a database and its tables |
|
|
Create a table |
|
|
Delete a table |
|
|
Remove all rows from a table |
|
|
Why all the CAPITAL LETTERS? SQL is case-insensitive, but it’s tradition (don’t ask me why) to SHOUT its keywords in code examples to distinguish them from column names.
The main DML operations of a relational database are often known by the acronym CRUD:
Create by using the SQL INSERT
statement
Read by using SELECT
Update by using UPDATE
Delete by using DELETE
Table 8-2 looks at the commands available for SQL DML.
Operation | SQL pattern | SQL example |
---|---|---|
Add a row |
|
|
Select all rows and columns |
|
|
Select all rows, some columns |
|
|
Select some rows, some columns |
|
|
Change some rows in a column |
|
|
Delete some rows |
|
|
An application programming interface (API) is a set of functions that you can call to get access to some service. DB-API is Python’s standard API for accessing relational databases. Using it, you can write a single program that works with multiple kinds of relational databases instead of writing a separate program for each one. It’s similar to Java’s JDBC or Perl’s dbi.
Its main functions are the following:
connect()
Make a connection to the database; this can include arguments such as username, password, server address, and others.
cursor()
execute()
and executemany()
fetchone()
, fetchmany()
, and fetchall()
The Python database modules in the coming sections conform to DB-API, often with extensions and some differences in details.
SQLite is a good, light, open source relational database. It’s implemented as a standard Python library, and stores databases in normal files. These files are portable across machines and operating systems, making SQLite a very portable solution for simple relational database applications. It isn’t as full-featured as MySQL or PostgreSQL, but it does support SQL, and it manages multiple simultaneous users. Web browsers, smart phones, and other applications use SQLite as an embedded database.
You begin with a connect()
to the local SQLite database file that
you want to use or create.
This file is the equivalent of the directory-like
database that parents tables in other servers.
The special string ':memory:'
creates the database in
memory only; this is fast and useful for testing
but will lose data when your program terminates or
if your computer goes down.
For the next example, let’s make a database called enterprise.db
and the table zoo
to manage our
thriving roadside petting zoo business.
The table columns are as follows:
critter
A variable length string, and our primary key
count
An integer count of our current inventory for this animal
damages
The dollar amount of our current losses from animal-human interactions
>>>
import
sqlite3
>>>
conn
=
sqlite3
.
connect
(
'enterprise.db'
)
>>>
curs
=
conn
.
cursor
()
>>>
curs
.
execute
(
'''CREATE TABLE zoo
(critter VARCHAR(20) PRIMARY KEY,
count INT,
damages FLOAT)''')
<sqlite3.Cursor object at 0x1006a22d0>
Python’s triple quotes are handy when creating long strings such as SQL queries.
Now, add some animals to the zoo:
>>>
curs
.
execute
(
'INSERT INTO zoo VALUES("duck", 5, 0.0)'
)
<sqlite3.Cursor object at 0x1006a22d0>
>>>
curs
.
execute
(
'INSERT INTO zoo VALUES("bear", 2, 1000.0)'
)
<sqlite3.Cursor object at 0x1006a22d0>
There’s a safer way to insert data, using a placeholder:
>>>
ins
=
'INSERT INTO zoo (critter, count, damages) VALUES(?, ?, ?)'
>>>
curs
.
execute
(
ins
,
(
'weasel'
,
1
,
2000.0
))
<sqlite3.Cursor object at 0x1006a22d0>
This time, we used three question marks in the SQL to indicate that we plan
to insert three values, and then pass those three values as a tuple to
the execute()
function.
Placeholders handle tedious details such as quoting. They
protect you against SQL injection—a kind of external attack
that is common on the Web that
inserts malicious SQL commands into the system.
Now, let’s see if we can get all our animals out again:
>>>
curs
.
execute
(
'SELECT * FROM zoo'
)
<sqlite3.Cursor object at 0x1006a22d0>
>>>
rows
=
curs
.
fetchall
()
>>>
(
rows
)
[('duck', 5, 0.0), ('bear', 2, 1000.0), ('weasel', 1, 2000.0)]
Let’s get them again, but ordered by their counts:
>>>
curs
.
execute
(
'SELECT * from zoo ORDER BY count'
)
<sqlite3.Cursor object at 0x1006a22d0>
>>>
curs
.
fetchall
()
[('weasel', 1, 2000.0), ('bear', 2, 1000.0), ('duck', 5, 0.0)]
Hey, we wanted them in descending order:
>>>
curs
.
execute
(
'SELECT * from zoo ORDER BY count DESC'
)
<sqlite3.Cursor object at 0x1006a22d0>
>>>
curs
.
fetchall
()
[('duck', 5, 0.0), ('bear', 2, 1000.0), ('weasel', 1, 2000.0)]
Which type of animal is costing us the most?
>>>
curs
.
execute
(
'''SELECT * FROM zoo WHERE
...
damages = (SELECT MAX(damages) FROM zoo)'''
)
<sqlite3.Cursor object at 0x1006a22d0>
>>>
curs
.
fetchall
()
[('weasel', 1, 2000.0)]
You would have thought it was the bears. It’s always best to check the actual data.
Before we leave SQLite, we need to clean up. If we opened a connection and a cursor, we need to close them when we’re done:
>>>
curs
.
close
()
>>>
conn
.
close
()
MySQL is a very popular open source relational database. Unlike SQLite, it’s an actual server, so clients can access it from different devices across the network.
MysqlDB has been the most popular MySQL driver, but it has not yet been ported to Python 3. Table 8-3 lists the drivers you can use to access MySQL from Python.
Name | Link | Pypi package | Import as | Notes |
---|---|---|---|---|
MySQL Connector |
mysql-connector-python |
|
||
PYMySQL |
pymysql |
|
||
oursql |
oursql |
|
Requires the MySQL C client libraries. |
PostgreSQL is a full-featured open source relational database, in many ways more advanced than MySQL. Table 8-4 presents the Python drivers you can use to access it.
Name | Link | Pypi package | Import as | Notes |
---|---|---|---|---|
psycopg2 |
psycopg2 |
psycopg2 |
Needs |
|
py-postgresql |
py-postgresql |
postgresql |
The most popular driver is psycopg2
,
but its installation requires the PostgreSQL client libraries.
SQL is not quite the same for all relational databases, and DB-API takes you only so far. Each database implements a particular dialect reflecting its features and philosophy. Many libraries try to bridge these differences in one way or another. The most popular cross-database Python library is SQLAlchemy.
It isn’t in the standard library, but it’s well known and used by many people. You can install it on your system by using this command:
$ pip install sqlalchemy
You can use SQLAlchemy on several levels:
The lowest level handles database connection pools, executing SQL commands, and returning results. This is closest to the DB-API.
Next up is the SQL expression language, a Pythonic SQL builder.
Highest is the ORM (Object Relational Model) layer, which uses the SQL Expression Language and binds application code with relational data structures.
As we go along, you’ll understand what the terms mean in those levels. SQLAlchemy works with the database drivers documented in the previous sections. You don’t need to import the driver; the initial connection string you provide to SQLAlchemy will determine it. That string looks like this:
dialect
+
driver
://
user
:
password
@
host
:
port
/
dbname
The values you put in this string are as follows:
dialect
The database type
driver
The particular driver you want to use for that database
user
and password
Your database authentication strings
host
and port
The database server’s location (: port
is only needed if it’s not the standard one for this server)
dbname
The database to initially connect to on the server
Table 8-5 lists the dialects and drivers.
dialect | driver |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
First, we’ll try the lowest level of SQLAlchemy, which does little more than the base DB-API functions.
Let’s try it with SQLite, which is already built into Python.
The connection string for SQLite
skips the host
, port
, user
, and password
.
The dbname
informs SQLite as to what file to use to store your database. If you omit the dbname
, SQLite builds a database in memory.
If the dbname
starts with a slash (/),
it’s an absolute
filename on your computer
(as in Linux and OS X;
for example, C:\ on Windows).
Otherwise, it’s relative to your current
directory.
The following segments are all part of one program, separated here for explanation.
To begin, you need to import what we need.
The following is an example of an import alias,
which lets us use the string sa
to refer to SQLAlchemy methods.
I do this mainly because sa
is a lot easier to type than sqlalchemy
:
>>>
import
sqlalchemy
as
sa
Connect to the database
and create the storage
for it in memory (the argument string 'sqlite:///:memory:'
also works):
>>>
conn
=
sa
.
create_engine
(
'sqlite://'
)
Create a database table called zoo
that comprises three columns:
>>>
conn
.
execute
(
'''CREATE TABLE zoo
...
(critter VARCHAR(20) PRIMARY KEY,
...
count INT,
...
damages FLOAT)'''
)
<sqlalchemy.engine.result.ResultProxy object at 0x1017efb10>
Running conn.execute()
returns a SQLAlchemy object called a ResultProxy
.
You’ll soon see what to do with it.
By the way, if you’ve never made a database table before, congratulations. Check that one off your bucket list.
Now, insert three sets of data into your new empty table:
>>>
ins
=
'INSERT INTO zoo (critter, count, damages) VALUES (?, ?, ?)'
>>>
conn
.
execute
(
ins
,
'duck'
,
10
,
0.0
)
<sqlalchemy.engine.result.ResultProxy object at 0x1017efb50>
>>>
conn
.
execute
(
ins
,
'bear'
,
2
,
1000.0
)
<sqlalchemy.engine.result.ResultProxy object at 0x1017ef090>
>>>
conn
.
execute
(
ins
,
'weasel'
,
1
,
2000.0
)
<sqlalchemy.engine.result.ResultProxy object at 0x1017ef450>
Next, ask the database for everything that we just put in:
>>>
rows
=
conn
.
execute
(
'SELECT * FROM zoo'
)
In SQLAlchemy, rows
is not a list; it’s that special ResultProxy
thing
that we can’t print directly:
>>>
(
rows
)
<sqlalchemy.engine.result.ResultProxy object at 0x1017ef9d0>
However, you can iterate over it like a list, so we can get a row at a time:
>>>
for
row
in
rows
:
...
(
row
)
...
('duck', 10, 0.0)
('bear', 2, 1000.0)
('weasel', 1, 2000.0)
That was almost the same as the SQLite DB-API example that you saw earlier. The one advantage is that we didn’t need to import the database driver at the top; SQLAlchemy figured that out from the connection string. Just changing the connection string would make this code portable to another type of database. Another plus is SQLAlchemy’s connection pooling, which you can read about at its documentation site.
The next level up is SQLAlchemy’s SQL Expression Language. It introduces functions to create the SQL for various operations. The Expression Language handles more of the SQL dialect differences than the lower-level engine layer does. It can be a handy middle-ground approach for relational database applications.
Here’s how to create and populate the zoo
table.
Again, these are successive fragments of a single program.
The import and connection are the same as before:
>>>
import
sqlalchemy
as
sa
>>>
conn
=
sa
.
create_engine
(
'sqlite://'
)
To define the zoo
table,
we’ll begin using some of the Expression Language
instead of SQL:
>>>
meta
=
sa
.
MetaData
()
>>>
zoo
=
sa
.
Table
(
'zoo'
,
meta
,
...
sa
.
Column
(
'critter'
,
sa
.
String
,
primary_key
=
True
),
...
sa
.
Column
(
'count'
,
sa
.
Integer
),
...
sa
.
Column
(
'damages'
,
sa
.
Float
)
...
)
>>>
meta
.
create_all
(
conn
)
Check out the parentheses in that multiline call in the preceding example.
The structure of the Table()
method matches
the structure of the table.
Just as our table contains three columns,
there are three calls to Column()
inside the parentheses of the
Table()
method call.
Meanwhile, zoo
is some magic object that bridges the SQL database world
and the Python data structure world.
Insert the data with more Expression Language functions:
...
conn
.
execute
(
zoo
.
insert
((
'bear'
,
2
,
1000.0
)))
<sqlalchemy.engine.result.ResultProxy object at 0x1017ea910>
>>>
conn
.
execute
(
zoo
.
insert
((
'weasel'
,
1
,
2000.0
)))
<sqlalchemy.engine.result.ResultProxy object at 0x1017eab10>
>>>
conn
.
execute
(
zoo
.
insert
((
'duck'
,
10
,
0
)))
<sqlalchemy.engine.result.ResultProxy object at 0x1017eac50>
Next, create the SELECT statement
(zoo.select()
selects everything from the table
represented by the zoo
object,
such as SELECT * FROM zoo
would do in plain SQL):
>>>
result
=
conn
.
execute
(
zoo
.
select
())
Finally, get the results:
>>>
rows
=
result
.
fetchall
()
>>>
(
rows
)
[('bear', 2, 1000.0), ('weasel', 1, 2000.0), ('duck', 10, 0.0)]
In the last section,
the zoo
object was a mid-level connection between
SQL and Python.
At the top layer of SQLAlchemy, the Object-Relational Mapper (ORM)
uses the SQL Expression Language
but tries to make the actual database
mechanisms invisible.
You define classes, and the ORM handles how
to get their data in and out of the database.
The basic idea behind that complicated phrase,
“object-relational mapper,” is that you can refer to
objects in your code, and thus stay close to the way Python
likes to operate, while still using a relational
database.
We’ll define a Zoo
class and hook it into the ORM.
This time, we’ll make SQLite use the file zoo.db
so that we can confirm that the ORM worked.
As in the previous two sections, the snippets that follow are actually one program separated by explanations. Don’t worry if you don’t understand some if it. The SQLAlchemy documentation has all the details, and this stuff can get complex. I just want you to get an idea of how much work it is to do this, so that you can decide which of the approaches discussed in this chapter suits you.
The initial import is the same, but this time we need another something also:
>>>
import
sqlalchemy
as
sa
>>>
from
sqlalchemy.ext.declarative
import
declarative_base
Here, we make the connection:
>>>
conn
=
sa
.
create_engine
(
'sqlite:///zoo.db'
)
Now, we get into SQLAlchemy’s ORM.
We define the Zoo
class and associate
its attributes with table columns:
>>>
Base
=
declarative_base
()
>>>
class
Zoo
(
Base
):
...
__tablename__
=
'zoo'
...
critter
=
sa
.
Column
(
'critter'
,
sa
.
String
,
primary_key
=
True
)
...
count
=
sa
.
Column
(
'count'
,
sa
.
Integer
)
...
damages
=
sa
.
Column
(
'damages'
,
sa
.
Float
)
...
def
__init__
(
self
,
critter
,
count
,
damages
):
...
self
.
critter
=
critter
...
self
.
count
=
count
...
self
.
damages
=
damages
...
def
__repr__
(
self
):
...
return
"<Zoo({}, {}, {})>"
.
format
(
self
.
critter
,
self
.
count
,
self
.
damages
)
The following line magically creates the database and table:
>>>
Base
.
metadata
.
create_all
(
conn
)
You can then insert data by creating Python objects. The ORM manages these internally:
>>>
first
=
Zoo
(
'duck'
,
10
,
0.0
)
>>>
second
=
Zoo
(
'bear'
,
2
,
1000.0
)
>>>
third
=
Zoo
(
'weasel'
,
1
,
2000.0
)
>>>
first
<Zoo(duck, 10, 0.0)>
Next, we get the ORM to take us to SQL land. We create a session to talk to the database:
>>>
from
sqlalchemy.orm
import
sessionmaker
>>>
Session
=
sessionmaker
(
bind
=
conn
)
>>>
session
=
Session
()
Within the session, we write the three objects that we created
to the database.
The add()
function adds one object, and add_all()
adds a list:
>>>
session
.
add
(
first
)
>>>
session
.
add_all
([
second
,
third
])
Finally, we need to force everything to complete:
>>>
session
.
commit
()
Did it work?
Well, it created a zoo.db file
in the current directory.
You can use the command-line sqlite3
program to check it:
$ sqlite3 zoo.db SQLite version 3.6.12 Enter ".help" for instructions Enter SQL statements terminated with a ";" sqlite> .tables zoo sqlite> select * from zoo; duck|10|0.0 bear|2|1000.0 weasel|1|2000.0
The purpose of this section was to show what an ORM is and how it works at a high level. The author of SQLAlchemy has written a full tutorial. After reading this, decide which of the following levels would best fit your needs:
Plain DB-API, as in the earlier SQLite section
The SQLAlchemy engine room
The SQLAlchemy Expression Language
The SQLAlchemy ORM
It seems like a natural choice to use an ORM to avoid the complexities of SQL. Should you use one? Some people think ORMs should be avoided, but others think the criticism is overdone. Whoever’s right, an ORM is an abstraction, and all abstractions break down at some point; they’re leaky. When your ORM doesn’t do what you want, you must figure out both how it works and how to fix it in SQL. To borrow an Internet meme: Some people, when confronted with a problem, think, “I know, I’ll use an ORM.” Now they have two problems. Use ORMs sparingly, and mostly for simple applications. If the application is that simple, maybe you can just use straight SQL (or the SQL Expression Language), anyhow.
Or, you can try something simpler such as
dataset
.
It’s built on SQLAlchemy and provides a simple
ORM for SQL, JSON, and CSV storage.
Some databases are not relational and don’t support SQL. These were written to handle very large data sets, allow more flexible data definitions, or support custom data operations. They’ve been collectively labeled NoSQL (formerly meaning no SQL; now the less confrontational not only SQL).
The dbm
formats were around long before
NoSQL was coined.
They’re key-value stores,
often embedded in applications such as web browsers
to maintain various settings.
A dbm database is like a Python dictionary in the following ways:
You can assign a value to a key, and it’s automatically saved to the database on disk.
You can get a value from a key.
The following is a quick example.
The second argument to the following open()
method
is 'r'
to read, 'w'
to write,
and 'c'
for both, creating the file if it doesn’t exist:
>>>
import
dbm
>>>
db
=
dbm
.
open
(
'definitions'
,
'c'
)
To create key-value pairs, just assign a value to a key just as you would a dictionary:
>>>
db
[
'mustard'
]
=
'yellow'
>>>
db
[
'ketchup'
]
=
'red'
>>>
db
[
'pesto'
]
=
'green'
Let’s pause and check what we have so far:
>>>
len
(
db
)
3
>>>
db
[
'pesto'
]
b'green'
Now close, then reopen to see if it actually saved what we gave it:
>>>
db
.
close
()
>>>
db
=
dbm
.
open
(
'definitions'
,
'r'
)
>>>
db
[
'mustard'
]
b'yellow'
Keys and values are stored as bytes
.
You cannot iterate over the database object db
,
but you can get the
number of keys by using len()
. Note that get()
and setdefault()
work as they do for dictionaries.
memcached
is a fast in-memory key-value cache server.
It’s often put in front of a database,
or used to store web server session data.
You can download versions for Linux and OS X,
and for Windows.
If you want to try out this section,
you’ll need a memcached server and Python driver.
There are many Python drivers;
one that works with Python 3 is
python3-memcached
,
which you can install by using this command:
$ pip install python-memcached
To use it, connect to a memcached server, after which you can do the following:
Set and get values for keys
Increment or decrement a value
Delete a key
Data is not persistent, and data that you wrote earlier might disappear. This is inherent in memcached, being that it’s a cache server. It avoids running out of memory by discarding old data.
You can connect to multiple memcached servers at the same time. In this next example, we’re just talking to one on the same computer:
>>>
import
memcache
>>>
db
=
memcache
.
Client
([
'127.0.0.1:11211'
])
>>>
db
.
set
(
'marco'
,
'polo'
)
True
>>>
db
.
get
(
'marco'
)
'polo'
>>>
db
.
set
(
'ducks'
,
0
)
True
>>>
db
.
get
(
'ducks'
)
0
>>>
db
.
incr
(
'ducks'
,
2
)
2
>>>
db
.
get
(
'ducks'
)
2
Redis is a data structure server. Like memcached, all of the data in a Redis server should fit in memory (although there is now an option to save the data to disk). Unlike memcached, Redis can do the following:
Save data to disk for reliability and restarts
Keep old data
Provide more data structures than simple strings
The Redis data types are a close match to Python’s, and a Redis server can be a useful intermediary for one or more Python applications to share data. I’ve found it so useful that it’s worth a little extra coverage here.
The Python driver redis-py
has its source code and tests
on
GitHub, as
well as online documentation.
You can install it by using this command:
$ pip install redis
The Redis server itself has good documentation.
If you install and start the Redis server
on your local computer
(with the network nickname localhost
),
you can try the programs in the following sections.
A key with a single value is a Redis string.
Simple Python data types are automatically converted.
Connect to a Redis server at some host
(default is localhost
)
and port (default is 6379
):
>>>
import
redis
>>>
conn
=
redis
.
Redis
()
redis.Redis('localhost')
or redis.Redis('localhost', 6379)
would have given the same result.
List all keys (none so far):
>>>
conn
.
keys
(
'*'
)
[]
Set a simple string (key 'secret'
),
integer (key 'carats'
),
and float (key 'fever'
):
>>>
conn
.
set
(
'secret'
,
'ni!'
)
True
>>>
conn
.
set
(
'carats'
,
24
)
True
>>>
conn
.
set
(
'fever'
,
'101.5'
)
True
Get the values back by key:
>>>
conn
.
get
(
'secret'
)
b'ni!'
>>>
conn
.
get
(
'carats'
)
b'24'
>>>
conn
.
get
(
'fever'
)
b'101.5'
Here, the setnx()
method sets a value
only if the key does not exist:
>>>
conn
.
setnx
(
'secret'
,
'icky-icky-icky-ptang-zoop-boing!'
)
False
It failed because we had already defined 'secret'
:
>>>
conn
.
get
(
'secret'
)
b'ni!'
The getset()
method
returns the old value and sets it to a new one at the same time:
>>>
conn
.
getset
(
'secret'
,
'icky-icky-icky-ptang-zoop-boing!'
)
b'ni!'
Let’s not get too far ahead of ourselves. Did it work?
>>>
conn
.
get
(
'secret'
)
b'icky-icky-icky-ptang-zoop-boing!'
Now, get a substring by using getrange()
(as in Python, offset 0=start, -1=end):
>>>
conn
.
getrange
(
'secret'
,
-
6
,
-
1
)
b'boing!'
Replace a substring
by using setrange()
(using a zero-based offset):
>>>
conn
.
setrange
(
'secret'
,
0
,
'ICKY'
)
32
>>>
conn
.
get
(
'secret'
)
b'ICKY-icky-icky-ptang-zoop-boing!'
Next, set multiple keys at once by using mset()
:
>>>
conn
.
mset
({
'pie'
:
'cherry'
,
'cordial'
:
'sherry'
})
True
Get more than one value at once by using mget()
:
>>>
conn
.
mget
([
'fever'
,
'carats'
])
[b'101.5', b'24']
Delete a key by using delete()
:
>>>
conn
.
delete
(
'fever'
)
True
Increment
by using the incr()
or incrbyfloat()
commands,
and decrement with decr()
:
>>>
conn
.
incr
(
'carats'
)
25
>>>
conn
.
incr
(
'carats'
,
10
)
35
>>>
conn
.
decr
(
'carats'
)
34
>>>
conn
.
decr
(
'carats'
,
15
)
19
>>>
conn
.
set
(
'fever'
,
'101.5'
)
True
>>>
conn
.
incrbyfloat
(
'fever'
)
102.5
>>>
conn
.
incrbyfloat
(
'fever'
,
0.5
)
103.0
There’s no decrbyfloat()
.
Use a negative increment to reduce the fever:
>>>
conn
.
incrbyfloat
(
'fever'
,
-
2.0
)
101.0
Redis lists can contain only strings.
The list is created when you do your first insertion.
Insert at the beginning by using lpush()
:
>>>
conn
.
lpush
(
'zoo'
,
'bear'
)
1
Insert more than one item at the beginning:
>>>
conn
.
lpush
(
'zoo'
,
'alligator'
,
'duck'
)
3
Insert before or after a value by using linsert()
:
>>>
conn
.
linsert
(
'zoo'
,
'before'
,
'bear'
,
'beaver'
)
4
>>>
conn
.
linsert
(
'zoo'
,
'after'
,
'bear'
,
'cassowary'
)
5
Insert at an offset by using lset()
(the list must exist already):
>>>
conn
.
lset
(
'zoo'
,
2
,
'marmoset'
)
True
Insert at the end by using rpush()
:
>>>
conn
.
rpush
(
'zoo'
,
'yak'
)
6
Get the value at an offset by using lindex()
:
>>>
conn
.
lindex
(
'zoo'
,
3
)
b'bear'
Get the values in an offset range
by using lrange()
(0 to -1 for all):
>>>
conn
.
lrange
(
'zoo'
,
0
,
2
)
[b'duck', b'alligator', b'marmoset']
Trim the list with ltrim()
,
keeping only those in a range of offsets:
>>>
conn
.
ltrim
(
'zoo'
,
1
,
4
)
True
Get a range of values (use 0
to -1
for all) by using lrange()
:
>>>
conn
.
lrange
(
'zoo'
,
0
,
-
1
)
[b'alligator', b'marmoset', b'bear', b'cassowary']
Chapter 10 shows you how you can use Redis lists and publish-subscribe to implement job queues.
Redis hashes are similar to Python dictionaries
but can contain only strings.
Thus, you can go only one level deep,
not make deep-nested structures.
Here are examples that create and play with a Redis hash called song
:
Set the fields do
and re
in hash song
at once
by using hmset()
:
>>>
conn
.
hmset
(
'song'
,
{
'do'
:
'a deer'
,
're'
:
'about a deer'
})
True
Set a single field value in a hash by using hset()
:
>>>
conn
.
hset
(
'song'
,
'mi'
,
'a note to follow re'
)
1
Get one field’s value by using hget()
:
>>>
conn
.
hget
(
'song'
,
'mi'
)
b'a note to follow re'
Get multiple field values by using hmget()
:
>>>
conn
.
hmget
(
'song'
,
're'
,
'do'
)
[b'about a deer', b'a deer']
Get all field keys for the hash by using hkeys()
:
>>>
conn
.
hkeys
(
'song'
)
[b'do', b're', b'mi']
Get all field values for the hash by using hvals()
:
>>>
conn
.
hvals
(
'song'
)
[b'a deer', b'about a deer', b'a note to follow re']
Get the number of fields in the hash by using hlen()
:
>>>
conn
.
hlen
(
'song'
)
3
Get all field keys and values in the hash by using hgetall()
:
>>>
conn
.
hgetall
(
'song'
)
{
b
'do'
:
b
'a deer'
,
b
're'
:
b
'about a deer'
,
b
'mi'
:
b
'a note to follow re'
}
Set a field if its key doesn’t exist by using hsetnx()
:
>>>
conn
.
hsetnx
(
'song'
,
'fa'
,
'a note that rhymes with la'
)
1
Redis sets are similar to Python sets, as you can see in the series of examples that follow.
Add one or more values to a set:
>>>
conn
.
sadd
(
'zoo'
,
'duck'
,
'goat'
,
'turkey'
)
3
Get the number of values from the set:
>>>
conn
.
scard
(
'zoo'
)
3
Get all the set’s values:
>>>
conn
.
smembers
(
'zoo'
)
{b'duck', b'goat', b'turkey'}
Remove a value from the set:
>>>
conn
.
srem
(
'zoo'
,
'turkey'
)
True
Let’s make a second set to show some set operations:
>>>
conn
.
sadd
(
'better_zoo'
,
'tiger'
,
'wolf'
,
'duck'
)
0
Intersect (get the common members of)
the zoo
and better_zoo
sets:
>>>
conn
.
sinter
(
'zoo'
,
'better_zoo'
)
{b'duck'}
Get the intersection of zoo
and better_zoo
,
and store the result in the set fowl_zoo
:
>>>
conn
.
sinterstore
(
'fowl_zoo'
,
'zoo'
,
'better_zoo'
)
1
Who’s in there?
>>>
conn
.
smembers
(
'fowl_zoo'
)
{b'duck'}
Get the union (all members) of zoo
and better_zoo
:
>>>
conn
.
sunion
(
'zoo'
,
'better_zoo'
)
{b'duck', b'goat', b'wolf', b'tiger'}
Store that union result in the set fabulous_zoo
:
>>>
conn
.
sunionstore
(
'fabulous_zoo'
,
'zoo'
,
'better_zoo'
)
4
>>>
conn
.
smembers
(
'fabulous_zoo'
)
{b'duck', b'goat', b'wolf', b'tiger'}
What does zoo
have that better_zoo
doesn’t?
Use sdiff()
to get the set difference,
and sdiffstore()
to save it in the zoo_sale
set:
>>>
conn
.
sdiff
(
'zoo'
,
'better_zoo'
)
{b'goat'}
>>>
conn
.
sdiffstore
(
'zoo_sale'
,
'zoo'
,
'better_zoo'
)
1
>>>
conn
.
smembers
(
'zoo_sale'
)
{b'goat'}
One of the most versatile Redis data types is the sorted set, or zset. It’s a set of unique values, but each value has an associated floating point score. You can access each item by its value or score. Sorted sets have many uses:
Leader boards
Secondary indexes
Timeseries, using timestamps as scores
We’ll show the last use case,
tracking user logins via timestamps.
We’re using the Unix epoch value
(more on this in Chapter 10)
that’s returned by the Python time()
function:
>>>
import
time
>>>
now
=
time
.
time
()
>>>
now
1361857057.576483
Let’s add our first guest, looking nervous:
>>>
conn
.
zadd
(
'logins'
,
'smeagol'
,
now
)
1
Five minutes later, another guest:
>>>
conn
.
zadd
(
'logins'
,
'sauron'
,
now
+
(
5
*
60
))
1
Two hours later:
>>>
conn
.
zadd
(
'logins'
,
'bilbo'
,
now
+
(
2
*
60
*
60
))
1
One day later, not hasty:
>>>
conn
.
zadd
(
'logins'
,
'treebeard'
,
now
+
(
24
*
60
*
60
))
1
In what order did bilbo
arrive?
>>>
conn
.
zrank
(
'logins'
,
'bilbo'
)
2
When was that?
>>>
conn
.
zscore
(
'logins'
,
'bilbo'
)
1361864257.576483
Let’s see everyone in login order:
>>>
conn
.
zrange
(
'logins'
,
0
,
-
1
)
[b'smeagol', b'sauron', b'bilbo', b'treebeard']
With their times, please:
>>>
conn
.
zrange
(
'logins'
,
0
,
-
1
,
withscores
=
True
)
[(b'smeagol', 1361857057.576483), (b'sauron', 1361857357.576483),
(b'bilbo', 1361864257.576483), (b'treebeard', 1361943457.576483)]
This is a very space-efficient and fast way to deal with large sets of numbers. Suppose that you have a website with registered users. You’d like to track how often people log in, how many users visit on a particular day, how often the same user visits on following days, and so on. You could use Redis sets, but if you’ve assigned increasing numeric user IDs, bits are more compact and faster.
Let’s begin by creating a bitset for each day. For this test, we’ll just use three days and a few user IDs:
>>>
days
=
[
'2013-02-25'
,
'2013-02-26'
,
'2013-02-27'
]
>>>
big_spender
=
1089
>>>
tire_kicker
=
40459
>>>
late_joiner
=
550212
Each date is a separate key.
Set the bit for a particular user ID for that date.
For example,
on the first date (2013-02-25
),
we had visits from big_spender
(ID 1089
)
and tire_kicker
(ID 40459
):
>>>
conn
.
setbit
(
days
[
0
],
big_spender
,
1
)
0
>>>
conn
.
setbit
(
days
[
0
],
tire_kicker
,
1
)
0
The next day, big_spender
came back:
>>>
conn
.
setbit
(
days
[
1
],
big_spender
,
1
)
0
The next day had yet another visit from
our friend, big_spender
,
and a new person whom we’re calling late_joiner
:
>>>
conn
.
setbit
(
days
[
2
],
big_spender
,
1
)
0
>>>
conn
.
setbit
(
days
[
2
],
late_joiner
,
1
)
0
Let’s get the daily visitor count for these three days:
>>>
for
day
in
days
:
...
conn
.
bitcount
(
day
)
...
2
1
2
Did a particular user visit on a particular day?
>>>
conn
.
getbit
(
days
[
1
],
tire_kicker
)
0
So, tire_kicker
did not visit on the second day.
How many users visited every day?
>>>
conn
.
bitop
(
'and'
,
'everyday'
,
*
days
)
68777
>>>
conn
.
bitcount
(
'everyday'
)
1
I’ll give you three guesses who it was:
>>>
conn
.
getbit
(
'everyday'
,
big_spender
)
1
Finally, what was the number of total unique users in these three days?
>>>
conn
.
bitop
(
'or'
,
'alldays'
,
*
days
)
68777
>>>
conn
.
bitcount
(
'alldays'
)
3
All Redis keys have a time-to-live,
or expiration date.
By default, this is forever.
We can use the expire()
function to instruct Redis how long to keep the
key. As is demonstrated here, the value is a number of seconds:
>>>
import
time
>>>
key
=
'now you see it'
>>>
conn
.
set
(
key
,
'but not for long'
)
True
>>>
conn
.
expire
(
key
,
5
)
True
>>>
conn
.
ttl
(
key
)
5
>>>
conn
.
get
(
key
)
b'but not for long'
>>>
time
.
sleep
(
6
)
>>>
conn
.
get
(
key
)
>>>
The expireat()
command expires a key at
a given epoch time. Key expiration is useful to keep caches fresh
and to limit login sessions.
The NoSQL servers listed here handle data larger than memory, and many of them use multiple computers. Table 8-6 presents notable servers and their Python libraries.
Site | Python API |
---|---|
Finally, there’s a special category of databases for full-text search. They index everything, so you can find that poem that talks about windmills and giant wheels of cheese. You can see some popular open source examples, and their Python APIs, in Table 8-7.
Site | Python API |
---|---|
(written in Python, includes an API) |
8.1. Assign the string 'This is a test of the emergency text system'
to the variable test1
, and write test1
to a file called test.txt.
8.2. Open the file test.txt and read its contents into the string test2
. Are test1
and test2
the same?
8.3. Save these text lines to a file called books.csv. Notice that if the fields are separated by commas, you need to surround a field with quotes if it contains a comma.
author,book J R R Tolkien,The Hobbit Lynne Truss,"Eats, Shoots & Leaves"
8.4. Use the csv
module and its DictReader
method to read books.csv to the variable books
. Print the values in books
. Did DictReader
handle the quotes and commas in the second book’s title?
8.5. Create a CSV file called books.csv by using these lines:
title,author,year The Weirdstone of Brisingamen,Alan Garner,1960 Perdido Street Station,China Miéville,2000 Thud!,Terry Pratchett,2005 The Spellman Files,Lisa Lutz,2007 Small Gods,Terry Pratchett,1992
8.6. Use the sqlite3
module to create a SQLite database called books.db, and a table called books
with these fields: title
(text), author
(text), and year
(integer).
8.7. Read books.csv and insert its data into the book
table.
8.8. Select and print the title
column from the book
table in alphabetical order.
8.9. Select and print all columns from the book
table in order of publication.
8.10. Use the sqlalchemy
module to connect to the sqlite3 database books.db that you just made in exercise 8.6. As in 8.8, select and print the title
column from the book
table in alphabetical order.
8.11. Install the Redis server and the Python redis
library (pip install redis
) on your computer. Create a Redis hash called test
with the fields count
(1
) and name
('Fester Bestertester'
). Print all the fields for test
.
8.12. Increment the count
field of test
and print it.