WHAT YOU WILL LEARN IN THIS CHAPTER:
WROX.COM DOWNLOADS FOR THIS CHAPTER
For this chapter the wrox.com code downloads are found at www.wrox.com/go/pythonprojects
on the Download Code tab. The code is in the Chapter 3 download, called Chapter3.zip
, and individually named according to the names throughout the chapter.
In many scenarios you need to store data between executions of your program. The data you want to store could be local status information, such as the current location in an e-book reader or the current working filename, or it could be administrative data such as usernames and passwords or server addresses. Often it will be large volumes of business-oriented data such as customer orders, inventory, or address information. Many business applications consist of little more than a mechanism to create, view, and edit stored data.
This capability to store data in such a way that it is available for use on subsequent invocations of your program is known as data persistence because the data persists beyond the lifetime of the process that created it. To implement data persistence you need to store the data somewhere, either in a file or in a database.
This chapter is a bit like a history of computing storage technologies. That’s because the need to store data has grown—and continues to grow—ever more complex with the passage of time. You now have a broad range of technologies available covering every storage need, from a few simple configuration settings to sophisticated distributed data sources representing thousands of logical entities. In this chapter learn about the different options available for storing data and the advantages and disadvantages of each. Along the way, you see how Python modules assist in this fundamental programming task.
The simplest storage is a plaintext file. You have already seen in Chapter 2 how to use a text file to store data in various formats, such as CSV and XML, as well as how to store unformatted text. These formats are fine if you need to store the data only when the program closes and read it back when the program is started again. This situation makes these formats very suitable for configuration data or application status information. These flat-file formats are less useful when you need to handle large volumes of data non-sequentially or search for specific records or fields. For that, you need a database.
A database is just a data storage system that enables you to create, read, update, and delete individual records—this set of four fundamental data management functions is often referred to as a CRUD interface. Database records consist of one or more key fields that uniquely identify the record, plus any other fields needed to represent the attributes of the entity that the record represents.
A Python dictionary can be used as a type of non-persistent database in that you can use the dictionary key to create, read, update, or delete a value associated with a given dictionary key; that could be a tuple of fields, or a record. All that’s missing is the ability to store the data between sessions. The concept of a dictionary as a database has been exploited over the years, and various forms of persistent dictionaries exist. The oldest are the database management (DBM) family of files.
DBM files originated in UNIX but have been developed over the years for other platforms as well. Python supports several variations. These variations are hidden by the dbm
module that automatically determines the best solution based on which libraries are supported by the OS installation at hand. If no native DBM library can be found, a basic, pure Python version is used.
The DBM system is a simplified version of a dictionary in that both the keys and values must be strings, which means that some data conversion and string formatting is necessary if you are using non-string data. The advantages of a DBM database are that it is simple to use, fast, and fairly compact.
You can see how DBM works by revisiting the tool-hire example from Chapter 2. When you last looked at it you were working from a spreadsheet as the master data source. Suppose you decided to migrate the solution to a pure Python application? You would need a storage mechanism for the various data elements.
Recall that the spreadsheet had two sheets, one representing the tools for hire and the other the actual loans by the members. The record formats are shown in Table 3.1.
Table 3.1 Tool-Hire Data Entities
TOOL | LOAN |
ItemID | ItemID |
Name | Name |
Description | Description |
Owner | Owner |
Price | Borrower |
Condition | Date Borrowed |
Date Registered | Date Returned |
That design is fine for a human working with a spreadsheet, but if you want to convert it into a full-blown data application you need to overcome a number of issues with it:
Name
, Description
, and Owner
fields are all duplicated, and therefore need to be changed in two places whenever they are edited.ItemID
as a key, which suggests the ItemID
represents both a Tool
and a Loan
which is confusing.Tool
entity as Item
. And in keeping with that, you can rename the application to reflect its more generic approach. Call it LendyDB
.With very little effort, you can rearrange things to overcome the issues with the spreadsheet. Table 3.2 shows the resulting database design.
Table 3.2 LendyDB Data Design
ITEM | MEMBER | LOAN |
ItemID | MemberID | LoanID |
Name | Name | ItemID |
Description | BorrowerID | |
OwnerID | Date Borrowed | |
Price | Date Returned | |
Condition | ||
Date Registered |
You now have three entities, so you need to store the data in three data files. You can use the DBM format for this because each entity now has a unique identifier field, which, in string format, works well as a DBM key. You need to populate these files with data, and that means reformatting the data from the spreadsheet. You could write a Python program to do that but, because the sample data set is small, it’s easier to just cut and paste the data into the new format. (Or you can extract the files from the LendyDB
folder of the Chapter3.zip
file from the download site.) Once you have the data you can save it into DBM files quite easily, as shown in the following Try It Out.
Having created your database, you can now use it to read or edit the contents. This is best demonstrated from an interactive session at the Python prompt, so fire up the Python interpreter from the folder where you saved the data files and type the following:
>>> import dbm
>>> items = dbm.open('itemdb')
>>> members = dbm.open('memberdb')
>>> loans = dbm.open('loandb','w')
>>> loan2 = loans['2'].decode()
>>> loan2
'2,2,5,9/5/2012,1/5/2013'
>>> loan2 = loan2.split(',')
>>> loan2
['2', '2', '5', '9/5/2012', '1/5/2013']
>>> item2 = items[loan2[1]].decode().split(',')
>>> item2
['2', 'Lawnmower', 'Tool', '2', '$370', 'Fair', '2012-04-01']
>>> member2 = members[loan2[2]].decode().split(',')
>>> member2
['5', 'Anne', '[email protected]']
>>> print('{} borrowed a {} on {}'.format(
... member2[1],item2[1],loan2[3]))
Anne borrowed a Lawnmower on 9/5/2012
With the preceding commands, you opened the three databases, extracted loan number 2 (using decode()
to convert from the dbm
bytes format into a normal Python string), and split it into its separate fields. You then extracted the corresponding member and item records by using the loan record values as keys. Finally, you printed a message reporting the data in human-readable form.
Of course, you can create new records in the data set, too. Here is how you create a new loan record:
>>> max(loans.keys()).decode()
'5'
>>> key = int(max(loans.keys()).decode()) + 1
>>> newloan = [str(key),'2','1','4/5/2014']
>>> loans[str(key)] = ','.join(newloan)
>>> loans[str(key)]
b'6,2,1,4/5/2014'
With the preceding code, you used the built-in max()
function to find the highest existing key value in the loans
database. You then created a new key by converting that maximum value to an integer and adding one. Next, you used the string version of the new key value to create a new loan record. You then wrote that record out to the database using the new key field. Finally, you checked that the new record existed by using the new key value to read the record back.
You can see that DBM files can be used as a database, even with multiple entities. However, if the data is not naturally string-based, or has many fields, extracting the fields and converting to the appropriate format becomes tedious. You can write helper functions or methods to do that conversion for you, but there is an easier way. Python has a module that can store arbitrary Python objects to files and read them back without you having to do any data conversion. It’s time to meet the pickle
module.
The pickle
module is designed to convert Python objects into sequences of binary bytes. The object types converted include the basic data types, such as integers and boolean values, as well as system- and user-defined classes and even collections such as lists, tuples, and functions (except those defined using lambda
). A few restrictions exist on objects that can be pickled; these are described in the module documentation.
pickle
is not of itself a data management solution; it merely converts objects into binary sequences. These sequences can be stored in binary files and read back again so they can be used as a form of data persistence. But pickle
does not provide any means to search the stored objects or retrieve a single object from among many stored items. You must read the entire stored inventory back into memory and access the objects that way. pickle
is ideal when you just want to save the state of a program so that you can start it up and continue from the same position as before (for example, if you were playing a game).
The pickle
module provides several functions and classes, but you normally only use the dump()
and load()
functions. The dump()
function dumps an object (or objects) to a file and the load()
function reads an object from a file (usually an object previously written with dump
).
To see how this works, you can use the interactive prompt and experiment with the Item
data definition from LendyDB
in the previous section. You start off by creating a single item and this time, instead of using a single string for all the values, you use a tuple, like this:
>>> import pickle
>>> anItem = ['1','Lawnmower','Tool','1','$150','Excellent','2012-01-05']
>>> with open('item.pickle','wb') as pf:
... pickle.dump(anItem,pf)
...
>>> with open('item.pickle','rb') as pf:
... itemCopy = pickle.load(pf)
...
>>> print(itemCopy)
['1', 'Lawnmower', 'Tool', '1', '$150', 'Excellent', '2012-01-05']
>>>
Notice that you have to use binary file modes for the pickle file. Most importantly, notice that you got a list back from the file, not just a string. Of course, these elements are all strings, so just for fun try pickling some different data types:
>>> funData = ('a string', True, 42, 3.14159, ['embedded', 'list'])
>>> with open('data.pickle','wb') as pf:
... pickle.dump(funData, pf)
...
>>> with open('data.pickle','rb') as pf:
... copyData = pickle.load(pf)
...
>>> print (copyData)
('a string', True, 42, 3.14159, ['embedded', 'list'])
>>>
That all worked as expected, and you got back the same data that you put in. The only other thing you need to know about pickle
is that it is not secure. It potentially executes objects that get unpickled, so you should never use pickle
to read data received from untrusted sources. But for local object persistence in a controlled environment, it does a great job very simply. If you are using pickle
in your own projects you should be aware that you can get some pickle
-specific exceptions raised so you might want to wrap your code inside a try/except
construct.
For your LendyDB
project, the big problem with pickle
is that you can only access the data by reading the whole file into memory. Wouldn’t it be great if you could have an indexed file of arbitrary Python objects by combining the features of pickle
and dbm
? It turns out that you can, and the work has been done for you in the shelve
module.
The shelve
module combines the dbm
module’s ability to provide random access to files with pickle
’s ability to serialize Python objects. It is not a perfect solution in that the key field must still be a string and the security issue with pickle
also applies to shelve
, so you must ensure your data sources are safe. Also, like dbm
files the module cannot tell if you modify data read into memory, so you must explicitly write any changes back to the file by reassigning to the same key. Finally, dbm
files impose some limits around the size of objects they can store and are not designed for concurrent access from, for example, multiple threads or users. However, for many projects, shelve
provides a simple, lightweight, and fairly fast solution for storing and accessing data.
So as far as you are concerned, shelve
acts just like a dictionary. Almost everything you do with a dictionary you can also do with shelve
instances. The only difference is that the data remains on the disk rather than being in memory. This has obvious speed implications, but on the other hand, it means you can work with very large dictionaries even when memory is limited.
Before you build LendyDB
with shelve
, you’ll experiment with some dummy data that includes a bigger selection of data types, including a user-defined class. The first thing you do is create the shelve
database file (or, as they are sometimes known, a shelf):
>>> shelf = shelve.open('fundata.shelve','c')
The open
function takes the same arguments as the dbm
version discussed earlier. Because you are creating a new shelf, you use mode c
for create. Now you can start adding items to the shelf:
>>> shelf['tuple'] = (1,2,'a','b',True,False)
>>> shelf['lists'] = [[1,2,3],[True,False],[3.14159, -66]]
With these commands, you saved two items, each of which contains a mix of Python data types, and shelve
happily stored them without any data conversion required by you. You can check that shelve
saved the items by reading the values back:
>>> shelf['tuple']
(1, 2, 'a', 'b', True, False)
>>> shelf['lists']
[[1, 2, 3], [True, False], [3.14159, -66]]
To make the data changes permanent you need to call close
(normally you would use a try/finally
construct or, unlike dbm
, you can use a context manager style):
>>> shelf.close()
>>> shelf['tuple']
Traceback (most recent call last):
File "C:Python33libshelve.py", line 111, in __getitem__
value = self.cache[key]
KeyError: 'tuple'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:Python33libshelve.py", line 113, in __getitem__
f = BytesIO(self.dict[key.encode(self.keyencoding)])
File "C:Python33libshelve.py", line 70, in closed
raise ValueError('invalid operation on closed shelf')
ValueError: invalid operation on closed shelf
>>>
You can see that after closing the shelf you can no longer access the data; you need to reopen the shelf.
Now you can try something slightly more complex. First, you define a class, create some instances, and then store them to a shelf:
>>> class Test:
... def __init__(self,x,y):
... self.x = x
... self.y = y
... def show(self):
... print(self.x, self.y)
...
>>> shelf = shelve.open('test.shelve','c')
>>> a = Test(1,2)
>>> a.show()
1 2
>>> b = Test('a','b')
>>> b.show()
a b
>>> shelf['12'] = a
>>> shelf['ab'] = b
So far, so good. You have saved two instances of the class. Getting them back is just as easy:
>>> shelf['12']
<__main__.Test object at 0x01BD1570>
>>> shelf['ab']
<__main__.Test object at 0x01BD1650>
>>> c = shelf['12']
>>> c.show()
1 2
>>> d = shelf['ab']
>>> d.show()
a b
>>> shelf.close()
Notice that the object returned was reported as a __main__.Test
object. That raises one very important caveat about saving and restoring user-defined classes. You must make sure that the very same class definition used by shelf for the save is also available to the module that reads the class back from the shelf, and the class definitions must be the same. If the class definition changes between writing the data and reading it back, the result is unpredictable. The usual way to make the class visible is to put it into its own module. That module can then be imported, and used in the code that writes, as well as the code that reads, the shelf.
It’s time to revisit your lending library, LendyDB
. This time you replicate what you did with the dbm
database, but use the shelve
module instead.
You’ve now seen the various options Python offers for storing objects and retrieving them. The shelve
module, in particular, offers a persistence mechanism that is compact, fairly fast, and simple to use. If you have a solution that uses Python dictionaries in memory, switching to a shelve
solution is almost a trivial task. However, this is still a long way short of what is needed for complex data handling. Operations like finding a set of records based on non-key values or sorting the data essentially still require a complete read of the data into memory. The only way to avoid that is to move to a full-blown database solution. However, before you look at that you should consider some aids that Python provides to make data analysis of in-memory data sets easier.
Once you have a set of data, you usually want to ask questions about it. For example, in the lending library example, you might want to know the total cost of the items or even the average cost of an item. You might want to know who contributed the most items, which items are out on loan at any given time, and so on. You can do that using Python, and you could write functions using all the standard Python features that would answer those questions. However, Python has some powerful features that often get overlooked that are especially useful for analyzing data sets.
In this section you look at some of the built-in features you can use, especially the functional programming features of the language. Then you turn your attention to the itertools
module, which offers more advanced features that often save time and computing resources when compared with the standard alternatives.
When you analyze data, it is important to select the right data structure. For example, Python includes a set
data type that automatically eliminates duplicates. If you care only about unique values, converting (or extracting) the data to a set can simplify the process considerably. Similarly, using Python dictionaries to provide keyword access rather than numeric indices often improves code readability, and thus reliability (you saw an example of that in Chapter 2 that compared the CSV dictionary-based reader with the standard tuple-based reader). If you are finding that your code is getting complicated, it’s often worthwhile to stop and consider whether a different data structure would help.
In addition to the wide variety of data structures, Python also offers many built-in and standard library functions that you can use, such as any
, all
, map
, sorted
, and slicing. (Slicing isn’t technically a function but an operation, however it does return a value in a similar way that a function would.) When you combine these functions with Python generator expressions and list comprehensions, you have a powerful toolkit for slicing and dicing your data.
You can apply these techniques to your LendyDB
data to answer the questions raised in the opening paragraph of this “Analyzing Data with Python” section. You can try that now.
In the preceding Try It Out you saw that you can use the built-in functions and data structures combined with loops and generators to answer most questions about data. The problem is that for volumes of data this technique requires storing large lists in memory and may involve looping over those lists many times. This can become very slow and resource-intensive. The Python itertools
module provides several functions that can reduce the load significantly.
The itertools
module of the standard Python library provides a set of tools that utilize functional programming principles to consume iterable objects and produce other iterables as results. This means that the functions can be combined to build sophisticated data filters.
Before looking at how itertools
can be used on the LendyDB
data, you should look at some of the functions provided using simpler data sets. These functions are powerful, but operate in slightly different ways than most of the functions you have dealt with in the past. In particular, they are all geared around processing iterators. You should recall that all the standard Python collections, as well as objects such as files, are iterators. You can also create your own custom iterators by defining some methods that adhere to the Python iterator protocol. The simplest iterators are strings, so that’s mainly what the documentation uses to demonstrate the itertools
functions, but remember that these functions work with any kind of iterator, not just strings.
The first group of functions you look at includes relatively simple functions that you typically use to provide input to the other functions in the module. The count()
function works a lot like the built-in range()
function, except where range()
generates numbers up to a limit, count()
generates an indefinite series of numbers from a start point, incrementing by a given, optional, stepsize
. It looks like this:
>>> import itertools as it
>>> for n in it.count(15,2):
... if n < 40: print(n, end=' ')
... else: break
...
15 17 19 21 23 25 27 29 31 33 35 37 39
The repeat()
function is even simpler; it just repeats its argument continuously, or for the number of repetitions specified, like this:
>>> for n in range(7):
... print(next(it.repeat('yes ')), end='')
...
yes yes yes yes yes yes yes >>>
>>> list(it.repeat(6,3))
[6, 6, 6]
>>>
The cycle()
function rotates over the input sequence over and over again. This is useful for building round-robin–style iterations for load balancing or resource allocation. Consider the case where you have a number of resources and want to allocate data to each resource in turn. You can build a list of resources, then cycle over that list until you run out of data. You can simulate this technique using lists as resources, like this:
>>> res1 = []
>>> res2 = []
>>> res3 = []
>>> resources = it.cycle([res1,res2,res3])
>>> for n in range(30):
... res = next(resources)
... res.append(n)
...
>>> res1
[0, 3, 6, 9, 12, 15, 18, 21, 24, 27]
>>> res2
[1, 4, 7, 10, 13, 16, 19, 22, 25, 28]
>>> res3
[2, 5, 8, 11, 14, 17, 20, 23, 26, 29]
>>>
The chain()
function concatenates all the input arguments into a single collection and then returns each element. If the arguments were all of the same type, you could achieve the same result by adding the collections together with the plus operator, but chain()
also works for types of collections that are not compatible with the plus operator. Here is an example using a list, a string, and a set:
>>> items = it.chain([1,2,3],'astring',{'a','set','of','strings'})
>>> for item in items:
... print(item)
...
1
2
3
a
s
t
r
i
n
g
a
of
set
strings
Finally, there is the islice()
function that works like the slice operator but, because it uses a generator, is more memory efficient. It does have one significant difference from the normal slice: You cannot use negative indices to count backward from the end, because iterators do not always have well-defined endpoints.
You could use islice()
like this:
>>> data = list(range(20))
>>> data[3:12:2]
[3, 5, 7, 9, 11]
>>> for d in it.islice(data,3,12,2): print(d, end=' ')
...
3 5 7 9 11
itertools
can do much more than just generate data. It can also help analyze data using a variety of data processing functions.
itertools
has many data processing functions that either take input data and transform the elements, or filter the contents in some way. By combining these functions you can build sophisticated data processing tools. One feature that many of these functions have in common is that they accept a function object as a parameter.
The compress()
function acts a little bit like a higher-order version of the bitmasks that you explored in Chapters 1 and 2. It takes a collection of data as its first argument and a collection of boolean values as its second. It returns those items of the first collection that correspond to the True
values of the second collection. Here is a basic example:
>>> for item in it.compress([1,2,3,4,5],[False,True,False,0,1]):
... print (item)
...
2
5
Note that the boolean values do not need to be pure boolean values; they can be anything that Python can convert to boolean, even expressions. (The itertools.filterfalse()
function works in exactly the same way, but in reverse; it returns those elements whose corresponding boolean flags are False
instead of True
.)
Likewise, the dropwhile()
and takewhile()
functions have related, but opposite, effects. Both take an input function and a collection, or iterator, as arguments and then apply the function to the input data elements one at a time. dropwhile()
ignores all of the input elements until the function argument evaluates to False
, whereas takewhile()
returns the elements until the result is False
. You can see the difference in these examples that use the same input data and argument function:
>>> def singleDigit(n): return n < 10
...
>>> for n in it.dropwhile(singleDigit,range(20)): print(n,end=' ')
...
10 11 12 13 14 15 16 17 18 19
>>> for n in it.takewhile(singleDigit,range(20)): print(n,end=' ')
...
0 1 2 3 4 5 6 7 8 9
Note that both of these functions stop processing the data after the first time a trigger is detected. Consider this example:
>>> for n in it.dropwhile(singleDigit,[1,2,12,4,20,7,999]): print(n,end=' ')
...
12 4 20 7 999
Notice that the output includes the single-digit numbers following the first non–single-digit number, for the reason just indicated: Once dropwhile stops dropping, nothing else is dropped thereafter. (And takewhile's taking behavior is analogous.)
The accumulate()
function applies its input function to each element of the input data along with the result of the previous operation. (The default function is addition and the first result is always the first element.) Thus, for an input data set of [1,2,3,4]
the initial value, result1
, is 1
, followed by the function applied to result1
and 2
to produce result2
, and to result2
and 3
to create result3
, and to result3
and 4
to create result4
. The output is result1
, result2
, result3
, and result4
. (The final result value is the same as applying the reduce()
function from the functools
module.) Here is an example using accumulate()'s
default addition operator:
>>> for n in it.accumulate([1,2,3,4,]): print(n, end=' ')
...
1 3 6 10
groupby()
is one of the most useful and powerful of the itertools
functions, but it has a number of little foibles that can catch you out. Its basic role is to collect the input data into groups based on a key derived by an input function and return those groups as iterators in their own right.
The first problem is that the function only groups for as long as it finds the same key, but it creates a new group if a new key is found. Then, if the original key is found later in the sequence, it creates a new group with the same key rather than adding the new element to the original group. To avoid this behavior, it is best if the input data is sorted using the same key function used by groupby()
.
The second snag is that the groups generated by groupby()
are not really independent iterators; they are effectively views into the original input collection. Thus, if the function moves on to the next group of data, the previous groups become invalid. The only way to retain the groups for later processing is to copy them into a separate container—a list, for example.
To reinforce these concepts, you look at an example that produces a set of data groups that can be processed independently. The example is built up to the final, correct solution starting from an initial, naïve, but broken solution.
First, you define several groups of data and use the built-in all()
function as a key. The all()
function returns True when all of its input data items are True.
>>> data = [[1,2,3,4,5],[6,7,8,9,0],[0,2,4,6,8],[1,3,5,7,9]]
>>> for d in data: print(all(d))
...
True
False
False
True
Next, you apply the groupby()
function to your data:
>>> for ky,grp in it.groupby(data,key=all):
... print(ky, grp)
...
True <itertools._grouper object at 0x7fd3ee2c>
False <itertools._grouper object at 0x7fd3ee8c>
True <itertools._grouper object at 0x7fd3ee2c>
You can see that groupby()
returned two separate groups both keyed on True
. To avoid that you must sort the data before processing it with groupby()
, like this:
>>> for ky,grp in it.groupby(sorted(data,key=all), key=all):
... print(ky, grp)
...
False <itertools._grouper object at 0x7fd3ef4c>
True <itertools._grouper object at 0x7fd3ee2c>
Now you want to try to access these groups, so you store each one in a variable:
>>> for ky,grp in it.groupby(sorted(data,key=all), key=all):
... if ky: trueset = grp
... else: falseset=grp
...
>>> for item in falseset: print(item)
...
>>>
As you can see, falseset
is empty. That’s because the falseset
group was created first and then the underlying iterator (grp) moved on, thus invalidating the value just stored in falseset
. To save the sets for later access, you need to store them as lists, like this:
>>> groups = {True:[], False:[]}
>>> for ky,grp in it.groupby(sorted(data,key=all), key=all):
... groups[ky].append(list(grp))
...
>>> groups
{False: [[[6, 7, 8, 9, 0], [0, 2, 4, 6, 8]]],
True: [[[1, 2, 3, 4, 5], [1, 3, 5, 7, 9]]]}
>>>
Notice that you created a dictionary whose keys are the expected ones (True and False), and whose values are lists. You then had to append the groups, converted to lists, as you found them. This may seem complex, but if you remember to sort the input data first and copy the groups into lists as groupby()
generates them, you will find that groupby()
is a powerful and useful tool.
You’ve seen what itertools
has to offer, so now it’s time to try using it with your LendyDB
data. You want to repeat the analysis that you did using the standard tools, but see how the itertools
functions can be brought to bear, too. Remember, the real point of the itertools
module is not so much that it gives you new features, but rather that it lets you process large volumes of data more efficiently. Given the tiny amount of data you are using in this chapter, you won’t actually see any efficiency improvements, but as you scale the data volumes up, it does make a difference.
In this section you have seen how a mix of the conventional Python data structures, functions, and operators, combined with the functional techniques of the itertools
module, enable you to perform complex analysis of quite large data sets. However, there comes a point when the volume and complexity of the data call for another approach, and that means introducing a new technology: relational databases powered by the Structured Query Language (SQL).
In this section you are introduced to some of the concepts behind SQL and relational databases. You find out how to use SQL to create data tables and populate them with data, and how to manipulate the data contained within those tables. You go on to link tables to capture the relationships between data and finally apply all of these techniques to your lending library data.
The basic principle of a relational database is very simple. It’s simply a set of two-dimensional tables. Columns are known as fields and rows as records. Field values can refer to other records, either in the same table, or in another table—that’s the “relational” part.
A table holding data about employees might look like Table 3.3.
Table 3.3 Employee Data
EMPID | NAME | HIREDATE | GRADE | MANAGERID |
1020304 | John Brown | 20030623 | Foreman | 1020311 |
1020305 | Fred Smith | 20040302 | Laborer | 1020304 |
1020307 | Anne Jones | 19991125 | Laborer | 1020304 |
Notice a couple of conventions demonstrated by this data:
ManagerID
field, which is simply a reference to another EmpID
entry in the same table. Looking at your data, you see that both Fred
and Anne
are managed by John
who is, in turn, managed by someone else, whose details are not visible in this section of the table.You are not restricted to linking data within a single table. You can create another table for Salary
. A salary can be related to Grade
, and so you get a second table like Table 3.4.
Table 3.4 Salary Data
SALARYID | GRADE | AMOUNT |
000010 | Foreman | 60000 |
000011 | Laborer | 35000 |
To determine employee John Brown’s salary, you would first look up John’s grade in the main employee data table. You would then consult the Salary table to learn what an employee of that grade is paid. Thus you can see that John, a foreman, is paid $60,000.
Relational databases take their name from this ability to link table rows together in relationships. Other database types include network databases, hierarchical databases, and flat-file databases (which includes the DBM databases you looked at earlier in the chapter). For large volumes of data, relational databases are by far the most common.
You can do much more sophisticated queries, too, and you look at how to do this in the next few sections. But before you can query anything, you need to create a database and insert some data.
The Structured Query Language, or SQL (pronounced either as Sequel or as the letters S-Q-L), is the standard software tool for manipulating relational databases. In SQL an expression is often referred to as a query, regardless of whether it actually returns any data.
SQL is comprised of two parts. The first is the data definition language (DDL). This is the set of commands used to create and alter the shape of the database itself—its structure. DDL tends to be quite specific to each database, with each vendor’s DLL having a slightly different syntax.
The other part of SQL is the data manipulation language (DML). DML, used to manipulate database content rather than structure, is much more highly standardized between databases. You spend the majority of your time using DML rather than DDL.
You only look briefly at DDL, just enough to create (with the CREATE
command) and destroy (with the DROP
command) your database tables so that you can move on to filling them with data, retrieving that data in interesting ways, and even modifying it, using the DML commands (INSERT
, SELECT
, UPDATE
, and DELETE
).
To create a table in SQL you use the CREATE
command. It is quite easy to use and takes the following form:
CREATE TABLE tablename (fieldName, fieldName,....);
SQL is not case-sensitive and, unlike Python, does not care about whitespace or indentation levels. An informal style convention is used, but it is not rigidly adhered to, and SQL itself cares not a jot!
Try creating your Employee
and Salary
tables in SQLite. The first thing to do is start the interpreter, which you do simply by invoking sqlite3 with a single command-line argument, the database filename. If that database file exists, the interpreter will open it, otherwise it will create a new database file by that name. (If you omit the database filename entirely, the interpreter will still process your commands, but your data will exist only in RAM, and will disappear irretrievably when you exit the interpreter.)
Thus, to create an employee database you execute the SQLite interpreter like this:
$ sqlite3 employee.db
SQLite version 3.8.2 2013-12-06 14:53:30
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite>
The interpreter creates an empty database called employee.db
and leaves you at the sqlite>
prompt, ready to type SQL commands. You are now ready to create some tables:
sqlite> create table Employee
...> (EmpID,Name,HireDate,Grade,ManagerID);
sqlite> create table Salary
...> (SalaryID, Grade,Amount);
sqlite> .tables
Employee Salary
sqlite>
Note that you moved the list of fields into a separate line, making it easier to see them. The fields are listed by name but have no other defining information such as data type. This is a peculiarity of SQLite; most databases require you to specify the type along with the name. It is possible to specify types in SQLite too, but it is not essential (you look at this in more detail later in the chapter).
Also note that you tested that the create statements had worked by using the .tables
command to list all the tables in the database. The SQLite interpreter supports several of these dot commands that you use to find out about your database. .help
provides a list of the commands along with a brief description of their functions.
You can do lots of other things when you create a table. As well as declaring the types of data in each column, you can also specify constraints on the values. Constraints are rules that the database enforces to help ensure the data remains consistent. For example, NOT NULL
means the value is mandatory and must be filled in, and UNIQUE
means that no other record can have the same value in that field. Usually you specify the primary key field to be NOT NULL
and UNIQUE
. You can also specify which field is the PRIMARY KEY
. You look more closely at these more advanced creation options later in the chapter.
For now you leave the basic table definition as it is and move on to the more interesting topic of actually creating some data.
The first thing to do after creating the tables is to fill them with data. You do this using the SQL INSERT
statement. The structure is very simple:
INSERT INTO Tablename ( column1, column2... ) VALUES ( value1, value2... );
An alternate form of INSERT
uses a query to select data from elsewhere in the database, but that’s too advanced at this stage. You can read about it in the SQLite manual, which you can find at: http://sqlite.org/lang.html
.
To insert some rows into your Employee
table, do the following:
sqlite> insert into Employee (EmpID, Name, HireDate, Grade, ManagerID)
...> values ('1020304','John Brown','20030623','Foreman','1020311'),
sqlite> insert into Employee (EmpID, Name, HireDate, Grade, ManagerID)
...> values ('1020305','Fred Smith','20040302','Laborer','1020304'),
sqlite> insert into Employee (EmpID, Name, HireDate, Grade, ManagerID)
...> values ('1020307','Anne Jones','19991125','Laborer','1020304'),
And for the Salary
table:
sqlite> insert into Salary (SalaryID, Grade,Amount)
...> values('000010','Foreman',60000);
sqlite> insert into Salary (SalaryID, Grade,Amount)
...> values('000011','Laborer',35000);
And that’s it. You have created two tables and populated them with data corresponding to the values described in the introduction. Notice that you used actual numbers for the salary amount, not just string representations. SQLite tries to determine the correct data type based on the INSERT input values we provide. Because it makes the most sense to have SQLite maintain salary data as a numeric type, it behooves you to inform SQLite of that preference in your INSERT statements by specifying salary data in numeric—not string—format.
Now you are ready to start experimenting with the data. This is where the fun starts!
You read data from a database using SQL’s SELECT
command. SELECT
is the very heart of SQL and has the most complex structure of all the SQL commands. You start with the most basic form and add additional features as you go. The most basic SELECT
statement looks like this:
SELECT column1, column2... FROM table1,table2...;
To select the names of all employees in your database, you use:
sqlite> SELECT Name from Employee;
John Brown
Fred Smith
Anne Jones
You are rewarded with a list of all of the names in the Employee
table. In this case that’s only three, but if you have a big database that’s probably going to be more information than you want. To restrict the output, you need to be able to limit your search somewhat. SQL enables you to do this by adding a WHERE
clause to your SELECT
statement, like this:
SELECT col1,col2... FROM table1,table2... WHERE condition;
The condition
is an arbitrarily complex boolean expression that can even include nested SELECT
statements within it.
Now, add a WHERE
clause to refine your search of names. This time you only look for names of employees who are laborers:
sqlite> SELECT Name
...> FROM Employee
...> WHERE Employee.Grade = 'Laborer';
Fred Smith
Anne Jones
You only get two names back, not three (because John Brown is not a laborer). You can extend the WHERE
condition using boolean operators such as AND
, OR
, NOT
, and so on. Note that =
in a WHERE
condition performs a case-sensitive test. When using the =
test, the case of the string is important; testing for 'laborer'
would not have worked!
SQLite has some functions that can be used to manipulate strings, but it also has a comparison operator called LIKE
that uses %
as a wildcard character for more flexible searching. The example just shown, written using LIKE
, looks like this:
sqlite> SELECT Name FROM Employee
...> WHERE lower(employee.grade) LIKE 'lab%';
Fred Smith
Anne Jones
After converting grade to lowercase, you then tested it for an initial substring of 'lab'
. When used in conjunction with lower()
, upper(), and SQLite’s other string-manipulation functions, LIKE
can greatly increase the scope of your text-based searches. The SQLite documentation has a full list of the functions available.
Notice too that in the WHERE
clause you used dot notation (employee.grade
) to signify the Grade
field. In this case it was not really needed because you were only working with a single table (Employee
, as specified in the FROM
clause) but, where multiple tables are specified, you need to make clear which table the field belongs to. As an example, change your query to find the names of all employees paid more than $50,000. To do that, you need to consider data in both tables:
sqlite> SELECT Name, Amount
...> FROM Salary, Employee
...> WHERE Employee.Grade = Salary.Grade
...> AND Salary.Amount > 50000;
John Brown|60000
As expected, you only get one name back—that of the foreman. But notice that you also got back the salary, because you added Amount
to the list of columns selected. Also note that you have two parts to your WHERE
clause, combined using an AND
boolean operator. The first part links the two tables by ensuring that the common fields are equal; this is known as a join in SQL. A couple other features of this query are worth noting.
Because the fields that you are selecting exist in two separate tables, you have to specify both of the tables from which the result comes. The order of the field names is the order in which you get the data back, but the order of the tables doesn’t matter so long as the specified fields appear in those tables.
You specified two unique field names so SQLite can figure out which table to take them from. If you had also wanted to display the Grade
, which appears in both tables, you would have had to use dot notation to specify which table’s Grade
you wanted, like this:
sqlite> SELECT Employee.Grade, Name, Amount
...> FROM Employee, Salary
etc/...
Note in particular that SQL would require such qualification even though the choice of table here for the Grade field really does not matter, because the WHERE condition guarantees that for any result row displayed the grades of the two tables will have the identical value in any case.
The final feature of SELECT
discussed here (although you can read about several more in the SQL documentation for SELECT
) is the capability to sort the output. Databases generally hold data either in the order that makes it easiest to find things, or in the order in which they are inserted; in either case that’s not usually the order you want things displayed! To deal with that you can use the ORDER BY
clause of the SELECT
statement. It looks like this:
SELECT columns FROM tables WHERE expression ORDER BY columns;
Notice that the final ORDER BY
clause can take multiple columns; this enables you to have primary, secondary, tertiary, and so on sort orders.
You can use this to get a list of names of employees sorted by HireDate
:
sqlite> SELECT Name
...> FROM Employee
...> ORDER BY HireDate;
Anne Jones
John Brown
Fred Smith
(It is interesting to note that HireDate was perfectly acceptable as an ORDER BY
column, even though HireDate is not a column SELECTed for display.)
And that’s really all there is to it; you can’t get much easier than that! The only thing worthy of mention is that you didn’t use a WHERE
clause. If you had used one, it would have had to come before the ORDER BY
clause. Thus, although SQL doesn’t require that all components of a SELECT statement be present, it does require that those elements that are present will appear in a prescribed order.
That’s enough about reading data; you now look at how to modify your data in place.
You can change the data in your database in two ways. You can alter the contents of one or more records, or, more drastically, you can delete a record or even the contents of a whole table. Changing the content of an existing record is the more common case, and you do that using SQL’s UPDATE
command.
The basic format is:
UPDATE table SET column = value WHERE condition;
You can try it out on the Employee
database by changing the salary of a foreman to $70,000:
sqlite> UPDATE Salary
...> SET Amount = 70000
...> WHERE Grade = 'Foreman';
Be careful to get the WHERE
clause right. If you don’t specify one, every row in the table is modified, and that’s not usually a good idea. Similarly, if the WHERE
clause is not specific enough, you end up changing more rows than you want. One way to check you have it right is to do a SELECT
using the same WHERE
clause and check that only the rows you want to change are found. If all is well, you can repeat the WHERE
clause from the SELECT
in your UPDATE
statement.
A more drastic change you might need to make to your database table, rather than merely modifying certain fields of a given row or rows, is to entirely delete one or more rows from the table. You would do this using SQL’s DELETE FROM
command, whose basic form looks like this:
DELETE FROM Table WHERE condition
So, if you want to delete Anne Jones from your Employee
table you can do it like this:
sqlite> DELETE FROM Employee
...> WHERE Name = 'Anne Jones';
If more than one row matches your WHERE condition, all of the matching rows are deleted. SQL always operates on all the rows that match the specified WHERE condition. In this respect SQL is quite different from, say, using a regular expression in a Python program to perform substring substitution on a string (where the default behavior is to modify only the first occurrence found, unless you specifically request otherwise).
An even more drastic change you might want to make to your database is to delete not only all of a table’s rows, but to delete the entire table itself. This is done using SQL’s DROP
command.
Obviously, destructive commands like DELETE
and DROP
must be used with extreme caution.
The possibility of linking data between tables was mentioned earlier, in the section on SELECT
. However, this is such a fundamental part of relational database theory that you consider it in more depth here. The links between tables represent the relationships between data entities that give a relational database such as SQLite its name. The database maintains not only the raw data about the entities, but also information about the relationships.
The information about the relationships is stored in the form of database constraints, applied when you define the database structure using the CREATE
statement. Before you see how to use constraints to model relationships, you first need to look deeper into the kinds of constraints available in SQLite.
You normally express the constraints on a field-by-field basis within the CREATE
statement. This means you can expand the basic CREATE
definition from,
CREATE Tablename (Column, Column,...);
to:
CREATE Tablename (
ColumnName Type Constraints,
ColumnName Type Constraints,
...);
The most common constraints are:
NOT NULL
PRIMARY KEY [AUTOINCREMENT]
UNIQUE
DEFAULT value
NOT NULL
is fairly self-explanatory; it indicates that the value must exist and not be NULL
. A NULL
value is simply one that has no specified value, rather like None
in Python. If no suitable value is provided for a field governed by a NOT NULL
constraint, data insertion will utterly fail—quite possibly not only for that particular field, but also for the entire row. (Or, far worse, violation of the constraint could cause a very large database update transaction—possibly involving hundreds or thousands of rows—to fail in its entirety.)
PRIMARY KEY
tells SQLite to use this column as the main key for lookups (in practice this means it is optimized for faster searches). The optional AUTOINCREMENT
keyword means that an INTEGER
type value is automatically assigned on each INSERT
and the value automatically incremented by one. This saves a lot of work for the programmer in maintaining separate counts. Note that the AUTOINCREMENT
“keyword” is not normally used; rather, it is implied from a type/constraint combination of INTEGER PRIMARY KEY
. This not-so-obvious quirk of the SQLite documentation trips up enough people for it to appear at the top of the SQLite Frequently Asked Questions page, found here: http://sqlite.org/faq.html
.
The UNIQUE
constraint means that the value of the field must be unique within the specified column. If you try to insert a duplicate value into a column that has a UNIQUE
constraint, an error results and the row is not inserted. UNIQUE
is often used for non-INTEGER
type PRIMARY KEY
columns.
DEFAULT
is always accompanied by a value. The value is what SQLite inserts into that field if the user does not explicitly provide one.
Here is a short example showing some of these constraints, including the use of DEFAULT
:
sqlite> CREATE table test
...> (Id INTEGER PRIMARY KEY,
...> Name NOT NULL,
...> Value INTEGER DEFAULT 42);
sqlite> INSERT INTO test (Name, Value) VALUES ('Alan', 24);
sqlite> INSERT INTO test (Name) VALUES ('Heather'),
sqlite> INSERT INTO test (Name, Value) VALUES ('Laura', NULL);
sqlite> SELECT * FROM test;
1|Alan|24
2|Heather|42
3|Laura|
The first thing to notice is that although none of the INSERT
statements had an Id
value, there is an Id
value in the SELECT
output. That’s because by specifying Id
to be an INTEGER PRIMARY KEY
, it is automatically generated by SQLite. Notice too how the entry for Heather has the default Value
set. Also, note that the Value
for Linda is nonexistent, or NULL
. There is an important difference between NOT NULL
and DEFAULT
. The former does not allow NULL
values, either by default or explicitly. The DEFAULT
constraint prevents unspecified NULLs
, but does not prevent deliberate creation of NULL
values.
You can also apply constraints to the table itself, such as how to handle data conflicts like duplicates in a UNIQUE
column. For example, a table constraint could specify that where a conflict occurs the entire database query will be cancelled, or it could specify that only the changes to the conflicting row be cancelled. Table constraints are not discussed further in this chapter; you should consult the documentation for details.
The other kind of constraint that you can apply, as already mentioned, is to specify the column type. This is very like the concept of data types in a programming language. The valid types in SQLite are as follows:
TEXT
INTEGER
REAL
NUMERIC
BLOB
NULL
These types should be self-evident, with the possible exceptions of NUMERIC
, which enables the storage of floating-point numbers as well as integers, and BLOB
, which stands for Binary Large Object, typically used for media data such as images. NULL
is not really a type, but simply suggests that no explicit type has been specified. Most databases come with a much wider set of types including, crucially, a DATE
type. As you are about to see, however, SQLite has a somewhat unconventional approach to types that renders such niceties less relevant.
Most databases strictly apply the types specified. However, SQLite employs a more dynamic scheme, where the type specified is more like a hint and any type of data can be stored in the table. When data of a different type is loaded into a field, SQLite uses the declared type to try to convert the data, but if it cannot be converted, it is stored in its original form. Thus, if a field is declared as INTEGER
, but the TEXT
value '123'
is passed in, SQLite converts the string '123'
to the number 123
. However, if the TEXT
value 'Freddy'
is passed in, the conversion fails, so SQLite simply stores the string 'Freddy'
in the field! This can cause some strange behavior if you are not aware of this foible. Most databases treat the type declaration as a strict constraint and fail if an illegal value is passed.
Having seen the various kinds of constraints available and how you can use them in your database, it’s time to return to the topic of modeling relationships. So how do constraints help you to model data and, in particular, relationships? Look again at your simple two-table database, as summarized in Table 3.5 and Table 3.6.
Table 3.5 Employee Database Table
EMPID | NAME | HIREDATE | GRADE | MANAGERID |
1020304 | John Brown | 20030623 | Foreman | 1020311 |
1020305 | Fred Smith | 20040302 | Laborer | 1020304 |
1020307 | Anne Jones | 19991125 | Laborer | 1020304 |
Table 3.6 Salary Database Table
SALARYID | GRADE | AMOUNT |
000010 | Foreman | 60000 |
000011 | Laborer | 35000 |
Looking at the Employee
table first you see that the EmpID
value should be of INTEGER
type and have a PRIMARY KEY
constraint; the other columns, with the possible exception of the ManagerID
, should be NOT NULL
. ManagerID
should also be of type INTEGER
.
For the Salary
table you see that, once more, the SalaryID
should be an INTEGER
with PRIMARY KEY
. The Amount
column should also be an INTEGER
, and you should apply a DEFAULT
value of, say, 20000
. Finally, the Grade
column should be constrained as UNIQUE
because you don’t want more than one salary per grade! (Actually, this is a bad idea because normally salary varies with things like length of service as well as grade, but you ignore such niceties for now. In fact, in the real world, you should probably call this a Grade
table and not Salary
.)
The modified SQL looks like this:
sqlite> CREATE TABLE Employee (
...> EmpID INTEGER PRIMARY KEY,
...> Name NOT NULL,
...> HireDate NOT NULL,
...> Grade NOT NULL,
...> ManagerID INTEGER
...> );
sqlite> CREATE TABLE Salary (
...> SalaryID INTEGER PRIMARY KEY,
...> Grade UNIQUE,
...> Amount INTEGER DEFAULT 20000
...> );
You can try out these constraints by attempting to enter data that breaks them and see what happens. Hopefully you see an error message!
One thing to point out here is that the INSERT
statements you used previously are no longer adequate. You previously inserted your own values for the ID fields, but these are now auto-generated so you can (and should) omit them from the inserted data. But this gives rise to a new difficulty. How can you populate the ManagerID
field if you don’t know the EmpID
of the manager? The answer is you can use a nested select statement. In this example, you do this in two stages using NULL
fields initially and then using an UPDATE
statement after creating all the rows.
To avoid a lot of repeat typing you can put all of the commands in a couple of files, called employee.sql
for the table creation commands and load_employee.sql
for the INSERT
statements. This is the same idea as creating a Python script file ending in .py
to save typing everything at the >>> prompt.
The employee.sql
file looks like this (and is in the SQL
folder of the Chapter3.zip
download):
DROP TABLE Employee;
CREATE TABLE Employee (
EmpID INTEGER PRIMARY KEY,
Name NOT NULL,
HireDate NOT NULL,
Grade NOT NULL,
ManagerID INTEGER
);
DROP TABLE Salary;
CREATE TABLE Salary (
SalaryID INTEGER PRIMARY KEY,
Grade UNIQUE,
Amount INTEGER DEFAULT 10000
);
Notice that you drop the tables before creating them. The DROP TABLE
command, as mentioned earlier, deletes the table and any data within it. This ensures the database is in a completely clean state before you start creating your new table. (You will get some errors reported the first time you run this script because no tables exist to DROP
, but you can ignore them. Subsequent executions should be error free.)
The load_employee.sql
script looks like this (and is also available in the SQL
folder of the .zip
file):
INSERT INTO Employee (Name, HireDate, Grade, ManagerID)
VALUES ('John Brown','20030623','Foreman', NULL);
INSERT INTO Employee (Name, HireDate, Grade, ManagerID)
VALUES ('Fred Smith','20040302','Labourer',NULL);
INSERT INTO Employee (Name, HireDate, Grade, ManagerID)
VALUES ('Anne Jones','19991125','Labourer',NULL);
UPDATE Employee
SET ManagerID = (SELECT EmpID
From Employee
WHERE Name = 'John Brown')
WHERE Name IN ('Fred Smith','Anne Jones'),
INSERT INTO Salary (Grade, Amount)
VALUES('Foreman','60000'),
INSERT INTO Salary (Grade, Amount)
VALUES('Labourer','35000'),
Notice the use of the nested SELECT
statement inside the UPDATE
command, and also the fact that you used a single UPDATE
to modify both employee rows at the same time by using The SQL IN
operator that works like the Python in
keyword for testing membership of a collection. By extending the set of names being tested, you can easily add more employees with the same manager.
This is typical of the problems you can have when populating a database when constraints are being used. You need to plan the order of the statements carefully to ensure that, for every row that contains a reference to another table, you have already provided the data for it to reference! It’s a bit like starting at the leaves of a tree and working back to the trunk. Always create, or insert, the data with no references first, then the data that references that data, and so on. If you are adding data after the initial creation, you need to use queries to check that the data you need already exists, and add it if it doesn’t. At this point a scripting language like Python becomes invaluable!
Finally, you run these from the SQLite prompt like this:
sqlite> .read employee.sql
sqlite> .read load_employee.sql
Make sure you have the path issues sorted out, though: Either start sqlite3
from wherever the .sql
scripts are stored (as you did earlier) or provide the full path to the script.
Now try a query to check that everything is as it should be:
sqlite> SELECT Name
...> FROM Employee
...> WHERE Grade IN
...> (SELECT Grade FROM Salary WHERE amount >50000)
...> ;
John Brown
That seems to work; John Brown is the only employee earning more than $50,000. Notice that you used an IN
condition combined with another embedded SELECT
statement. This is a variation on a similar query that you performed previously using a cross table join. Both techniques work, but usually the join approach will be faster.
Although this is an improvement over the original unconstrained definition and it ensures that the ManagerID
is an integer, it does not ensure that the integer is a valid EmpID
key. You need the embedded SELECT
statement for that. However, SQLite offers one more constraint that helps you ensure the data is consistent, and that is the REFERENCES
constraint. This tells SQLite that a given field references another specified field somewhere in the database. You can apply the REFERENCES
constraint to the ManagerID
field by modifying the CREATE
statement like this:
CREATE TABLE Employee (
...
ManagerID INTEGER REFERENCES Employee(EmpID)
);
You see that the REFERENCES
constraint specifies the table and the key field within it. At the time of writing, SQLite does not actually enforce this constraint by default (although it is worth including in your code as documentation of your intentions). However, you can turn checking on by using a pragma statement. A pragma is a special command that is used to control the way the interpreter works. It looks like this:
PRAGMA Foreign_Keys=True;
If you now modify the employee.sql
file to add the pragma and modify the create statement, it should look like this:
PRAGMA Foreign_Keys=True;
DROP TABLE Employee;
CREATE TABLE Employee (
EmpID INTEGER PRIMARY KEY,
Name NOT NULL,
HireDate NOT NULL,
Grade NOT NULL,
ManagerID INTEGER REFERENCES Employee(EmpID)
);
DROP TABLE Salary;
CREATE TABLE Salary (
SalaryID INTEGER PRIMARY KEY,
Grade UNIQUE,
Amount INTEGER DEFAULT 10000
);
After running the script and reloading the data with the load_employee.sql
script, you can check that it works by trying to insert a new employee with a ManagerID
not already in the table. Like this:
sqlite> .read employee.sql
sqlite> .read load_employee.sql
sqlite> insert into Employee (Name,HireDate,Grade,ManagerID)
...> values('Fred', '20140602','Laborer',999);
Error: FOREIGN KEY constraint failed
sqlite>
This is a big advantage in keeping your data consistent. It is now impossible for a non-valid worker-to-manager relationship to be created (although you can still use NULL
values that indicate that no manager relationship exists).
One scenario you haven’t covered is where two tables are linked in a many-to-many relationship. That is, a row in one table can be linked to several rows in a second table and a row in the second table can at the same time be linked to many rows in the first table.
Consider an example. Imagine creating a database to support a book publishing company. It needs lists of authors and lists of books. Each author may write one or more books. Each book may have one or more authors. How do you represent that in a database? The solution is to represent the relationship between books and authors as a table in its own right. Such a table is often called an intersection table or a mapping table. Each row of this table represents a book/author relationship. Now each book has potentially many book/author relationships, but each relationship only has one book and one author, so you have converted a many-to-many relationship into two one-to-many relationships. And you already know how to build one-to-many relationships using IDs. It looks like this (you can find the code in the file books.sql
in the SQL
folder of the .zip
file):
PRAGMA Foreign_Keys=True;
drop table author;
create table author (
ID Integer PRIMARY KEY,
Name Text NOT NULL
);
drop table book;
create table book (
ID Integer PRIMARY KEY,
Title Text NOT NULL
);
drop table book_author;
create table book_author (
bookID Integer NOT NULL REFERENCES book(ID),
authorID Integer NOT NULL REFERENCES author(ID)
);
insert into author (Name) values ('Jane Austin'),
insert into author (Name) values ('Grady Booch'),
insert into author (Name) values ('Ivar Jacobson'),
insert into author (Name) values ('James Rumbaugh'),
insert into book (Title) values('Pride & Prejudice'),
insert into book (Title) values('Emma'),
insert into book (Title) values('Sense &; Sensibility'),
insert into book (Title) values ('Object Oriented Design with Applications'),
insert into book (Title) values ('The UML User Guide'),
insert into book_author (BookID,AuthorID) values (
(select ID from book where title = 'Pride &; Prejudice'),
(select ID from author where Name = 'Jane Austin')
);
insert into book_author (BookID,AuthorID) values (
(select ID from book where title = 'Emma'),
(select ID from author where Name = 'Jane Austin')
);
insert into book_author (BookID,AuthorID) values (
(select ID from book where title = 'Sense & Sensibility'),
(select ID from author where Name = 'Jane Austin')
);
insert into book_author (BookID,AuthorID) values (
(select ID from book where title = 'Object Oriented Design with Applications'),
(select ID from author where Name = 'Grady Booch')
);
insert into book_author (BookID,AuthorID) values (
(select ID from book where title = 'The UML User Guide'),
(select ID from author where Name = 'Grady Booch')
);
insert into book_author (BookID,AuthorID) values (
(select ID from book where title = 'The UML User Guide'),
(select ID from author where Name = 'Ivar Jacobson')
);
insert into book_author (BookID,AuthorID) values (
(select ID from book where title = 'The UML User Guide'),
(select ID from author where Name = 'James Rumbaugh')
);
If you look at the values inserted into the tables, you see that Jane Austin has three books to her credit, while the book The UML User Guide has three authors.
If you load that into SQLite in a database called books.
db (or just use the file books.db
found in the SQL folder of the Chapter3. zip file), you can try some queries to see how it works:
$ sqlite3 books.db
SQLite version 3.8.2 2013-12-06 14:53:30
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> .read books.sql
Error: near line 3: no such table: author
Error: near line 9: no such table: book
Error: near line 15: no such table: book_author
sqlite> .tables
author book book_author
Notice the errors resulting from the DROP
statements. You always get those the first time you run the script because the tables don’t exist yet. Now you can find out which Jane Austin books are published:
sqlite> SELECT title FROM book, book_author
...> WHERE book_author.bookID = book.ID
...> AND book_author.authorID = (
...> SELECT ID from Author
...> WHERE name='Jane Austin'),
Pride & Prejudice
Emma
Sense & Sensibility
Things are getting a bit more complex, but if you sit and work through it you’ll get the idea soon enough. Notice you need to include both of the referenced tables—book
and book_author
—in the table list after the SELECT
. (The third table, author
, is not listed there because it is listed against its own embedded SELECT
statement.) Now try it the other way around—see who wrote The UML User Guide:
sqlite> SELECT name FROM author, book_author
...> WHERE book_author.authorID = author.ID
...> AND book_author.bookID = (
...> SELECT ID FROM book
...> WHERE title='The UML User Guide'),
Grady Booch
Ivar Jacobson
James Rumbaugh
If you look closely you see that the structure of the two queries is identical—you just swapped around the table and field names a little.
That’s enough for that example; you now return to your lending library. You now see how you convert it from file-based storage to a full SQL database. You then go on to build an accompanying Python module that enables application writers to ignore the SQL and just call Python functions.
In this section you re-create the LendyDB
database using a combination of SQL and Python code. Before getting into the nitty-gritty, you need to see how Python and SQL come together.
SQLite provides an application programming interface or API consisting of a number of standard functions that allow programmers to perform all the same operations that you have been doing without using the interactive SQL prompt. The SQLite API is written in C, but wrappers have been provided for other languages, including Python. Python has similar interfaces to many other databases, and they all provide a standard set of functions and provide very similar functionality. This interface is called the Python DBAPI, and its existence makes porting data between databases much easier than if each database had its own interface.
The DBAPI defines a couple of useful conceptual objects that form the core of the interface. These are connections and cursors.
A connection is the conduit between your application code and the database’s SQL engine. The name comes from the client-server architecture used by most SQL databases whereby the client must connect to the server over a network. For SQLite the connection is actually to the data file via the SQLite library. The arguments passed to create the connection are of necessity database-specific. For example, many databases require a user ID and password, and some require IP addresses and ports, whereas SQLite just requires a filename. It looks like this:
>>> import sqlite3
>>> db = sqlite3.connect('D:/PythonCode/Chapter3/SQL/lendy.db')
Once a connection has been established, you can go on to create cursors, which are the mechanism used to issue SQL commands and receive their output. In principle you can have more than one cursor per connection, but in practice that is rarely needed.
When accessing a database from within a program, one important consideration is how to access the many rows of data potentially returned by a SELECT
statement without running out of memory. The answer is to use what is known in SQL as a cursor. A cursor is a Python iterator so it can be accessed iteratively one row at a time. Thus, by selecting data into a cursor and using the cursor methods to extract the results either as a list (for smaller volumes of data), or row by row, you can process large collections of data. You now try that out.
The database design is not significantly different from its previous incarnations. You just have to translate it into SQL syntax and add in some constraints to improve data integrity.
The initial database setup is usually easiest to do using raw SQL commands as you did in the earlier sections. Although it is possible to do it all from Python, occasionally another tool is better suited to the task at hand.
The code looks like this (and is available in the file lendydb.sql
in the SQL
folder of the .zip
file):
PRAGMA Foreign_Keys=True;
drop table loan;
drop table item;
drop table member;
create table member (
ID INTEGER PRIMARY KEY,
Name TEXT NOT NULL,
Email TEXT);
create table item (
ID INTEGER PRIMARY KEY,
Name TEXT NOT NULL,
Description TEXT NOT NULL,
OwnerID INTEGER NOT NULL REFERENCES member(ID),
Price NUMERIC,
Condition TEXT,
DateRegistered TEXT);
create table loan (
ID INTEGER PRIMARY KEY,
ItemID INTEGER NOT NULL REFERENCES item(ID),
BorrowerID INTEGER NOT NULL REFERENCES member(ID),
DateBorrowed TEXT NOT NULL,
DateReturned TEXT);
You need to drop all the tables at the top of the script, and the order is important because otherwise the referential constraints fail, and an error results. Notice also that you used the NUMERIC
type for the item price because this caters for both integer and floating-point values. It also means you don’t need to worry about those pesky dollar signs that you had to strip off in the previous incarnations of the data. Apart from the various key fields, the other fields are all declared as TEXT
.
Now that the database is ready, you can insert the test data.
The database design now includes some referential constraints so you need to think about the order in which you populate the data. The member data has no references, so it can be loaded first. The item data only references members, so it comes next. Finally, the loan data references both members and items, so it must come last.
For the data insertion, you are essentially repeating the same INSERT
operation over and over with different data values. This time Python is the better solution because you can write the SQL once and execute it many times using either the executemany()
method of the cursor object or by calling the execute()
method from within a Python for
loop. It looks like this (and the code is in the file lendydata-sql.py
):
import sqlite3
members = [
['Fred', '[email protected]'],
['Mike', '[email protected]'],
['Joe', '[email protected]'],
['Rob', '[email protected]'],
['Anne', '[email protected]'],
]
member_sql = '''insert into member (Name, Email) values (?, ?)'''
items = [
['Lawnmower','Tool', 0, 150,'Excellent', '2012-01-05'],
['Lawnmower','Tool', 0, 370,'Fair', '2012-04-01'],
['Bike', 'Vehicle', 0, 200,'Good', '2013-03-22'],
['Drill', 'Tool', 0, 100,'Good', '2013-10-28'],
['Scarifier','Tool', 0, 200,'Average', '2013-09-14'],
['Sprinkler','Tool', 0, 80,'Good', '2014-01-06']
]
item_sql = '''
insert into item
(Name, Description, ownerID, Price, Condition, DateRegistered)
values (?, ?, ?, ?, ?, date(?))'''
set_owner_sql = '''
update item
set OwnerID = (SELECT ID from member where name = ?)
where item.id = ?
'''
loans = [
[1,3,'2012-01-04','2012-04-26'],
[2,5,'2012-09-05','2013-01-05'],
[3,4,'2013-07-03','2013-07-22'],
[4,1,'2013-11-19','2013-11-29'],
[5,2,'2013-12-05', None]
]
loan_sql = '''
insert into loan
(itemID, BorrowerID, DateBorrowed, DateReturned )
values (?, ?, date(?), date(?))'''
db = sqlite3.connect('lendy.db')
cur = db.cursor()
cur.executemany(member_sql, members)
cur.executemany(item_sql, items)
cur.executemany(loan_sql, loans)
owners = ('Fred','Mike','Joe','Rob','Anne','Fred')
for item in cur.execute("select id from item").fetchall():
itemID = item[0]
cur.execute(set_owner_sql, (owners[itemID-1], itemID))
cur.close()
db.commit()
db.close()
Several things are noteworthy in this script. The first point is the format of the dates. Although SQLite does not have a Date
data type, it does have a number of functions that can be used to create standardized date strings and values that it then stores as text (or in some cases as floating-point numbers). The date()
function used here requires the date string to be in the format shown, and an invalid date string will be stored as NULL. Using date()
therefore improves data quality by ensuring only valid and consistently formatted dates get stored.
The item OwnerID
field is specified as NOT NULL
, so you filled it with a dummy value (0
) that is then overwritten by the UPDATE
code later in the script.
You could have used a for
loop to process the INSERT
statements, but instead used the executemany()
method that takes the statement and a sequence (or iterator or generator) and repeatedly applies the statement until the iteration is completed.
The variable values are inserted into the query strings by using a question mark as a placeholder. This is very similar to the string formatting method that uses {}
as a place marker.
At this point you have created the database and populated it with some test data. The next stage is to make that data accessible to applications via an application programming interface (API).
When defining an API for a database, it’s normal to start off by thinking about the basic entities and providing functions to create, read, update, and delete the items. (This is often known as a CRUD interface after the initials of the operations.) The temptation is to provide these functions as a thin wrapper around the corresponding SQL commands. However, to the application programmer it is not particularly useful to just be given an ID of a member—for example, it would be much better in most cases to get the member name instead. Otherwise, the programmer must perform multiple reads of the member database to create a useful display for the user. On the other hand, there may be times when an ID is the best option because the programmer may want to get more specific details about the member. The skill in building a good API comes from being able to resolve these contradictions in a way that makes the application programmer effective while retaining full access to the data.
Although the CRUD operations are a good foundation, it is often more useful to the application programmer if some higher-level operations are also available that span entities. To do this you need to think about how your data is likely to be used in an application. What kinds of questions will the designer want to ask? If you are also the application programmer, this is relatively easy, but if you are providing the database as part of a bigger project, it gets more difficult.
For the lending library example, you focus only on the CRUD interface to items and members. The principles should be obvious, and you have the opportunity to extend the API to cover loans in Exercise 3.3 at the end of the chapter. (The code for items and members is in the lendydata.py
file in the SQL
folder of the .zip
file.)
'''
Lending library database API
Provides a CRUD interface to item and member entities
and init and close functions for database control.
'''
import sqlite3 as sql
db=None
cursor = None
##### CRUD functions for items ######
def insert_item(Name, Description, OwnerID, Price, Condition):
query = '''
insert into item
(Name, Description, OwnerID, Price, Condition, DateRegistered)
values (?,?,?,?,?, date('now'))'''
cursor.execute(query,(Name,Description,OwnerID,Price,Condition))
def get_items():
query = '''
select ID, Name, Description, OwnerID, Price, Condition, DateRegistered
from item'''
return cursor.execute(query).fetchall()
def get_item_details(id):
query = '''
select name, description, OwnerID, Price, Condition, DateRegistered
from item
where id = ?'''
return cursor.execute(query,(id,)).fetchall()[0]
def get_item_name(id):
return get_item_details(id)[0]
def update_item(id, Name=None, Description=None,
OwnerID=None, Price=None, Condition=None):
query = '''
update item
set Name=?, Description=?, OwnerID=?, Price=?, Condition=?
where id=?'''
data = get_item_details(id)
if not Name: Name = data[0]
if not Description: Description = data[1]
if not OwnerID: OwnerID = data[2]
if not Price: Price = data[3]
if not Condition: Condition = data[4]
cursor.execute(query, (Name,Description,OwnerID,Price,Condition,id))
def delete_item(id):
query = '''
delete from item
where id = ?'''
cursor.execute(query,(id,))
##### CRUD functions for members ######
def insert_member(name, email):
query = '''
insert into member (name, email)
values (?, ?)'''
cursor.execute(query, (name,email))
def get_members():
query = '''
select id, name, email
from member'''
return cursor.execute(query).fetchall()
def get_member_details(id):
query = '''
select name, email
from member
where id = ?'''
return cursor.execute(query, (id,)).fetchall()[0]
def get_member_name(id):
return get_member_details(id)[0]
def update_member(id, Name=None, Email=None):
query = '''
update member
set name=?, email=?
where id = ?'''
data = get_member_details(id)
if not Name: Name = data[0]
if not Email: Email = data[1]
cursor.execute(query, (Name, Email, id))
def delete_member(id):
query = '''
delete from member
where id = ?'''
cursor.execute(query,(id,))
##### Database init and close #######
def initDB(filename = None):
global db, cursor
if not filename:
filename = 'lendy.db'
try:
db = sql.connect(filename)
cursor = db.cursor()
except:
print("Error connecting to", filename)
cursor = None
raise
def closeDB():
try:
cursor.close()
db.commit()
db.close()
except:
print("problem closing database...")
raise
if __name__ == "__main__":
initDB() # use default file
print("Members:
", get_members())
print("Items:
",get_items())
In this module you create two global variables for the database connection and the cursor. The initDB()
function initializes these variables, and the same cursor is used in each of the query functions within the module. The closeDB()
function then closes these objects when you are finished using the module. This approach means that the initDB()
/closeDB()
functions must be called to initialize and finalize the database, which adds a little bit of complexity for the user, but it means that the individual queries are much simpler because you do not need to manage the database and cursor objects each time a query is called. Notice, too, that you set the Foreign_Keys
pragma in the initDB()
function to ensure that all the transactions are checked for referential integrity.
The query functions all follow the same pattern. A SQL query string is created, using triple quotes to allow multiple line layouts, and then that string is passed to the cursor.execute()
method. The retrieved values are passed back where appropriate. The cursor.execute()
parameter substitution mechanism is used throughout.
The two update methods have an extra twist in that they have defaulted input parameters. This means that the user can provide only those fields that are changing. The other fields are populated based on the existing values that are retrieved by using the appropriate get-details function.
The two get-name functions are provided for the convenience of the user to easily map from the database identifiers returned in the get-details queries to meaningful names. Typically, the application programmer should use these functions before displaying the results to the end user.
The final section of the module, inside the if
test, is a very basic test function just to check that the connection and cursor objects are working as expected. It does not comprehensively test all of the API functions; you will do that in the Try It Out that follows.
You have now built a high-level API for application programmers to use and have seen that by making the underlying cursor object accessible you enable the user to issue arbitrary low-level SQL commands, too. This kind of flexibility is usually appreciated by application programmers, although if they find that they must use SQL extensively, they should request an addition to the functional API.
Of course SQLite is not the only tool available for working with large data volumes, and Python can provide support for these alternatives, too. In the next section, you look at some alternatives to SQLite and how Python works in these environments.
You have many options for managing large amounts of data. In this section you look at traditional client-server databases, newer databases known as “NoSQL” databases, and how cloud storage is changing approaches to data management. Finally, you consider some powerful data analysis tools that can be accessed from Python using third-party modules.
The traditional SQL database is rather different from SQLite in the way it is constructed. In these databases you have a database server process accessed over a network by multiple clients. The Python DBAPI is designed so that you can work with these databases just as easily as you did with SQLite. Only minor changes in the way you initialize the database connection are usually all that is necessary. Occasionally you’ll see minor changes to the SQL syntax, and the parameter substitution symbol is sometimes different, too. But, in general, swapping from one SQL interface to another is a relatively painless experience.
Several reasons exist for wanting to adopt a client-server database or to migrate to one from SQLite. The first is capacity; most client-server databases can scale up to much larger volumes of data than SQLite. In addition, they can be configured to distribute their processing over several servers and disks, which greatly improves performance when multiple users are accessing the data concurrently. The second reason for migrating from SQLite is that larger databases usually come with a much richer set of SQL commands and query types as well as more data types. Many even support object-oriented techniques and have their own built-in programming languages that enable you to write stored procedures, effectively putting the API into the database itself. Usually, a much richer set of database constraints can be used to tighten up data integrity far beyond the simple foreign key checks that SQLite performs.
The biggest downside of selecting a client-server database is the extra complexity of administering the database. Usually a dedicated administrator is needed to set up users, adjust their access privileges, tune the SQL query engine performance, and do backups, data extracts, and loads.
Popular client-server databases include commercial offerings such as SQL Server, Oracle, and DB2, as well as open source projects such as MySQL, PostGres, and Firebird.
As data storage needs have expanded both in size and in variety, several projects have been exploring alternatives to SQL. Many of these projects are associated with what is called “Big Data,” which usually relates to the harvesting of large volumes of, often unstructured, data from places such as social media sites or from sensors in factories or hospitals and so on. One of the features of this kind of data is that most of it is of little significance to the database user, but amongst the detritus are gems to be had that can influence sales strategy or alert you to imminent failure of components or processes so that action can be taken in advance. The nature, volumes, and need for rapid access of such data means that traditional SQL databases are not well suited to the task.
The solution has been a variety of technologies that go under the collective term NoSQL. NoSQL does not mean that no SQL is used, but rather it stands for Not Only SQL. SQL may still be available for traditional queries on these new databases, but alternative query techniques are also used, especially for searching unstructured data. Some examples of NoSQL approaches (with typical implementations) are Document (MongoDB), Key-Value (Dynamo), Columnar (HBase), and Graphs (Allegro). All of these diverge to some degree or other from the traditional relational model of multiple, two-dimensional tables with cross-references. To get the best from these systems, you need to consider which architecture best suits the nature of your data. Many solutions are open source, but commercial offerings exist, too.
Although NoSQL databases do provide the potential for faster and more flexible access to a wider variety of data, they generally sacrifice qualities like data integrity, transactional control, and usability. The products are all evolving rapidly, although at the time of writing they require lower-level programming skills to utilize the data compared to traditional SQL databases that could change quite quickly. Most of the popular databases offer Python modules that facilitate programming them from Python.
Cloud computing has become popular in recent years with its promise of computing on-demand. This potentially offers lower costs, more flexibility, and lower risks than traditional data center–based solutions. It brings its own set of concerns, of course, especially for sensitive data or where network reliability is an issue. The biggest use of cloud technologies in the database world has been in combination with the NoSQL solutions, discussed in the previous section, and other Big Data solutions such as Hadoop. Many cloud storage providers offer these technologies on a software-as-a-service (SAAS) basis. This offers an attractive option for those just dipping a toe into the Big Data or NoSQL world.
The advantage of cloud computing for the application programmer is that the data is abstracted into a virtual database located at a single consistent network address. The physical location of the data may change over time, or the amount of storage available may grow or shrink, but to the programmer it doesn’t matter. (This does not mean that normal error handling can be ignored, but it does mean that the code can be largely decoupled from the physical design of the data storage.)
One of the biggest providers of cloud storage and database technology is the online retailer Amazon. It provides storage and an API (Amazon Web Services, or AWS) as well as a proprietary NoSQL database called SimpleDB, in addition to several other open source offerings. Other cloud suppliers are starting to offer similar products. Many providers offer small amounts of free storage and access as an incentive to try the product before committing to a significant investment.
Amazon AWS has Python support available, including a comprehensive tutorial to both Python and the AWS interface, at http://boto.readthedocs.org/en/latest/
. Similar interfaces are available, or will likely soon become available, from the other providers.
Though client-server SQL, NoSQL, and cloud computing all provide solutions for handling large data volumes or many users, you have other data management issues to consider. Often, the processing of large volumes of data is more important than the storage or retrieval. If that processing involves a high degree of statistical exploration or manipulation, Python offers a basic statistics
module (introduced to the standard library in version 3.4). If that is not enough, there is the R programming language. R is a specialized language designed for statistical number crunching on a large scale. Like Python it has accumulated a large library of add-in modules, and many statistical researchers use R as their platform of choice, publishing their research using R.
The good news for Python programmers is that there is an interface from Python to R called rpy2
that opens up this processing power without having to become an expert in R. Knowing the basics of R, especially its data handling concepts, is pretty much essential, but much of your Python knowledge can be applied, too. You can find rpy2
on the Python Package Index and install it via pip
.
Data management has many facets, and this chapter has reviewed how Python supports you, from small volumes to large, from simple data persistence through to complex data analysis. Once you have control of your data, you are in a much stronger position to create powerful, user-friendly applications whether on the desktop or on the web. That’s the subject of the next two chapters.
In this chapter you learned how to store and retrieve data so that your application can pick up from where it left off or so that you can work on multiple different projects. You saw how you could do this with flat files using indexed files (DBM), pickles, and shelves. You then looked at SQL and relational databases using SQLite and concluded with a review of some alternative technologies.
Flat files are good for storing small amounts of data or for saving the program’s state information when it shuts down. They are less useful where multiple queries have to be made, especially if querying on more than one key value. DBM files act like persistent dictionaries that can only store strings. Pickles turn binary objects into strings. Shelves combine DBM and pickle to act like persistent dictionaries, albeit limited to using strings for the keys.
SQL is used to manage data in a relational database management system (RDBMS). Relational databases are conceptually made up of one or more two-dimensional tables, each representing one logical entity. Cross-references between tables act as a model for the relationships between entities. SQL provides commands to build and manage the tables in a database as well as operations to create, delete, query, and update the data within those tables.
Python provides the DBAPI, which is a standard protocol for accessing SQL-based databases. The DBAPI primarily consists of two objects: a connection and a cursor. The cursor is the primary object used for executing SQL commands and retrieving their results. Results are returned as a list of tuples.
The SQLite database interface is provided in Python’s standard library, and you can download a standalone interpreter. SQLite, as the name suggests, is a lightweight version of SQL that stores the entire database in a single file and supports a lightweight subset of the SQL language. SQLite can be used on small- to medium-sized projects, and it is then relatively easy to migrate to a larger database if the project expands in scale. DBAPI libraries are available for most major databases and can be downloaded from third parties.
By using the DBAPI it is possible to build a data abstraction layer that hides both the details of the database design and the SQL language from the application programmer, further facilitating the migration process if it becomes necessary.
Several other database technologies exist, and this is an active area of development with several NoSQL projects vying for popular support. These databases tend to have quite specific areas of applicability, and no single solution suits all scenarios. Many are focused on the challenges of managing “Big Data” and are well suited to cloud-based solutions.
To appreciate the work that pickle
does for you, try building a simple serialization function for numbers, called ser_num()
. It should accept any valid integer or float number as an argument and convert it into a byte string. You should also write a function to perform the reverse operation to read a byte string produced by your ser_num()
function and convert it back to a number of the appropriate type. (Hint: You may find the struct
module useful for this exercise.)
Write a version of the employee database example using shelve
instead of SQLite. Populate the shelf with the sample data and write a function that lists the name of all employees earning more than a specified amount.
Extend the lendydata.py
module to provide CRUD functions for the loan table. Add an extra function, get_active_loans()
, to list those loans that are still active. (Hint: That means the DateReturned
field is NULL
.)
Explore the Python statistics
module to see what it offers (only available in Python 3.4 or later).
KEY CONCEPT | DESCRIPTION |
Persistence | The ability to store data between program executions such that when a program is restarted it can restore its data to the state it was in when the program was last stopped. |
Flat file | A standard file containing data. The data may be in text form or binary form. |
DBM | A form of file storage that uses a combination of a flat file to store the data and index files to store the location of individual data records for rapid retrieval. Both keys and data must be in string format. In Python a dictionary-like interface is provided by the dbm module. |
Serialization | The process of converting binary data into a string of bytes for storage or transmission over a network. |
Pickle | A Python-specific format for serializing binary data. Most Python objects can be serialized using Pickle. Because it poses some security risks, care must be exercised when reading data from untrusted sources. |
Shelve | A combination of DBM and Pickle technologies to provide a persistent dictionary. Arbitrary binary data can be stored against a string-based key. |
Relational Database | A database comprising one or more tables. Table rows represent records, and columns represent the fields of the record. Field values can refer to other records within the database, thus representing relationships between entities. |
Relationship | Each record in a relational database has a unique “primary key,” and other records can store references to the primary key of another record, thus establishing a relationship between the two records and their two tables. Relationships can span tables. |
Constraint | Various rules can be defined to ensure data integrity is maintained within the database. These rules are known as constraints and regulate such things as data types, whether a value is required or optional, and whether a cross-reference must contain a valid key from another entity. |
Cardinality | Relationships in a database can represent various types of mapping. The numbers of each entity in the mapping are called its cardinality. For example, if one entity refers to exactly one other entity, that is a 1-1 mapping. If several entities refer to one other entity, that is a 1-N mapping. If many entities refer to many other entities, that is a many-to-many mapping. |
Structured Query Language (SQL) | A standardized mechanism for working with relational databases. The language contains commands for defining the database structure, known as the data definition language (DDL), and commands for manipulating and retrieving the data within the database, known as data manipulation language (DML). |
DBAPI | A standard programming protocol for accessing relational databases from Python. The DBAPI implementations vary slightly in details, but porting code between DBAPI libraries is much easier than porting at the native database library level. |
NoSQL | Not Only SQL is a term used to describe several database technologies that do not conform to the traditional relational model. Many of these technologies are focused on managing very large volumes of data with a variety of data types, much of it unstructured, such as documents and social media data. SQL is not well suited to handling such unstructured data, and hence NoSQL technology has become increasingly important. |