Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

3
Managing Data

WHAT YOU WILL LEARN IN THIS CHAPTER:

What data persistence means
How to store data in files
How to store data in a database
How databases search, sort, and access data
Other options for data storage

WROX.COM DOWNLOADS FOR THIS CHAPTER

For this chapter the wrox.com code downloads are found at www.wrox.com/go/pythonprojects on the Download Code tab. The code is in the Chapter 3 download, called Chapter3.zip, and individually named according to the names throughout the chapter.

In many scenarios you need to store data between executions of your program. The data you want to store could be local status information, such as the current location in an e-book reader or the current working filename, or it could be administrative data such as usernames and passwords or server addresses. Often it will be large volumes of business-oriented data such as customer orders, inventory, or address information. Many business applications consist of little more than a mechanism to create, view, and edit stored data.

This capability to store data in such a way that it is available for use on subsequent invocations of your program is known as data persistence because the data persists beyond the lifetime of the process that created it. To implement data persistence you need to store the data somewhere, either in a file or in a database.

This chapter is a bit like a history of computing storage technologies. That’s because the need to store data has grown—and continues to grow—ever more complex with the passage of time. You now have a broad range of technologies available covering every storage need, from a few simple configuration settings to sophisticated distributed data sources representing thousands of logical entities. In this chapter learn about the different options available for storing data and the advantages and disadvantages of each. Along the way, you see how Python modules assist in this fundamental programming task.

Storing Data Using Python

The simplest storage is a plaintext file. You have already seen in Chapter 2 how to use a text file to store data in various formats, such as CSV and XML, as well as how to store unformatted text. These formats are fine if you need to store the data only when the program closes and read it back when the program is started again. This situation makes these formats very suitable for configuration data or application status information. These flat-file formats are less useful when you need to handle large volumes of data non-sequentially or search for specific records or fields. For that, you need a database.

A database is just a data storage system that enables you to create, read, update, and delete individual records—this set of four fundamental data management functions is often referred to as a CRUD interface. Database records consist of one or more key fields that uniquely identify the record, plus any other fields needed to represent the attributes of the entity that the record represents.

A Python dictionary can be used as a type of non-persistent database in that you can use the dictionary key to create, read, update, or delete a value associated with a given dictionary key; that could be a tuple of fields, or a record. All that’s missing is the ability to store the data between sessions. The concept of a dictionary as a database has been exploited over the years, and various forms of persistent dictionaries exist. The oldest are the database management (DBM) family of files.

Using DBM as a Persistent Dictionary

DBM files originated in UNIX but have been developed over the years for other platforms as well. Python supports several variations. These variations are hidden by the dbm module that automatically determines the best solution based on which libraries are supported by the OS installation at hand. If no native DBM library can be found, a basic, pure Python version is used.

The DBM system is a simplified version of a dictionary in that both the keys and values must be strings, which means that some data conversion and string formatting is necessary if you are using non-string data. The advantages of a DBM database are that it is simple to use, fast, and fairly compact.

You can see how DBM works by revisiting the tool-hire example from Chapter 2. When you last looked at it you were working from a spreadsheet as the master data source. Suppose you decided to migrate the solution to a pure Python application? You would need a storage mechanism for the various data elements.

Recall that the spreadsheet had two sheets, one representing the tools for hire and the other the actual loans by the members. The record formats are shown in Table 3.1.

Table 3.1 Tool-Hire Data Entities

TOOL	LOAN
ItemID	ItemID
Name	Name
Description	Description
Owner	Owner
Price	Borrower
Condition	Date Borrowed
Date Registered	Date Returned

That design is fine for a human working with a spreadsheet, but if you want to convert it into a full-blown data application you need to overcome a number of issues with it:

First, there is a lot of duplication between the two entities. The Name, Description, and Owner fields are all duplicated, and therefore need to be changed in two places whenever they are edited.
Both entities use the ItemID as a key, which suggests the ItemID represents both a Tool and a Loan which is confusing.
Several fields store names of subscribers to the service, but it would be better to have a separate entity to describe those members and reference that member entity from the other entities.
Finally, although this started out as a tool-hire application, there is no reason to limit it to tools. The members could just as well borrow books or DVDs or anything else. So rather than restrict it to tools, you can rename the Tool entity as Item. And in keeping with that, you can rename the application to reflect its more generic approach. Call it LendyDB.

With very little effort, you can rearrange things to overcome the issues with the spreadsheet. Table 3.2 shows the resulting database design.

Table 3.2 LendyDB Data Design

ITEM	MEMBER	LOAN
ItemID	MemberID	LoanID
Name	Name	ItemID
Description	Email	BorrowerID
OwnerID		Date Borrowed
Price		Date Returned
Condition
Date Registered

You now have three entities, so you need to store the data in three data files. You can use the DBM format for this because each entity now has a unique identifier field, which, in string format, works well as a DBM key. You need to populate these files with data, and that means reformatting the data from the spreadsheet. You could write a Python program to do that but, because the sample data set is small, it’s easier to just cut and paste the data into the new format. (Or you can extract the files from the LendyDB folder of the Chapter3.zip file from the download site.) Once you have the data you can save it into DBM files quite easily, as shown in the following Try It Out.

TRY IT OUT Creating a LendyDB DBM Database (create-lendyDB.py)

In this Try It Out you translate the data from the tool-hire spreadsheet into the LendyDB data format and save it as three sets of DBM files. You then prove that it worked by reading the files and printing their contents. To do so, follow these steps:

Create a project folder and name it LendyDB.

Start your favorite editor or IDE and type the following code (or load the file create-lendyDB.py from the book’s website):

  import dbm

  # ID, Name, Description, OwnerID, Price, Condition, DateRegistered
  items = [
  ['1','Lawnmower','Tool','1','$150','Excellent','2012-01-05'],
  ['2','Lawnmower','Tool','2','$370','Fair','2012-04-01'],
  ['3','Bike','Vehicle','3','$200','Good','2013-03-22'],
  ['4','Drill','Tool','4','$100','Good','2013-10-28'],
  ['5','Scarifier','Tool','5','$200','Average','2013-09-14'],
  ['6','Sprinkler','Tool','1','$80','Good','2014-01-06']
  ]

  # ID, Name, Email
  members = [
  ['1', 'Fred', '[email protected]'],
  ['2', 'Mike', '[email protected]'],
  ['3', 'Joe', '[email protected]'],
  ['4', 'Rob', '[email protected]'],
  ['5', 'Anne', '[email protected]'],
  ]

  # ID, ItemID, BorrowerID, DateBorrowed, DateReturned
  loans = [
  ['1','1','3','4/1/2012','4/26/2012'],
  ['2','2','5','9/5/2012','1/5/2013'],
  ['3','3','4','7/3/2013','7/22/2013'],
  ['4','4','1','11/19/2013','11/29/2013'],
  ['5','5','2','12/5/2013','None']
  ]

  def createDB(data, dbName):
 	try:
 		 db = dbm.open(dbName, 'c')
 		 for datum in data:
 			 db[datum[0]] = ','.join(datum)
 	finally:
 		 db.close()
 		 print(dbName, 'created')
  def readDB(dbName):
 	try:
  		 db = dbm.open(dbName, 'r')
  		 print('Reading ', dbName)
  		 return [db[datum] for datum in db]
 	finally:
  		 db.close()
  def main():
  	print('Creating data files...')
  	createDB(items, 'itemdb')
  	createDB(members, 'memberdb')
  	createDB(loans, 'loandb')

  	print('reading data files...')
  	print(readDB('itemdb'))
  	print(readDB('memberdb'))
  	print(readDB('loandb'))
  if __name__ == "__main__": main()

Save the file as create-lendyDB.py and run it. Verify that your output looks like this:

  Creating data files...
  itemdb created
  memberdb created
  loandb created
  reading data files...
  Reading itemdb
 	[b'2,Lawnmower,Tool,2,$370,Fair,2012-04-01', b'3,Bike,Vehicle,3,$200,
  Good,2013-03-22', b'1,Lawnmower,Tool,1,$150,Excellent,2012-01-05',
 	b'6,Sprinkler,Tool,1,$80,Good,2014-01-06', b'4,Drill,Tool,4,$100,
  Good,2013-10-28', b'5,Scarifier,Tool,5,$200,Average,2013-09-14']
  Reading memberdb
 	[b'2,Mike,[email protected]', b'3,Joe,[email protected]', b'1,Fred,
  [email protected]', b'4,Rob,[email protected]', b'5,Anne,[email protected]']
  Reading loandb
  [b'2,2,5,9/5/2012,1/5/2013', b'3,3,4,7/3/2013,7/22/2013', b'1,1,3,4/1/2012,
  4/26/2012', b'4,4,1,11/19/2013,11/29/2013', b'5,5,2,12/5/2013,None']

Inspect the contents of your LendyDB folder. You should see three sets of three files: one set per entity, with extensions of .bak, .dat, and .dir.

How It Works

You started off by importing the dbm module. (The module internally analyzes your system to determine which DBM library is available and initializes it for use.) You then created the raw data items by extracting the values from the Excel spreadsheet data. Note that you changed the item Description field so that it now records what kind of item you have (tool, book, DVD, and so on). You could have created an extra field instead and that would have been equally valid, but for this exercise you chose to reuse the existing field name.

You then defined the createDB() function, which opens the DBM database file in c, for create mode. (The c mode creates a new file if one does not exist or opens the existing file if it has already been created.) You then used a for loop to read each data item and store it in the database using the first field as the key and joining all the fields as a comma-separated string for the value.

You used a try/finally construct to ensure all data was written to the file and it was closed properly.

The readDB() function is the converse operation to createDB(). It opens the file using r, for read mode, and then returns the contents as a list using a list comprehension. If you expected the database to be very large you could have made this function into a generator instead and yielded each line in turn. Because you don’t expect the lending library to contain vast numbers of items or members, returning a list is fine.

Finally, the main() function calls the createDB() function once for each data entity. Note that you do not provide any file extension; dbm does that itself. main() then checks that the data has been created correctly by printing the output from readDB() for each database. The databases created by dbm consist of three files. One file contains the actual data; the other two files hold the index information that dbm uses to find the records in the data file. It is this indexing mechanism that makes dbm so much faster than simply searching sequentially through a plaintext file. You should not try to edit the dbm files directly because this could corrupt the database.

Having created your database, you can now use it to read or edit the contents. This is best demonstrated from an interactive session at the Python prompt, so fire up the Python interpreter from the folder where you saved the data files and type the following:

  >>> import dbm
  >>> items = dbm.open('itemdb')
  >>> members = dbm.open('memberdb')
  >>> loans = dbm.open('loandb','w')
  >>> loan2 = loans['2'].decode()
  >>> loan2
  '2,2,5,9/5/2012,1/5/2013'
  >>> loan2 = loan2.split(',')
  >>> loan2
  ['2', '2', '5', '9/5/2012', '1/5/2013']
  >>> item2 = items[loan2[1]].decode().split(',')
  >>> item2
  ['2', 'Lawnmower', 'Tool', '2', '$370', 'Fair', '2012-04-01']
  >>> member2 = members[loan2[2]].decode().split(',')
  >>> member2
  ['5', 'Anne', '[email protected]']
  >>> print('{} borrowed a {} on {}'.format(
  ... member2[1],item2[1],loan2[3]))
  Anne borrowed a Lawnmower on 9/5/2012

With the preceding commands, you opened the three databases, extracted loan number 2 (using decode() to convert from the dbm bytes format into a normal Python string), and split it into its separate fields. You then extracted the corresponding member and item records by using the loan record values as keys. Finally, you printed a message reporting the data in human-readable form.

Of course, you can create new records in the data set, too. Here is how you create a new loan record:

  >>> max(loans.keys()).decode()
  '5'
  >>> key = int(max(loans.keys()).decode()) + 1
  >>> newloan = [str(key),'2','1','4/5/2014']
  >>> loans[str(key)] = ','.join(newloan)
  >>> loans[str(key)]
  b'6,2,1,4/5/2014'

With the preceding code, you used the built-in max() function to find the highest existing key value in the loans database. You then created a new key by converting that maximum value to an integer and adding one. Next, you used the string version of the new key value to create a new loan record. You then wrote that record out to the database using the new key field. Finally, you checked that the new record existed by using the new key value to read the record back.

You can see that DBM files can be used as a database, even with multiple entities. However, if the data is not naturally string-based, or has many fields, extracting the fields and converting to the appropriate format becomes tedious. You can write helper functions or methods to do that conversion for you, but there is an easier way. Python has a module that can store arbitrary Python objects to files and read them back without you having to do any data conversion. It’s time to meet the pickle module.

Using Pickle to Store and Retrieve Objects

The pickle module is designed to convert Python objects into sequences of binary bytes. The object types converted include the basic data types, such as integers and boolean values, as well as system- and user-defined classes and even collections such as lists, tuples, and functions (except those defined using lambda). A few restrictions exist on objects that can be pickled; these are described in the module documentation.

pickle is not of itself a data management solution; it merely converts objects into binary sequences. These sequences can be stored in binary files and read back again so they can be used as a form of data persistence. But pickle does not provide any means to search the stored objects or retrieve a single object from among many stored items. You must read the entire stored inventory back into memory and access the objects that way. pickle is ideal when you just want to save the state of a program so that you can start it up and continue from the same position as before (for example, if you were playing a game).

The pickle module provides several functions and classes, but you normally only use the dump() and load() functions. The dump() function dumps an object (or objects) to a file and the load() function reads an object from a file (usually an object previously written with dump).

To see how this works, you can use the interactive prompt and experiment with the Item data definition from LendyDB in the previous section. You start off by creating a single item and this time, instead of using a single string for all the values, you use a tuple, like this:

  >>> import pickle
  >>> anItem = ['1','Lawnmower','Tool','1','$150','Excellent','2012-01-05']
  >>> with open('item.pickle','wb') as pf:
  ...	pickle.dump(anItem,pf)
  ...
  >>> with open('item.pickle','rb') as pf:
  ...	itemCopy = pickle.load(pf)
  ...
  >>> print(itemCopy)
  ['1', 'Lawnmower', 'Tool', '1', '$150', 'Excellent', '2012-01-05']
  >>>

Notice that you have to use binary file modes for the pickle file. Most importantly, notice that you got a list back from the file, not just a string. Of course, these elements are all strings, so just for fun try pickling some different data types:

  >>> funData = ('a string', True, 42, 3.14159, ['embedded', 'list'])
  >>> with open('data.pickle','wb') as pf:
  ...	pickle.dump(funData, pf)
  ...
  >>> with open('data.pickle','rb') as pf:
  ...	copyData = pickle.load(pf)
  ...
  >>> print (copyData)
  ('a string', True, 42, 3.14159, ['embedded', 'list'])
  >>>

That all worked as expected, and you got back the same data that you put in. The only other thing you need to know about pickle is that it is not secure. It potentially executes objects that get unpickled, so you should never use pickle to read data received from untrusted sources. But for local object persistence in a controlled environment, it does a great job very simply. If you are using pickle in your own projects you should be aware that you can get some pickle-specific exceptions raised so you might want to wrap your code inside a try/except construct.

For your LendyDB project, the big problem with pickle is that you can only access the data by reading the whole file into memory. Wouldn’t it be great if you could have an indexed file of arbitrary Python objects by combining the features of pickle and dbm? It turns out that you can, and the work has been done for you in the shelve module.

Accessing Objects with shelve

The shelve module combines the dbm module’s ability to provide random access to files with pickle’s ability to serialize Python objects. It is not a perfect solution in that the key field must still be a string and the security issue with pickle also applies to shelve, so you must ensure your data sources are safe. Also, like dbm files the module cannot tell if you modify data read into memory, so you must explicitly write any changes back to the file by reassigning to the same key. Finally, dbm files impose some limits around the size of objects they can store and are not designed for concurrent access from, for example, multiple threads or users. However, for many projects, shelve provides a simple, lightweight, and fairly fast solution for storing and accessing data.

So as far as you are concerned, shelve acts just like a dictionary. Almost everything you do with a dictionary you can also do with shelve instances. The only difference is that the data remains on the disk rather than being in memory. This has obvious speed implications, but on the other hand, it means you can work with very large dictionaries even when memory is limited.

Before you build LendyDB with shelve, you’ll experiment with some dummy data that includes a bigger selection of data types, including a user-defined class. The first thing you do is create the shelve database file (or, as they are sometimes known, a shelf):

  >>> shelf = shelve.open('fundata.shelve','c')

The open function takes the same arguments as the dbm version discussed earlier. Because you are creating a new shelf, you use mode c for create. Now you can start adding items to the shelf:

  >>> shelf['tuple'] = (1,2,'a','b',True,False)
  >>> shelf['lists'] = [[1,2,3],[True,False],[3.14159, -66]]

With these commands, you saved two items, each of which contains a mix of Python data types, and shelve happily stored them without any data conversion required by you. You can check that shelve saved the items by reading the values back:

  >>> shelf['tuple']
  (1, 2, 'a', 'b', True, False)
  >>> shelf['lists']
  [[1, 2, 3], [True, False], [3.14159, -66]]

To make the data changes permanent you need to call close (normally you would use a try/finally construct or, unlike dbm, you can use a context manager style):

  >>> shelf.close()
  >>> shelf['tuple']
  Traceback (most recent call last):
 	File "C:Python33libshelve.py", line 111, in __getitem__
 		value = self.cache[key]
  KeyError: 'tuple'

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:Python33libshelve.py", line 113, in __getitem__
 	f = BytesIO(self.dict[key.encode(self.keyencoding)])
  File "C:Python33libshelve.py", line 70, in closed
 	raise ValueError('invalid operation on closed shelf')
  ValueError: invalid operation on closed shelf
  >>>

You can see that after closing the shelf you can no longer access the data; you need to reopen the shelf.

Now you can try something slightly more complex. First, you define a class, create some instances, and then store them to a shelf:

  >>> class Test:
  ...	def __init__(self,x,y):
  ...		self.x = x
  ...		self.y = y
  ... def show(self):
  ...	print(self.x, self.y)
  ...
  >>> shelf = shelve.open('test.shelve','c')
  >>> a = Test(1,2)
  >>> a.show()
  1 2
  >>> b = Test('a','b')
  >>> b.show()
  a b
  >>> shelf['12'] = a
  >>> shelf['ab'] = b

So far, so good. You have saved two instances of the class. Getting them back is just as easy:

  >>> shelf['12']
  <__main__.Test object at 0x01BD1570>
  >>> shelf['ab']
  <__main__.Test object at 0x01BD1650>
  >>> c = shelf['12']
  >>> c.show()
  1 2
  >>> d = shelf['ab']
  >>> d.show()
  a b
  >>> shelf.close()

Notice that the object returned was reported as a __main__.Test object. That raises one very important caveat about saving and restoring user-defined classes. You must make sure that the very same class definition used by shelf for the save is also available to the module that reads the class back from the shelf, and the class definitions must be the same. If the class definition changes between writing the data and reading it back, the result is unpredictable. The usual way to make the class visible is to put it into its own module. That module can then be imported, and used in the code that writes, as well as the code that reads, the shelf.

It’s time to revisit your lending library, LendyDB. This time you replicate what you did with the dbm database, but use the shelve module instead.

TRY IT OUT Using shelve to Store LendyDB (shelve-lendyDB.py)

In this Try It Out, you replicate the functionality of the dbm example but use shelve instead. The code is simpler as a result. To do this, complete the following steps:

Change into your LendyDB project folder.

Open your favorite editor or IDE and type in the following code (or load shelve-lendyDB.py from the LendyDB folder of the downloaded files):

  import shelve
  # ID, Name, Description, OwnerID, Price, Condition, DateRegistered
  items = [
  ['1','Lawnmower','Tool','1','$150','Excellent','2012-01-05'],
  ['2','Lawnmower','Tool','2','$370','Fair','2012-04-01'],
  ['3','Bike','Vehicle','3','$200','Good','2013-03-22'],
  ['4','Drill','Tool','4','$100','Good','2013-10-28'],
  ['5','Scarifier','Tool','5','$200','Average','2013-09-14'],
  ['6','Sprinkler','Tool','1','$80','Good','2014-01-06']
  ]

  # ID, Name, Email
  members = [
  ['1', 'Fred', '[email protected]'],
  ['2', 'Mike', '[email protected]'],
  ['3', 'Joe', '[email protected]'],
  ['4', 'Rob', '[email protected]'],
  ['5', 'Anne', '[email protected]'],
  ]

  # ID, ItemID, BorrowerID, DateBorrowed, DateReturned
  loans = [
  ['1','1','3','4/1/2012','4/26/2012'],
  ['2','2','5','9/5/2012','1/5/2013'],
  ['3','3','4','7/3/2013','7/22/2013'],
  ['4','4','1','11/19/2013','11/29/2013'],
  ['5','5','2','12/5/2013','None']
  ]

  def createDB(data, shelfname):
 	try:
  		 shelf = shelve.open(shelfname,'c')
  		 for datum in data:
  		 shelf[datum[0]] = datum
 	finally:
  		 shelf.close()
  def readDB(shelfname):
 	try:
  		 shelf = shelve.open(shelfname,'r')
  		 return [shelf[key] for key in shelf]
 	finally:
  		 shelf.close()
  def main():
  	print('Creating data files...')
  	createDB(items, 'itemshelf')
  	createDB(members, 'membershelf')
  	createDB(loans, 'loanshelf')
  	print('reading items...')
  	print(readDB('itemshelf'))
  	print('reading members...')
  	print(readDB('membershelf'))
  	print('reading loans...')
  	print(readDB('loanshelf'))
  if __name__ == "__main__": main()

Save the file as shelve-lendyDB.py and run it.

Check that your output matches the following output:

  Creating data files...
  reading items...
  [['1', 'Lawnmower', 'Tool', '1', '$150', 'Excellent', '2012-01-05'], ['3', 'Bike
  ', 'Vehicle', '3', '$200', 'Good', '2013-03-22'], ['2', 'Lawnmower', 'Tool', '2'
  , '$370', 'Fair', '2012-04-01'], ['5', 'Scarifier', 'Tool', '5', '$200', 'Averag
  e', '2013-09-14'], ['4', 'Drill', 'Tool', '4', '$100', 'Good', '2013-10-28'], ['
  6', 'Sprinkler', 'Tool', '1', '$80', 'Good', '2014-01-06']]
  reading members...
  [['1', 'Fred', '[email protected]'], ['3', 'Joe', '[email protected]'], ['2', 'Mi
  ke', '[email protected]'], ['5', 'Anne', '[email protected]'], ['4', 'Rob', 'rjb@som
  corp.com']]
  reading loans...
  [['1', '1', '3', '4/1/2012', '4/26/2012'], ['3', '3', '4', '7/3/2013', '7/22/201
  3'], ['2', '2', '5', '9/5/2012', '1/5/2013'], ['5', '5', '2', '12/5/2013', 'None
  '], ['4', '4', '1', '11/19/2013', '11/29/2013']]

Start the Python interpreter and experiment with the data by typing the following:

  >>> import shelve
  <module 'shelve' from 'C:\Python33\lib\shelve.py'>
  >>> items = shelve.open('itemshelf','w')
  >>> members = shelve.open('membershelf','w')
  >>> loans = shelve.open('loanshelf','w')
  >>> loan2 = loans['2']
  >>> loan2
  ['2', '2', '5', '9/5/2012', '1/5/2013']
  >>> item2 = items[loan2[1]]
  >>> item2
  ['2', 'Lawnmower', 'Tool', '2', '$370', 'Fair', '2012-04-01']
  >>> member2 = members[loan2[2]]
  >>> print('{} borrowed a {} on {}'.format(
  ... member2[1],item2[1],loan2[3]))
  Anne borrowed a Lawnmower on 9/5/2012
  >>>

Add a new loan record by typing the following:

  >>> key = int(max(loans.keys())) + 1
  >>> newloan = [str(key), '2','1','4/5/2014']
  >>> loans[str(key)] = newloan
  >>> loans[str(key)]
  ['6', '2', '1', '4/5/2014']
  >>> loans.close() # make the change permanent

How It Works

The file is very similar to the one using dbm. You start off by importing shelve instead of dbm. The three sets of data definitions that follow are identical to the earlier example. You then define the two functions: createDB() and readDB().

This is where the shelve version starts to simplify the code. For creation, the shelf is opened and the data is written to the shelf directly instead of having to use the string join() method. For reading, things are almost identical but you use a list comprehension to retrieve, store, and return the shelf content.

The main() function is also very similar to the dbm example except for a few tweaks to the printed messages.

At this stage the shelve solution doesn’t seem to have been a huge advantage. However, when you start to access the database and modify it, the situation begins to improve. You open the three shelves and repeat the exercises from the dbm section. But this time you do not need to split() the values to get a list and you do not need to mess with decode() to get from bytes to normal strings. This makes the code shorter and easier to read. (If the records had contained mixed types, the savings would have been even more obvious.)

Finally, you create a new loan record. Again, this does not require any decoding or joining of strings. When you close the shelf, you ensure the data is written to disk.

You’ve now seen the various options Python offers for storing objects and retrieving them. The shelve module, in particular, offers a persistence mechanism that is compact, fairly fast, and simple to use. If you have a solution that uses Python dictionaries in memory, switching to a shelve solution is almost a trivial task. However, this is still a long way short of what is needed for complex data handling. Operations like finding a set of records based on non-key values or sorting the data essentially still require a complete read of the data into memory. The only way to avoid that is to move to a full-blown database solution. However, before you look at that you should consider some aids that Python provides to make data analysis of in-memory data sets easier.

Analyzing Data with Python

Once you have a set of data, you usually want to ask questions about it. For example, in the lending library example, you might want to know the total cost of the items or even the average cost of an item. You might want to know who contributed the most items, which items are out on loan at any given time, and so on. You can do that using Python, and you could write functions using all the standard Python features that would answer those questions. However, Python has some powerful features that often get overlooked that are especially useful for analyzing data sets.

In this section you look at some of the built-in features you can use, especially the functional programming features of the language. Then you turn your attention to the itertools module, which offers more advanced features that often save time and computing resources when compared with the standard alternatives.

Analyzing Data Using Built-In Features of Python

When you analyze data, it is important to select the right data structure. For example, Python includes a set data type that automatically eliminates duplicates. If you care only about unique values, converting (or extracting) the data to a set can simplify the process considerably. Similarly, using Python dictionaries to provide keyword access rather than numeric indices often improves code readability, and thus reliability (you saw an example of that in Chapter 2 that compared the CSV dictionary-based reader with the standard tuple-based reader). If you are finding that your code is getting complicated, it’s often worthwhile to stop and consider whether a different data structure would help.

In addition to the wide variety of data structures, Python also offers many built-in and standard library functions that you can use, such as any, all, map, sorted, and slicing. (Slicing isn’t technically a function but an operation, however it does return a value in a similar way that a function would.) When you combine these functions with Python generator expressions and list comprehensions, you have a powerful toolkit for slicing and dicing your data.

You can apply these techniques to your LendyDB data to answer the questions raised in the opening paragraph of this “Analyzing Data with Python” section. You can try that now.

TRY IT OUT Analyzing LendyDB with Python (lendydata.py)

In this Try It Out, you use standard Python features to answer the questions about the LendyDB data raised earlier: 1) What is the total cost of all items? 2) What is the average cost of an item? 3) Who contributed the most items? 4) Which items are currently on loan? To do this, complete the following steps.

Create a module called lendydata.py containing the following code (or load it from the Analysis folder of the downloadable files):

  items = [
  ['ID','Name', 'Description', 'OwnerID', 'Price', 'Condition', 'Registered'],
  ['1','Lawnmower','Tool','1','$150','Excellent','2012-01-05'],
  ['2','Lawnmower','Tool','2','$370','Fair','2012-04-01'],
  ['3','Bike','Vehicle','3','$200','Good','2013-03-22'],
  ['4','Drill','Tool','4','$100','Good','2013-10-28'],
  ['5','Scarifier','Tool','5','$200','Average','2013-09-14'],
  ['6','Sprinkler','Tool','1','$80','Good','2014-01-06']
  ]

  members = [
  ['ID', 'Name', 'Email'],
  ['1', 'Fred', '[email protected]'],
  ['2', 'Mike', '[email protected]'],
  ['3', 'Joe', '[email protected]'],
  ['4', 'Rob', '[email protected]'],
  ['5', 'Anne', '[email protected]'],
  ]

  loans = [
  ['ID','ItemID','BorrowerID','DateBorrowed','DateReturned'],
  ['1','1','3','4/1/2012','4/26/2012'],
  ['2','2','5','9/5/2012','1/5/2013'],
  ['3','3','4','7/3/2013','7/22/2013'],
  ['4','4','1','11/19/2013','11/29/2013'],
  ['5','5','2','12/5/2013','None']
  ]

Start the Python interpreter and import the data using the following command:
```
  >>> from lendydata import *
```

To answer the question, “What is the total cost of all items?” type the following code:

  >>> def cost(item):
  ...	return int(item[4][1:])
  ...
  >>> cost(items[2])
  370
  >>> sum(cost(item) for item in items[1:])
  1100

To answer the question, “What is the average cost of an item?” type this:

  >>> sum(cost(item) for item in items[1:])/len(items)-1
  183.33333333334
  >>>

To answer the question, “Who contributed the most items?” type this:

  >>> def owner(item): return item[3]
  ...
  >>> for member in members[1:]:
  ...	count = 0
  ...	for item in items[1:]:
  ...		if owner(item) == member[0]:
  ...			count += 1
  ...	print(member[1],':',count)
  ...
  Fred : 2
  Mike : 1
  Joe : 1
  Rob : 1
  Anne : 1
  >>>

To answer the question, “Which items are currently on loan?” type this:

  >>> def onLoan(loan): return loan[-1] == 'None'
  ...
  >>> [items[int(loan[1])] for loan in loans if onLoan(loan)]
  [['5', 'Scarifier', 'Tool', '5', '$200', 'Average', '2013-09-14']]
  >>>

How It Works

You started by creating a Python module containing your sample data and importing that data into the interpreter. Note the handy trick of using the first entry (having index 0) in each data section to store a list of that section’s field descriptions.

This has two useful effects:

Every ID value of any given data section (of the three sections) now matches the zero-relative index of that same row in its section. For example, the ‘Mike’ row of the members data section, having an ID of ‘3’, can now be accessed as members[3].
You have access to the field names, both programmatically and as an aide-de-memoir in the interpreter, by accessing the first record.

The downside is that, in your processing code, you must remember to adjust by 1 the length, and indices, of the data sections, to account for the extra header-line record of each section.

You then used standard Python tools to answer several questions about the data using standard Python tools. For each question you defined a small helper function that typically just extracted a field from a data entry. For the first question it returned the cost as an integer value by extracting the string value and stripping the dollar sign from the front before converting to an integer. You then used Python’s built-in sum() function applied to a generator expression to calculate the total cost of the items. And you computed the average item cost by dividing that total cost by the number of items.

To find out who contributed which items, you defined a function owner() that simply extracted the ownerID field from an item record. You then looped over all the members checking how many items each member owned.

Finally, you determined which items were out on loan by creating a helper function, called onLoan(), that returned a boolean result depending on whether or not the DateReturned field was None. You then used this in a list comprehension with a filter condition using the onLoan() function.

In the preceding Try It Out you saw that you can use the built-in functions and data structures combined with loops and generators to answer most questions about data. The problem is that for volumes of data this technique requires storing large lists in memory and may involve looping over those lists many times. This can become very slow and resource-intensive. The Python itertools module provides several functions that can reduce the load significantly.

Analyzing Data with itertools

The itertools module of the standard Python library provides a set of tools that utilize functional programming principles to consume iterable objects and produce other iterables as results. This means that the functions can be combined to build sophisticated data filters.

Before looking at how itertools can be used on the LendyDB data, you should look at some of the functions provided using simpler data sets. These functions are powerful, but operate in slightly different ways than most of the functions you have dealt with in the past. In particular, they are all geared around processing iterators. You should recall that all the standard Python collections, as well as objects such as files, are iterators. You can also create your own custom iterators by defining some methods that adhere to the Python iterator protocol. The simplest iterators are strings, so that’s mainly what the documentation uses to demonstrate the itertools functions, but remember that these functions work with any kind of iterator, not just strings.

Utility Functions

The first group of functions you look at includes relatively simple functions that you typically use to provide input to the other functions in the module. The count() function works a lot like the built-in range() function, except where range() generates numbers up to a limit, count() generates an indefinite series of numbers from a start point, incrementing by a given, optional, stepsize. It looks like this:

  >>> import itertools as it
  >>> for n in it.count(15,2):
  ...	if n < 40: print(n, end=' ')
  ...	else: break
  ...
  15 17 19 21 23 25 27 29 31 33 35 37 39

The repeat() function is even simpler; it just repeats its argument continuously, or for the number of repetitions specified, like this:

  >>> for n in range(7):
  ...	print(next(it.repeat('yes ')), end='')
  ...
  yes yes yes yes yes yes yes >>>
  >>> list(it.repeat(6,3))
  [6, 6, 6]
  >>>

The cycle() function rotates over the input sequence over and over again. This is useful for building round-robin–style iterations for load balancing or resource allocation. Consider the case where you have a number of resources and want to allocate data to each resource in turn. You can build a list of resources, then cycle over that list until you run out of data. You can simulate this technique using lists as resources, like this:

  >>> res1 = []
  >>> res2 = []
  >>> res3 = []
  >>> resources = it.cycle([res1,res2,res3])
  >>> for n in range(30):
  ...	res = next(resources)
  ...	res.append(n)
  ...
  >>> res1
  [0, 3, 6, 9, 12, 15, 18, 21, 24, 27]
  >>> res2
  [1, 4, 7, 10, 13, 16, 19, 22, 25, 28]
  >>> res3
  [2, 5, 8, 11, 14, 17, 20, 23, 26, 29]
  >>>

The chain() function concatenates all the input arguments into a single collection and then returns each element. If the arguments were all of the same type, you could achieve the same result by adding the collections together with the plus operator, but chain() also works for types of collections that are not compatible with the plus operator. Here is an example using a list, a string, and a set:

  >>> items = it.chain([1,2,3],'astring',{'a','set','of','strings'})
  >>> for item in items:
  ...	print(item)
  ...
  1
  2
  3
  a
  s
  t
  r
  i
  n
  g
  a
  of
  set
  strings

Finally, there is the islice() function that works like the slice operator but, because it uses a generator, is more memory efficient. It does have one significant difference from the normal slice: You cannot use negative indices to count backward from the end, because iterators do not always have well-defined endpoints.

You could use islice()like this:

  >>> data = list(range(20))
  >>> data[3:12:2]
  [3, 5, 7, 9, 11]
  >>> for d in it.islice(data,3,12,2): print(d, end=' ')
  ...
  3 5 7 9 11

itertools can do much more than just generate data. It can also help analyze data using a variety of data processing functions.

Data Processing Functions

itertools has many data processing functions that either take input data and transform the elements, or filter the contents in some way. By combining these functions you can build sophisticated data processing tools. One feature that many of these functions have in common is that they accept a function object as a parameter.

The compress() function acts a little bit like a higher-order version of the bitmasks that you explored in Chapters 1 and 2. It takes a collection of data as its first argument and a collection of boolean values as its second. It returns those items of the first collection that correspond to the True values of the second collection. Here is a basic example:

  >>> for item in it.compress([1,2,3,4,5],[False,True,False,0,1]):
  ...	 print (item)
  ...
  2
  5

Note that the boolean values do not need to be pure boolean values; they can be anything that Python can convert to boolean, even expressions. (The itertools.filterfalse() function works in exactly the same way, but in reverse; it returns those elements whose corresponding boolean flags are False instead of True.)

Likewise, the dropwhile() and takewhile() functions have related, but opposite, effects. Both take an input function and a collection, or iterator, as arguments and then apply the function to the input data elements one at a time. dropwhile() ignores all of the input elements until the function argument evaluates to False, whereas takewhile() returns the elements until the result is False. You can see the difference in these examples that use the same input data and argument function:

  >>> def singleDigit(n): return n < 10
  ...
  >>> for n in it.dropwhile(singleDigit,range(20)): print(n,end=' ')
  ...
  10 11 12 13 14 15 16 17 18 19
  >>> for n in it.takewhile(singleDigit,range(20)): print(n,end=' ')
  ...
  0 1 2 3 4 5 6 7 8 9

Note that both of these functions stop processing the data after the first time a trigger is detected. Consider this example:

  >>> for n in it.dropwhile(singleDigit,[1,2,12,4,20,7,999]): print(n,end=' ')
  ...
  12 4 20 7 999

Notice that the output includes the single-digit numbers following the first non–single-digit number, for the reason just indicated: Once dropwhile stops dropping, nothing else is dropped thereafter. (And takewhile's taking behavior is analogous.)

The accumulate() function applies its input function to each element of the input data along with the result of the previous operation. (The default function is addition and the first result is always the first element.) Thus, for an input data set of [1,2,3,4] the initial value, result1, is 1, followed by the function applied to result1 and 2 to produce result2, and to result2 and 3 to create result3, and to result3 and 4 to create result4. The output is result1, result2, result3, and result4. (The final result value is the same as applying the reduce() function from the functools module.) Here is an example using accumulate()'s default addition operator:

  >>> for n in it.accumulate([1,2,3,4,]): print(n, end=' ')
  ...
  1 3 6 10

Taming the Vagaries of groupby()

groupby() is one of the most useful and powerful of the itertools functions, but it has a number of little foibles that can catch you out. Its basic role is to collect the input data into groups based on a key derived by an input function and return those groups as iterators in their own right.

The first problem is that the function only groups for as long as it finds the same key, but it creates a new group if a new key is found. Then, if the original key is found later in the sequence, it creates a new group with the same key rather than adding the new element to the original group. To avoid this behavior, it is best if the input data is sorted using the same key function used by groupby().

The second snag is that the groups generated by groupby() are not really independent iterators; they are effectively views into the original input collection. Thus, if the function moves on to the next group of data, the previous groups become invalid. The only way to retain the groups for later processing is to copy them into a separate container—a list, for example.

To reinforce these concepts, you look at an example that produces a set of data groups that can be processed independently. The example is built up to the final, correct solution starting from an initial, naïve, but broken solution.

First, you define several groups of data and use the built-in all() function as a key. The all() function returns True when all of its input data items are True.

  >>> data = [[1,2,3,4,5],[6,7,8,9,0],[0,2,4,6,8],[1,3,5,7,9]]
  >>> for d in data: print(all(d))
  ...
  True
  False
  False
  True

Next, you apply the groupby() function to your data:

  >>> for ky,grp in it.groupby(data,key=all):
  ...	print(ky, grp)
  ...
  True <itertools._grouper object at 0x7fd3ee2c>
  False <itertools._grouper object at 0x7fd3ee8c>
  True <itertools._grouper object at 0x7fd3ee2c>

You can see that groupby() returned two separate groups both keyed on True. To avoid that you must sort the data before processing it with groupby(), like this:

  >>> for ky,grp in it.groupby(sorted(data,key=all), key=all):
  ...	print(ky, grp)
  ...
  False <itertools._grouper object at 0x7fd3ef4c>
  True <itertools._grouper object at 0x7fd3ee2c>

Now you want to try to access these groups, so you store each one in a variable:

  >>> for ky,grp in it.groupby(sorted(data,key=all), key=all):
  ...	if ky: trueset = grp
  ...	else: falseset=grp
  ...
  >>> for item in falseset: print(item)
  ...
  >>>

As you can see, falseset is empty. That’s because the falseset group was created first and then the underlying iterator (grp) moved on, thus invalidating the value just stored in falseset. To save the sets for later access, you need to store them as lists, like this:

  >>> groups = {True:[], False:[]}
  >>> for ky,grp in it.groupby(sorted(data,key=all), key=all):
  ...	groups[ky].append(list(grp))
  ...
  >>> groups
  {False: [[[6, 7, 8, 9, 0], [0, 2, 4, 6, 8]]],
 	True: [[[1, 2, 3, 4, 5], [1, 3, 5, 7, 9]]]}
  >>>

Notice that you created a dictionary whose keys are the expected ones (True and False), and whose values are lists. You then had to append the groups, converted to lists, as you found them. This may seem complex, but if you remember to sort the input data first and copy the groups into lists as groupby() generates them, you will find that groupby() is a powerful and useful tool.

Using itertools to Analyze LendyDB Data

You’ve seen what itertools has to offer, so now it’s time to try using it with your LendyDB data. You want to repeat the analysis that you did using the standard tools, but see how the itertools functions can be brought to bear, too. Remember, the real point of the itertools module is not so much that it gives you new features, but rather that it lets you process large volumes of data more efficiently. Given the tiny amount of data you are using in this chapter, you won’t actually see any efficiency improvements, but as you scale the data volumes up, it does make a difference.

TRY IT OUT Analyzing the LendyDB Data Using itertools

In this Try It Out, you repeat the earlier analysis of the LendyDB data using some of the itertools functions. To achieve this, follow these steps:

Change into the folder where you stored the lendydata.py file.
Start the Python interpreter and import the data file using the following command:
```
  >>> from lendydata import *
```
Import the itertools module:
```
  >>> from itertools import *
```

To answer the question, “What is the total cost of all items?” type this:

  >>> def cost(item):
  ...	return int(item[4][1:])
  ...
  >>> for n in islice(accumulate(cost(item) for item in items[1:]),
  ...		 len(items)-2,None):
  ...	print(n)
  ...
  1100

To answer the question, “What is the average cost of an item?” type this:
```
  >>> n/len(items)-1
  183.33333333334
  >>>
```

To answer the question, “Who contributed the most items?” type this:

  >>> def owner(item): return item[3]
  ...
  >>>		owners = {}
  >>> for ky,grp in groupby(sorted(items[1:], key=owner), key=owner):
  ... owners[ky] = len(list(grp))
  ...
  >>> for member in members[1:]:
  ...	print(member[1],' : ', owners[member[0]])
  ...
  Fred : 2
  Mike : 1
  Joe : 1
  Rob : 1
  Anne : 1

To answer the question, “Which items are currently on loan?” type this:

  >>> def returned(loan): return not (loan[-1] == 'None')
  ...
  >>> [items[int(loan[1])] for loan in filterfalse(returned,loans)]
  [['5', 'Scarifier', 'Tool', '5', '$200', 'Average', '2013-09-14']]

How It Works

As in the previous Try It Out you created small helper functions to improve the readability of the code. To answer the first question you used the accumulate() function to produce a running count of the costs, then used islice() to extract only the last item by specifying a start index of len(items)-2 and a stop index of None. (You had to subtract 2 to account for the headers line at the start of items.) Because the result, n, was still in scope, you could calculate the average by simply dividing n by the number of items.

The question of who contributed most is answered quite differently from the previous Try It Out because you used groupby() to gather the related items. In this case you are interested only in the size of the group, not the details, so you used len() to calculate the size of the group. You then iterated over the members, in conventional style, to print the names and counts.

You answered the final question by inverting the logic of the helper function, returned(), to return whether an item has been returned. By using that in the filterfalse() function, you could find those items that had not been returned and, therefore, were still out on loan.

In this section you have seen how a mix of the conventional Python data structures, functions, and operators, combined with the functional techniques of the itertools module, enable you to perform complex analysis of quite large data sets. However, there comes a point when the volume and complexity of the data call for another approach, and that means introducing a new technology: relational databases powered by the Structured Query Language (SQL).

Managing Data Using SQL

In this section you are introduced to some of the concepts behind SQL and relational databases. You find out how to use SQL to create data tables and populate them with data, and how to manipulate the data contained within those tables. You go on to link tables to capture the relationships between data and finally apply all of these techniques to your lending library data.

Relational Database Concepts

The basic principle of a relational database is very simple. It’s simply a set of two-dimensional tables. Columns are known as fields and rows as records. Field values can refer to other records, either in the same table, or in another table—that’s the “relational” part.

A table holding data about employees might look like Table 3.3.

Table 3.3 Employee Data

EMPID	NAME	HIREDATE	GRADE	MANAGERID
1020304	John Brown	20030623	Foreman	1020311
1020305	Fred Smith	20040302	Laborer	1020304
1020307	Anne Jones	19991125	Laborer	1020304

Notice a couple of conventions demonstrated by this data:

You have an identifier (ID) field to uniquely identify each row; this ID is known as the primary key. It is possible to have other keys too, but conventionally, there is nearly always an ID field to uniquely identify a record. This helps should an employee decide to change her name, for example.
You can link one row to another by having a field that holds the primary key value for another row. Thus an employee’s manager is identified by the ManagerID field, which is simply a reference to another EmpID entry in the same table. Looking at your data, you see that both Fred and Anne are managed by John who is, in turn, managed by someone else, whose details are not visible in this section of the table.

RELATIONSHIP CARDINALITY

Relationships within a database link two or more entities together. The number of each entity involved in the relationship is known as its cardinality. The relationships can be one-to-one, where one record links to exactly one other record. They can also be one-to-many, such as the employee-to-manager relationship in the example.

A relationship can also be many-to-many. This relationship is best explained by an example. Suppose you introduced a new table of tasks. Each task could have many employees assigned to it. At the same time, each employee could have several tasks. There is, therefore, a many-to-many relationship between employees and tasks.

Much of the work in any database application is focused on maintaining the many-to-many relationships within the database. Several forms of graphical notation (known as entity-relationship diagrams) are used to describe database structures. Most of these notations have a strong emphasis on showing the cardinality of each relationship.

You are not restricted to linking data within a single table. You can create another table for Salary. A salary can be related to Grade, and so you get a second table like Table 3.4.

Table 3.4 Salary Data

SALARYID	GRADE	AMOUNT
000010	Foreman	60000
000011	Laborer	35000

To determine employee John Brown’s salary, you would first look up John’s grade in the main employee data table. You would then consult the Salary table to learn what an employee of that grade is paid. Thus you can see that John, a foreman, is paid $60,000.

Relational databases take their name from this ability to link table rows together in relationships. Other database types include network databases, hierarchical databases, and flat-file databases (which includes the DBM databases you looked at earlier in the chapter). For large volumes of data, relational databases are by far the most common.

You can do much more sophisticated queries, too, and you look at how to do this in the next few sections. But before you can query anything, you need to create a database and insert some data.

Structured Query Language

The Structured Query Language, or SQL (pronounced either as Sequel or as the letters S-Q-L), is the standard software tool for manipulating relational databases. In SQL an expression is often referred to as a query, regardless of whether it actually returns any data.

SQL is comprised of two parts. The first is the data definition language (DDL). This is the set of commands used to create and alter the shape of the database itself—its structure. DDL tends to be quite specific to each database, with each vendor’s DLL having a slightly different syntax.

The other part of SQL is the data manipulation language (DML). DML, used to manipulate database content rather than structure, is much more highly standardized between databases. You spend the majority of your time using DML rather than DDL.

You only look briefly at DDL, just enough to create (with the CREATE command) and destroy (with the DROP command) your database tables so that you can move on to filling them with data, retrieving that data in interesting ways, and even modifying it, using the DML commands (INSERT, SELECT, UPDATE, and DELETE).

Creating Tables

To create a table in SQL you use the CREATE command. It is quite easy to use and takes the following form:

  CREATE TABLE tablename (fieldName, fieldName,....);

SQL is not case-sensitive and, unlike Python, does not care about whitespace or indentation levels. An informal style convention is used, but it is not rigidly adhered to, and SQL itself cares not a jot!

Try creating your Employee and Salary tables in SQLite. The first thing to do is start the interpreter, which you do simply by invoking sqlite3 with a single command-line argument, the database filename. If that database file exists, the interpreter will open it, otherwise it will create a new database file by that name. (If you omit the database filename entirely, the interpreter will still process your commands, but your data will exist only in RAM, and will disappear irretrievably when you exit the interpreter.)

Thus, to create an employee database you execute the SQLite interpreter like this:

  $ sqlite3 employee.db
  SQLite version 3.8.2 2013-12-06 14:53:30
  Enter ".help" for instructions
  Enter SQL statements terminated with a ";"
  sqlite>

The interpreter creates an empty database called employee.db and leaves you at the sqlite> prompt, ready to type SQL commands. You are now ready to create some tables:

  sqlite> create table Employee
 	...> (EmpID,Name,HireDate,Grade,ManagerID);
  sqlite> create table Salary
 	...> (SalaryID, Grade,Amount);
  sqlite> .tables
  Employee Salary
  sqlite>

Note that you moved the list of fields into a separate line, making it easier to see them. The fields are listed by name but have no other defining information such as data type. This is a peculiarity of SQLite; most databases require you to specify the type along with the name. It is possible to specify types in SQLite too, but it is not essential (you look at this in more detail later in the chapter).

Also note that you tested that the create statements had worked by using the .tables command to list all the tables in the database. The SQLite interpreter supports several of these dot commands that you use to find out about your database. .help provides a list of the commands along with a brief description of their functions.

You can do lots of other things when you create a table. As well as declaring the types of data in each column, you can also specify constraints on the values. Constraints are rules that the database enforces to help ensure the data remains consistent. For example, NOT NULL means the value is mandatory and must be filled in, and UNIQUE means that no other record can have the same value in that field. Usually you specify the primary key field to be NOT NULL and UNIQUE. You can also specify which field is the PRIMARY KEY. You look more closely at these more advanced creation options later in the chapter.

For now you leave the basic table definition as it is and move on to the more interesting topic of actually creating some data.

Inserting Data

The first thing to do after creating the tables is to fill them with data. You do this using the SQL INSERT statement. The structure is very simple:

  INSERT INTO Tablename ( column1, column2... ) VALUES ( value1, value2... );

An alternate form of INSERT uses a query to select data from elsewhere in the database, but that’s too advanced at this stage. You can read about it in the SQLite manual, which you can find at: http://sqlite.org/lang.html.

To insert some rows into your Employee table, do the following:

  sqlite> insert into Employee (EmpID, Name, HireDate, Grade, ManagerID)
  	...> values ('1020304','John Brown','20030623','Foreman','1020311'),
  sqlite> insert into Employee (EmpID, Name, HireDate, Grade, ManagerID)
  	...> values ('1020305','Fred Smith','20040302','Laborer','1020304'),
  sqlite> insert into Employee (EmpID, Name, HireDate, Grade, ManagerID)
  	...> values ('1020307','Anne Jones','19991125','Laborer','1020304'),

And for the Salary table:

  sqlite> insert into Salary (SalaryID, Grade,Amount)
  	...> values('000010','Foreman',60000);
  sqlite> insert into Salary (SalaryID, Grade,Amount)
  	...> values('000011','Laborer',35000);

And that’s it. You have created two tables and populated them with data corresponding to the values described in the introduction. Notice that you used actual numbers for the salary amount, not just string representations. SQLite tries to determine the correct data type based on the INSERT input values we provide. Because it makes the most sense to have SQLite maintain salary data as a numeric type, it behooves you to inform SQLite of that preference in your INSERT statements by specifying salary data in numeric—not string—format.

Now you are ready to start experimenting with the data. This is where the fun starts!

Reading Data

You read data from a database using SQL’s SELECT command. SELECT is the very heart of SQL and has the most complex structure of all the SQL commands. You start with the most basic form and add additional features as you go. The most basic SELECT statement looks like this:

  SELECT column1, column2... FROM table1,table2...;

To select the names of all employees in your database, you use:

  sqlite> SELECT Name from Employee;
  John Brown
  Fred Smith
  Anne Jones

You are rewarded with a list of all of the names in the Employee table. In this case that’s only three, but if you have a big database that’s probably going to be more information than you want. To restrict the output, you need to be able to limit your search somewhat. SQL enables you to do this by adding a WHERE clause to your SELECT statement, like this:

  SELECT col1,col2... FROM table1,table2... WHERE condition;

The condition is an arbitrarily complex boolean expression that can even include nested SELECT statements within it.

Now, add a WHERE clause to refine your search of names. This time you only look for names of employees who are laborers:

  sqlite> SELECT Name
  	...> FROM Employee
  	...> WHERE Employee.Grade = 'Laborer';
  Fred Smith
  Anne Jones

You only get two names back, not three (because John Brown is not a laborer). You can extend the WHERE condition using boolean operators such as AND, OR, NOT, and so on. Note that = in a WHERE condition performs a case-sensitive test. When using the = test, the case of the string is important; testing for 'laborer' would not have worked!

SQLite has some functions that can be used to manipulate strings, but it also has a comparison operator called LIKE that uses % as a wildcard character for more flexible searching. The example just shown, written using LIKE, looks like this:

  sqlite> SELECT Name FROM Employee
  	...> WHERE lower(employee.grade) LIKE 'lab%';
  Fred Smith
  Anne Jones

After converting grade to lowercase, you then tested it for an initial substring of 'lab'. When used in conjunction with lower(), upper(), and SQLite’s other string-manipulation functions, LIKE can greatly increase the scope of your text-based searches. The SQLite documentation has a full list of the functions available.

Notice too that in the WHERE clause you used dot notation (employee.grade) to signify the Grade field. In this case it was not really needed because you were only working with a single table (Employee, as specified in the FROM clause) but, where multiple tables are specified, you need to make clear which table the field belongs to. As an example, change your query to find the names of all employees paid more than $50,000. To do that, you need to consider data in both tables:

  sqlite> SELECT Name, Amount
  	...> FROM Salary, Employee
  	...> WHERE Employee.Grade = Salary.Grade
  	...> AND Salary.Amount > 50000;
  John Brown|60000

As expected, you only get one name back—that of the foreman. But notice that you also got back the salary, because you added Amount to the list of columns selected. Also note that you have two parts to your WHERE clause, combined using an AND boolean operator. The first part links the two tables by ensuring that the common fields are equal; this is known as a join in SQL. A couple other features of this query are worth noting.

Because the fields that you are selecting exist in two separate tables, you have to specify both of the tables from which the result comes. The order of the field names is the order in which you get the data back, but the order of the tables doesn’t matter so long as the specified fields appear in those tables.

You specified two unique field names so SQLite can figure out which table to take them from. If you had also wanted to display the Grade, which appears in both tables, you would have had to use dot notation to specify which table’s Grade you wanted, like this:

  sqlite> SELECT Employee.Grade, Name, Amount
  	...> FROM Employee, Salary
 	etc/...

Note in particular that SQL would require such qualification even though the choice of table here for the Grade field really does not matter, because the WHERE condition guarantees that for any result row displayed the grades of the two tables will have the identical value in any case.

The final feature of SELECT discussed here (although you can read about several more in the SQL documentation for SELECT) is the capability to sort the output. Databases generally hold data either in the order that makes it easiest to find things, or in the order in which they are inserted; in either case that’s not usually the order you want things displayed! To deal with that you can use the ORDER BY clause of the SELECT statement. It looks like this:

  SELECT columns FROM tables WHERE expression ORDER BY columns;

Notice that the final ORDER BY clause can take multiple columns; this enables you to have primary, secondary, tertiary, and so on sort orders.

You can use this to get a list of names of employees sorted by HireDate:

  sqlite> SELECT Name
  	...> FROM Employee
  	...> ORDER BY HireDate;
  Anne Jones
  John Brown
  Fred Smith

(It is interesting to note that HireDate was perfectly acceptable as an ORDER BY column, even though HireDate is not a column SELECTed for display.)

And that’s really all there is to it; you can’t get much easier than that! The only thing worthy of mention is that you didn’t use a WHERE clause. If you had used one, it would have had to come before the ORDER BY clause. Thus, although SQL doesn’t require that all components of a SELECT statement be present, it does require that those elements that are present will appear in a prescribed order.

That’s enough about reading data; you now look at how to modify your data in place.

Modifying Data

You can change the data in your database in two ways. You can alter the contents of one or more records, or, more drastically, you can delete a record or even the contents of a whole table. Changing the content of an existing record is the more common case, and you do that using SQL’s UPDATE command.

The basic format is:

  UPDATE table SET column = value WHERE condition;

You can try it out on the Employee database by changing the salary of a foreman to $70,000:

  sqlite> UPDATE Salary
  	...> SET Amount = 70000
  	...> WHERE Grade = 'Foreman';

Be careful to get the WHERE clause right. If you don’t specify one, every row in the table is modified, and that’s not usually a good idea. Similarly, if the WHERE clause is not specific enough, you end up changing more rows than you want. One way to check you have it right is to do a SELECT using the same WHERE clause and check that only the rows you want to change are found. If all is well, you can repeat the WHERE clause from the SELECT in your UPDATE statement.

A more drastic change you might need to make to your database table, rather than merely modifying certain fields of a given row or rows, is to entirely delete one or more rows from the table. You would do this using SQL’s DELETE FROM command, whose basic form looks like this:

  DELETE FROM Table WHERE condition

So, if you want to delete Anne Jones from your Employee table you can do it like this:

  sqlite> DELETE FROM Employee
  	...> WHERE Name = 'Anne Jones';

If more than one row matches your WHERE condition, all of the matching rows are deleted. SQL always operates on all the rows that match the specified WHERE condition. In this respect SQL is quite different from, say, using a regular expression in a Python program to perform substring substitution on a string (where the default behavior is to modify only the first occurrence found, unless you specifically request otherwise).

An even more drastic change you might want to make to your database is to delete not only all of a table’s rows, but to delete the entire table itself. This is done using SQL’s DROP command.

Obviously, destructive commands like DELETE and DROP must be used with extreme caution.

Linking Data Across Tables

The possibility of linking data between tables was mentioned earlier, in the section on SELECT. However, this is such a fundamental part of relational database theory that you consider it in more depth here. The links between tables represent the relationships between data entities that give a relational database such as SQLite its name. The database maintains not only the raw data about the entities, but also information about the relationships.

The information about the relationships is stored in the form of database constraints, applied when you define the database structure using the CREATE statement. Before you see how to use constraints to model relationships, you first need to look deeper into the kinds of constraints available in SQLite.

Digging Deeper into Data Constraints

You normally express the constraints on a field-by-field basis within the CREATE statement. This means you can expand the basic CREATE definition from,

  CREATE Tablename (Column, Column,...);

to:

  CREATE Tablename (
  ColumnName Type Constraints,
  ColumnName Type Constraints,
  ...);

The most common constraints are:

NOT NULL
PRIMARY KEY [AUTOINCREMENT]
UNIQUE
DEFAULT value

NOT NULL is fairly self-explanatory; it indicates that the value must exist and not be NULL. A NULL value is simply one that has no specified value, rather like None in Python. If no suitable value is provided for a field governed by a NOT NULL constraint, data insertion will utterly fail—quite possibly not only for that particular field, but also for the entire row. (Or, far worse, violation of the constraint could cause a very large database update transaction—possibly involving hundreds or thousands of rows—to fail in its entirety.)

PRIMARY KEY tells SQLite to use this column as the main key for lookups (in practice this means it is optimized for faster searches). The optional AUTOINCREMENT keyword means that an INTEGER type value is automatically assigned on each INSERT and the value automatically incremented by one. This saves a lot of work for the programmer in maintaining separate counts. Note that the AUTOINCREMENT “keyword” is not normally used; rather, it is implied from a type/constraint combination of INTEGER PRIMARY KEY. This not-so-obvious quirk of the SQLite documentation trips up enough people for it to appear at the top of the SQLite Frequently Asked Questions page, found here: http://sqlite.org/faq.html.

The UNIQUE constraint means that the value of the field must be unique within the specified column. If you try to insert a duplicate value into a column that has a UNIQUE constraint, an error results and the row is not inserted. UNIQUE is often used for non-INTEGER type PRIMARY KEY columns.

DEFAULT is always accompanied by a value. The value is what SQLite inserts into that field if the user does not explicitly provide one.

Here is a short example showing some of these constraints, including the use of DEFAULT:

  sqlite> CREATE table test
  	...> (Id INTEGER PRIMARY KEY,
  	...> Name NOT NULL,
  	...> Value INTEGER DEFAULT 42);
  sqlite> INSERT INTO test (Name, Value) VALUES ('Alan', 24);
  sqlite> INSERT INTO test (Name) VALUES ('Heather'),
  sqlite> INSERT INTO test (Name, Value) VALUES ('Laura', NULL);
  sqlite> SELECT * FROM test;
  1|Alan|24
  2|Heather|42
  3|Laura|

The first thing to notice is that although none of the INSERT statements had an Id value, there is an Id value in the SELECT output. That’s because by specifying Id to be an INTEGER PRIMARY KEY, it is automatically generated by SQLite. Notice too how the entry for Heather has the default Value set. Also, note that the Value for Linda is nonexistent, or NULL. There is an important difference between NOT NULL and DEFAULT. The former does not allow NULL values, either by default or explicitly. The DEFAULT constraint prevents unspecified NULLs, but does not prevent deliberate creation of NULL values.

You can also apply constraints to the table itself, such as how to handle data conflicts like duplicates in a UNIQUE column. For example, a table constraint could specify that where a conflict occurs the entire database query will be cancelled, or it could specify that only the changes to the conflicting row be cancelled. Table constraints are not discussed further in this chapter; you should consult the documentation for details.

Revisiting SQLite Field Types

The other kind of constraint that you can apply, as already mentioned, is to specify the column type. This is very like the concept of data types in a programming language. The valid types in SQLite are as follows:

TEXT
INTEGER
REAL
NUMERIC
BLOB
NULL

These types should be self-evident, with the possible exceptions of NUMERIC, which enables the storage of floating-point numbers as well as integers, and BLOB, which stands for Binary Large Object, typically used for media data such as images. NULL is not really a type, but simply suggests that no explicit type has been specified. Most databases come with a much wider set of types including, crucially, a DATE type. As you are about to see, however, SQLite has a somewhat unconventional approach to types that renders such niceties less relevant.

Most databases strictly apply the types specified. However, SQLite employs a more dynamic scheme, where the type specified is more like a hint and any type of data can be stored in the table. When data of a different type is loaded into a field, SQLite uses the declared type to try to convert the data, but if it cannot be converted, it is stored in its original form. Thus, if a field is declared as INTEGER, but the TEXT value '123' is passed in, SQLite converts the string '123' to the number 123. However, if the TEXT value 'Freddy' is passed in, the conversion fails, so SQLite simply stores the string 'Freddy' in the field! This can cause some strange behavior if you are not aware of this foible. Most databases treat the type declaration as a strict constraint and fail if an illegal value is passed.

Modeling Relationships with Constraints

Having seen the various kinds of constraints available and how you can use them in your database, it’s time to return to the topic of modeling relationships. So how do constraints help you to model data and, in particular, relationships? Look again at your simple two-table database, as summarized in Table 3.5 and Table 3.6.

Table 3.5 Employee Database Table

EMPID	NAME	HIREDATE	GRADE	MANAGERID
1020304	John Brown	20030623	Foreman	1020311
1020305	Fred Smith	20040302	Laborer	1020304
1020307	Anne Jones	19991125	Laborer	1020304

Table 3.6 Salary Database Table

SALARYID	GRADE	AMOUNT
000010	Foreman	60000
000011	Laborer	35000

Looking at the Employee table first you see that the EmpID value should be of INTEGER type and have a PRIMARY KEY constraint; the other columns, with the possible exception of the ManagerID, should be NOT NULL. ManagerID should also be of type INTEGER.

For the Salary table you see that, once more, the SalaryID should be an INTEGER with PRIMARY KEY. The Amount column should also be an INTEGER, and you should apply a DEFAULT value of, say, 20000. Finally, the Grade column should be constrained as UNIQUE because you don’t want more than one salary per grade! (Actually, this is a bad idea because normally salary varies with things like length of service as well as grade, but you ignore such niceties for now. In fact, in the real world, you should probably call this a Grade table and not Salary.)

The modified SQL looks like this:

  sqlite> CREATE TABLE Employee (
  	...> EmpID INTEGER PRIMARY KEY,
  	...> Name NOT NULL,
  	...> HireDate NOT NULL,
  	...> Grade NOT NULL,
  	...> ManagerID INTEGER
  	...> );
  sqlite> CREATE TABLE Salary (
  	...> SalaryID INTEGER PRIMARY KEY,
  	...> Grade UNIQUE,
  	...> Amount INTEGER DEFAULT 20000
  	...> );

You can try out these constraints by attempting to enter data that breaks them and see what happens. Hopefully you see an error message!

One thing to point out here is that the INSERT statements you used previously are no longer adequate. You previously inserted your own values for the ID fields, but these are now auto-generated so you can (and should) omit them from the inserted data. But this gives rise to a new difficulty. How can you populate the ManagerID field if you don’t know the EmpID of the manager? The answer is you can use a nested select statement. In this example, you do this in two stages using NULL fields initially and then using an UPDATE statement after creating all the rows.

To avoid a lot of repeat typing you can put all of the commands in a couple of files, called employee.sql for the table creation commands and load_employee.sql for the INSERT statements. This is the same idea as creating a Python script file ending in .py to save typing everything at the >>> prompt.

The employee.sql file looks like this (and is in the SQL folder of the Chapter3.zip download):

  DROP TABLE Employee;
  CREATE TABLE Employee (
  EmpID INTEGER PRIMARY KEY,
  Name NOT NULL,
  HireDate NOT NULL,
  Grade NOT NULL,
  ManagerID INTEGER
  );

  DROP TABLE Salary;
  CREATE TABLE Salary (
  SalaryID INTEGER PRIMARY KEY,
  Grade UNIQUE,
  Amount INTEGER DEFAULT 10000
  );

Notice that you drop the tables before creating them. The DROP TABLE command, as mentioned earlier, deletes the table and any data within it. This ensures the database is in a completely clean state before you start creating your new table. (You will get some errors reported the first time you run this script because no tables exist to DROP, but you can ignore them. Subsequent executions should be error free.)

The load_employee.sql script looks like this (and is also available in the SQL folder of the .zip file):

  INSERT INTO Employee (Name, HireDate, Grade, ManagerID)
  		 VALUES ('John Brown','20030623','Foreman', NULL);
  INSERT INTO Employee (Name, HireDate, Grade, ManagerID)
  		 VALUES ('Fred Smith','20040302','Labourer',NULL);
  INSERT INTO Employee (Name, HireDate, Grade, ManagerID)
  		 VALUES ('Anne Jones','19991125','Labourer',NULL);

  UPDATE Employee
  SET ManagerID = (SELECT EmpID
  		 From Employee
  		 WHERE Name = 'John Brown')
  WHERE Name IN ('Fred Smith','Anne Jones'),

  INSERT INTO Salary (Grade, Amount)
  		 VALUES('Foreman','60000'),
  INSERT INTO Salary (Grade, Amount)
  		 VALUES('Labourer','35000'),

Notice the use of the nested SELECT statement inside the UPDATE command, and also the fact that you used a single UPDATE to modify both employee rows at the same time by using The SQL IN operator that works like the Python in keyword for testing membership of a collection. By extending the set of names being tested, you can easily add more employees with the same manager.

This is typical of the problems you can have when populating a database when constraints are being used. You need to plan the order of the statements carefully to ensure that, for every row that contains a reference to another table, you have already provided the data for it to reference! It’s a bit like starting at the leaves of a tree and working back to the trunk. Always create, or insert, the data with no references first, then the data that references that data, and so on. If you are adding data after the initial creation, you need to use queries to check that the data you need already exists, and add it if it doesn’t. At this point a scripting language like Python becomes invaluable!

Finally, you run these from the SQLite prompt like this:

  sqlite> .read employee.sql
  sqlite> .read load_employee.sql

Make sure you have the path issues sorted out, though: Either start sqlite3 from wherever the .sql scripts are stored (as you did earlier) or provide the full path to the script.

Now try a query to check that everything is as it should be:

  sqlite> SELECT Name
  	...> FROM Employee
  	...> WHERE Grade IN
  	...> (SELECT Grade FROM Salary WHERE amount >50000)
  	...> ;
  John Brown

That seems to work; John Brown is the only employee earning more than $50,000. Notice that you used an IN condition combined with another embedded SELECT statement. This is a variation on a similar query that you performed previously using a cross table join. Both techniques work, but usually the join approach will be faster.

Although this is an improvement over the original unconstrained definition and it ensures that the ManagerID is an integer, it does not ensure that the integer is a valid EmpID key. You need the embedded SELECT statement for that. However, SQLite offers one more constraint that helps you ensure the data is consistent, and that is the REFERENCES constraint. This tells SQLite that a given field references another specified field somewhere in the database. You can apply the REFERENCES constraint to the ManagerID field by modifying the CREATE statement like this:

  CREATE TABLE Employee (
  ...
  ManagerID INTEGER REFERENCES Employee(EmpID)
  );

You see that the REFERENCES constraint specifies the table and the key field within it. At the time of writing, SQLite does not actually enforce this constraint by default (although it is worth including in your code as documentation of your intentions). However, you can turn checking on by using a pragma statement. A pragma is a special command that is used to control the way the interpreter works. It looks like this:

  PRAGMA Foreign_Keys=True;

If you now modify the employee.sql file to add the pragma and modify the create statement, it should look like this:

  PRAGMA Foreign_Keys=True;

  DROP TABLE Employee;
  CREATE TABLE Employee (
  EmpID INTEGER PRIMARY KEY,
  Name NOT NULL,
  HireDate NOT NULL,
  Grade NOT NULL,
  ManagerID INTEGER REFERENCES Employee(EmpID)
  );

  DROP TABLE Salary;
  CREATE TABLE Salary (
  SalaryID INTEGER PRIMARY KEY,
  Grade UNIQUE,
  Amount INTEGER DEFAULT 10000
  );

After running the script and reloading the data with the load_employee.sql script, you can check that it works by trying to insert a new employee with a ManagerID not already in the table. Like this:

  sqlite> .read employee.sql
  sqlite> .read load_employee.sql
  sqlite> insert into Employee (Name,HireDate,Grade,ManagerID)
  	...> values('Fred', '20140602','Laborer',999);
  Error: FOREIGN KEY constraint failed
  sqlite>

This is a big advantage in keeping your data consistent. It is now impossible for a non-valid worker-to-manager relationship to be created (although you can still use NULL values that indicate that no manager relationship exists).

Many-to-Many Relationships

One scenario you haven’t covered is where two tables are linked in a many-to-many relationship. That is, a row in one table can be linked to several rows in a second table and a row in the second table can at the same time be linked to many rows in the first table.

Consider an example. Imagine creating a database to support a book publishing company. It needs lists of authors and lists of books. Each author may write one or more books. Each book may have one or more authors. How do you represent that in a database? The solution is to represent the relationship between books and authors as a table in its own right. Such a table is often called an intersection table or a mapping table. Each row of this table represents a book/author relationship. Now each book has potentially many book/author relationships, but each relationship only has one book and one author, so you have converted a many-to-many relationship into two one-to-many relationships. And you already know how to build one-to-many relationships using IDs. It looks like this (you can find the code in the file books.sql in the SQL folder of the .zip file):

  PRAGMA Foreign_Keys=True;

  drop table author;
  create table author (
  ID Integer PRIMARY KEY,
  Name Text NOT NULL
  );

  drop table book;
  create table book (
  ID Integer PRIMARY KEY,
  Title Text NOT NULL
  );

  drop table book_author;
  create table book_author (
  bookID Integer NOT NULL REFERENCES book(ID),
  authorID Integer NOT NULL REFERENCES author(ID)
  );

  insert into author (Name) values ('Jane Austin'),
  insert into author (Name) values ('Grady Booch'),
  insert into author (Name) values ('Ivar Jacobson'),
  insert into author (Name) values ('James Rumbaugh'),

  insert into book (Title) values('Pride & Prejudice'),
  insert into book (Title) values('Emma'),
  insert into book (Title) values('Sense &; Sensibility'),
  insert into book (Title) values ('Object Oriented Design with Applications'),
  insert into book (Title) values ('The UML User Guide'),

  insert into book_author (BookID,AuthorID) values (
  (select ID from book where title = 'Pride &; Prejudice'),
  (select ID from author where Name = 'Jane Austin')
  );

  insert into book_author (BookID,AuthorID) values (
  (select ID from book where title = 'Emma'),
  (select ID from author where Name = 'Jane Austin')
  );

  insert into book_author (BookID,AuthorID) values (
  (select ID from book where title = 'Sense & Sensibility'),
  (select ID from author where Name = 'Jane Austin')
  );

  insert into book_author (BookID,AuthorID) values (
  (select ID from book where title = 'Object Oriented Design with Applications'),
  (select ID from author where Name = 'Grady Booch')
  );

  insert into book_author (BookID,AuthorID) values (
  (select ID from book where title = 'The UML User Guide'),
  (select ID from author where Name = 'Grady Booch')
  );

  insert into book_author (BookID,AuthorID) values (
  (select ID from book where title = 'The UML User Guide'),
  (select ID from author where Name = 'Ivar Jacobson')
  );

  insert into book_author (BookID,AuthorID) values (
  (select ID from book where title = 'The UML User Guide'),
  (select ID from author where Name = 'James Rumbaugh')
  );

If you look at the values inserted into the tables, you see that Jane Austin has three books to her credit, while the book The UML User Guide has three authors.

If you load that into SQLite in a database called books.db (or just use the file books.db found in the SQL folder of the Chapter3. zip file), you can try some queries to see how it works:

  $ sqlite3 books.db
  SQLite version 3.8.2 2013-12-06 14:53:30
  Enter ".help" for instructions
  Enter SQL statements terminated with a ";"
  sqlite> .read books.sql
  Error: near line 3: no such table: author
  Error: near line 9: no such table: book
  Error: near line 15: no such table: book_author

  sqlite> .tables
  author	book	book_author

Notice the errors resulting from the DROP statements. You always get those the first time you run the script because the tables don’t exist yet. Now you can find out which Jane Austin books are published:

  sqlite> SELECT title FROM book, book_author
  	...> WHERE book_author.bookID = book.ID
  	...> AND book_author.authorID = (
  	...> SELECT ID from Author
  	...> WHERE name='Jane Austin'),
  Pride & Prejudice
  Emma
  Sense & Sensibility

Things are getting a bit more complex, but if you sit and work through it you’ll get the idea soon enough. Notice you need to include both of the referenced tables—book and book_author—in the table list after the SELECT. (The third table, author, is not listed there because it is listed against its own embedded SELECT statement.) Now try it the other way around—see who wrote The UML User Guide:

  sqlite> SELECT name FROM author, book_author
  	...> WHERE book_author.authorID = author.ID
  	...> AND book_author.bookID = (
  	...> SELECT ID FROM book
  	...> WHERE title='The UML User Guide'),
  Grady Booch
  Ivar Jacobson
  James Rumbaugh

If you look closely you see that the structure of the two queries is identical—you just swapped around the table and field names a little.

That’s enough for that example; you now return to your lending library. You now see how you convert it from file-based storage to a full SQL database. You then go on to build an accompanying Python module that enables application writers to ignore the SQL and just call Python functions.

Migrating LendyDB to an SQL Database

In this section you re-create the LendyDB database using a combination of SQL and Python code. Before getting into the nitty-gritty, you need to see how Python and SQL come together.

Accessing SQL from Python

SQLite provides an application programming interface or API consisting of a number of standard functions that allow programmers to perform all the same operations that you have been doing without using the interactive SQL prompt. The SQLite API is written in C, but wrappers have been provided for other languages, including Python. Python has similar interfaces to many other databases, and they all provide a standard set of functions and provide very similar functionality. This interface is called the Python DBAPI, and its existence makes porting data between databases much easier than if each database had its own interface.

The DBAPI defines a couple of useful conceptual objects that form the core of the interface. These are connections and cursors.

Using SQL Connections

A connection is the conduit between your application code and the database’s SQL engine. The name comes from the client-server architecture used by most SQL databases whereby the client must connect to the server over a network. For SQLite the connection is actually to the data file via the SQLite library. The arguments passed to create the connection are of necessity database-specific. For example, many databases require a user ID and password, and some require IP addresses and ports, whereas SQLite just requires a filename. It looks like this:

  >>> import sqlite3
  >>> db = sqlite3.connect('D:/PythonCode/Chapter3/SQL/lendy.db')

Once a connection has been established, you can go on to create cursors, which are the mechanism used to issue SQL commands and receive their output. In principle you can have more than one cursor per connection, but in practice that is rarely needed.

Using a Cursor

When accessing a database from within a program, one important consideration is how to access the many rows of data potentially returned by a SELECT statement without running out of memory. The answer is to use what is known in SQL as a cursor. A cursor is a Python iterator so it can be accessed iteratively one row at a time. Thus, by selecting data into a cursor and using the cursor methods to extract the results either as a list (for smaller volumes of data), or row by row, you can process large collections of data. You now try that out.

TRY IT OUT Using the Python DBAPI

In this Try It Out, you use the sqlite3 module to create a table, populate it with data, and perform some queries with SELECT. You then delete the data, drop the tables, and exit. To achieve that complete the following steps:

Start the Python interpreter and type the following code:

  >>> import sqlite3
  >>> db = sqlite3.connect(':memory:')
  >>> cur = db.cursor()
  >>> cur.execute("create table test(id,name)")
  <sqlite3.Cursor object at 0x7fd28ca0>
  >>> cur.execute(
  ... "insert into test (id,name) values (1,'Alan')")
  <sqlite3.Cursor object at 0x7fd28ca0>
  >>> cur.execute(
  ... "insert into test (id,name) values (2,'Laura')")
  <sqlite3.Cursor object at 0x7fd28ca0>
  >>> cur.execute(
  ... "insert into test (id,name) values (3,'Jennifer')")
  <sqlite3.Cursor object at 0x7fd28ca0>
  >>> cur.execute("Select * FROM test")
  <sqlite3.Cursor object at 0x7fd28ca0>
  >>> print(cur.fetchall())
  [(1, 'Alan'), (2, 'Laura'), (3, 'Jennifer')]

As you can see, the cursor returns a list of tuples. This is very similar to what you started off with, back in the “Using Comma-Separated Values” section of Chapter 2! And you could simply use this list in your program as if you had read it from a file—using the database merely as a persistence mechanism. However, the real power of the database lies in its ability to perform sophisticated queries using SELECT. Try out a few queries using the following code:
```
  >>> def findData(cursor, aString):
  ... cursor.execute("select * from test where name like ?",(aString,))
  ... return cur.fetchall()
  ...
  >>> findData(cur,'A%')
  [(1, 'Alan')]
  >>> findData(cur,'%a%')
  [(1, 'Alan'), (2, 'Laura')]
```

Finally, try a few data manipulation commands:

  >>> cur.execute("delete from test")
  <sqlite3.Cursor object at 0x7fd28ca0>
  >>> cur.execute("drop table test")
  <sqlite3.Cursor object at 0x7fd28ca0>
  >>> cur.close()
  >>> db.commit()
  >>> db.close()

How It Works

You started by importing the sqlite module. The next step was to create a connection to the database file, which in this case was :memory:, indicating you wanted a temporary database held in the computer’s memory. You then created a cursor object that you subsequently used for all the SQL interaction.

You created a test table and inserted three records. You then used the Select * idiom to extract all the data from the table. You used the fetchall() method of the cursor to view the output as a list of tuples. (Other options include using fetchone() within a loop or fetchmany() to return batches of results.)

The next step involved defining a Python function that used a string to locate records of interest. This demonstrates how Python functions can be created to hide the details of the SQL from the user. You use this technique later to create an API for the lending library applications.

Having shown that the function works, you deleted the records, dropped the tables, and closed the cursor. You then executed the connection commit() method that permanently applies any changes you made, followed by closing the database connection itself.

The use of commit is slightly complex, with the DBAPI sometimes calling commit for you, and at other times you need to do it explicitly. As a general rule, you should always call commit before closing the database connection. In this example it makes no real difference because the database exists only in memory and disappears whenever you close the connection. If you are using a file-based database, you need the commit() to ensure your changes are visible next time you open the database.

Creating the LendyDB SQL Database

The database design is not significantly different from its previous incarnations. You just have to translate it into SQL syntax and add in some constraints to improve data integrity.

The initial database setup is usually easiest to do using raw SQL commands as you did in the earlier sections. Although it is possible to do it all from Python, occasionally another tool is better suited to the task at hand.

The code looks like this (and is available in the file lendydb.sql in the SQL folder of the .zip file):

  PRAGMA Foreign_Keys=True;

  drop table loan;
  drop table item;
  drop table member;

  create table member (
  ID INTEGER PRIMARY KEY,
  Name TEXT NOT NULL,
  Email TEXT);

  create table item (
  ID INTEGER PRIMARY KEY,
  Name TEXT NOT NULL,
  Description TEXT NOT NULL,
  OwnerID INTEGER NOT NULL REFERENCES member(ID),
  Price NUMERIC,
  Condition TEXT,
  DateRegistered TEXT);

  create table loan (
  ID INTEGER PRIMARY KEY,
  ItemID INTEGER NOT NULL REFERENCES item(ID),
  BorrowerID INTEGER NOT NULL REFERENCES member(ID),
  DateBorrowed TEXT NOT NULL,
  DateReturned TEXT);

You need to drop all the tables at the top of the script, and the order is important because otherwise the referential constraints fail, and an error results. Notice also that you used the NUMERIC type for the item price because this caters for both integer and floating-point values. It also means you don’t need to worry about those pesky dollar signs that you had to strip off in the previous incarnations of the data. Apart from the various key fields, the other fields are all declared as TEXT.

Now that the database is ready, you can insert the test data.

Inserting Test Data

The database design now includes some referential constraints so you need to think about the order in which you populate the data. The member data has no references, so it can be loaded first. The item data only references members, so it comes next. Finally, the loan data references both members and items, so it must come last.

For the data insertion, you are essentially repeating the same INSERT operation over and over with different data values. This time Python is the better solution because you can write the SQL once and execute it many times using either the executemany() method of the cursor object or by calling the execute() method from within a Python for loop. It looks like this (and the code is in the file lendydata-sql.py):

  import sqlite3
  members = [
  	['Fred', '[email protected]'],
  	['Mike', '[email protected]'],
  	['Joe', '[email protected]'],
  	['Rob', '[email protected]'],
  	['Anne', '[email protected]'],
  ]

  member_sql = '''insert into member (Name, Email) values (?, ?)'''
  items = [
  	['Lawnmower','Tool', 	0, 150,'Excellent', 	'2012-01-05'],
  	['Lawnmower','Tool', 	0, 370,'Fair',		'2012-04-01'],
  	['Bike', 'Vehicle', 	0, 200,'Good',		'2013-03-22'],
  	['Drill', 'Tool', 	0, 100,'Good',		'2013-10-28'],
  	['Scarifier','Tool', 	0, 200,'Average',	'2013-09-14'],
  	['Sprinkler','Tool', 	0, 80,'Good',		'2014-01-06']
  ]
  item_sql = '''
  insert into item
  (Name, Description, ownerID, Price, Condition, DateRegistered)
  values (?, ?, ?, ?, ?, date(?))'''
  set_owner_sql = '''
  update item
  set OwnerID = (SELECT ID from member where name = ?)
  where item.id = ?
  '''

  loans = [
  	[1,3,'2012-01-04','2012-04-26'],
  	[2,5,'2012-09-05','2013-01-05'],
  	[3,4,'2013-07-03','2013-07-22'],
  	[4,1,'2013-11-19','2013-11-29'],
  	[5,2,'2013-12-05', None]
  ]
  loan_sql = '''
  insert into loan
  (itemID, BorrowerID, DateBorrowed, DateReturned )
  values (?, ?, date(?), date(?))'''

  db = sqlite3.connect('lendy.db')
  cur = db.cursor()

  cur.executemany(member_sql, members)
  cur.executemany(item_sql, items)
  cur.executemany(loan_sql, loans)

  owners = ('Fred','Mike','Joe','Rob','Anne','Fred')
  for item in cur.execute("select id from item").fetchall():
 	itemID = item[0]
  cur.execute(set_owner_sql, (owners[itemID-1], itemID))

  cur.close()
  db.commit()
  db.close()

Several things are noteworthy in this script. The first point is the format of the dates. Although SQLite does not have a Date data type, it does have a number of functions that can be used to create standardized date strings and values that it then stores as text (or in some cases as floating-point numbers). The date() function used here requires the date string to be in the format shown, and an invalid date string will be stored as NULL. Using date() therefore improves data quality by ensuring only valid and consistently formatted dates get stored.

NOTE SQLite’s use of date functions to handle date fields is very unusual, but it fits well with SQLite’s dynamic typing mechanism.

The other date functions include a version of the strftime() function that you met in the “Using the time Module” section of Chapter 2. The SQLite version adds some extra features.

One point to note about SQLite’s time structure is that it always uses UTC (aka GMT) because this avoids any of the complex issues surrounding time zones and daylight saving time. For display of the data in local time format, you need to use time functions within Python to convert the times returned by SQLite. A page in the SQLite documentation describes how the various date functions can be used in several common scenarios (http://www.sqlite.org/lang_datefunc.html).

The item OwnerID field is specified as NOT NULL, so you filled it with a dummy value (0) that is then overwritten by the UPDATE code later in the script.

You could have used a for loop to process the INSERT statements, but instead used the executemany() method that takes the statement and a sequence (or iterator or generator) and repeatedly applies the statement until the iteration is completed.

The variable values are inserted into the query strings by using a question mark as a placeholder. This is very similar to the string formatting method that uses {} as a place marker.

At this point you have created the database and populated it with some test data. The next stage is to make that data accessible to applications via an application programming interface (API).

Creating a LendyDB API

When defining an API for a database, it’s normal to start off by thinking about the basic entities and providing functions to create, read, update, and delete the items. (This is often known as a CRUD interface after the initials of the operations.) The temptation is to provide these functions as a thin wrapper around the corresponding SQL commands. However, to the application programmer it is not particularly useful to just be given an ID of a member—for example, it would be much better in most cases to get the member name instead. Otherwise, the programmer must perform multiple reads of the member database to create a useful display for the user. On the other hand, there may be times when an ID is the best option because the programmer may want to get more specific details about the member. The skill in building a good API comes from being able to resolve these contradictions in a way that makes the application programmer effective while retaining full access to the data.

Although the CRUD operations are a good foundation, it is often more useful to the application programmer if some higher-level operations are also available that span entities. To do this you need to think about how your data is likely to be used in an application. What kinds of questions will the designer want to ask? If you are also the application programmer, this is relatively easy, but if you are providing the database as part of a bigger project, it gets more difficult.

For the lending library example, you focus only on the CRUD interface to items and members. The principles should be obvious, and you have the opportunity to extend the API to cover loans in Exercise 3.3 at the end of the chapter. (The code for items and members is in the lendydata.py file in the SQL folder of the .zip file.)

  '''
  Lending library database API

  Provides a CRUD interface to item and member entities
  and init and close functions for database control.
  '''

  import sqlite3 as sql

  db=None
  cursor = None

  ##### CRUD functions for items ######

  def insert_item(Name, Description, OwnerID, Price, Condition):
  	query = '''
  	insert into item
  	(Name, Description, OwnerID, Price, Condition, DateRegistered)
  	values (?,?,?,?,?, date('now'))'''
  	cursor.execute(query,(Name,Description,OwnerID,Price,Condition))

  def get_items():
  	query = '''
  	select ID, Name, Description, OwnerID, Price, Condition, DateRegistered
  	from item'''
  	return cursor.execute(query).fetchall()

  def get_item_details(id):
  	query = '''
  	select name, description, OwnerID, Price, Condition, DateRegistered
  	from item
  	where id = ?'''
  	return cursor.execute(query,(id,)).fetchall()[0]

  def get_item_name(id):
  	return get_item_details(id)[0]

  def update_item(id, Name=None, Description=None,
  		 OwnerID=None, Price=None, Condition=None):
  query = '''
  	update item
  	set Name=?, Description=?, OwnerID=?, Price=?, Condition=?
  	where id=?'''
  	data = get_item_details(id)
  	if not Name: Name = data[0]
  	if not Description: Description = data[1]
  	if not OwnerID: OwnerID = data[2]
  	if not Price: Price = data[3]
  	if not Condition: Condition = data[4]

  	cursor.execute(query, (Name,Description,OwnerID,Price,Condition,id))

  def delete_item(id):
  	query = '''
  	delete from item
  	where id = ?'''
  	cursor.execute(query,(id,))
  ##### CRUD functions for members ######

  def insert_member(name, email):
  	query = '''
  	insert into member (name, email)
  	values (?, ?)'''
  	cursor.execute(query, (name,email))

  def get_members():
  	query = '''
  	select id, name, email
  	from member'''
  	return cursor.execute(query).fetchall()

  def get_member_details(id):
  	query = '''
  	select name, email
  	from member
  	where id = ?'''
  	return cursor.execute(query, (id,)).fetchall()[0]

  def get_member_name(id):
  	return get_member_details(id)[0]

  def update_member(id, Name=None, Email=None):
  	query = '''
  	update member
  	set name=?, email=?
  	where id = ?'''
  	data = get_member_details(id)
  	if not Name: Name = data[0]
  	if not Email: Email = data[1]
  	cursor.execute(query, (Name, Email, id))

  def delete_member(id):
  	query = '''
  	delete from member
  	where id = ?''' 
  	cursor.execute(query,(id,))
  ##### Database init and close #######

  def initDB(filename = None):
  	global db, cursor
  	if not filename:
  		 filename = 'lendy.db'
  	try:
  		 db = sql.connect(filename)
  		 cursor = db.cursor()
  	except:
  		 print("Error connecting to", filename)
  		 cursor = None
  		 raise

  def closeDB():
  	try:
  		 cursor.close()
  		 db.commit()
  		 db.close()
  	except:
  		 print("problem closing database...")
  		 raise
  if __name__ == "__main__":
  	initDB() # use default file
  	print("Members:
", get_members())
  	print("Items:
",get_items())

In this module you create two global variables for the database connection and the cursor. The initDB() function initializes these variables, and the same cursor is used in each of the query functions within the module. The closeDB() function then closes these objects when you are finished using the module. This approach means that the initDB()/closeDB() functions must be called to initialize and finalize the database, which adds a little bit of complexity for the user, but it means that the individual queries are much simpler because you do not need to manage the database and cursor objects each time a query is called. Notice, too, that you set the Foreign_Keys pragma in the initDB() function to ensure that all the transactions are checked for referential integrity.

The query functions all follow the same pattern. A SQL query string is created, using triple quotes to allow multiple line layouts, and then that string is passed to the cursor.execute() method. The retrieved values are passed back where appropriate. The cursor.execute() parameter substitution mechanism is used throughout.

The two update methods have an extra twist in that they have defaulted input parameters. This means that the user can provide only those fields that are changing. The other fields are populated based on the existing values that are retrieved by using the appropriate get-details function.

The two get-name functions are provided for the convenience of the user to easily map from the database identifiers returned in the get-details queries to meaningful names. Typically, the application programmer should use these functions before displaying the results to the end user.

The final section of the module, inside the if test, is a very basic test function just to check that the connection and cursor objects are working as expected. It does not comprehensively test all of the API functions; you will do that in the Try It Out that follows.

TRY IT OUT Using the LendyDB API

In this Try It Out, you use the lendydata.py module to insert, read, update, and delete entities from the lendy.db database created earlier. To do this, complete the following steps:

Switch to the folder containing your lendy.db database file.

Start the Python interpreter and type the following code:

  >>> import lendydata as ld
  >>> ld.initDB()
  >>> ld.get_members()
  [(1, 'Fred', '[email protected]'), (2, 'Mike', '[email protected]'),
 	(3, 'Joe', '[email protected]'), (4, 'Rob', '[email protected]'),
 	(5, 'Anne', '[email protected]')]
  >>> ld.get_items()
  [(1, 'Lawnmower', 'Tool', 1, 150, 'Excellent', '2012-01-05'),
 	(2, 'Lawnmower', 'Tool', 2, 370, 'Fair', '2012-04-01'),
 	(3, 'Bike', 'Vehicle', 3, 200, 'Good', '2013-03-22'),
 	(4, 'Drill', 'Tool', 4, 100, 'Good', '2013-10-28'),
 	(5, 'Scarifier', 'Tool', 5, 200, 'Average', '2013-09-14'),
 	(6, 'Sprinkler', 'Tool', 1, 80, 'Good', '2014-01-06')]

Having proved you can access the data, you now add an item by typing:

  >>> ld.insert_item('Python Projects','Book',6,30,'Excellent')
  Traceback (most recent call last):
 	File "<stdin>", line 1, in <module>
 	File "lendydata.py", line 20, in insert_item
 	cursor.execute(query,(Name,Description,OwnerID,Price,Condition))
  sqlite3.IntegrityError: FOREIGN KEY constraint failed

The preceding error proves that the referential integrity check works; you first need to add a member and then add the new item. Type this:

  >>> ld.insert_member('Alan','[email protected]')
  >>> ld.get_members()
  [(1, 'Fred', '[email protected]'), (2, 'Mike', '[email protected]'),
 	(3, 'Joe', '[email protected]'), (4, 'Rob', '[email protected]'),
 	(5, 'Anne', '[email protected]'), (6, 'Alan', '[email protected]')]
  >>> ld.insert_item('Python Projects','Book',6,30,'Excellent')
  >>> ld.get_items()
  [(1, 'Lawnmower', 'Tool', 1, 150, 'Excellent', '2012-01-05'),
 	(2, 'Lawnmower', 'Tool', 2, 370, 'Fair', '2012-04-01'),
 	(3, 'Bike', 'Vehicle', 3, 200, 'Good', '2013-03-22'),
 	(4, 'Drill', 'Tool', 4, 100, 'Good', '2013-10-28'),
 	(5, 'Scarifier', 'Tool', 5, 200, 'Average', '2013-09-14'),
 	(6, 'Sprinkler', 'Tool', 1, 80, 'Good', '2014-01-06'),
 	(7, 'Python Projects', 'Book', 6, 30, 'Excellent', '2014-06-23')]

You successfully added an item. Now modify the item and member by typing this:

  >>> ld.update_item(7,Price=25)
  >>> ld.get_item_details(7)
  ('Python Projects', 'Book', 6, 25, 'Excellent', '2014-06-23')
  >>> ld.get_member_name(6)
  'Alan'
  >>> ld.update_member(6,Name='Alan Gauld')
  >>> ld.get_member_details(6)
  ('Alan Gauld', '[email protected]')

To delete the data you added, type this:

  >>> ld.delete_member(6)
  Traceback (most recent call last):
 	File "<stdin>", line 1, in <module>
 	File "lendydata.py", line 97, in delete_member
 		cursor.execute(query,(id,))
  sqlite3.IntegrityError: FOREIGN KEY constraint failed
  >>> ld.delete_item(7)
  >>> ld.delete_member(6)

To see how the API can be used for arbitrary queries, type the following:

  >>> ld.cursor.execute('''
  ... select * from item
  ... where OwnerID in (select id from member where name like '%e%')
  ... ''').fetchall()
  [(1, 'Lawnmower', 'Tool', 1, 150, 'Excellent', '2012-01-05'),
 	(2, 'Lawnmower', 'Tool', 2, 370, 'Fair', '2012-04-01'),
 	(3, 'Bike', 'Vehicle', 3, 200, 'Good', '2013-03-22'),
 	(5, 'Scarifier', 'Tool', 5, 200, 'Average', '2013-09-14'),
 	(6, 'Sprinkler', 'Tool', 1, 80, 'Good', '2014-01-06')]
  >>> ld.cursor.execute('''
  ... select * from item
  ... where OwnerID not in (select id from member where name like '%e%')
  ... ''').fetchall()
  [(4, 'Drill', 'Tool', 4, 100, 'Good', '2013-10-28')]
  >>> ld.get_member_name(4)
  'Rob'

Finally, close the database like this:
```
  >>> ld.closeDB()
```

How It Works

You started by importing the lendydata.py module and assigning a shorter alias, ld. You then initialized the module by calling initDB(). You proved that the connection was valid by fetching the list of members and items.

You next proceeded to try to add an item, but the OwnerID value was for a nonexistent member and the database Foreign_Keys constraint check prevented the operation from completing. You then added the missing member followed by the item, which, this time, was successfully added.

Having added data, you went on to explore the update functions by modifying the price of the new item and the name of the new member. You also used the get_member_name() helper function to verify that the OwnerID of 6 was indeed the member that you had created.

You next tried to delete the new member but, again, referential integrity checks prevented this because the member was being referenced by an item. First you needed to delete the item and then the member.

Having exercised the high-level functions of the API, you proceeded to use the cursor object directly to execute arbitrary SQL statements. In this case you found all items owned by members with an “e” in their name followed by all items owned by members without an “e”. You used the get_member_name() helper function to verify that the only such member with an ID of 4 was in fact Rob.

Finally, you closed the database, which ensured that all changes were committed, and that the connection and cursor objects were properly finalized.

You have now built a high-level API for application programmers to use and have seen that by making the underlying cursor object accessible you enable the user to issue arbitrary low-level SQL commands, too. This kind of flexibility is usually appreciated by application programmers, although if they find that they must use SQL extensively, they should request an addition to the functional API.

Of course SQLite is not the only tool available for working with large data volumes, and Python can provide support for these alternatives, too. In the next section, you look at some alternatives to SQLite and how Python works in these environments.

Exploring Other Data Management Options

You have many options for managing large amounts of data. In this section you look at traditional client-server databases, newer databases known as “NoSQL” databases, and how cloud storage is changing approaches to data management. Finally, you consider some powerful data analysis tools that can be accessed from Python using third-party modules.

Client-Server Databases

The traditional SQL database is rather different from SQLite in the way it is constructed. In these databases you have a database server process accessed over a network by multiple clients. The Python DBAPI is designed so that you can work with these databases just as easily as you did with SQLite. Only minor changes in the way you initialize the database connection are usually all that is necessary. Occasionally you’ll see minor changes to the SQL syntax, and the parameter substitution symbol is sometimes different, too. But, in general, swapping from one SQL interface to another is a relatively painless experience.

Several reasons exist for wanting to adopt a client-server database or to migrate to one from SQLite. The first is capacity; most client-server databases can scale up to much larger volumes of data than SQLite. In addition, they can be configured to distribute their processing over several servers and disks, which greatly improves performance when multiple users are accessing the data concurrently. The second reason for migrating from SQLite is that larger databases usually come with a much richer set of SQL commands and query types as well as more data types. Many even support object-oriented techniques and have their own built-in programming languages that enable you to write stored procedures, effectively putting the API into the database itself. Usually, a much richer set of database constraints can be used to tighten up data integrity far beyond the simple foreign key checks that SQLite performs.

The biggest downside of selecting a client-server database is the extra complexity of administering the database. Usually a dedicated administrator is needed to set up users, adjust their access privileges, tune the SQL query engine performance, and do backups, data extracts, and loads.

Popular client-server databases include commercial offerings such as SQL Server, Oracle, and DB2, as well as open source projects such as MySQL, PostGres, and Firebird.

NoSQL

As data storage needs have expanded both in size and in variety, several projects have been exploring alternatives to SQL. Many of these projects are associated with what is called “Big Data,” which usually relates to the harvesting of large volumes of, often unstructured, data from places such as social media sites or from sensors in factories or hospitals and so on. One of the features of this kind of data is that most of it is of little significance to the database user, but amongst the detritus are gems to be had that can influence sales strategy or alert you to imminent failure of components or processes so that action can be taken in advance. The nature, volumes, and need for rapid access of such data means that traditional SQL databases are not well suited to the task.

The solution has been a variety of technologies that go under the collective term NoSQL. NoSQL does not mean that no SQL is used, but rather it stands for Not Only SQL. SQL may still be available for traditional queries on these new databases, but alternative query techniques are also used, especially for searching unstructured data. Some examples of NoSQL approaches (with typical implementations) are Document (MongoDB), Key-Value (Dynamo), Columnar (HBase), and Graphs (Allegro). All of these diverge to some degree or other from the traditional relational model of multiple, two-dimensional tables with cross-references. To get the best from these systems, you need to consider which architecture best suits the nature of your data. Many solutions are open source, but commercial offerings exist, too.

Although NoSQL databases do provide the potential for faster and more flexible access to a wider variety of data, they generally sacrifice qualities like data integrity, transactional control, and usability. The products are all evolving rapidly, although at the time of writing they require lower-level programming skills to utilize the data compared to traditional SQL databases that could change quite quickly. Most of the popular databases offer Python modules that facilitate programming them from Python.

The Cloud

Cloud computing has become popular in recent years with its promise of computing on-demand. This potentially offers lower costs, more flexibility, and lower risks than traditional data center–based solutions. It brings its own set of concerns, of course, especially for sensitive data or where network reliability is an issue. The biggest use of cloud technologies in the database world has been in combination with the NoSQL solutions, discussed in the previous section, and other Big Data solutions such as Hadoop. Many cloud storage providers offer these technologies on a software-as-a-service (SAAS) basis. This offers an attractive option for those just dipping a toe into the Big Data or NoSQL world.

The advantage of cloud computing for the application programmer is that the data is abstracted into a virtual database located at a single consistent network address. The physical location of the data may change over time, or the amount of storage available may grow or shrink, but to the programmer it doesn’t matter. (This does not mean that normal error handling can be ignored, but it does mean that the code can be largely decoupled from the physical design of the data storage.)

One of the biggest providers of cloud storage and database technology is the online retailer Amazon. It provides storage and an API (Amazon Web Services, or AWS) as well as a proprietary NoSQL database called SimpleDB, in addition to several other open source offerings. Other cloud suppliers are starting to offer similar products. Many providers offer small amounts of free storage and access as an incentive to try the product before committing to a significant investment.

Amazon AWS has Python support available, including a comprehensive tutorial to both Python and the AWS interface, at http://boto.readthedocs.org/en/latest/. Similar interfaces are available, or will likely soon become available, from the other providers.

Data Analysis with RPy

Though client-server SQL, NoSQL, and cloud computing all provide solutions for handling large data volumes or many users, you have other data management issues to consider. Often, the processing of large volumes of data is more important than the storage or retrieval. If that processing involves a high degree of statistical exploration or manipulation, Python offers a basic statistics module (introduced to the standard library in version 3.4). If that is not enough, there is the R programming language. R is a specialized language designed for statistical number crunching on a large scale. Like Python it has accumulated a large library of add-in modules, and many statistical researchers use R as their platform of choice, publishing their research using R.

The good news for Python programmers is that there is an interface from Python to R called rpy2 that opens up this processing power without having to become an expert in R. Knowing the basics of R, especially its data handling concepts, is pretty much essential, but much of your Python knowledge can be applied, too. You can find rpy2 on the Python Package Index and install it via pip.

Data management has many facets, and this chapter has reviewed how Python supports you, from small volumes to large, from simple data persistence through to complex data analysis. Once you have control of your data, you are in a much stronger position to create powerful, user-friendly applications whether on the desktop or on the web. That’s the subject of the next two chapters.

Summary

In this chapter you learned how to store and retrieve data so that your application can pick up from where it left off or so that you can work on multiple different projects. You saw how you could do this with flat files using indexed files (DBM), pickles, and shelves. You then looked at SQL and relational databases using SQLite and concluded with a review of some alternative technologies.

Flat files are good for storing small amounts of data or for saving the program’s state information when it shuts down. They are less useful where multiple queries have to be made, especially if querying on more than one key value. DBM files act like persistent dictionaries that can only store strings. Pickles turn binary objects into strings. Shelves combine DBM and pickle to act like persistent dictionaries, albeit limited to using strings for the keys.

SQL is used to manage data in a relational database management system (RDBMS). Relational databases are conceptually made up of one or more two-dimensional tables, each representing one logical entity. Cross-references between tables act as a model for the relationships between entities. SQL provides commands to build and manage the tables in a database as well as operations to create, delete, query, and update the data within those tables.

Python provides the DBAPI, which is a standard protocol for accessing SQL-based databases. The DBAPI primarily consists of two objects: a connection and a cursor. The cursor is the primary object used for executing SQL commands and retrieving their results. Results are returned as a list of tuples.

The SQLite database interface is provided in Python’s standard library, and you can download a standalone interpreter. SQLite, as the name suggests, is a lightweight version of SQL that stores the entire database in a single file and supports a lightweight subset of the SQL language. SQLite can be used on small- to medium-sized projects, and it is then relatively easy to migrate to a larger database if the project expands in scale. DBAPI libraries are available for most major databases and can be downloaded from third parties.

By using the DBAPI it is possible to build a data abstraction layer that hides both the details of the database design and the SQL language from the application programmer, further facilitating the migration process if it becomes necessary.

Several other database technologies exist, and this is an active area of development with several NoSQL projects vying for popular support. These databases tend to have quite specific areas of applicability, and no single solution suits all scenarios. Many are focused on the challenges of managing “Big Data” and are well suited to cloud-based solutions.

EXERCISES

To appreciate the work that pickle does for you, try building a simple serialization function for numbers, called ser_num(). It should accept any valid integer or float number as an argument and convert it into a byte string. You should also write a function to perform the reverse operation to read a byte string produced by your ser_num() function and convert it back to a number of the appropriate type. (Hint: You may find the struct module useful for this exercise.)
Write a version of the employee database example using shelve instead of SQLite. Populate the shelf with the sample data and write a function that lists the name of all employees earning more than a specified amount.
Extend the lendydata.py module to provide CRUD functions for the loan table. Add an extra function, get_active_loans(), to list those loans that are still active. (Hint: That means the DateReturned field is NULL.)
Explore the Python statistics module to see what it offers (only available in Python 3.4 or later).

WHAT YOU LEARNED IN THIS CHAPTER

KEY CONCEPT	DESCRIPTION
Persistence	The ability to store data between program executions such that when a program is restarted it can restore its data to the state it was in when the program was last stopped.
Flat file	A standard file containing data. The data may be in text form or binary form.
DBM	A form of file storage that uses a combination of a flat file to store the data and index files to store the location of individual data records for rapid retrieval. Both keys and data must be in string format. In Python a dictionary-like interface is provided by the dbm module.
Serialization	The process of converting binary data into a string of bytes for storage or transmission over a network.
Pickle	A Python-specific format for serializing binary data. Most Python objects can be serialized using Pickle. Because it poses some security risks, care must be exercised when reading data from untrusted sources.
Shelve	A combination of DBM and Pickle technologies to provide a persistent dictionary. Arbitrary binary data can be stored against a string-based key.
Relational Database	A database comprising one or more tables. Table rows represent records, and columns represent the fields of the record. Field values can refer to other records within the database, thus representing relationships between entities.
Relationship	Each record in a relational database has a unique “primary key,” and other records can store references to the primary key of another record, thus establishing a relationship between the two records and their two tables. Relationships can span tables.
Constraint	Various rules can be defined to ensure data integrity is maintained within the database. These rules are known as constraints and regulate such things as data types, whether a value is required or optional, and whether a cross-reference must contain a valid key from another entity.
Cardinality	Relationships in a database can represent various types of mapping. The numbers of each entity in the mapping are called its cardinality. For example, if one entity refers to exactly one other entity, that is a 1-1 mapping. If several entities refer to one other entity, that is a 1-N mapping. If many entities refer to many other entities, that is a many-to-many mapping.
Structured Query Language (SQL)	A standardized mechanism for working with relational databases. The language contains commands for defining the database structure, known as the data definition language (DDL), and commands for manipulating and retrieving the data within the database, known as data manipulation language (DML).
DBAPI	A standard programming protocol for accessing relational databases from Python. The DBAPI implementations vary slightly in details, but porting code between DBAPI libraries is much easier than porting at the native database library level.
NoSQL	Not Only SQL is a term used to describe several database technologies that do not conform to the traditional relational model. Many of these technologies are focused on managing very large volumes of data with a variety of data types, much of it unstructured, such as documents and social media data. SQL is not well suited to handling such unstructured data, and hence NoSQL technology has become increasingly important.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 3: Managing Data

Create new playlist

Sign In

Sign Up

Storing Data Using Python

Using DBM as a Persistent Dictionary

Using Pickle to Store and Retrieve Objects

Accessing Objects with shelve

Analyzing Data with Python

Analyzing Data Using Built-In Features of Python

Analyzing Data with itertools

Utility Functions

Data Processing Functions

Taming the Vagaries of groupby()

Using itertools to Analyze LendyDB Data

Managing Data Using SQL

Relational Database Concepts

Structured Query Language

Creating Tables

Inserting Data

Reading Data

Modifying Data

Linking Data Across Tables

Digging Deeper into Data Constraints

Revisiting SQLite Field Types

Modeling Relationships with Constraints

Many-to-Many Relationships

Migrating LendyDB to an SQL Database

Accessing SQL from Python

Using SQL Connections

Using a Cursor

Creating the LendyDB SQL Database

Inserting Test Data

Creating a LendyDB API

Exploring Other Data Management Options

Client-Server Databases

NoSQL

The Cloud

Data Analysis with RPy

Summary

EXERCISES

WHAT YOU LEARNED IN THIS CHAPTER

Table of Contents for
Chapter 3: Managing Data