Chapter 25

Persistent Tables

image

25.1 Constraints

  • The data exists beyond the execution of programs that use it, and is meant to be used by many different programs.
  • The data is stored in a way that makes it easier/faster to explore. For example:
    • The input data of the problem is modeled as one or more series of domains, or types, of data.
    • The concrete data is modeled as having components of several domains, establishing relationships between the application's data and the domains identified.
  • The problem is solved by issuing queries over the data.

25.2 A Program in this Style

  1 #!/usr/bin/env python
  2 import sys, re, string, sqlite3, os.path
  3
  4 #
  5 # The relational database of this problem consists of 3 tables:
  6 # documents, words, characters
  7 #
  8 def create_db_schema(connection):
  9 c = connection.cursor()
 10 c.execute('''CREATE TABLE documents (id INTEGER PRIMARY KEY
      AUTOINCREMENT, name)''')
 11 c.execute('''CREATE TABLE words (id, doc_id, value)''')
 12 c.execute('''CREATE TABLE characters (id, word_id, value)''')
 13 connection.commit()
 14 c.close()
 15
 16 def load_file_into_database(path_to_file, connection):
 17 """ Takes the path to a file and loads the contents into the
      database """
 18 def _extract_words(path_to_file):
 19    with open(path_to_file) as f:
 20        str_data = f.read()
 21    pattern = re.compile('[W_]+')
 22    word_list = pattern.sub(' ', str_data).lower().split()
 23    with open('../stop_words.txt') as f:
 24        stop_words = f.read().split(',')
 25    stop_words.extend(list(string.ascii_lowercase))
 26    return [w for w in word_list if not w in stop_words]
 27
 28 words = _extract_words(path_to_file)
 29
 30 # Now let's add data to the database
 31 # Add the document itself to the database
 32 c = connection.cursor()
 33 c.execute("INSERT INTO documents (name) VALUES (?)", (
      path_to_file,))
 34 c.execute("SELECT id from documents WHERE name=?", (
      path_to_file,))
 35 doc_id = c.fetchone()[0]
 36
 37 # Add the words to the database
 38 c.execute("SELECT MAX(id) FROM words")
 39 row = c.fetchone()
 40 word_id = row[0]
 41 if word_id == None:
 42    word_id = 0
 43 for w in words:
 44    c.execute("INSERT INTO words VALUES (?, ?, ?)", (word_id,
      doc_id, w))
 45    # Add the characters to the database
 46    char_id = 0
 47    for char in w:
 48        c.execute("INSERT INTO characters VALUES (?, ?, ?)", (
       char_id, word_id, char))
 49        char_id += 1
 50    word_id += 1
 51 connection.commit()
 52 c.close()
 53
 54 #
 55 # Create if it doesn't exist
 56 #
 57 if not os.path.isfile('tf.db'):
 58 with sqlite3.connect('tf.db') as connection:
 59    create_db_schema(connection)
 60    load_file_into_database(sys.argv[1], connection)
 61
 62 # Now, let's query
 63 with sqlite3.connect('tf.db') as connection:
 64 c = connection.cursor()
 65 c.execute("SELECT value, COUNT(*) as C FROM words GROUP BY
      value ORDER BY C DESC")
 66 for i in range(25):
 67    row = c.fetchone()
 68    if row != None:
 69        print row[0] + ' - ' + str(row[1])

25.3 Commentary

IN THIS STYLE, we want to model and store the data so that it is amenable to future retrieval in all sorts of ways. For the term-frequency task, if it's done several times, we might not always want to read and parse the file in its raw form every time we want to compute the term frequency. Also, we might want to mine for more facts about the books, not just term frequencies. So, instead of consuming the data in its raw form, this style encourages using alternative representations of the input data that make it easier to mine, now and in the future. In one approach to such a goal, the types of data (domains) that need to be stored are identified, and pieces of concrete data are associated with those domains, forming tables. With an unambiguous entity-relationship model in place, we can then fill in the tables with data, so to retrieve portions of it using declarative queries.

Let's take a look at the example program, starting at the bottom. Lines #57–60 check if the database file already exists; if it doesn't exist, it creates it, and fills it out with the input file's data.

The rest of the program (from line #63 onward), which could very well be another program, queries the database. The language used to query it is the well-known Structured Query Language (SQL) used pervasively in the programming world. Line #64 gets a cursor – an object that enables traversal over records in the database (similar to an iterator in programming languages). On that cursor we then execute a SQL statement that counts the number of occurrences of each word in the words table, ordered by decreasing frequency. Finally we iterate through the first 25 rows of the retrieved data, printing out the first column (the word) and the second one (the count). Let's look into each of the functions of the program.

create_database_schema (lines #8–14) takes the connection to the database and creates our relational schema. We divided the data into documents, words and characters, each on a table. A document is a tuple consisting of an id (an integer) and a name; a word is a tuple consisting of an id, a cross reference to the document id where it occurs, and a value; finally, a character is a tuple consisting of an id, a cross reference to the word id where it occurs, and a value.1

load_file_into_database (lines #16–52) takes a path to a file and the connection to the database and populates the tables. It first adds the document row (line #33) using the file name as the value. Lines #34–35 retrieve the automatically-generated document id from the row we just inserted, so that we can use it for the words. Line #38 queries the words table for its latest word id, so that it can continue from there. Then the function proceeds to fill in the words and the characters tables (lines #43–50). Finally, the data is committed to the database (line #51) and the cursor is discarded (line #52).

25.4 This Style in Systems Design

Databases are pervasive in the computing world, and relational databases, in particular, are the most popular ones. Their purpose is the same as it was back in 1955: to store data for later retrieval.

Whenever applications have this need (and they often do), there are some alternatives to the Persistent Tables style, but often those alternatives fall short. For starters, applications can store the data in some ad-hoc manner. For simple bulk storage and retrieval, this approach may be perfectly appropriate. For example, it is quite common to see applications storing data in comma-separated value (CSV) files. However, when data is to be retrieved from storage selectively, rather than in bulk, better data structures need to be used. Invariably, one is led to some form of database technology, because they tend to be mature and solid pieces of software – and fast, too.

The type of database technology used depends on the application. Relational databases are good at supporting complex queries that involve many pieces of data. They take a conservative (Tantrum-ish) approach to adversity, by aborting any partial changes that fail to commit as a whole. Because of the ACID properties (Atomicity, Consistency, Isolation, Durability), relational databases are guaranteed to be consistent. While some applications need these features, many others don't, and more lightweight technologies, such as NoSQL, can be used instead.

In applications where there is no need to store the data for later analysis, this style of programming is, obviously, an overkill.

25.5 Historical Notes

By the early 1960s, a few companies and government laboratories were already storing and processing relatively large amounts of data, and using computers primarily as data processors. The term database emerged in the mid-1960s, coinciding with the availability of direct-access storage, aka disks – an improvement over tapes. Early on, engineers realized that storing the data in some structured manner would support faster retrieval on the new storage technology. During the 1960s, the main model used was navigational. A navigational database is one where the records, or objects, are found by following references from one to another. Within that model, two main approaches were being used: hierarchical databases and network databases. Hierarchical databases decompose the data into a tree, so parents can have many children but children have exactly only one parent (one-to-many); network databases extend that model to a graph.

The relational database model was formulated in the late 1960s by a computer scientist working for IBM, Edgar Codd. The ideas that came along with this model were so much better than the technology of the time, that relational databases quickly became the de facto standard for storing data.

In the 1980s, the emergence of object-oriented programming brought along the "object-relational impedance mismatch", the observation that the object model of OOP programs and the relational data model of long-term storage were somehow in conflict. OOP data is more like a graph, so it brought back some of the concepts of network data models of the 1960s. This mismatch gave rise to object and object-relational databases, which had some success, but not as much as one would expect. Relational databases continue to be the databases of choice these days, even when OOP languages are used.

More recently, there has been a push towards NoSQL databases, a class of data storage systems that use highly optimized key-value pairs as the basis for storage, still tabular in nature. NoSQL databases are intended for simple retrieval and appending operations, rather than retrieval of complicated data relations.

25.6 Further Reading

Codd, E.F. (1970). Relational model of data for large shared data banks. Communications of the ACM 13(6): 377–387.
Synopsis: The original paper that described the relational model and that started it all in the database field.

25.7 Glossary

Entity: An N-tuple in a relational database containing data from N domains.

Relationship: An association between data and domains (a table).

25.8 Exercises

25.1 Another language. Implement the example program in another language, but preserve the style.

25.2 Separate writer from readers. Starting with the example program, split it into two programs: one that adds data to the database given a file, and one that reads data from the database. Change your program so that it stores the data on the file system instead of storing it in memory.

25.3 More books. Download another book from the Gutenberg collection, e.g. http://www.gutenberg.org/files/44534/44534-0.txt. Populate your database with both Pride and Prejudice and this second book.

25.4 More queries. Query your database to find out the answers to the following questions, and show your queries together with your answers (ignore stop words):

  1. What are the 25 most frequently occurring words in each book?
  2. How many words does each book have?
  3. How many characters does each book have?
  4. What is the longest word in each book?
  5. What is the average number of characters per word?
  6. What is the combined length of characters in the top 25 words of each book?

25.5 A different task. Write one of the tasks proposed in the Prologue using this style.

1The design of entity-relationship models is a much-studied topic in CS; it is not the goal here to teach how to do it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset