Chapter 8. Tables

An associative table is a set of key-value pairs. It’s like an array except that the indices can be values of any type. Many applications use tables. Compilers, for example, maintain symbol tables, which map names to sets of attributes for those names. Some window systems maintain tables that map window titles into some kind of window-related data structures. Document-preparation systems use tables to represent indices: For example, the index might be a table in which the keys are one-character strings — one for each section of the index — and the values are other tables in which the keys are the strings for the index entries themselves and the values are lists of page numbers.

Tables have many uses, and the examples alone could fill a chapter. The Table interface is designed so that it can be used for many of these uses. It maintains key-value pairs, but it never inspects keys themselves; only clients inspect keys via functions passed to routines in Table. Section 8.2 describes a typical Table client, a program that prints the number of occurrences of words in its input. This program, wf, also uses the Atom and Mem interfaces.

Interface

Table represents an associative table with an opaque pointer type:

<table.h>≡
   #ifndef TABLE_INCLUDED
   #define TABLE_INCLUDED
   #define T Table_T
   typedef struct T *T;

   <exported functions 116>

   #undef T
   #endif

The exported functions allocate and deallocate Table_Ts, add and remove key-value pairs from those tables, and visit the key-value pairs in them. It is a checked runtime error to pass a null Table_T or null key to any function in this interface.

Table_Ts are allocated and deallocated by

<exported functions 116>≡
   extern T    Table_new (int hint,
       int cmp(const void *x, const void *y),
       unsigned hash(const void *key));
   extern void Table_free(T *table);

Table_new’s first argument, hint, is an estimate of the number of entries that the new table is expected to hold. All tables can hold an arbitrary number of entries regardless of the value of hint, but accurate values of hint may improve performance. It is a checked runtime error for hint to be negative. The functions cmp and hash manipulate client-specific keys. Given two keys, x and y, cmp(x,y) must return an integer less than zero, equal to zero, or greater than zero, if, respectively, x is less than y, x equals y, or x is greater than y. The standard library function strcmp is an example of a comparison function suitable for keys that are strings. hash must return a hash number for key; if cmp(x,y) returns zero, then hash(x) must be equal to hash(y). Each table can have its own hash and cmp functions.

Atoms are often used as keys, so if hash is the null function pointer, the keys in the new table are assumed to be atoms and the implementation of Table provides a suitable hash function. Similarly, if cmp is the null function pointer, keys are assumed to be atoms, and two keys x and y are equal if x = y.

Table_new can raise Mem_Failed.

Table_new’s arguments — a size hint, a hash function, and a comparison function — provide more information than most implementations need. For example, the hash table implementation described in Section 8.3 needs a comparison function that tests only for equality, and implementations that use trees don’t need the hint or the hash function. This complexity is the price of a design that permits multiple implementations, and this feature is one of the reasons designing good interfaces is difficult.

Table_free deallocates *table and sets it to the null pointer. It is a checked runtime error for table or *table to be null. Table_free does not deallocate the keys or values; see Table_map.

The functions

<exported functions 116>+≡
   extern int   Table_length(T table);
   extern void *Table_put   (T table, const void *key,
       void *value);
   extern void *Table_get   (T table, const void *key);
   extern void *Table_remove(T table, const void *key);

return the number of keys in a table, add a new key-value pair or change the value of an existing pair, fetch the value associated with a key, and remove a key-value pair.

Table_length returns the number of key-value pairs in table.

Table_put adds the key-value pair given by key and value to table. If table already holds key, value overwrites the previous value, and Table_put returns the previous value. Otherwise, key and value are added to table, which grows by one entry, and Table_put returns the null pointer. Table_put can raise Mem_Failed.

Table_get searches table for key and, if it’s found, returns its associated value. If table doesn’t hold key, Table_get returns the null pointer. Notice that returning the null pointer is ambiguous if table holds null pointer values.

Table_remove searches table for key and, if it’s found, removes the key-value pair from table, which thus shrinks by one entry, and returns the removed value. If table doesn’t hold key, Table_remove has no effect on table and returns the null pointer.

The functions

<exported functions 116>+≡
   extern void   Table_map    (T table,
       void apply(const void *key, void **value, void *cl),
       void *cl);
   extern void **Table_toArray(T table, void *end);

visit the key-value pairs and collect them into an array. Table_map calls the function pointed to by apply for every key-value pair in table in an unspecified order. apply and cl specify a closure: Clients can pass an application-specific pointer, cl, to Table_map and this pointer is passed along to apply at each call. For each pair in table, apply is called with its key, a pointer to its value, and cl. Since apply is called with pointers to the values, it can change them. Table_map can also be used to deallocate keys or values before deallocating the table. For example, assuming the keys are atoms,

   static void vfree(const void *key, void **value,
       void *cl) {
       FREE(*value);
   }

deallocates just the values, so

   Table_map(table, vfree, NULL);
   Table_free(&table);

deallocates the values in table and then table itself.

It is a checked runtime error for apply to change the contents of table by calling Table_put or Table_remove.

Given a table with N key-value pairs, Table_toArray builds an array with 2N+1 elements and returns a pointer to the first element. The keys and values alternate, with keys appearing in the even-numbered elements and their associated values in the following odd-numbered elements. The last even-numbered element, at index 2N, is assigned end, which is often the null pointer. The order of the key-value pairs in the array is unspecified. The program described in Section 8.2 illustrates the use of Table_toArray.

Table_toArray can raise Mem_Failed, and clients must deallocate the array it returns.

Example: Word Frequencies

wf lists the number of times each word appears in a list of named files or in the standard input if no files are specified. For example:

% wf table.c mem.c
table.c:
3   apply
7   array
13  assert
9   binding
18  book
2   break
10  buckets
...
4   y
mem.c:
1   allocation
7   assert
12  book
1   stdlib
9   void
...

As this output shows, the words in each file are listed in alphabetical order and are preceded by the number of times they appear in the file. For wf, a word is a letter followed by zero more letters or underscores, and case doesn’t matter.

More generally, a word begins with a character in a first set followed by zero or more characters in a rest set. Words of this form are recognized by getword, which is a generalization of double’s getword described in Section 1.1. It’s used enough in this book to be packaged separately in its own interface:

<getword.h>≡
   #include <stdio.h>

   extern int getword(FILE *fp, char *buf, int size,
       int first(int c), int rest(int c));

getword consumes the next word in the file opened on fp, stores it as a null-terminated string in buf[0..size-1], and returns one. When it reaches the end of file without consuming a word, it returns zero. The functions first and rest test a character for membership in first and rest. A word is a contiguous sequence of characters; it starts with a character for which first returns a nonzero value followed by characters for which rest returns nonzero values. If a word is longer than size-2 characters, the excess characters are discarded. size must exceed one, and fp, buf, first, and rest must be nonnull.

<getword.c>≡
   #include <ctype.h>
   #include <string.h>
   #include <stdio.h>
   #include "assert.h"
   #include "getword.h"

   int getword(FILE *fp, char *buf, int size,
       int first(int c), int rest(int c)) {
       int i = 0, c;

       assert(fp && buf && size > 1 && first && rest);
       c = getc(fp);
       for ( ; c != EOF; c = getc(fp))
           if (first(c)) {
               <store c in buf if it fits 120>
               c = getc(fp);
               break;
           }
       for ( ; c != EOF && rest(c); c = getc(fp))
           <store c in buf if it fits 120>
       if (i < size)
           buf[i] = '';
       else
           buf[size-1] = '';
       if (c != EOF)
           ungetc(c, fp);
       return i > 0;
   }

<store c in buf if it fits 120>≡
   {
        if (i < size - 1)
            buf[i++] = c;
   }

This version of getword is a bit more complex than the version in double because this one must work when a character is in first but not in rest. When first returns nonzero, that character is stored in buf and only subsequent characters are passed to rest.

wf’s main function processes its arguments, which name files. main opens each file and calls wf with the file pointer and file name:

<wf functions 121>≡
   int main(int argc, char *argv[]) {
       int i;

       for (i = 1; i < argc; i++) {
           FILE *fp = fopen(argv[i], "r");
           if (fp == NULL) {
               fprintf(stderr, "%s: can't open '%s' (%s)
",
                   argv[0], argv[i], strerror(errno));
               return EXIT_FAILURE;
           } else {
               wf(argv[i], fp);
               fclose(fp);
           }
       }
       if (argc == 1) wf(NULL, stdin);
       return EXIT_SUCCESS;
   }

<wf includes 121>≡
   #include <stdio.h>
   #include <stdlib.h>
   #include <errno.h>

If there are no arguments, main calls wf with a null file name and the file pointer for the standard input. The null file name tells wf not to print the name of the file.

wf uses a table to store the words and their counts. Each word is folded to lowercase, converted to an atom, and used as a key. Using atoms lets wf use the defaults for the table’s hash and comparison functions. Values are pointers, but wf needs to associate an integer count with each key. It thus allocates space for a counter and stores a pointer to this space in the table.

<wf functions 121>+≡
   void wf(char *name, FILE *fp) {
       Table_T table = Table_new(0, NULL, NULL);
       char buf[128];

       while (getword(fp, buf, sizeof buf, first, rest)) {
           const char *word;
           int i, *count;
           for (i = 0; buf[i] != ''; i++)
               buf[i] = tolower(buf[i]);
           word = Atom_string(buf);
           count = Table_get(table, word);
           if (count)
               (*count)++;
           else {
               NEW(count);
               *count = 1;
               Table_put(table, word, count);
           }
       }
       if (name)
           printf("%s:
", name);
       { <print the words 123> }
       <deallocate the entries and table 124>
   }

<wf includes 121>+≡
   #include <ctype.h>
   #include "atom.h"
   #include "table.h"
   #include "mem.h"
   #include "getword.h"

<wf prototypes 122>≡
   void wf(char *, FILE *);

count is a pointer to an integer. If Table_get returns null, the word isn’t in table, so wf allocates space for the counter, initializes it to one to account for this first occurrence of the word, and adds it to the table. When Table_get returns a nonnull pointer, the expression (*count)++ increments the integer pointed to by that pointer. This expression is much different than *count++, which would increment count instead of the integer it points to.

Membership in first and rest is tested by functions of the same names that use the predicates defined in the standard header ctype.h:

<wf functions 121>+≡
   int first(int c) {
       return isalpha(c);
   }

   int rest(int c) {
       return isalpha(c) || c == '_';
   }

<wf prototypes 122>+≡
   int first(int c);
   int rest (int c);

Once wf has read all of the words, it must sort and print them. qsort, the standard C library sorting function, sorts an array, so wf can sort the array returned by Table_toArray if it tells qsort that key-value pairs in the array should be treated as single elements. It can then print the words and their counts by walking down the array:

<print the words 123>≡
   int i;
   void **array = Table_toArray(table, NULL);
   qsort(array, Table_length(table), 2*sizeof (*array),
       compare);
   for (i = 0; array[i]; i += 2)
       printf("%d	%s
", *(int *)array[i+1],
           (char *)array[i]);
   FREE(array);

qsort takes four arguments: the array, the number of elements, the size of each element in bytes, and a function that’s called to compare two elements. To treat each of the N key-value pairs as a single element, wf tells qsort that there are N elements and that each takes the space occupied by two pointers.

qsort calls the comparison function with pointers to the elements. Each element is itself two pointers — one to the word and one to the count — so the comparision function is called with two pointers to pointers to characters. For instance, when assert from mem.c is compared with book, the arguments x and y are

Example: Word Frequencies

The comparison function can compare the words by calling strcmp:

<wf functions 121>+≡
   int compare(const void *x, const void *y) {
       return strcmp(*(char **)x, *(char **)y);
   }

<wf includes 121>+≡
   #include <string.h>

<wf prototypes 122>+≡
   int compare(const void *x, const void *y);

The wf function is called for each file-name argument, so, to save space, it should deallocate the table and the counts before it returns. A call to Table_map deallocates the counts, and Table_free deallocates the table itself.

<deallocate the entries and table 124>≡
   Table_map(table, vfree, NULL);
   Table_free(&table);

<wf functions 121>+≡
   void vfree(const void *key, void **count, void *cl) {
       FREE(*count);
   }

<wf prototypes 122>+≡
   void vfree(const void *, void **, void *);

The keys aren’t deallocated because they’re atoms, and so must not be. Besides, some of them are likely to appear in subsequent files.

Collecting the various wf.c fragments forms the program wf:

<wf.c>≡
  <wf includes 121>
  <wf prototypes 122>
  <wf functions 121>

Implementation

<table.c>≡
   #include <limits.h>
   #include <stddef.h>
   #include "mem.h"
   #include "assert.h"
   #include "table.h"

   #define T Table_T

   <types 125>
   <static functions 127>
   <functions 126>

Hash tables are one of the obvious data structures for representing associative tables (trees are the other one; see Exercise 8.2). Each Table_T is thus a pointer to a structure that holds a hash table of bindings, which carry the key-value pairs:

<types 125>≡
   struct T {
       <fields 126>
       struct binding {
           struct binding *link;
           const void *key;
           void *value;
       } **buckets;
   };

buckets points to an array with the appropriate number of elements. The cmp and hash functions are associated with a particular table, so they are also stored in the structure along with the number of elements in buckets:

<fields 126>≡
   int size;
   int (*cmp)(const void *x, const void *y);
   unsigned (*hash)(const void *key);

Table_new uses its hint argument to choose a prime for the size of buckets, and it saves either cmp and hash or pointers to static functions for comparing and hashing atoms:

<functions 126>≡
   T Table_new(int hint,
       int cmp(const void *x, const void *y),
       unsigned hash(const void *key)) {
       T table;
       int i;
       static int primes[] = { 509, 509, 1021, 2053, 4093,
           8191, 16381, 32771, 65521, INT_MAX };

       assert(hint >= 0);
       for (i = 1; primes[i] < hint; i++)
           ;
       table = ALLOC(sizeof (*table) +
           primes[i-1]*sizeof (table->buckets[0]));
       table->size = primes[i-1];
       table->cmp  = cmp  ?  cmp : cmpatom;
       table->hash = hash ? hash : hashatom;
       table->buckets = (struct binding **)(table + 1);
       for (i = 0; i < table->size; i++)
           table->buckets[i] = NULL;
       table->length = 0;
       table->timestamp = 0;
       return table;
   }

The for loop sets i to the index of the first element in primes that is equal to or exceeds hint, and primes[i-1] gives the number of elements in buckets. Notice that the loop starts at index 1. Mem’s ALLOC allocates the structure and space for buckets. Table uses a prime for the size of its hash table because it has no control over how hash numbers for keys are computed. The values in primes are the primes nearest 2n for n from 9 to 16, which yield a wide range of hash-table sizes. Atom uses a simpler algorithm because it also computes the hash numbers.

If cmp or hash is the null function pointer, the functions

<static functions 127>≡
   static int cmpatom(const void *x, const void *y) {
       return x != y;
   }

   static unsigned hashatom(const void *key) {
       return (unsigned long)key>>2;
   }

are used instead. Since atoms x and y are equal if x = y, cmpatom returns zero when x = y and one otherwise. This particular implementation of Table tests keys for equality only, so cmpatom doesn’t need to test the relative order of x and y. An atom is an address and this address itself can be used as a hash number; it is shifted right two bits because it’s likely that each atom starts on a word boundary, so the rightmost two bits are probably zero.

Each element in buckets heads a linked list of binding structures that hold a key, its associated value, and a pointer to the next binding structure on the list. Figure 8.1 gives an example. All of the keys in each list have the same hash number.

Table layout

Figure 8.1. Table layout

Table_get finds a binding by hashing its key, taking it modulo the number of elements in buckets, and searching the list for a key equal to key. It calls the table’s hash and cmp functions.

<functions 126>+≡
   void *Table_get(T table, const void *key) {
       int i;
       struct binding *p;
       assert(table);
       assert(key);
       <search table for key 128>
       return p ? p->value : NULL;
   }

<search table for key 128>≡
   i = (*table->hash)(key)%table->size;
   for (p = table->buckets[i]; p; p = p->link)
       if ((*table->cmp)(key, p->key) == 0)
           break;

This for loop terminates when it finds the key, and it thus leaves p pointing to the binding of interest. Otherwise, p ends up null.

Table_put is similar; it searches for a key and, if it finds it, changes the associated value. If Table_put doesn’t find the key, it allocates and initializes a new binding, and adds that binding to the front of the appropriate list hanging off of buckets. It could link the new binding in anywhere on the list, but adding to the front of the list is the easiest and most efficient alternative.

<functions>+≡
   void *Table_put(T table, const void *key, void *value) {
       int i;
       struct binding *p;
       void *prev;

       assert(table);
       assert(key);
       <search table for key 128>
       if (p == NULL) {
           NEW(p);
           p->key = key;
           p->link = table->buckets[i];
           table->buckets[i] = p;
           table->length++;
           prev = NULL;
       } else
           prev = p->value;
       p->value = value;
       table->timestamp++;
       return prev;
   }

Table_put increments two per-table counters:

<fields 126>+≡
   int length;
   unsigned timestamp;

length is the number of bindings in the table; it’s returned by Table_length:

<functions 126>+≡
   int Table_length(T table) {
       assert(table);
       return table->length;
   }

A table’s timestamp is incremented every time the table is changed by Table_put or Table_remove. timestamp is used to implement the checked runtime error that Table_map must enforce: the table can’t be changed while Table_map is visiting its bindings. Table_map saves the value of timestamp upon entry. After each call to apply, it asserts that the table’s timestamp is still equal to this saved value.

<functions 126>+≡
   void Table_map(T table,
       void apply(const void *key, void **value, void *cl),
       void *cl) {
       int i;
       unsigned stamp;
       struct binding *p;

       assert(table);
       assert(apply);
       stamp = table->timestamp;
       for (i = 0; i < table->size; i++)
           for (p = table->buckets[i]; p; p = p->link) {
               apply(p->key, &p->value, cl);
               assert(table->timestamp == stamp);
           }
   }

Table_remove also searches for a key, but does so by using a pointer to a pointer to a binding so that it can remove the binding for the key if it finds it:

<functions 126>+≡
   void *Table_remove(T table, const void *key) {
       int i;
       struct binding **pp;

       assert(table);
       assert(key);
       table->timestamp++;
       i = (*table->hash)(key)%table->size;
       for (pp = &table->buckets[i]; *pp; pp = &(*pp)->link)
           if ((*table->cmp)(key, (*pp)->key) == 0) {
               struct binding *p = *pp;
               void *value = p->value;
               *pp = p->link;
               FREE(p);
               table->length--;
               return value;
           }
       return NULL;
   }

The for loop is functionally equivalent to the one in <search table for key 128>, except that pp points to the pointer to the binding for each key. pp starts by pointing to table->buckets[i] and follows along the list, pointing to the link field of the kth binding in the list when the k+1st binding is examined, as depicted below.

Table layout

If *pp holds key, the binding can be unlinked from the list by setting *pp to (*pp)->link; p holds the value of *pp. If Table_remove finds the key, it also decrements the table’s length.

Table_toArray is similar to List_toArray. It allocates an array to hold the key-value pairs followed by a terminating end pointer, and fills in the array by visiting each binding in table:

<functions 126>+≡
   void **Table_toArray(T table, void *end) {
       int i, j = 0;
       void **array;
       struct binding *p;
       assert(table);
       array = ALLOC((2*table->length + 1)*sizeof (*array));
       for (i = 0; i < table->size; i++)
           for (p = table->buckets[i]; p; p = p->link) {
               array[j++] = (void *)p->key;
               array[j++] = p->value;
           }
       array[j] = end;
       return array;
   }

p->key must be cast from const void * to void * because the array is not declared const. The order of the key-value pairs in the array is arbitrary.

Table_free must deallocate the binding structures and the Table_T structure itself. The former step is needed only if the table isn’t empty:

<functions 126>+≡
   void Table_free(T *table) {
       assert(table && *table);
       if ((*table)->length > 0) {
           int i;
           struct binding *p, *q;
           for (i = 0; i < (*table)->size; i++)
               for (p = (*table)->buckets[i]; p; p = q) {
                   q = p->link;
                   FREE(p);
               }
       }
       FREE(*table);
   }

Further Reading

Tables are so useful that many programming languages use them as built-in data types. AWK (Aho, Kernighan, and Weinberger 1988) is a recent example, but tables appeared in SNOBOL4 (Griswold 1972), which predates AWK, and in SNOBOL4’s successor, Icon (Griswold and Griswold 1990). Tables in SNOBOL4 and Icon can be indexed by and can hold values of any type, but AWK tables (which are called arrays) can be indexed by and hold only strings and numbers. Table’s implementation uses some of the same techniques used to implement tables in Icon (Griswold and Griswold 1986).

PostScript (Adobe Systems 1990), a page-description language, also has tables, which it calls dictionaries. PostScript tables can be indexed only by “names,” which are PostScript’s rendition of atoms, but can hold values of any type, including dictionaries.

Tables also appear in object-oriented languages, either as built-in types or in libraries. The foundation libraries in both SmallTalk and Objective-C include dictionaries, which are much like the tables exported by Table. These kinds of objects are often called container objects because they hold collections of other objects.

Table’s implementation uses fixed-size hash tables. As long as the load factor, which is the number of table entries divided by the number of elements in the hash table, is reasonably small, keys can be found by looking at only a few entries. Performance suffers, however, when the load factor gets too high. The load factor can be kept within reasonable bounds by expanding the hash table whenever the load factor exceeds, say, five. Exercise 8.5 explores an effective but naive implementation of dynamic hash tables, which expands the hash table and rehashes all the existing entries. Larson (1988) describes, in great detail, a more sophisticated approach in which the hash table is expanded (or contracted) incrementally, one hash chain at a time. Larson’s approach eliminates the need for hint, and it can save storage because all tables can start small.

Exercises

8.1

There are many viable alternatives for associative-table ADTs. For example, in earlier versions of Table, Table_get returned pointers to the values instead of returning the values themselves, so clients could change them. In one design, Table_put always added a new binding to the table even if the key was already present, effectively “hiding” a previous binding with the same key, and Table_remove removed only the most recent binding. Table_map, however, visited all bindings in the table. Discuss the pros and cons of these and other alternatives. Design and implement a different table ADT.

8.2

The Table interface is designed so that other data structures can be used to implement tables. The comparison function reveals the relative order of two keys to admit implementations that use trees, for example. Reimplement Table using binary search trees or red-black trees. See Sedgewick (1990) for details about these data structures.

8.3

The order in which Table_map and Table_toArray visit the bindings in a table is unspecified. Suppose the interface were amended so that Table_map visited the bindings in the order they were added to the table and Table_array returned an array with the bindings in the same order. Implement this amendment. What are the practical advantages of this behavior?

8.4

Suppose the interface stipulated that Table_map and Table_array visited the bindings in sorted order. This stipulation would complicate the implementation of Table, but would simplify clients like wf that sort the table’s bindings anyway. Discuss the merits of this proposal and implement it. Hint: In the current implementation, the average-case running time of Table_put is constant and that of Table_get is nearly so. What are the average-case running times of Table_put and Table_get in your revised implementation?

8.5

Once buckets is allocated, it’s never expanded or contracted. Revise the Table implementation so that it uses a heuristic to adjust the size of buckets periodically as pairs are added and removed. Devise a test program that tests the effectiveness of your heuristics, and measure its benefit.

8.6

Implement the linear dynamic-hashing algorithm described in Larson (1988), and compare its performance with your solution to the previous exercise.

8.7

Revise wf.c to measure how much space is lost because atoms are never deallocated.

8.8

Change wf.c’s compare function so that it sorts the array in decreasing order of count values.

8.9

Change wf.c so that it prints the output for each file argument in alphabetical order of file names. With this change, the counts for mem.c would appear before those for table.c in the example shown at the beginning of Section 8.2.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset