An associative table is a set of key-value pairs. It’s like an array except that the indices can be values of any type. Many applications use tables. Compilers, for example, maintain symbol tables, which map names to sets of attributes for those names. Some window systems maintain tables that map window titles into some kind of window-related data structures. Document-preparation systems use tables to represent indices: For example, the index might be a table in which the keys are one-character strings — one for each section of the index — and the values are other tables in which the keys are the strings for the index entries themselves and the values are lists of page numbers.
Tables have many uses, and the examples alone could fill a chapter. The Table
interface is designed so that it can be used for many of these uses. It maintains key-value pairs, but it never inspects keys themselves; only clients inspect keys via functions passed to routines in Table
. Section 8.2 describes a typical Table
client, a program that prints the number of occurrences of words in its input. This program, wf
, also uses the Atom
and Mem
interfaces.
Table
represents an associative table with an opaque pointer type:
<table.h>≡ #ifndef TABLE_INCLUDED #define TABLE_INCLUDED #define T Table_T typedef struct T *T; <exported functions 116> #undef T #endif
The exported functions allocate and deallocate Table_T
s, add and remove key-value pairs from those tables, and visit the key-value pairs in them. It is a checked runtime error to pass a null Table_T
or null key to any function in this interface.
Table_T
s are allocated and deallocated by
<exported functions 116>≡
extern T Table_new (int hint,
int cmp(const void *x, const void *y),
unsigned hash(const void *key));
extern void Table_free(T *table);
Table_new
’s first argument, hint
, is an estimate of the number of entries that the new table is expected to hold. All tables can hold an arbitrary number of entries regardless of the value of hint
, but accurate values of hint
may improve performance. It is a checked runtime error for hint
to be negative. The functions cmp
and hash
manipulate client-specific keys. Given two keys, x
and y
, cmp(x,y)
must return an integer less than zero, equal to zero, or greater than zero, if, respectively, x
is less than y
, x
equals y
, or x
is greater than y
. The standard library function strcmp
is an example of a comparison function suitable for keys that are strings. hash
must return a hash number for key
; if cmp(x,y)
returns zero, then hash(x)
must be equal to hash(y)
. Each table can have its own hash
and cmp
functions.
Atoms are often used as keys, so if hash
is the null function pointer, the keys in the new table are assumed to be atoms and the implementation of Table
provides a suitable hash function. Similarly, if cmp
is the null function pointer, keys are assumed to be atoms, and two keys x
and y
are equal if x
= y
.
Table_new
can raise Mem_Failed
.
Table_new
’s arguments — a size hint, a hash function, and a comparison function — provide more information than most implementations need. For example, the hash table implementation described in Section 8.3 needs a comparison function that tests only for equality, and implementations that use trees don’t need the hint or the hash function. This complexity is the price of a design that permits multiple implementations, and this feature is one of the reasons designing good interfaces is difficult.
Table_free
deallocates *table
and sets it to the null pointer. It is a checked runtime error for table
or *table
to be null. Table_free
does not deallocate the keys or values; see Table_map
.
The functions
<exported functions 116>+≡
extern int Table_length(T table);
extern void *Table_put (T table, const void *key,
void *value);
extern void *Table_get (T table, const void *key);
extern void *Table_remove(T table, const void *key);
return the number of keys in a table, add a new key-value pair or change the value of an existing pair, fetch the value associated with a key, and remove a key-value pair.
Table_length
returns the number of key-value pairs in table
.
Table_put
adds the key-value pair given by key
and value
to table
. If table
already holds key
, value
overwrites the previous value, and Table_put
returns the previous value. Otherwise, key
and value
are added to table
, which grows by one entry, and Table_put
returns the null pointer. Table_put
can raise Mem_Failed
.
Table_get
searches table
for key
and, if it’s found, returns its associated value. If table
doesn’t hold key
, Table_get
returns the null pointer. Notice that returning the null pointer is ambiguous if table
holds null pointer values.
Table_remove
searches table
for key
and, if it’s found, removes the key-value pair from table, which thus shrinks by one entry, and returns the removed value. If table
doesn’t hold key
, Table_remove
has no effect on table
and returns the null pointer.
The functions
<exported functions 116>+≡
extern void Table_map (T table,
void apply(const void *key, void **value, void *cl),
void *cl);
extern void **Table_toArray(T table, void *end);
visit the key-value pairs and collect them into an array. Table_map
calls the function pointed to by apply
for every key-value pair in table
in an unspecified order. apply
and cl
specify a closure: Clients can pass an application-specific pointer, cl
, to Table_map
and this pointer is passed along to apply
at each call. For each pair in table
, apply
is called with its key, a pointer to its value, and cl
. Since apply
is called with pointers to the values, it can change them. Table_map
can also be used to deallocate keys or values before deallocating the table. For example, assuming the keys are atoms,
static void vfree(const void *key, void **value, void *cl) { FREE(*value); }
deallocates just the values, so
Table_map(table, vfree, NULL); Table_free(&table);
deallocates the values in table
and then table
itself.
It is a checked runtime error for apply
to change the contents of table
by calling Table_put
or Table_remove
.
Given a table with N
key-value pairs, Table_toArray
builds an array with 2N
+1 elements and returns a pointer to the first element. The keys and values alternate, with keys appearing in the even-numbered elements and their associated values in the following odd-numbered elements. The last even-numbered element, at index 2N
, is assigned end
, which is often the null pointer. The order of the key-value pairs in the array is unspecified. The program described in Section 8.2 illustrates the use of Table_toArray
.
Table_toArray
can raise Mem_Failed
, and clients must deallocate the array it returns.
wf
lists the number of times each word appears in a list of named files or in the standard input if no files are specified. For example:
% wf table.c mem.c
table.c:
3 apply
7 array
13 assert
9 binding
18 book
2 break
10 buckets
...
4 y
mem.c:
1 allocation
7 assert
12 book
1 stdlib
9 void
...
As this output shows, the words in each file are listed in alphabetical order and are preceded by the number of times they appear in the file. For wf
, a word is a letter followed by zero more letters or underscores, and case doesn’t matter.
More generally, a word begins with a character in a first set followed by zero or more characters in a rest set. Words of this form are recognized by getword
, which is a generalization of double
’s getword
described in Section 1.1. It’s used enough in this book to be packaged separately in its own interface:
<getword.h>≡
#include <stdio.h>
extern int getword(FILE *fp, char *buf, int size,
int first(int c), int rest(int c));
getword
consumes the next word in the file opened on fp
, stores it as a null-terminated string in buf[0
..size-1]
, and returns one. When it reaches the end of file without consuming a word, it returns zero. The functions first
and rest
test a character for membership in first and rest. A word is a contiguous sequence of characters; it starts with a character for which first
returns a nonzero value followed by characters for which rest
returns nonzero values. If a word is longer than size-2
characters, the excess characters are discarded. size
must exceed one, and fp
, buf
, first
, and rest
must be nonnull.
<getword.c>≡ #include <ctype.h> #include <string.h> #include <stdio.h> #include "assert.h" #include "getword.h" int getword(FILE *fp, char *buf, int size, int first(int c), int rest(int c)) { int i = 0, c; assert(fp && buf && size > 1 && first && rest); c = getc(fp); for ( ; c != EOF; c = getc(fp)) if (first(c)) { <store c in buf if it fits 120> c = getc(fp); break; } for ( ; c != EOF && rest(c); c = getc(fp)) <store c in buf if it fits 120> if (i < size) buf[i] = ' '; else buf[size-1] = ' '; if (c != EOF) ungetc(c, fp); return i > 0; } <store c in buf if it fits 120>≡ { if (i < size - 1) buf[i++] = c; }
This version of getword
is a bit more complex than the version in double
because this one must work when a character is in first but not in rest. When first
returns nonzero, that character is stored in buf
and only subsequent characters are passed to rest
.
wf
’s main
function processes its arguments, which name files. main
opens each file and calls wf
with the file pointer and file name:
<wf functions 121>≡ int main(int argc, char *argv[]) { int i; for (i = 1; i < argc; i++) { FILE *fp = fopen(argv[i], "r"); if (fp == NULL) { fprintf(stderr, "%s: can't open '%s' (%s) ", argv[0], argv[i], strerror(errno)); return EXIT_FAILURE; } else { wf(argv[i], fp); fclose(fp); } } if (argc == 1) wf(NULL, stdin); return EXIT_SUCCESS; } <wf includes 121>≡ #include <stdio.h> #include <stdlib.h> #include <errno.h>
If there are no arguments, main
calls wf
with a null file name and the file pointer for the standard input. The null file name tells wf
not to print the name of the file.
wf
uses a table to store the words and their counts. Each word is folded to lowercase, converted to an atom, and used as a key. Using atoms lets wf
use the defaults for the table’s hash and comparison functions. Values are pointers, but wf
needs to associate an integer count with each key. It thus allocates space for a counter and stores a pointer to this space in the table.
<wf functions 121>+≡ void wf(char *name, FILE *fp) { Table_T table = Table_new(0, NULL, NULL); char buf[128]; while (getword(fp, buf, sizeof buf, first, rest)) { const char *word; int i, *count; for (i = 0; buf[i] != ' '; i++) buf[i] = tolower(buf[i]); word = Atom_string(buf); count = Table_get(table, word); if (count) (*count)++; else { NEW(count); *count = 1; Table_put(table, word, count); } } if (name) printf("%s: ", name); { <print the words 123> } <deallocate the entries and table 124> } <wf includes 121>+≡ #include <ctype.h> #include "atom.h" #include "table.h" #include "mem.h" #include "getword.h" <wf prototypes 122>≡ void wf(char *, FILE *);
count
is a pointer to an integer. If Table_get
returns null, the word isn’t in table
, so wf
allocates space for the counter, initializes it to one to account for this first occurrence of the word, and adds it to the table. When Table_get
returns a nonnull pointer, the expression (*count)++
increments the integer pointed to by that pointer. This expression is much different than *count++
, which would increment count
instead of the integer it points to.
Membership in first and rest is tested by functions of the same names that use the predicates defined in the standard header ctype.h
:
<wf functions 121>+≡ int first(int c) { return isalpha(c); } int rest(int c) { return isalpha(c) || c == '_'; } <wf prototypes 122>+≡ int first(int c); int rest (int c);
Once wf
has read all of the words, it must sort and print them. qsort
, the standard C library sorting function, sorts an array, so wf
can sort the array returned by Table_toArray
if it tells qsort
that key-value pairs in the array should be treated as single elements. It can then print the words and their counts by walking down the array:
<print the words 123>≡
int i;
void **array = Table_toArray(table, NULL);
qsort(array, Table_length(table), 2*sizeof (*array),
compare);
for (i = 0; array[i]; i += 2)
printf("%d %s
", *(int *)array[i+1],
(char *)array[i]);
FREE(array);
qsort
takes four arguments: the array, the number of elements, the size of each element in bytes, and a function that’s called to compare two elements. To treat each of the N
key-value pairs as a single element, wf
tells qsort
that there are N
elements and that each takes the space occupied by two pointers.
qsort
calls the comparison function with pointers to the elements. Each element is itself two pointers — one to the word and one to the count — so the comparision function is called with two pointers to pointers to characters. For instance, when assert
from mem.c
is compared with book
, the arguments x
and y
are
The comparison function can compare the words by calling strcmp
:
<wf functions 121>+≡ int compare(const void *x, const void *y) { return strcmp(*(char **)x, *(char **)y); } <wf includes 121>+≡ #include <string.h> <wf prototypes 122>+≡ int compare(const void *x, const void *y);
The wf
function is called for each file-name argument, so, to save space, it should deallocate the table and the counts before it returns. A call to Table_map
deallocates the counts, and Table_free
deallocates the table itself.
<deallocate the entries and table 124>≡ Table_map(table, vfree, NULL); Table_free(&table); <wf functions 121>+≡ void vfree(const void *key, void **count, void *cl) { FREE(*count); } <wf prototypes 122>+≡ void vfree(const void *, void **, void *);
The keys aren’t deallocated because they’re atoms, and so must not be. Besides, some of them are likely to appear in subsequent files.
Collecting the various wf.c
fragments forms the program wf
:
<wf.c>≡ <wf includes 121> <wf prototypes 122> <wf functions 121>
<table.c>≡ #include <limits.h> #include <stddef.h> #include "mem.h" #include "assert.h" #include "table.h" #define T Table_T <types 125> <static functions 127> <functions 126>
Hash tables are one of the obvious data structures for representing associative tables (trees are the other one; see Exercise 8.2). Each Table_T
is thus a pointer to a structure that holds a hash table of bindings, which carry the key-value pairs:
<types 125>≡ struct T { <fields 126> struct binding { struct binding *link; const void *key; void *value; } **buckets; };
buckets
points to an array with the appropriate number of elements. The cmp
and hash
functions are associated with a particular table, so they are also stored in the structure along with the number of elements in buckets
:
<fields 126>≡
int size;
int (*cmp)(const void *x, const void *y);
unsigned (*hash)(const void *key);
Table_new
uses its hint
argument to choose a prime for the size of buckets
, and it saves either cmp
and hash
or pointers to static functions for comparing and hashing atoms:
<functions 126>≡
T Table_new(int hint,
int cmp(const void *x, const void *y),
unsigned hash(const void *key)) {
T table;
int i;
static int primes[] = { 509, 509, 1021, 2053, 4093,
8191, 16381, 32771, 65521, INT_MAX };
assert(hint >= 0);
for (i = 1; primes[i] < hint; i++)
;
table = ALLOC(sizeof (*table) +
primes[i-1]*sizeof (table->buckets[0]));
table->size = primes[i-1];
table->cmp = cmp ? cmp : cmpatom;
table->hash = hash ? hash : hashatom;
table->buckets = (struct binding **)(table + 1);
for (i = 0; i < table->size; i++)
table->buckets[i] = NULL;
table->length = 0;
table->timestamp = 0;
return table;
}
The for loop sets i
to the index of the first element in primes
that is equal to or exceeds hint
, and primes[i-1]
gives the number of elements in buckets
. Notice that the loop starts at index 1. Mem
’s ALLOC
allocates the structure and space for buckets
. Table
uses a prime for the size of its hash table because it has no control over how hash numbers for keys are computed. The values in primes
are the primes nearest 2n for n
from 9 to 16, which yield a wide range of hash-table sizes. Atom
uses a simpler algorithm because it also computes the hash numbers.
If cmp
or hash
is the null function pointer, the functions
<static functions 127>≡
static int cmpatom(const void *x, const void *y) {
return x != y;
}
static unsigned hashatom(const void *key) {
return (unsigned long)key>>2;
}
are used instead. Since atoms x
and y
are equal if x
= y
, cmpatom
returns zero when x
= y
and one otherwise. This particular implementation of Table
tests keys for equality only, so cmpatom
doesn’t need to test the relative order of x
and y
. An atom is an address and this address itself can be used as a hash number; it is shifted right two bits because it’s likely that each atom starts on a word boundary, so the rightmost two bits are probably zero.
Each element in buckets
heads a linked list of binding
structures that hold a key, its associated value, and a pointer to the next binding
structure on the list. Figure 8.1 gives an example. All of the keys in each list have the same hash number.
Table_get
finds a binding by hashing its key, taking it modulo the number of elements in buckets
, and searching the list for a key equal to key
. It calls the table’s hash
and cmp
functions.
<functions 126>+≡ void *Table_get(T table, const void *key) { int i; struct binding *p; assert(table); assert(key); <search table for key 128> return p ? p->value : NULL; } <search table for key 128>≡ i = (*table->hash)(key)%table->size; for (p = table->buckets[i]; p; p = p->link) if ((*table->cmp)(key, p->key) == 0) break;
This for loop terminates when it finds the key, and it thus leaves p
pointing to the binding
of interest. Otherwise, p
ends up null.
Table_put
is similar; it searches for a key and, if it finds it, changes the associated value. If Table_put
doesn’t find the key, it allocates and initializes a new binding
, and adds that binding to the front of the appropriate list hanging off of buckets
. It could link the new binding in anywhere on the list, but adding to the front of the list is the easiest and most efficient alternative.
<functions>+≡ void *Table_put(T table, const void *key, void *value) { int i; struct binding *p; void *prev; assert(table); assert(key); <search table for key 128> if (p == NULL) { NEW(p); p->key = key; p->link = table->buckets[i]; table->buckets[i] = p; table->length++; prev = NULL; } else prev = p->value; p->value = value; table->timestamp++; return prev; }
Table_put
increments two per-table counters:
<fields 126>+≡
int length;
unsigned timestamp;
length
is the number of bindings in the table; it’s returned by Table_length
:
<functions 126>+≡
int Table_length(T table) {
assert(table);
return table->length;
}
A table’s timestamp
is incremented every time the table is changed by Table_put
or Table_remove
. timestamp
is used to implement the checked runtime error that Table_map
must enforce: the table can’t be changed while Table_map
is visiting its bindings. Table_map
saves the value of timestamp
upon entry. After each call to apply
, it asserts that the table’s timestamp
is still equal to this saved value.
<functions 126>+≡
void Table_map(T table,
void apply(const void *key, void **value, void *cl),
void *cl) {
int i;
unsigned stamp;
struct binding *p;
assert(table);
assert(apply);
stamp = table->timestamp;
for (i = 0; i < table->size; i++)
for (p = table->buckets[i]; p; p = p->link) {
apply(p->key, &p->value, cl);
assert(table->timestamp == stamp);
}
}
Table_remove
also searches for a key, but does so by using a pointer to a pointer to a binding
so that it can remove the binding for the key if it finds it:
<functions 126>+≡
void *Table_remove(T table, const void *key) {
int i;
struct binding **pp;
assert(table);
assert(key);
table->timestamp++;
i = (*table->hash)(key)%table->size;
for (pp = &table->buckets[i]; *pp; pp = &(*pp)->link)
if ((*table->cmp)(key, (*pp)->key) == 0) {
struct binding *p = *pp;
void *value = p->value;
*pp = p->link;
FREE(p);
table->length--;
return value;
}
return NULL;
}
The for loop is functionally equivalent to the one in <search
table
for
key 128
>, except that pp
points to the pointer to the binding for each key. pp
starts by pointing to table->buckets[i]
and follows along the list, pointing to the link
field of the k
th binding
in the list when the k
+1st binding
is examined, as depicted below.
If *pp
holds key
, the binding
can be unlinked from the list by setting *pp
to (*pp)->link
; p
holds the value of *pp
. If Table_remove
finds the key, it also decrements the table’s length.
Table_toArray
is similar to List_toArray
. It allocates an array to hold the key-value pairs followed by a terminating end pointer, and fills in the array by visiting each binding in table
:
<functions 126>+≡
void **Table_toArray(T table, void *end) {
int i, j = 0;
void **array;
struct binding *p;
assert(table);
array = ALLOC((2*table->length + 1)*sizeof (*array));
for (i = 0; i < table->size; i++)
for (p = table->buckets[i]; p; p = p->link) {
array[j++] = (void *)p->key;
array[j++] = p->value;
}
array[j] = end;
return array;
}
p->key
must be cast from const void * to void * because the array is not declared const. The order of the key-value pairs in the array is arbitrary.
Table_free
must deallocate the binding structures and the Table_T
structure itself. The former step is needed only if the table isn’t empty:
<functions 126>+≡
void Table_free(T *table) {
assert(table && *table);
if ((*table)->length > 0) {
int i;
struct binding *p, *q;
for (i = 0; i < (*table)->size; i++)
for (p = (*table)->buckets[i]; p; p = q) {
q = p->link;
FREE(p);
}
}
FREE(*table);
}
Tables are so useful that many programming languages use them as built-in data types. AWK (Aho, Kernighan, and Weinberger 1988) is a recent example, but tables appeared in SNOBOL4 (Griswold 1972), which predates AWK, and in SNOBOL4’s successor, Icon (Griswold and Griswold 1990). Tables in SNOBOL4 and Icon can be indexed by and can hold values of any type, but AWK tables (which are called arrays) can be indexed by and hold only strings and numbers. Table
’s implementation uses some of the same techniques used to implement tables in Icon (Griswold and Griswold 1986).
PostScript (Adobe Systems 1990), a page-description language, also has tables, which it calls dictionaries. PostScript tables can be indexed only by “names,” which are PostScript’s rendition of atoms, but can hold values of any type, including dictionaries.
Tables also appear in object-oriented languages, either as built-in types or in libraries. The foundation libraries in both SmallTalk and Objective-C include dictionaries, which are much like the tables exported by Table
. These kinds of objects are often called container objects because they hold collections of other objects.
Table
’s implementation uses fixed-size hash tables. As long as the load factor, which is the number of table entries divided by the number of elements in the hash table, is reasonably small, keys can be found by looking at only a few entries. Performance suffers, however, when the load factor gets too high. The load factor can be kept within reasonable bounds by expanding the hash table whenever the load factor exceeds, say, five. Exercise 8.5 explores an effective but naive implementation of dynamic hash tables, which expands the hash table and rehashes all the existing entries. Larson (1988) describes, in great detail, a more sophisticated approach in which the hash table is expanded (or contracted) incrementally, one hash chain at a time. Larson’s approach eliminates the need for hint
, and it can save storage because all tables can start small.
8.1 | There are many viable alternatives for associative-table ADTs. For example, in earlier versions of |
8.2 | The |
8.3 | The order in which |
8.4 | Suppose the interface stipulated that |
8.5 | Once |
8.6 | Implement the linear dynamic-hashing algorithm described in Larson (1988), and compare its performance with your solution to the previous exercise. |
8.7 | Revise |
8.8 | Change |
8.9 | Change |