Chapter 25. A Hashed Database Library

Applications commonly wish to store some form of binary data in a file. Storing such data for efficient retrieval is tricky and error-prone. There are a number of libraries that provide simple APIs for storing information in files. The dbm library was an early part of Unix systems (and was later reimplemented as ndbm), which led later to the the Berkeley db library and the gdbm library from the GNU Project. All of these libraries provide easy access to files that are set up as hash tables, with a binary key providing access to a binary data region.[1]

While gdbm and Berkeley db are both widely available on Linux systems, the licenses for each of them make them unsuitable for commercial development.[2] The qdbm library is quite similar to those other choices, but is licensed under the LGPL, making it more attractive to many developers. The basic API for each of these libraries is quite similar to the others, however, and porting code between the libraries is straightforward.

Full source code and documentation for qdbm are available from the http://qdbm.sourceforge.net Web site. This chapter describes all of the functions most applications need to use qdbm (all of which have close analogues in Berkeley db, gdbm, and ndbm). There are other API functions available, all of which are documented at the qdbm Web site.

Overview

Qdbm provides quite a few different APIs. The most basic of them, Depot, is the lowest-level API and the one we discuss here. The Curia interface allows the database to be split into multiple files (for better scalability or to work around file system limits) and the Villa functions provide both B-Trees and a transaction model. Inverted indexes[3] are available through the Odeon API. The final two APIs are Relic and Hovel, which provide reimplementations of the ndbm and gdbm APIs.

The Depot functions provide basic key/value operations, where a key is used to retrieve the value. Both the key and value are arbitrary binary streams that have their size passed separately from the data; the library does not need to know anything about their structure. However, Depot has a couple of features that make using strings as both keys and data items more convenient. First, whenever the size of a key or data item is passed into the library, -1 can be passed instead, which tells Depot to use strlen() to calculate the size to use. Second, most functions that read keys and data items automatically append a 0 byte to the returned value. This extra character is not included in the size returned, so it can be ignored if the value is not a string. If it is a string, the returned value can be treated as a string without the application having to NULL-terminate it.

Depot uses the global integer dpecode to store error codes; when Depot functions return failures, dpecode tells what went wrong (this is just like the errno variable used by system calls and the C library). Text error messages are provided by dperrmsg().

#include <depot.h>

const char * dperrmsg(int ecode);

Like strerror(), dperrmsg() takes an error code (normally from dpecode) and returns a string describing what went wrong.

The functions in the Depot API use a pointer to a DEPOT structure. This structure is opaque (programs that use Depot cannot inspect the values in the structure itself), but contains all of the information the library uses to maintain the on-disk file.

Basic Operations

Opening a qdbm File

The dpopen() library function is used to open db files.

#include <depot.h>

DB * dpopen(const char * filename, int omode, int bnum);

The first argument is the name of the file to use for the database.[4] The omode specifies how the file should be accessed, and should be DP_OREADER or DP_OWRITER, depending on whether the program wants read or write access to the database. By default, the database is locked to allow multiple programs read access or a single program write access. If the application does not want qdbm to perform any locking, DP_ONOLCK may be bitwise OR’ed with omode.

When applications are creating new databases, they should also bitwise OR DP_CREAT to tell qdbm to create a new file if it does not already exist. The DP_OTRUNC flag causes whatever contents filename originally contained to be erased and replaced with an empty database.

The final parameter for dpopen(), bnum, tells qdbm how many buckets to use in the hash array. The lower this number is, the smaller the database; the larger it is, the faster it is, due to reducing the number of hash collisions. The qdbm documentation recommends that this number be anywhere from half to four times the number of items that are expected to be in the database.[5] If you are not sure what number to use, zero gives a default value.[6]

dpopen() returns a pointer to a DEPOT structure, which is passed into the rest of the Depot functions. If an error occurs, dpopen() returns NULL and sets dpecode.

Closing a Database

Database files are closed through dpclose().

int dpclose(DEPOT * depot);

The dpclose() function returns zero on success, and nonzero if it fails, which can occur if the database’s buffers cannot be flushed for any reason. Here is an example program that opens a database file in the current directory and immediately closes it:

 1: /* qdbmsimple.c */
 2:
 3: #include <depot.h>
 4: #include <errno.h>
 5: #include <fcntl.h>
 6: #include <stdio.h>
 7:
 8: int main(void) {
 9:     DEPOT * dp;
10:
11:     dp = dpopen("test.db", DP_OWRITER | DP_OCREAT, 0);
12:     if (!dp) {
13:         printf("error: %s
", dperrmsg(dpecode));
14:         return 1;
15:     }
16:
17:     dpclose(dp);
18:
19:     return 0;
20: }

Obtaining the File Descriptor

While qdbm provides automatic locking, some programs may wish to substitute their own locking logic. To enable this, qdbm provides access to the file descriptor that refers to the database.

int dpfdesc(DEPOT * depot);

This function returns the file descriptor the database referred to by depot.[7]

Syncing the Database

Qdbm caches data in RAM to provide faster database access, and the Linux kernel caches disk writes for low-latency write() calls. An application can ensure that the on-disk database is consistent with the buffered structures by syncing the database. When a database is synced, qdbm flushes all its internal buffers and calls fsync() on the file descriptor.

int dpsync(DEPOT * depot);

Reading Records

There are two ways to read records from the database: looking up a record by its key, and reading sequential key/value pairs.

Reading a Particular Record

Both dpget() and dpgetwb() look up database entries by a key.

int dpget(DEPOT * depot, const char * key, int keySize, int start,
          int max, int * dataSize);

The key is the item to look up in the database, and keySize is the length of the key (or -1, in which case Depot uses strlen(key) to determine the length). The next two parameters, start and max, allow partial records to be read in; start is the offset in the data where the read begins, and max is the maximum number of bytes to be read. If the data region were an array of 4 byte int values, for example, setting start to 12 and max to 8 would read in the fourth and fifth elements from the array. If less than start bytes are available, dpget() returns NULL. To read in all of the bytes from the data, set max to -1.

If the final parameter, dataSize, is not NULL, the integer it points to is filled in with the number of bytes read.

This function returns NULL on failure and a pointer to the data read in when it succeeds. If it fails, dpecode tells what caused the failure. In particular, if the item does not exist or has less than start bytes of data, it sets dpecode to DP_ENOITEM.

When dpget() returns data, a 0 byte is appended to the data, allowing it to be used as a string. The pointer is allocated using malloc(), and the application is responsible for freeing the memory when it is finished with it. If applications would like to read into a buffer rather than having Depot allocate one with malloc(), they should instead use the dpgetwb() function.

int dpgetwb(DEPOT * depot, const char * key, int keySize, int start,
            int max, const char * data);

The only parameters that are different between dpgetwb() and dpget() are max (which is interpreted differently) and data (which replaces the dataSize parameter from dpgetwb()). data should point to a buffer of max bytes for dpgetwb() to fill in with the data read from the database. The max parameter should not be -1 for dpgetwb() and the buffer does not have a 0 byte automatically appended to it by dpgetwb(). This function returns the number of bytes stored at data, and -1 if the record was not found, the data was smaller than start bytes, or an error occurred.

Reading Records Sequentially

Applications can iterate over all of the keys in the database by using dpiterinit() and dpiternext(). The keys are not returned in any particular order[8] and the database should not be modified while the application is iterating over it.

int dpiterinit(DEPOT * depot);
char * dpiternext(DEPOT * depot, int * keySize);

Calling dpiterinit() tells qdbm to return the first key in the database the next time dpiternext() is called.

dpiternext() returns a pointer to either the first key in the database (if dpiterinit() was just called) or the key in the database just after the key it returned the previous time. If there are no more keys in the database, it returns NULL. If keySize is not NULL, the integer it points to is set to the size of the key returned.

The buffer dpiternext() returns a pointer to is allocated with malloc(), and the pointer needs to be deallocated with free() when the application is done with the key. The buffer is also NULL terminated, allowing it to be used as a string, if appropriate.

Modifying the Database

There are two operations that modify qdbm databases: adding records and removing records. Updating records is done using the same function as adding records.

Adding Records

New and updated records are recorded in the database through the dpput() function.

int dpput(DEPOT * depot, const char * key, int keySize,
          const char * data, int dataSize, int dmode);

The key is the index value that can later be used to retrieve the information referenced by data. Both keySize and dataSize may be -1, which tells dpput() to use strlen() to get the size of that field. It looks at the dmode parameter only if there is already a data item associated with key in the database. dmode can have one of the following values:

DP_DCAT

The new data is appended to the end of the data already in the database.

DP_DKEEP

The database is not modified; dpput() returns failure and dpecode is set to DP_EKEEP.

DP_DOVER

The data in the database is overwritten with the new value.

dpput() returns zero if an error occurred (or if the key already existed and DP_DKEEP was specified), and nonzero if the data for the key was updated successfully.

Removing Records

Remove records from the database by passing the key whose data should be removed to dpout().

int dpout(DEPOT * depot, const char * key, int keySize);

The specified key and the data associated with it are removed from the database, and a nonzero value is returned. If there is no data for that key, zero is returned. As for all of the other functions that take a key, if the keySize is -1, dpout uses strlen() to find the length of the key.

Example

To help make all of this more concrete, here is a sample application that makes use of most of qdbm’s features. It is intended as a simple phone database, although it can be used to store any simple name/value pairs. It stores its database in the user’s home directory as .phonedb.

The -a flag adds an entry to the database. If -f is specified, any existing entry is overwritten with the new data. The next parameter is the key value to use, and the last parameter is the data (usually a phone number).

The -q flag queries the database for a particular key, which should be the only other parameter specified. Entries are removed from the database through the -d flag, which takes the key value to remove as the only other parameter.

If -l is specified, all the key/value pairs in the database are listed.

Here is an example of using phones:

$ ./phones -a Erik 374-5876
$ ./phones -a Michael 642-4235
$ ./phones -a Larry 527-7976
$ ./phones -a Barbara 227-2272
$ ./phones -q Larry
Larry 527-7976
$ ./phones -l
Larry 527-7976
Erik 374-5876
Michael 642-4235
Barbara 227-2272
$ ./phones -d Michael
$ ./phones -l
Larry 527-7976
Erik 374-5876
Barbara 227-2272

This program does something quite useful, has fewer than 200 lines of source code, and is well suited to managing large numbers of key/value pairs, clearly illustrating the value of the qdbm library.

  1: /* phones.c */
  2:
  3: /* This implements a very simple phone database. Full usage
  4:    information is given in the text. */
  5:
  6: #include <alloca.h>
  7: #include <depot.h>
  8: #include <errno.h>
  9: #include <fcntl.h>
 10: #include <stdio.h>
 11: #include <stdlib.h>
 12: #include <string.h>
 13: #include <unistd.h>
 14:
 15: void usage(void) {
 16:     fprintf(stderr, "usage: phones -a [-f] <name> <phone>
");
 17:     fprintf(stderr, "              -d <name>
");
 18:     fprintf(stderr, "              -q <name>
");
 19:     fprintf(stderr, "              -l
");
 20:     exit(1);
 21: }
 22:
 23: /* Opens the database $HOME/.phonedb. If writeable is nonzero,
 24:    the database is opened for updating. If writeable is 0, the
 25:    database is opened read-only. */
 26: DEPOT * openDatabase(int writeable) {
 27:     DEPOT * dp;
 28:     char * filename;
 29:     int flags;
 30:
 31:     /* Set the open mode */
 32:     if (writeable) {
 33:         flags = DP_OWRITER | DP_OCREAT;
 34:     } else {
 35:         flags = DP_OREADER;
 36:     }
 37:
 38:     filename = alloca(strlen(getenv("HOME")) + 20);
 39:     strcpy(filename, getenv("HOME"));
 40:     strcat(filename, "/.phonedb");
 41:
 42:     dp = dpopen(filename, flags, 0);
 43:     if (!dp) {
 44:         fprintf(stderr, "failed to open %s: %s
", filename,
 45:                 dperrmsg(dpecode));
 46:         return NULL;
 47:     }
 48:
 49:     return dp;
 50: }
 51:
 52: /* add a new record to the database; this parses the
 53:    command-line arguments directly */
 54: int addRecord(int argc, char ** argv) {
 55:     DEPOT * dp;
 56:     char * name, * phone;
 57:     int rc = 0;
 58:     int overwrite = 0;
 59:     int flag;
 60:
 61:     /* check for our parameters; -f means overwrite an
 62:        existing entry, and the name and phone number should
 63:        be all that remains */
 64:     if (!argc) usage();
 65:     if (!strcmp(argv[0], "-f")) {
 66:         overwrite = 1;
 67:         argc--, argv++;
 68:     }
 69:
 70:     if (argc != 2) usage();
 71:
 72:     name = argv[0];
 73:     phone = argv[1];
 74:
 75:     /* open the database for writing */
 76:     if (!(dp = openDatabase(1))) return 1;
 77:
 78:     /* if we shouldn't overwrite an existing entry, check
 79:        to see if this name is already used */
 80:     if (!overwrite) {
 81:         flag = DP_DKEEP;
 82:     } else {
 83:         flag = DP_DOVER;
 84:     }
 85:
 86:     if (!dpput(dp, name, -1, phone, -1, flag)) {
 87:         if (dpecode == DP_EKEEP) {
 88:             fprintf(stderr, "%s already has a listing
", name);
 89:         } else {
 90:             fprintf(stderr, "put failed: %s
", dperrmsg(dpecode));
 91:         }
 92:
 93:         rc = 1;
 94:     }
 95:
 96:     dpclose(dp);
 97:
 98:     return rc;
 99: }
100:
101: /* looks up a name, and prints the phone number associated
102:    with it; parses the command line directly */
103: int queryRecord(int argc, char ** argv) {
104:     DEPOT * dp;
105:     int rc;
106:     char * phone;
107:
108:     /* only one argument is expected, a name to look up */
109:     if (argc != 1) usage();
110:
111:     /* open the database for reading */
112:     if (!(dp = openDatabase(0))) return 1;
113:
114:     phone = dpget(dp, argv[0], -1, 0, -1, NULL);
115:     if (!phone) {
116:         if (dpecode == DP_ENOITEM)
117:             fprintf(stderr, "%s is not listed
", argv[0]);
118:         else
119:             fprintf(stderr, "error reading database: %s
",
120:                     dperrmsg(dpecode));
121:
122:         rc = 1;
123:     } else {
124:         printf("%s %s
", argv[0], (char *) phone);
125:         rc = 0;
126:     }
127:
128:     dpclose(dp);
129:
130:     return rc;
131: }
132:
133: /* delete the specified record; the name is passed as a
134:    command-line argument */
135: int delRecord(int argc, char ** argv) {
136:     DEPOT * dp;
137:     int rc;
138:
139:     /* only a single argument is expected */
140:     if (argc != 1) usage();
141:
142:     /* open the database for updating */
143:     if (!(dp = openDatabase(1))) return 1;
144:
145:     if (!(rc = dpout(dp, argv[0], -1))) {
146:         if (dpecode == DP_ENOITEM)
147:             fprintf(stderr, "%s is not listed
", argv[0]);
148:         else
149:             fprintf(stderr, "error removing item: %s
",
150:                     dperrmsg(dpecode));
151:
152:         rc = 1;
153:     }
154:
155:     dpclose(dp);
156:
157:     return rc;
158: }
159:
160: /* lists all of the records in the database */
161: int listRecords(void) {
162:     DEPOT * dp;
163:     char * key, * value;
164:
165:     /* open the database read-only */
166:     if (!(dp = openDatabase(0))) return 1;
167:
168:     dpiterinit(dp);
169:
170:     /* iterate over all of the records */
171:     while ((key = dpiternext(dp, NULL))) {
172:         value = dpget(dp, key, -1, 0, -1, NULL);
173:         printf("%s %s
", key, value);
174:     }
175:
176:     dpclose(dp);
177:
178:     return 0;
179: }
180:
181: int main(int argc, char ** argv) {
182:     if (argc == 1) usage();
183:
184:     /* look for a mode flag, and call the appropriate function
185:        with the remainder of the arguments */
186:     if (!strcmp(argv[1], "-a"))
187:         return addRecord(argc - 2, argv + 2);
188:     else if (!strcmp(argv[1], "-q"))
189:         return queryRecord(argc - 2, argv + 2);
190:     else if (!strcmp(argv[1], "-d"))
191:         return delRecord(argc - 2, argv + 2);
192:     else if (!strcmp(argv[1], "-l")) {
193:         if (argc != 2) usage();
194:         return listRecords();
195:     }
196:
197:     usage();  /* did not recognize any options */
198:     return 0; /* doesn't get here due to usage() */
199: }


[1] The Berkeley db library has been greatly extended, and now includes a B-Tree implementation and full transaction capabilities.

[2] Berkeley db is now developed by a for-profit organization, and they sell alternative licenses for their library that are attractive for certain applications.

[3] Inverted indexes are a data structure designed for full text searches

[4] Unlike some database libraries that use multiple files, commonly ending with .pag and .dir,the Depot library uses a single file.

[5] For a good introduction to hash tables, see [Cormen, 1992].

[6] This value can be changed only by optimizing the database with dpoptimize(), which is documented on the qdbm Web site.

[7] While qdbm makes the file descriptor available, be careful how you use it. All reading and writing to the file should be done through the qdbm library; operations that do not modify the data of the file, such as locking or setting the close-on-exec flag are acceptable.

[8] Well, they are returned in the order the items are referenced from the hash bucket. While this is an order, it is not a useful one.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset