An important application of hash tables is the way compilers maintain information about symbols encountered in a program. Formally, a compiler translates a program written in one language, a source language such as C, into another language, which is a set of instructions for the machine on which the program will run. In order to maintain information about the symbols in a program, compilers make use of a data structure called a symbol table. Symbol tables are often implemented as hash tables because a compiler must be able to store and retrieve information about symbols very quickly.
Several parts of a compiler access the symbol table during various phases of the compilation process. One part, the lexical analyzer, inserts symbols. The lexical analyzer is the part of a compiler charged with grouping characters from the source code into meaningful strings, called lexemes. These are translated into syntactic elements, called tokens , that are passed on to the parser . The parser performs syntactical analysis. As the lexical analyzer encounters symbols in its input stream, it stores information about them into the symbol table. Two important attributes stored by the lexical analyzer are a symbol’s lexeme and the type of token the lexeme constitutes (e.g., an identifier or an operator).
The example presented here is a very simple lexical analyzer that analyzes a string of characters and then groups the characters into one of two types of tokens: a token consisting only of digits or a token consisting of something other than digits alone. For simplicity, we assume that tokens are separated in the input stream by a single blank. The lexical analyzer is implemented as a function, lex (see Examples Example 8.4 and Example 8.5), which a parser calls each time it requires another token.
The function works by first calling the
next_token function (whose implementation is not shown) to get the next
blank-delimited string from the input stream
istream
. If next_token
returns NULL, there are no more tokens in the input stream. In this
case, the function returns lexit
, which
tells the parser that there are no more tokens to be processed. If
next_token finds a string, some simple analysis
is performed to determine what type of token the string represents.
Next, the function inserts the lexeme and token type together as a
Symbol
structure into the symbol table,
symtbl
, and returns the token type to the
parser. The type Symbol
is defined in
symbol.h, which is not included
in this example.
A chained hash table is a good way to implement a symbol table because, in addition to being an efficient way to store and retrieve information, we can use it to store a virtually unlimited amount of data. This is important for a compiler since it is difficult to know how many symbols a program will contain before lexical analysis.
The runtime complexity of lex is O (1), assuming next_token runs in a constant amount of time. This is because lex simply calls chtbl_insert, which is an O (1) operation.
/***************************************************************************** * * * --------------------------------- lex.h -------------------------------- * * * *****************************************************************************/ #ifndef LEX_H #define LEX_H #include "chtbl.h" /***************************************************************************** * * * Define the token types recognized by the lexical analyzer. * * * *****************************************************************************/ typedef enum Token_ {lexit, error, digit, other} Token; /***************************************************************************** * * * --------------------------- Public Interface --------------------------- * * * *****************************************************************************/ Token lex(const char *istream, CHTbl *symtbl); #endif
/***************************************************************************** * * * --------------------------------- lex.c -------------------------------- * * * *****************************************************************************/ #include <ctype.h> #include <stdlib.h> #include <string.h> #include "chtbl.h" #include "lex.h" #include "symbol.h" /***************************************************************************** * * * ---------------------------------- lex --------------------------------- * * * *****************************************************************************/ Token lex(const char *istream, CHTbl *symtbl) { Token token; Symbol *symbol; int length, retval, i; /***************************************************************************** * * * Allocate space for a symbol. * * * *****************************************************************************/ if ((symbol = (Symbol *)malloc(sizeof(Symbol))) == NULL) return error; /***************************************************************************** * * * Process the next token. * * * *****************************************************************************/ if ((symbol->lexeme = next_token(istream)) == NULL) { /************************************************************************** * * * Return that there is no more input. * * * **************************************************************************/ free(symbol); return lexit; } else { /************************************************************************** * * * Determine the token type. * * * **************************************************************************/ symbol->token = digit; length = strlen(symbol->lexeme); for (i = 0; i < length; i++) { if (!isdigit(symbol->lexeme[i])) symbol->token = other; } memcpy(&token, &symbol->token, sizeof(Token)); /************************************************************************** * * * Insert the symbol into the symbol table. * * * **************************************************************************/ if ((retval = chtbl_insert(symtbl, symbol)) < 0) { free(symbol); return error; } else if (retval == 1) { /*********************************************************************** * * * The symbol is already in the symbol table. * * * ***********************************************************************/ free(symbol); } } /***************************************************************************** * * * Return the token for the parser. * * * *****************************************************************************/ return token ; }