With Huffman coding, we try to compress data by encoding symbols as Huffman codes generated in a Huffman tree. To uncompress the data, we rebuild the Huffman tree used in the compression process and convert each code back to the symbol it represents. In the implementation presented here, a symbol in the original data is one byte.
The huffman_compress operation
(see Example 14.3) compresses data
using Huffman coding. It begins by scanning the data to determine
the frequency of each symbol. The frequencies are placed in an
array, freqs
. After scanning the data,
the frequencies are scaled so that each can be represented in a
single byte. This is done by determining the maximum number of times
any symbol occurs in the data and adjusting the other frequencies
accordingly. Since symbols that do not occur in the data should be
the only ones with frequencies of 0, we perform a simple test to
ensure that any nonzero frequencies that scale to less than 1 end up
being set to 1 instead of 0.
Once we have determined and scaled the frequencies, we call
build_tree to build the Huffman tree. The
build_tree function begins by inserting into a
priority queue one binary tree for each symbol occurring at least
once in the data. Nodes in the trees are
HuffNode
structures (see Example 14.1). This structure consists of two members:
symbol
is a symbol from the data (used
only in leaf nodes), and freq
is a
frequency. Each tree initially contains only a single node, which
stores one symbol and its scaled frequency as recorded and scaled in
the freqs
array.
To build the Huffman tree, we use a loop to perform
size
- 1 merges of the trees within the
priority queue. On each iteration, we call
pqueue_extract twice to extract the two binary
trees whose root nodes have the smallest frequencies. We then sum
the frequencies, merge the trees into a new one, store the sum of
the frequencies in the new tree’s root, and insert the new tree back
into the priority queue. We continue this process until, after
size
- 1 iterations, the only tree
remaining in the priority queue is the final Huffman tree.
Using the Huffman tree built in the previous step, we call
build_table to build a table of the Huffman codes assigned to every
symbol. Each entry in the table is a
HuffCode
structure. This structure
consists of three members: used
is a flag
set to 1 or indicating whether the entry has a code stored in it,
code
is the Huffman code stored in the
entry, and size
is the number of bits the
code contains. Each code is a short integer because it can be proven
(although this is not shown here) that when all frequencies are
scaled to fit within one byte, no code will be longer than 16
bits.
We build the table by traversing the Huffman tree using a
preorder traversal (see Chapter
9). In each activation of build_table,
code
keeps track of the current Huffman
code being generated, and size
maintains
the number of bits it contains. As we traverse the tree, each time
we move to the left, we append to the code; each time we move to the
right, we append 1. Once we encounter a leaf node, we store the
Huffman code into the table of codes at the appropriate entry. As we
store each code, we call the network function
htons as a convenient way to ensure that the
code is stored in big-endian format. This is the format required
when we actually generate the compressed data in the next step as
well as when we uncompress it.
While generating the compressed data, we use
ipos
to keep track of the current byte
being processed in the original data, and
opos
to keep track of the current bit we
are writing to the buffer of compressed data. To begin, we write a
header that will help to rebuild the Huffman tree in
huffman_uncompress. The header contains a
four-byte value for the number of symbols about to be encoded
followed by the scaled frequencies of all 256 possible symbols,
including those that are 0. Finally, to encode the data, we read one
symbol at a time, look up its Huffman code in the table, and write
each code to the compressed buffer. We allocate space for each byte
in the compressed buffer as we need it.
The runtime complexity of huffman_compress is O (n), where n is the number of symbols in the original data. Only two parts of the algorithm depend on the size of the data: the part in which we determine the frequency of each symbol, and the part in which we read the data so we can compress it. Each of these runs in O (n) time. The time to build the Huffman tree does not affect the complexity of huffman_compress because the running time of this process depends only on the number of different symbols in the data, which in this implementation is a constant, 256.
The huffman_uncompress operation
(see Example 14.3) uncompresses
data compressed with huffman_compress. This
operation begins by reading the header prepended to the compressed
data. Recall that the first four bytes of the header contain the
number of encoded symbols. This value is stored in
size
. The next 256 bytes contain the
scaled frequencies for all symbols.
Using the information stored in the header, we call
build_tree to rebuild the Huffman tree used in
compressing the data. Once we have rebuilt the tree, the next step
is to generate the buffer of restored data. To do this, we read the
compressed data bit by bit. Starting at the root of the Huffman
tree, whenever we encounter a bit that is in the data, we move to
the left; whenever we encounter a bit that is 1, we move to the
right. Once we encounter a leaf node, we have obtained the Huffman
code for a symbol. The decoded symbol resides in the leaf. Thus, we
write this symbol to the buffer of restored data. After writing the
symbol, we reposition ourselves at the root of the tree and repeat
the process. We use ipos
to keep track of
the current bit being processed in the compressed data, and
opos
to keep track of the current byte we
are writing to the buffer of restored data. Once
opos
reaches
size
, we have regenerated all of the
symbols from the original data.
The runtime complexity of huffman_uncompress is O (n), where n is the number of symbols in the original data. This is because for each of the n symbols we decode, the number of levels we must descend in the Huffman tree is a bounded constant that depends on the number of different symbols in the data: in this implementation, 256. The time to build the Huffman tree does not affect the complexity of huffman_uncompress because this process depends only on the number of different symbols in the data.
/***************************************************************************** * * * ------------------------------- huffman.c ------------------------------ * * * *****************************************************************************/ #include <limits.h> #include <netinet/in.h> #include <stdlib.h> #include <string.h> #include "bit.h" #include "bitree.h" #include "compress.h" #include "pqueue.h" /***************************************************************************** * * * ----------------------------- compare_freq ----------------------------- * * * *****************************************************************************/ static int compare_freq(const void *tree1, const void *tree2) { HuffNode *root1, *root2; /***************************************************************************** * * * Compare the frequencies stored in the root nodes of two binary trees. * * * *****************************************************************************/ root1 = (HuffNode *)bitree_data(bitree_root((const BiTree *)tree1)); root2 = (HuffNode *)bitree_data(bitree_root((const BiTree *)tree2)); if (root1->freq < root2->freq) return 1; else if (root1->freq > root2->freq) return -1; else return 0; } /***************************************************************************** * * * ----------------------------- destroy_tree ----------------------------- * * * *****************************************************************************/ static void destroy_tree(void *tree) { /***************************************************************************** * * * Destroy and free one binary tree from the priority queue of trees. * * * *****************************************************************************/ bitree_destroy(tree); free(tree); return; } /***************************************************************************** * * * ------------------------------ build_tree ------------------------------ * * * *****************************************************************************/ static int build_tree(int *freqs, BiTree **tree) { BiTree *init, *merge, *left, *right; PQueue pqueue; HuffNode *data; int size, c; /***************************************************************************** * * * Initialize the priority queue of binary trees. * * * *****************************************************************************/ *tree = NULL; pqueue_init(&pqueue, compare_freq, destroy_tree); for (c = 0; c <= UCHAR_MAX; c++) { if (freqs[c] != 0) { /*********************************************************************** * * * Set up a binary tree for the current symbol and its frequency. * * * ***********************************************************************/ if ((init = (BiTree *)malloc(sizeof(BiTree))) == NULL) { pqueue_destroy(&pqueue); return -1; } bitree_init(init, free); if ((data = (HuffNode *)malloc(sizeof(HuffNode))) == NULL) { pqueue_destroy(&pqueue); return -1; } data->symbol = c; data->freq = freqs[c]; if (bitree_ins_left(init, NULL, data) != 0) { free(data); bitree_destroy(init); free(init); pqueue_destroy(&pqueue); return -1; } /*********************************************************************** * * * Insert the binary tree into the priority queue. * * * ***********************************************************************/ if (pqueue_insert(&pqueue, init) != 0) { bitree_destroy(init); free(init); pqueue_destroy(&pqueue); return -1; } } } /***************************************************************************** * * * Build a Huffman tree by merging trees in the priority queue. * * * *****************************************************************************/ size = pqueue_size(&pqueue); for (c = 1; c <= size - 1; c++) { /************************************************************************** * * * Allocate storage for the next merged tree. * * * **************************************************************************/ if ((merge = (BiTree *)malloc(sizeof(BiTree))) == NULL) { pqueue_destroy(&pqueue); return -1; } /************************************************************************** * * * Extract the two trees whose root nodes have the smallest frequencies. * * * **************************************************************************/ if (pqueue_extract(&pqueue, (void **)&left) != 0) { pqueue_destroy(&pqueue); free(merge); return -1; } if (pqueue_extract(&pqueue, (void **)&right) != 0) { pqueue_destroy(&pqueue); free(merge); return -1; } /************************************************************************** * * * Allocate storage for the data in the root node of the merged tree. * * * **************************************************************************/ if ((data = (HuffNode *)malloc(sizeof(HuffNode))) == NULL) { pqueue_destroy(&pqueue); free(merge); return -1; } memset(data, 0, sizeof(HuffNode)); /************************************************************************** * * * Sum the frequencies in the root nodes of the trees being merged. * * * **************************************************************************/ data->freq = ((HuffNode *)bitree_data(bitree_root(left)))->freq + ((HuffNode *)bitree_data(bitree_root(right)))->freq; /************************************************************************** * * * Merge the two trees. * * * **************************************************************************/ if (bitree_merge(merge, left, right, data) != 0) { pqueue_destroy(&pqueue); free(merge); return -1; } /************************************************************************** * * * Insert the merged tree into the priority queue and free the others. * * * **************************************************************************/ if (pqueue_insert(&pqueue, merge) != 0) { pqueue_destroy(&pqueue); bitree_destroy(merge); free(merge); return -1; } free(left); free(right); } /***************************************************************************** * * * The last tree in the priority queue is the Huffman tree. * * * *****************************************************************************/ if (pqueue_extract(&pqueue, (void **)tree) != 0) { pqueue_destroy(&pqueue); return -1; } else { pqueue_destroy(&pqueue); } return 0; } /***************************************************************************** * * * ------------------------------ build_table ----------------------------- * * * *****************************************************************************/ static void build_table(BiTreeNode *node, unsigned short code, unsigned char size, HuffCode *table) { if (!bitree_is_eob(node)) { if (!bitree_is_eob(bitree_left(node))) { /*********************************************************************** * * * Move to the left and append 0 to the current code. * * * ***********************************************************************/ build_table(bitree_left(node), code << 1, size + 1, table); } if (!bitree_is_eob(bitree_right(node))) { /*********************************************************************** * * * Move to the right and append 1 to the current code. * * * ***********************************************************************/ build_table(bitree_right(node), (code << 1) | 0x0001, size + 1, table); } if (bitree_is_eob(bitree_left(node))&&bitree_is_eob(bitree_right(node))) { /*********************************************************************** * * * Ensure that the current code is in big-endian format. * * * ***********************************************************************/ code = htons(code); /*********************************************************************** * * * Assign the current code to the symbol in the leaf node. * * * ***********************************************************************/ table[((HuffNode *)bitree_data(node))->symbol].used = 1; table[((HuffNode *)bitree_data(node))->symbol].code = code; table[((HuffNode *)bitree_data(node))->symbol].size = size; } } return; } /***************************************************************************** * * * --------------------------- huffman_compress --------------------------- * * * *****************************************************************************/ int huffman_compress(const unsigned char *original, unsigned char **compressed, int size) { BiTree *tree; HuffCode table[UCHAR_MAX + 1]; int freqs[UCHAR_MAX + 1], max, scale, hsize, ipos, opos, cpos, c, i; unsigned char *comp, *temp; /***************************************************************************** * * * Initially, there is no buffer of compressed data. * * * *****************************************************************************/ *compressed = NULL; /***************************************************************************** * * * Get the frequency of each symbol in the original data. * * * *****************************************************************************/ for (c = 0; c <= UCHAR_MAX; c++) freqs[c] = 0; ipos = 0; if (size > 0) { while (ipos < size) { freqs[original[ipos]]++; ipos++; } } /***************************************************************************** * * * Scale the frequencies to fit into one byte. * * * *****************************************************************************/ max = UCHAR_MAX; for (c = 0; c <= UCHAR_MAX; c++) { if (freqs[c] > max) max = freqs[c]; } for (c = 0; c <= UCHAR_MAX; c++) { scale = (int)(freqs[c] / ((double)max / (double)UCHAR_MAX)); if (scale == 0 && freqs[c] != 0) freqs[c] = 1; else freqs[c] = scale; } /***************************************************************************** * * * Build the Huffman tree and table of codes for the data. * * * *****************************************************************************/ if (build_tree(freqs, &tree) != 0) return -1; for (c = 0; c <= UCHAR_MAX; c++) memset(&table[c], 0, sizeof(HuffCode)); build_table(bitree_root(tree), 0x0000, 0, table); bitree_destroy(tree); free(tree); /***************************************************************************** * * * Write the header information. * * * *****************************************************************************/ hsize = sizeof(int) + (UCHAR_MAX + 1); if ((comp = (unsigned char *)malloc(hsize)) == NULL) return -1; memcpy(comp, &size, sizeof(int)); for (c = 0; c <= UCHAR_MAX; c++) comp[sizeof(int) + c] = (unsigned char)freqs[c]; /***************************************************************************** * * * Compress the data. * * * *****************************************************************************/ ipos = 0; opos = hsize * 8; while (ipos < size) { /************************************************************************** * * * Get the next symbol in the original data. * * * **************************************************************************/ c = original[ipos]; /************************************************************************** * * * Write the code for the symbol to the buffer of compressed data. * * * **************************************************************************/ for (i = 0; i < table[c].size; i++) { if (opos % 8 == 0) { /******************************************************************** * * * Allocate another byte for the buffer of compressed data. * * * ********************************************************************/ if ((temp = (unsigned char *)realloc(comp,(opos / 8) + 1)) == NULL) { free(comp); return -1; } comp = temp; } cpos = (sizeof(short) * 8) - table[c].size + i; bit_set(comp, opos, bit_get((unsigned char *)&table[c].code, cpos)); opos++; } ipos++; } /***************************************************************************** * * * Point to the buffer of compressed data. * * * *****************************************************************************/ *compressed = comp; /***************************************************************************** * * * Return the number of bytes in the compressed data. * * * *****************************************************************************/ return ((opos - 1) / 8) + 1; } /***************************************************************************** * * * -------------------------- huffman_uncompress -------------------------- * * * *****************************************************************************/ int huffman_uncompress(const unsigned char *compressed, unsigned char **original) { BiTree *tree; BiTreeNode *node; int freqs[UCHAR_MAX + 1], hsize, size, ipos, opos, state, c; unsigned char *orig, *temp; /***************************************************************************** * * * Initially there is no buffer of original data. * * * *****************************************************************************/ *original = orig = NULL; /***************************************************************************** * * * Get the header information from the buffer of compressed data. * * * *****************************************************************************/ hsize = sizeof(int) + (UCHAR_MAX + 1); memcpy(&size, compressed, sizeof(int)); for (c = 0; c <= UCHAR_MAX; c++) freqs[c] = compressed[sizeof(int) + c]; /***************************************************************************** * * * Rebuild the Huffman tree used previously to compress the data. * * * *****************************************************************************/ if (build_tree(freqs, &tree) != 0) return -1; /***************************************************************************** * * * Uncompress the data. * * * *****************************************************************************/ ipos = hsize * 8; opos = 0; node = bitree_root(tree); while (opos < size) { /************************************************************************** * * * Get the next bit in the compressed data. * * * **************************************************************************/ state = bit_get(compressed, ipos); ipos++; if (state == 0) { /*********************************************************************** * * * Move to the left. * * * ***********************************************************************/ if (bitree_is_eob(node) || bitree_is_eob(bitree_left(node))) { bitree_destroy(tree); free(tree); return -1; } else node = bitree_left(node); } else { /*********************************************************************** * * * Move to the right. * * * ***********************************************************************/ if (bitree_is_eob(node) || bitree_is_eob(bitree_right(node))) { bitree_destroy(tree); free(tree); return -1; } else node = bitree_right(node); } if (bitree_is_eob(bitree_left(node))&&bitree_is_eob(bitree_right(node))) { /*********************************************************************** * * * Write the symbol in the leaf node to the buffer of original data. * * * ***********************************************************************/ if (opos > 0) { if ((temp = (unsigned char *)realloc(orig, opos + 1)) == NULL) { bitree_destroy(tree); free(tree); free(orig); return -1; } orig = temp; } else { if ((orig = (unsigned char *)malloc(1)) == NULL) { bitree_destroy(tree); free(tree); return -1; } } orig[opos] = ((HuffNode *)bitree_data(node))->symbol; opos++; /*********************************************************************** * * * Move back to the top of the tree. * * * ***********************************************************************/ node = bitree_root(tree); } } bitree_destroy(tree); free(tree); /***************************************************************************** * * * Point to the buffer of original data. * * * *****************************************************************************/ *original = orig; /***************************************************************************** * * * Return the number of bytes in the original data. * * * *****************************************************************************/ return opos; }