Implementation and Analysis of Huffman Coding

With Huffman coding, we try to compress data by encoding symbols as Huffman codes generated in a Huffman tree. To uncompress the data, we rebuild the Huffman tree used in the compression process and convert each code back to the symbol it represents. In the implementation presented here, a symbol in the original data is one byte.

huffman_compress

The huffman_compress operation (see Example 14.3) compresses data using Huffman coding. It begins by scanning the data to determine the frequency of each symbol. The frequencies are placed in an array, freqs. After scanning the data, the frequencies are scaled so that each can be represented in a single byte. This is done by determining the maximum number of times any symbol occurs in the data and adjusting the other frequencies accordingly. Since symbols that do not occur in the data should be the only ones with frequencies of 0, we perform a simple test to ensure that any nonzero frequencies that scale to less than 1 end up being set to 1 instead of 0.

Once we have determined and scaled the frequencies, we call build_tree to build the Huffman tree. The build_tree function begins by inserting into a priority queue one binary tree for each symbol occurring at least once in the data. Nodes in the trees are HuffNode structures (see Example 14.1). This structure consists of two members: symbol is a symbol from the data (used only in leaf nodes), and freq is a frequency. Each tree initially contains only a single node, which stores one symbol and its scaled frequency as recorded and scaled in the freqs array.

To build the Huffman tree, we use a loop to perform size - 1 merges of the trees within the priority queue. On each iteration, we call pqueue_extract twice to extract the two binary trees whose root nodes have the smallest frequencies. We then sum the frequencies, merge the trees into a new one, store the sum of the frequencies in the new tree’s root, and insert the new tree back into the priority queue. We continue this process until, after size - 1 iterations, the only tree remaining in the priority queue is the final Huffman tree.

Using the Huffman tree built in the previous step, we call build_table to build a table of the Huffman codes assigned to every symbol. Each entry in the table is a HuffCode structure. This structure consists of three members: used is a flag set to 1 or indicating whether the entry has a code stored in it, code is the Huffman code stored in the entry, and size is the number of bits the code contains. Each code is a short integer because it can be proven (although this is not shown here) that when all frequencies are scaled to fit within one byte, no code will be longer than 16 bits.

We build the table by traversing the Huffman tree using a preorder traversal (see Chapter 9). In each activation of build_table, code keeps track of the current Huffman code being generated, and size maintains the number of bits it contains. As we traverse the tree, each time we move to the left, we append to the code; each time we move to the right, we append 1. Once we encounter a leaf node, we store the Huffman code into the table of codes at the appropriate entry. As we store each code, we call the network function htons as a convenient way to ensure that the code is stored in big-endian format. This is the format required when we actually generate the compressed data in the next step as well as when we uncompress it.

While generating the compressed data, we use ipos to keep track of the current byte being processed in the original data, and opos to keep track of the current bit we are writing to the buffer of compressed data. To begin, we write a header that will help to rebuild the Huffman tree in huffman_uncompress. The header contains a four-byte value for the number of symbols about to be encoded followed by the scaled frequencies of all 256 possible symbols, including those that are 0. Finally, to encode the data, we read one symbol at a time, look up its Huffman code in the table, and write each code to the compressed buffer. We allocate space for each byte in the compressed buffer as we need it.

The runtime complexity of huffman_compress is O (n), where n is the number of symbols in the original data. Only two parts of the algorithm depend on the size of the data: the part in which we determine the frequency of each symbol, and the part in which we read the data so we can compress it. Each of these runs in O (n) time. The time to build the Huffman tree does not affect the complexity of huffman_compress because the running time of this process depends only on the number of different symbols in the data, which in this implementation is a constant, 256.

huffman_uncompress

The huffman_uncompress operation (see Example 14.3) uncompresses data compressed with huffman_compress. This operation begins by reading the header prepended to the compressed data. Recall that the first four bytes of the header contain the number of encoded symbols. This value is stored in size. The next 256 bytes contain the scaled frequencies for all symbols.

Using the information stored in the header, we call build_tree to rebuild the Huffman tree used in compressing the data. Once we have rebuilt the tree, the next step is to generate the buffer of restored data. To do this, we read the compressed data bit by bit. Starting at the root of the Huffman tree, whenever we encounter a bit that is in the data, we move to the left; whenever we encounter a bit that is 1, we move to the right. Once we encounter a leaf node, we have obtained the Huffman code for a symbol. The decoded symbol resides in the leaf. Thus, we write this symbol to the buffer of restored data. After writing the symbol, we reposition ourselves at the root of the tree and repeat the process. We use ipos to keep track of the current bit being processed in the compressed data, and opos to keep track of the current byte we are writing to the buffer of restored data. Once opos reaches size, we have regenerated all of the symbols from the original data.

The runtime complexity of huffman_uncompress is O (n), where n is the number of symbols in the original data. This is because for each of the n symbols we decode, the number of levels we must descend in the Huffman tree is a bounded constant that depends on the number of different symbols in the data: in this implementation, 256. The time to build the Huffman tree does not affect the complexity of huffman_uncompress because this process depends only on the number of different symbols in the data.

Example 14.3. Implementation of Huffman Coding
/*****************************************************************************
*                                                                            *
*  ------------------------------- huffman.c ------------------------------  *
*                                                                            *
*****************************************************************************/

#include <limits.h>
#include <netinet/in.h>
#include <stdlib.h>
#include <string.h>

#include "bit.h"
#include "bitree.h"
#include "compress.h"
#include "pqueue.h"

/*****************************************************************************
*                                                                            *
*  ----------------------------- compare_freq -----------------------------  *
*                                                                            *
*****************************************************************************/

static int compare_freq(const void *tree1, const void *tree2) {

HuffNode           *root1,
                   *root2;

/*****************************************************************************
*                                                                            *
*  Compare the frequencies stored in the root nodes of two binary trees.     *
*                                                                            *
*****************************************************************************/

root1 = (HuffNode *)bitree_data(bitree_root((const BiTree *)tree1));
root2 = (HuffNode *)bitree_data(bitree_root((const BiTree *)tree2));

if (root1->freq < root2->freq)
   return 1;
else if (root1->freq > root2->freq)
   return -1;
else
   return 0;

}

/*****************************************************************************
*                                                                            *
*  ----------------------------- destroy_tree -----------------------------  *
*                                                                            *
*****************************************************************************/

static void destroy_tree(void *tree) {

/*****************************************************************************
*                                                                            *
*  Destroy and free one binary tree from the priority queue of trees.        *
*                                                                            *
*****************************************************************************/

bitree_destroy(tree);
free(tree);

return;

}

/*****************************************************************************
*                                                                            *
*  ------------------------------ build_tree ------------------------------  *
*                                                                            *
*****************************************************************************/

static int build_tree(int *freqs, BiTree **tree) {

BiTree             *init,
                   *merge,
                   *left,
                   *right;

PQueue             pqueue;

HuffNode           *data;

int                size,
                   c;

/*****************************************************************************
*                                                                            *
*  Initialize the priority queue of binary trees.                            *
*                                                                            *
*****************************************************************************/

*tree = NULL;

pqueue_init(&pqueue, compare_freq, destroy_tree);

for (c = 0; c <= UCHAR_MAX; c++) {

   if (freqs[c] != 0) {

      /***********************************************************************
      *                                                                      *
      *  Set up a binary tree for the current symbol and its frequency.      *
      *                                                                      *
      ***********************************************************************/

      if ((init = (BiTree *)malloc(sizeof(BiTree))) == NULL) {

         pqueue_destroy(&pqueue);
         return -1;

      }

      bitree_init(init, free);

      if ((data = (HuffNode *)malloc(sizeof(HuffNode))) == NULL) {

         pqueue_destroy(&pqueue);
         return -1;

      }

      data->symbol = c;
      data->freq = freqs[c];

      if (bitree_ins_left(init, NULL, data) != 0) {

         free(data);
         bitree_destroy(init);
         free(init);
         pqueue_destroy(&pqueue);
         return -1;

      }

      /***********************************************************************
      *                                                                      *
      *  Insert the binary tree into the priority queue.                     *
      *                                                                      *
      ***********************************************************************/

      if (pqueue_insert(&pqueue, init) != 0) {

         bitree_destroy(init);
         free(init);
         pqueue_destroy(&pqueue);
         return -1;

      }

   }

}

/*****************************************************************************
*                                                                            *
*  Build a Huffman tree by merging trees in the priority queue.              *
*                                                                            *
*****************************************************************************/

size = pqueue_size(&pqueue);

for (c = 1; c <= size - 1; c++) {

   /**************************************************************************
   *                                                                         *
   *  Allocate storage for the next merged tree.                             *
   *                                                                         *
   **************************************************************************/

   if ((merge = (BiTree *)malloc(sizeof(BiTree))) == NULL) {

      pqueue_destroy(&pqueue);
      return -1;

   }

   /**************************************************************************
   *                                                                         *
   *  Extract the two trees whose root nodes have the smallest frequencies.  *
   *                                                                         *
   **************************************************************************/

   if (pqueue_extract(&pqueue, (void **)&left) != 0) {

      pqueue_destroy(&pqueue);
      free(merge);
      return -1;

   }

   if (pqueue_extract(&pqueue, (void **)&right) != 0) {

      pqueue_destroy(&pqueue);
      free(merge);
      return -1;

   }

   /**************************************************************************
   *                                                                         *
   *  Allocate storage for the data in the root node of the merged tree.     *
   *                                                                         *
   **************************************************************************/

   if ((data = (HuffNode *)malloc(sizeof(HuffNode))) == NULL) {

      pqueue_destroy(&pqueue);
      free(merge);
      return -1;

   }

   memset(data, 0, sizeof(HuffNode));

   /**************************************************************************
   *                                                                         *
   *  Sum the frequencies in the root nodes of the trees being merged.       *
   *                                                                         *
   **************************************************************************/

   data->freq = ((HuffNode *)bitree_data(bitree_root(left)))->freq +
      ((HuffNode *)bitree_data(bitree_root(right)))->freq;

   /**************************************************************************
   *                                                                         *
   *  Merge the two trees.                                                   *
   *                                                                         *
   **************************************************************************/

   if (bitree_merge(merge, left, right, data) != 0) {

      pqueue_destroy(&pqueue);
      free(merge);
      return -1;

   }

   /**************************************************************************
   *                                                                         *
   *  Insert the merged tree into the priority queue and free the others.    *
   *                                                                         *
   **************************************************************************/

   if (pqueue_insert(&pqueue, merge) != 0) {

      pqueue_destroy(&pqueue);
      bitree_destroy(merge);
      free(merge);
      return -1;

   }

   free(left);
   free(right);

}

/*****************************************************************************
*                                                                            *
*  The last tree in the priority queue is the Huffman tree.                  *
*                                                                            *
*****************************************************************************/

if (pqueue_extract(&pqueue, (void **)tree) != 0) {

   pqueue_destroy(&pqueue);
   return -1;

   }

else {

   pqueue_destroy(&pqueue);

}

return 0;

}

/*****************************************************************************
*                                                                            *
*  ------------------------------ build_table -----------------------------  *
*                                                                            *
*****************************************************************************/

static void build_table(BiTreeNode *node, unsigned short code, unsigned char
   size, HuffCode *table) {

if (!bitree_is_eob(node)) {

   if (!bitree_is_eob(bitree_left(node))) {

      /***********************************************************************
      *                                                                      *
      *  Move to the left and append 0 to the current code.                  *
      *                                                                      *
      ***********************************************************************/

      build_table(bitree_left(node), code << 1, size + 1, table);

   }

   if (!bitree_is_eob(bitree_right(node))) {

      /***********************************************************************
      *                                                                      *
      *  Move to the right and append 1 to the current code.                 *
      *                                                                      *
      ***********************************************************************/

      build_table(bitree_right(node), (code << 1) | 0x0001, size + 1, table);

   }

   if (bitree_is_eob(bitree_left(node))&&bitree_is_eob(bitree_right(node))) {

      /***********************************************************************
      *                                                                      *
      *  Ensure that the current code is in big-endian format.               *
      *                                                                      *
      ***********************************************************************/

      code = htons(code);

      /***********************************************************************
      *                                                                      *
      *  Assign the current code to the symbol in the leaf node.             *
      *                                                                      *
      ***********************************************************************/

      table[((HuffNode *)bitree_data(node))->symbol].used = 1;
      table[((HuffNode *)bitree_data(node))->symbol].code = code;
      table[((HuffNode *)bitree_data(node))->symbol].size = size;

   }

}

return;

}

/*****************************************************************************
*                                                                            *
*  --------------------------- huffman_compress ---------------------------  *
*                                                                            *
*****************************************************************************/

int huffman_compress(const unsigned char *original, unsigned char
   **compressed, int size) {

BiTree             *tree;
HuffCode           table[UCHAR_MAX + 1];

int                freqs[UCHAR_MAX + 1],
                   max,
                   scale,
                   hsize,
                   ipos,
                   opos,
                   cpos,
                   c,
                   i;

unsigned char      *comp,
                   *temp;

/*****************************************************************************
*                                                                            *
*  Initially, there is no buffer of compressed data.                         *
*                                                                            *
*****************************************************************************/

*compressed = NULL;

/*****************************************************************************
*                                                                            *
*  Get the frequency of each symbol in the original data.                    *
*                                                                            *
*****************************************************************************/

for (c = 0; c <= UCHAR_MAX; c++)
   freqs[c] = 0;

ipos = 0;

if (size > 0) {

   while (ipos < size) {

      freqs[original[ipos]]++;
      ipos++;

   }

}

/*****************************************************************************
*                                                                            *
*  Scale the frequencies to fit into one byte.                               *
*                                                                            *
*****************************************************************************/

max = UCHAR_MAX;

for (c = 0; c <= UCHAR_MAX; c++) {

   if (freqs[c] > max)
      max = freqs[c];

}

for (c = 0; c <= UCHAR_MAX; c++) {

   scale = (int)(freqs[c] / ((double)max / (double)UCHAR_MAX));

   if (scale == 0 && freqs[c] != 0)
      freqs[c] = 1;
   else
      freqs[c] = scale;

}

/*****************************************************************************
*                                                                            *
*  Build the Huffman tree and table of codes for the data.                   *
*                                                                            *
*****************************************************************************/

if (build_tree(freqs, &tree) != 0)
   return -1;

for (c = 0; c <= UCHAR_MAX; c++)
   memset(&table[c], 0, sizeof(HuffCode));

build_table(bitree_root(tree), 0x0000, 0, table);

bitree_destroy(tree);
free(tree);

/*****************************************************************************
*                                                                            *
*  Write the header information.                                             *
*                                                                            *
*****************************************************************************/

hsize = sizeof(int) + (UCHAR_MAX + 1);

if ((comp = (unsigned char *)malloc(hsize)) == NULL)
   return -1;

memcpy(comp, &size, sizeof(int));

for (c = 0; c <= UCHAR_MAX; c++)
   comp[sizeof(int) + c] = (unsigned char)freqs[c];

/*****************************************************************************
*                                                                            *
*  Compress the data.                                                        *
*                                                                            *
*****************************************************************************/

ipos = 0;
opos = hsize * 8;

while (ipos < size) {

   /**************************************************************************
   *                                                                         *
   *  Get the next symbol in the original data.                              *
   *                                                                         *
   **************************************************************************/

   c = original[ipos];

   /**************************************************************************
   *                                                                         *
   *  Write the code for the symbol to the buffer of compressed data.        *
   *                                                                         *
   **************************************************************************/

   for (i = 0; i < table[c].size; i++) {

      if (opos % 8 == 0) {

         /********************************************************************
         *                                                                   *
         *  Allocate another byte for the buffer of compressed data.         *
         *                                                                   *
         ********************************************************************/

         if ((temp = (unsigned char *)realloc(comp,(opos / 8) + 1)) == NULL) {

            free(comp);
            return -1;

         }


         comp = temp;

      }

      cpos = (sizeof(short) * 8) - table[c].size + i;
      bit_set(comp, opos, bit_get((unsigned char *)&table[c].code, cpos));
      opos++;

   }

   ipos++;

}

/*****************************************************************************
*                                                                            *
*  Point to the buffer of compressed data.                                   *
*                                                                            *
*****************************************************************************/

*compressed = comp;

/*****************************************************************************
*                                                                            *
*  Return the number of bytes in the compressed data.                        *
*                                                                            *
*****************************************************************************/

return ((opos - 1) / 8) + 1;

}

/*****************************************************************************
*                                                                            *
*  -------------------------- huffman_uncompress --------------------------  *
*                                                                            *
*****************************************************************************/

int huffman_uncompress(const unsigned char *compressed, unsigned char
   **original) {

BiTree             *tree;
BiTreeNode         *node;

int                freqs[UCHAR_MAX + 1],
                   hsize,
                   size,
                   ipos,
                   opos,
                   state,
                   c;

unsigned char      *orig,
                   *temp;

/*****************************************************************************
*                                                                            *
*  Initially there is no buffer of original data.                            *
*                                                                            *
*****************************************************************************/

*original = orig = NULL;

/*****************************************************************************
*                                                                            *
*  Get the header information from the buffer of compressed data.            *
*                                                                            *
*****************************************************************************/

hsize = sizeof(int) + (UCHAR_MAX + 1);
memcpy(&size, compressed, sizeof(int));

for (c = 0; c <= UCHAR_MAX; c++)
   freqs[c] = compressed[sizeof(int) + c];

/*****************************************************************************
*                                                                            *
*  Rebuild the Huffman tree used previously to compress the data.            *
*                                                                            *
*****************************************************************************/

if (build_tree(freqs, &tree) != 0)
   return -1;

/*****************************************************************************
*                                                                            *
*  Uncompress the data.                                                      *
*                                                                            *
*****************************************************************************/

ipos = hsize * 8;
opos = 0;
node = bitree_root(tree);

while (opos < size) {

   /**************************************************************************
   *                                                                         *
   *  Get the next bit in the compressed data.                               *
   *                                                                         *
   **************************************************************************/

   state = bit_get(compressed, ipos);
   ipos++;

   if (state == 0) {

      /***********************************************************************
      *                                                                      *
      *  Move to the left.                                                   *
      *                                                                      *
      ***********************************************************************/

      if (bitree_is_eob(node) || bitree_is_eob(bitree_left(node))) {

         bitree_destroy(tree);
         free(tree);
         return -1;

         }

      else
         node = bitree_left(node);

      }

   else {

      /***********************************************************************
      *                                                                      *
      *  Move to the right.                                                  *
      *                                                                      *
      ***********************************************************************/

      if (bitree_is_eob(node) || bitree_is_eob(bitree_right(node))) {

         bitree_destroy(tree);
         free(tree);
         return -1;

         }

      else
         node = bitree_right(node);

   }

   if (bitree_is_eob(bitree_left(node))&&bitree_is_eob(bitree_right(node))) {

      /***********************************************************************
      *                                                                      *
      *  Write the symbol in the leaf node to the buffer of original data.   *
      *                                                                      *
      ***********************************************************************/

      if (opos > 0) {

         if ((temp = (unsigned char *)realloc(orig, opos + 1)) == NULL) {

            bitree_destroy(tree);
            free(tree);
            free(orig);
            return -1;

         }

         orig = temp;

         }

      else {

         if ((orig = (unsigned char *)malloc(1)) == NULL) {

            bitree_destroy(tree);
            free(tree);
            return -1;

         }

      }

      orig[opos] = ((HuffNode *)bitree_data(node))->symbol;
      opos++;

      /***********************************************************************
      *                                                                      *
      *  Move back to the top of the tree.                                   *
      *                                                                      *
      ***********************************************************************/

      node = bitree_root(tree);

   }

}

bitree_destroy(tree);
free(tree);

/*****************************************************************************
*                                                                            *
*  Point to the buffer of original data.                                     *
*                                                                            *
*****************************************************************************/

*original = orig;

/*****************************************************************************
*                                                                            *
*  Return the number of bytes in the original data.                          *
*                                                                            *
*****************************************************************************/

return opos;

}
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset