Building a Huffman Code

Before we start studying an algorithm to solve this problem, we should introduce something called prefix codes. Prefix codes are codes in which no code word is also a prefix of some other code word. As you have seen from the proposed variable-length code, we must make sure that no code word is also a prefix of some other code word, since we want to concatenate on code words and unambiguously be able to decode it afterwards.

For example, using the code shown previously, we encode the string abc as 0101100. Since no code word is a prefix of any other, decoding is vastly simplified, as we can identify the initial code word, translate it, and repeat the process on the remainder of the encoded sequence.

A convenient (for decoding purposes) representation of prefix codes is a binary tree whose leaves are the characters of the original data sequence. For the proposed variable-length code, we have the following binary tree:

Figure 4.1: A Representation of prefix codes

The binary code word for a character is the path from the root to that character, following in the binary digits in each edge. Note that each node also holds the frequency of characters under its subtree. Such a tree also has some interesting properties. A tree for an optimal prefix code has exactly |C| leaves, C being the alphabet from which the characters are drawn. The number of internal nodes is exactly |C| - 1. We also know that the number of bits necessary to encode a particular character in a sequence is equal to the frequency of that character multiplied by the depth of the leaf that holds the character. As such, the number of bits required to encode the full character sequence is simply the sum of these values for all the characters in the alphabet.

If we can build such a tree, we can compute an optimal prefix code. David A. Huffman invented a greedy algorithm to construct an optimal prefix code, called a Huffman Code. Its basic idea is to build the tree in a bottom-up fashion. We start with just the leaf, one for each character. We then repeatedly join the two least-frequent nodes, until we are left with a single node, which is the root of the tree.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset