To gain more insight about greedy algorithms, let's look at another problem that is solvable by a greedy algorithm.
Huffman codes represent a way of compressing data effectively. Data is considered to be a sequence of characters. Huffman's greedy algorithm uses a table with the frequency of each character to build up an optimal way of representing each character as a binary string.
To illustrate this, imagine we have a 1,00,000 character data file that we wish to store in a compressed fashion. The frequency of each character in the data file is given by the following table:
Character | a | b | c | d | e | f |
Frequency | 45,000 | 13,000 | 12,000 | 16,000 | 9,000 | 5,000 |
There are various ways to represent this information. For the purpose of this problem, let's say that we want to design a binary character code in which each character is represented by a unique binary string (which we shall call a code word). One option is to use a fixed-length code (for example, each character is represented by a code word of the same size). If we opt for that, we need three bits to represent each the six characters, as shown in the following table:
Character | a | b | c | d | e | f |
Frequency | 45,000 | 13,000 | 12,000 | 16,000 | 9,000 | 5,000 |
Code Word | 000 | 001 | 010 | 011 | 100 | 101 |
Using this method, we need 3,00,000 bits to code the entire sequence of characters. Can we do better?
A variable-length code can do a lot better than a fixed-length code. Since we want to minimize the size of the compressed sequence of bits, we want to give short code words to frequent characters and long code words to infrequent characters. A possible code for this character sequence is shown in the following table:
Character | a | b | c | d | e | f |
Frequency | 45,000 | 13,000 | 12,000 | 16,000 | 9,000 | 5,000 |
Code Word | 0 | 101 | 100 | 111 | 1101 | 1100 |
Instead of 3,00,000 bits, this code requires only 2,24,000 bits to represent the character sequence. Using this code, we save around 28% of space. The code we have presented is also an optimal character code for this sequence, as we shall see.