Huffman Compression 423
After writing the code book, the header uses the next 4 bytes (32 bits) to write the
length of the article. This is an unsigned integer and can be as large as 2
32
− 1, more than
four billion. A 1000-page novel has about 500,000 words, a few million characters. Thus, 32
bits are sufficient. After the length, a new line character is written to the file, signifying the
end of the header.
The output file cannot be viewed easily in a text editor, because the data is compressed
in bit representations. We can use the xxd program in Linux to see the hexadecimal values
that represent each byte in the file. The following shows the compressed file in hexadecimal
format using xxd.
0000000: e8e4 f0d4 c5ce 639c e0b2 0000 000a e9d7 ......c.........
0000010: af7d f7df 7df7 ce73 9ce7 3fff ffff ffff .}..}..s..?.....
0000020: ffff ffff 6db6 db6d b6db 6db6 d555 5555 ....m..m..m..UUU
0000030: 5555 5555 5555 5554 0000 0000 0000 0000 UUUUUUUT........
0000040: 00
The following shows the compressed file in binary format using
0000000: 11101000 11100100 11110000 11010100 11000101 11001110 ......
0000006: 01100011 10011100 11100000 10110010 00000000 00000000 c.....
000000c: 00000000 00001010 11101001 11010111 10101111 01111101 .....}
0000012: 11110111 11011111 01111101 11110111 11001110 01110011 ..}..s
0000018: 10011100 11100111 00111111 11111111 11111111 11111111 ..?...
000001e: 11111111 11111111 11111111 11111111 11111111 11111111 ......
0000024: 01101101 10110110 11011011 01101101 10110110 11011011 m..m..
000002a: 01101101 10110110 11010101 01010101 01010101 01010101 m..UUU
0000030: 01010101 01010101 01010101 01010101 01010101 01010101 UUUUUU
0000036: 01010101 01010100 00000000 00000000 00000000 00000000 UT....
000003c: 00000000 00000000 00000000 00000000 00000000 .....
24.3 Decoding
Decoding is the reverse of encoding. The decoder first reconstructs the code tree from
the file header, and then reads the compressed codes of the characters. From the codes, the
decompresser traverses the code tree and outputs the characters stored in the tree’s leaf
nodes. To reconstruct the tree, the decoder needs to know how the tree is represented in the
header. In our case, the code book is encoded using the rules described in Section 24.2.5.
The header contains both commands (1 bit, either 0 or 1) and characters (7 bits). To build
the code tree, the decoder does the following:
• The first bit is a command bit and it is always 1.
• If a command is 1, then the next 7 bits are the value stored in a leaf node. Create a
tree node to store this value. Add this tree node to the beginning of the list. This tree
node is a single-node tree.
• If a command is 0 and the list has only one node, then the complete tree has been
built. If a command is 0 and the list has two or more nodes, then take the first two
nodes from the list, create a tree node as the parent. Add this parent node to the list.
• After the tree is completely built, then read one more bit. If this is not the last
(rightmost) bit of the byte, discard the remaining bits in the byte. The next four