Chapter 24
Huffman Compression
24.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
24.2 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
24.2.1 Count Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
24.2.2 Sort by Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
24.2.3 Build a Code Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
24.2.4 Build a Code Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
24.2.5 Compress a File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415
24.2.6 Compress with Bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
24.3 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423
Section 19.4 introduced binary trees through the example of binary search trees. This chap-
ter describes another way to use binary trees in a popular compression technique called
Huffman Compression or Huffman Coding. Huffman Coding was developed by David Huff-
man in the early 1950s, while he was still a graduate student at MIT. After more than
60 years, Huffman Coding remains one of the best general-purpose compression algorithms
available, fast and widely used.
It is easier to understand Huffman Coding by comparing it to ASCII. ASCII is a “fixed-
length code” using 8 bits for each letter, even though some letters (such as e and s) are
more common than some others (such as q and z). In contrast, Huffman Compression uses
“variable-length code”. If a letter appears frequently, then it is encoded with fewer bits. If a
letter appears infrequently, then more bits are used. This means that the average length of
all letters is shorter—the information is compressed. Huffman Coding is lossless compression
because the original data can be fully recovered. Lossy compression means the original data
cannot be fully recovered. Lossy compression may achieve higher compression ratios than
lossless compression, and is useful when full recovery of the original data is unnecessary.
Lossy compression is frequently used to compress images and JPEG is an example of lossy
compression. A compression ratio is defined as:
size of uncompressed file
size of compressed file
(24.1)
This chapter uses Huffman Coding to compress articles written in English. Given a set
of letters (or symbols) and their frequencies, Huffman Coding is optimal because Huffman
Coding uses the fewest bits on average.
24.1 Example
Consider an article with only eight different characters: E, N, G, T, g, p, d, and h. To
encode these characters, only 3 bits are sufficient because 2
3
= 8. That means that each
395
396 Intermediate C Programming
character can be assigned to a unique 3-bit sequence. Next consider the situation when the
letters’ frequencies are quite different, as shown in this table:
character frequencies
E 0.56%
N 1.12%
G 3.93%
T 3.93%
g 10.67%
p 12.92%
d 25.84%
h 41.01%
Consider the following codes (bit sequences) for the characters. We will explain how to
generate these codes (called a “code book”) in the next section.
character code length
E 1110100 7
N 1110101 7
G 111011 6
T 11100 5
g 1111 4
p 110 3
d 10 2
h 0 1
The average number of bits is 7 × 0.0056 + 7 × 0.0112 + 6 × 0.0393 + 5 × 0.0393 + 4 ×
0.1067 + 3 × 0.1292 + 2 × 0.2584 + 1 × 0.4101 = 2.29. This is 1
2.29
3
1 0.76 = 24%
reduction in the total length over using 3-bit codes. The code book can be displayed as a
tree, as shown in Fig. 24.1. At any tree node, 0 means moving left and 1 means moving
right.
FIGURE 24.1: Graphical representation of the code book.
Huffman Compression 397
The code book tree has several important properties:
Characters are stored only at leaf nodes. Non-leaf nodes do not store any characters.
If a non-leaf node stores a character, then ambiguity arises. For example, if the root’s
right child stores the character ’m’, does 10 mean ’mh’ or ’d’? To put it in a different
way, no code is the prefix of another code.
If a character is more frequent, then its distance to the root is shorter. This means
that its bit sequence will be shorter and the average total length will be shorter.
If the entire article contains only one character (perhaps repeated many times), then
the code book has only one row in the table and the tree has one node—the root node.
If the article contains n different characters (n > 1), then the tree has n leaf nodes
and each non-leaf node has two children. It is impossible to have a non-leaf node that
has only one child.
24.2 Encoding
To compress a file, the following steps are needed:
1. Count the frequencies of the characters.
2. Sort the characters by frequencies
3. Build a tree similar to Fig. 24.1.
4. Build the code book.
5. Use the code book to compress the file.
The chapter uses “encoder” and “compressor” interchangeably. Similarly, “decoder” and
“decompresser” are used interchangeably.
24.2.1 Count Frequencies
The first step of Huffman Coding is to determine the frequencies of each letter in the
document being compressed. We will use the charter of the United Nations as an example:
The Charter of the United Nations was signed on 26 June 1945 , in San
Francisco , at the conclu s ion of the Un ited Nations Confe r ence on
Inte r national Orga nization , and came into force on 24 October
1945. The Statute of the I nternat i o nal Court of Justice is an
int egral part of the Char ter .
Ame n dments to Ar ticles 23 , 27 and 61 of the Ch arte r were adopted by
the General Assembly on 17 D ecemb er 1963 and came into force on 31
August 1965. A f urth er a mendme nt to Arti cle 61 was ado pted by the
General Assembly on 20 D ecemb er 1971 , and came into force on 24
Sep tembe r 1973. An ame ndmen t to A rtic le 109 , adop ted by the Gen eral
Ass embly on 20 Dece mber 1965 , came into force on 12 June 1968.
The amen dment to Ar ticl e 23 e nlarges the member ship of the Securit y
Council from eleven to f ifte en . The amen ded Arti cle 27 pro vides that
dec ision s of the Secur ity Council on proce dural matt ers shall be made
by an aff irmativ e vote of nine members ( forme rly seven ) and on all
other matters by an a f firmat i ve vote of nine members ( formerly
seven ) , inc luding the concu rring votes of the five perma nent memb ers
of the Securit y Council .
398 Intermediate C Programming
The amen dment to Ar ticl e 61 , which ente red into fo rce on 31 August
1965 , enla rged the memb e rship of the Eco nomic and Social Council from
eig hteen to twenty - seven . The s ubsequ ent amen dment to that Article ,
which entered into force on 24 Septe mber 1973 , further incre ased the
mem b ership of the Coun cil from twenty - seven to fifty - four .
The amen dment to Ar ticl e 109 , which relates to the first para graph of
that Article , prov ides that a General C onfere nce of Member States f o r
the purpose of re viewi ng the C harter may be held at a date and place
to be fixed by a two - thirds vote of the memb ers of the General
Ass embly and by a vote of any nine mem bers ( form erly seven ) of the
Sec urity Coun cil . Parag raph 3 of Article 109 , which deals with the
cons i deration of a po ssible review c onfer e nce during the tenth
regular session of the G ener al Assembly , has been retained in its
ori ginal form in its refere nce to a " vote , of any seven members of
the S ecurity Cou ncil " , the para graph hav ing been acted upon in 1955
by the Ge nera l Assembly , at its tenth regular session , and by the
Sec urity Coun cil .
The table below shows the frequencies of the characters in the UN charter:
V: ASCII value C: character (if printable) F: frequency
V C F V C F V C F V C F V C F
10 38 11 0 12 0 13 0 14 0
30 0 31 0 32 340 33 ! 0 34 2
35 # 0 36 $ 0 37 % 0 38 & 0 39 0
40 ( 3 41 ) 3 42 * 0 43 + 0 44 , 20
45 - 4 46 . 11 47 / 0 48 0 5 49 1 22
50 2 11 51 3 8 52 4 5 53 5 7 54 6 9
55 7 6 56 8 1 57 9 14 58 : 0 59 ; 0
60 < 0 61 = 0 62 > 0 63 ? 0 64 @ 0
65 A 21 66 B 0 67 C 15 68 D 3 69 E 1
70 F 1 71 G 7 72 H 0 73 I 2 74 J 3
75 K 0 76 L 0 77 M 1 78 N 2 79 O 2
80 P 1 81 Q 0 82 R 0 83 S 12 84 T 7
85 U 2 86 V 0 87 W 0 88 X 0 89 Y 0
90 Z 0 91 [ 0 92 0 93 ] 0 94 0
95 0 96 0 97 a 105 98 b 38 99 c 64
100 d 46 101 e 263 102 f 58 103 g 19 104 h 73
105 i 96 106 j 0 107 k 0 108 l 56 109 m 64
110 n 140 111 o 121 112 p 23 113 q 1 114 r 119
115 s 71 116 t 157 117 u 37 118 v 21 119 w 13
120 x 1 121 y 30 122 z 1 123 { 0 124 0
125 } 0 126 0 127 0
The important point is that some characters appear much more frequently than the
others. For example, the letter ’e appears 263 times and ’t’ appears 157 times. In contrast,
’z’ appears only once and ’B’ is not in the article at all. Note that there are some invisible
characters. For example, 10 is the newline character (’n’) and it is used 38 times (there
are 38 lines in the charter). The ASCII code 32 is used for a space character, and it is used
340 times.
Huffman Compression 399
24.2.2 Sort by Frequency
After finding the frequencies of the characters, the frequencies are sorted in the ascending
order. The table below shows the characters sorted by their frequencies. If a character (for
example, ’B’ and ’H’) does not appear in the article, then it is discarded. If two characters
have the same frequency, then their order does not matter. When this occurs, we order the
letters of the same frequency by their ASCII value. For example, ’E’ and ’F’ appear once
and ’E’ is before ’F’ in this table.
V: value C: character F: frequency
V C F V C F V C F V C F V C F
56 8 1 69 E 1 70 F 1 77 M 1 80 P 1
113 q 1 120 x 1 122 z 1 34 2 73 I 2
78 N 2 79 O 2 85 U 2 40 ( 3 41 ) 3
68 D 3 74 J 3 45 - 4 48 0 5 52 4 5
55 7 6 53 5 7 71 G 7 84 T 7 51 3 8
54 6 9 46 . 11 50 2 11 83 S 12 119 w 13
57 9 14 67 C 15 103 g 19 44 , 20 65 A 21
118 v 21 49 1 22 112 p 23 121 y 30 117 u 37
10 38 98 b 38 100 d 46 108 l 56 102 f 58
99 c 64 109 m 64 115 s 71 104 h 73 105 i 96
97 a 105 114 r 119 111 o 121 110 n 140 116 t 157
101 e 263 32 340
The following code gives a sample implementation for determining the frequency of each
character, and then sorting the characters by their frequencies. This is the header file:
// freq . h1
#i f n d e f FREQ_H2
#d ef in e FREQ_H3
typedef s t ru ct4
{5
char value ;6
int freq ;7
} CharFreq ;8
// count the frequ encies of the letters9
// NUML ETTER is a constant (128) defined in constant . h10
// fr e quencies is an array of NUMLETTER elements11
// The functi o n returns the number of c h aracters in the file12
// The functi o n returns 0 if cannot read from the file13
int count Freque ncy ( char * filename , CharFreq * fr equencie s ) ;14
// print the array15
void print Freque ncy ( CharFreq * fr e quencies ) ;16
// sort the array17
void sortF requenc y ( C harFreq * f requenci e s );18
#e ndif19
This listing defines the function implementations.
// freq . c1
#in clude " constant .h "2
#in clude " freq .h "3
#in clude < stdio .h >4
#in clude < stdlib .h >5
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset