24. Huffman Compression (1/10)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 24

Huﬀman Compression

24.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395

24.2 Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397

24.2.1 Count Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397

24.2.2 Sort by Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399

24.2.3 Build a Code Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400

24.2.4 Build a Code Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410

24.2.5 Compress a File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415

24.2.6 Compress with Bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418

24.3 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423

Section 19.4 introduced binary trees through the example of binary search trees. This chap-

ter describes another way to use binary trees in a popular compression technique called

Huﬀman Compression or Huﬀman Coding. Huﬀman Coding was developed by David Huﬀ-

man in the early 1950s, while he was still a graduate student at MIT. After more than

60 years, Huﬀman Coding remains one of the best general-purpose compression algorithms

available, fast and widely used.

It is easier to understand Huﬀman Coding by comparing it to ASCII. ASCII is a “ﬁxed-

length code” using 8 bits for each letter, even though some letters (such as e and s) are

more common than some others (such as q and z). In contrast, Huﬀman Compression uses

“variable-length code”. If a letter appears frequently, then it is encoded with fewer bits. If a

letter appears infrequently, then more bits are used. This means that the average length of

all letters is shorter—the information is compressed. Huﬀman Coding is lossless compression

because the original data can be fully recovered. Lossy compression means the original data

cannot be fully recovered. Lossy compression may achieve higher compression ratios than

lossless compression, and is useful when full recovery of the original data is unnecessary.

Lossy compression is frequently used to compress images and JPEG is an example of lossy

compression. A compression ratio is deﬁned as:

size of uncompressed ﬁle

size of compressed ﬁle

(24.1)

This chapter uses Huﬀman Coding to compress articles written in English. Given a set

of letters (or symbols) and their frequencies, Huﬀman Coding is optimal because Huﬀman

Coding uses the fewest bits on average.

24.1 Example

Consider an article with only eight diﬀerent characters: E, N, G, T, g, p, d, and h. To

encode these characters, only 3 bits are suﬃcient because 2

= 8. That means that each

395

396 Intermediate C Programming

character can be assigned to a unique 3-bit sequence. Next consider the situation when the

letters’ frequencies are quite diﬀerent, as shown in this table:

character frequencies

E 0.56%

N 1.12%

G 3.93%

T 3.93%

g 10.67%

p 12.92%

d 25.84%

h 41.01%

Consider the following codes (bit sequences) for the characters. We will explain how to

generate these codes (called a “code book”) in the next section.

character code length

E 1110100 7

N 1110101 7

G 111011 6

T 11100 5

g 1111 4

p 110 3

d 10 2

h 0 1

The average number of bits is 7 × 0.0056 + 7 × 0.0112 + 6 × 0.0393 + 5 × 0.0393 + 4 ×

0.1067 + 3 × 0.1292 + 2 × 0.2584 + 1 × 0.4101 = 2.29. This is 1 −

2.29

≈ 1 − 0.76 = 24%

reduction in the total length over using 3-bit codes. The code book can be displayed as a

tree, as shown in Fig. 24.1. At any tree node, 0 means moving left and 1 means moving

right.

FIGURE 24.1: Graphical representation of the code book.

Huﬀman Compression 397

The code book tree has several important properties:

• Characters are stored only at leaf nodes. Non-leaf nodes do not store any characters.

If a non-leaf node stores a character, then ambiguity arises. For example, if the root’s

right child stores the character ’m’, does 10 mean ’mh’ or ’d’? To put it in a diﬀerent

way, no code is the preﬁx of another code.

• If a character is more frequent, then its distance to the root is shorter. This means

that its bit sequence will be shorter and the average total length will be shorter.

• If the entire article contains only one character (perhaps repeated many times), then

the code book has only one row in the table and the tree has one node—the root node.

• If the article contains n diﬀerent characters (n > 1), then the tree has n leaf nodes

and each non-leaf node has two children. It is impossible to have a non-leaf node that

has only one child.

24.2 Encoding

To compress a ﬁle, the following steps are needed:

1. Count the frequencies of the characters.

2. Sort the characters by frequencies

3. Build a tree similar to Fig. 24.1.

4. Build the code book.

5. Use the code book to compress the ﬁle.

The chapter uses “encoder” and “compressor” interchangeably. Similarly, “decoder” and

“decompresser” are used interchangeably.

24.2.1 Count Frequencies

The ﬁrst step of Huﬀman Coding is to determine the frequencies of each letter in the

document being compressed. We will use the charter of the United Nations as an example:

The Charter of the United Nations was signed on 26 June 1945 , in San

Francisco , at the conclu s ion of the Un ited Nations Confe r ence on

Inte r national Orga nization , and came into force on 24 October

1945. The Statute of the I nternat i o nal Court of Justice is an

int egral part of the Char ter .

Ame n dments to Ar ticles 23 , 27 and 61 of the Ch arte r were adopted by

the General Assembly on 17 D ecemb er 1963 and came into force on 31

August 1965. A f urth er a mendme nt to Arti cle 61 was ado pted by the

General Assembly on 20 D ecemb er 1971 , and came into force on 24

Sep tembe r 1973. An ame ndmen t to A rtic le 109 , adop ted by the Gen eral

Ass embly on 20 Dece mber 1965 , came into force on 12 June 1968.

The amen dment to Ar ticl e 23 e nlarges the member ship of the Securit y

Council from eleven to f ifte en . The amen ded Arti cle 27 pro vides that

dec ision s of the Secur ity Council on proce dural matt ers shall be made

by an aff irmativ e vote of nine members ( forme rly seven ) and on all

other matters by an a f firmat i ve vote of nine members ( formerly

seven ) , inc luding the concu rring votes of the five perma nent memb ers

of the Securit y Council .

398 Intermediate C Programming

The amen dment to Ar ticl e 61 , which ente red into fo rce on 31 August

1965 , enla rged the memb e rship of the Eco nomic and Social Council from

eig hteen to twenty - seven . The s ubsequ ent amen dment to that Article ,

which entered into force on 24 Septe mber 1973 , further incre ased the

mem b ership of the Coun cil from twenty - seven to fifty - four .

The amen dment to Ar ticl e 109 , which relates to the first para graph of

that Article , prov ides that a General C onfere nce of Member States f o r

the purpose of re viewi ng the C harter may be held at a date and place

to be fixed by a two - thirds vote of the memb ers of the General

Ass embly and by a vote of any nine mem bers ( form erly seven ) of the

Sec urity Coun cil . Parag raph 3 of Article 109 , which deals with the

cons i deration of a po ssible review c onfer e nce during the tenth

regular session of the G ener al Assembly , has been retained in its

ori ginal form in its refere nce to a " vote , of any seven members of

the S ecurity Cou ncil " , the para graph hav ing been acted upon in 1955

by the Ge nera l Assembly , at its tenth regular session , and by the

Sec urity Coun cil .

The table below shows the frequencies of the characters in the UN charter:

V: ASCII value C: character (if printable) F: frequency

V C F V C F V C F V C F V C F

10 38 11 0 12 0 13 0 14 0

30 0 31 0 32 340 33 ! 0 34 ” 2

35 # 0 36 $ 0 37 % 0 38 & 0 39 ’ 0

40 ( 3 41 ) 3 42 * 0 43 + 0 44 , 20

45 - 4 46 . 11 47 / 0 48 0 5 49 1 22

50 2 11 51 3 8 52 4 5 53 5 7 54 6 9

55 7 6 56 8 1 57 9 14 58 : 0 59 ; 0

60 < 0 61 = 0 62 > 0 63 ? 0 64 @ 0

65 A 21 66 B 0 67 C 15 68 D 3 69 E 1

70 F 1 71 G 7 72 H 0 73 I 2 74 J 3

75 K 0 76 L 0 77 M 1 78 N 2 79 O 2

80 P 1 81 Q 0 82 R 0 83 S 12 84 T 7

85 U 2 86 V 0 87 W 0 88 X 0 89 Y 0

90 Z 0 91 [ 0 92 0 93 ] 0 94 ∧ 0

95 0 96 ‘ 0 97 a 105 98 b 38 99 c 64

100 d 46 101 e 263 102 f 58 103 g 19 104 h 73

105 i 96 106 j 0 107 k 0 108 l 56 109 m 64

110 n 140 111 o 121 112 p 23 113 q 1 114 r 119

115 s 71 116 t 157 117 u 37 118 v 21 119 w 13

120 x 1 121 y 30 122 z 1 123 { 0 124 — 0

125 } 0 126 0 127 0

The important point is that some characters appear much more frequently than the

others. For example, the letter ’e’ appears 263 times and ’t’ appears 157 times. In contrast,

’z’ appears only once and ’B’ is not in the article at all. Note that there are some invisible

characters. For example, 10 is the newline character (’n’) and it is used 38 times (there

are 38 lines in the charter). The ASCII code 32 is used for a space character, and it is used

340 times.

Huﬀman Compression 399

24.2.2 Sort by Frequency

After ﬁnding the frequencies of the characters, the frequencies are sorted in the ascending

order. The table below shows the characters sorted by their frequencies. If a character (for

example, ’B’ and ’H’) does not appear in the article, then it is discarded. If two characters

have the same frequency, then their order does not matter. When this occurs, we order the

letters of the same frequency by their ASCII value. For example, ’E’ and ’F’ appear once

and ’E’ is before ’F’ in this table.

V: value C: character F: frequency

V C F V C F V C F V C F V C F

56 8 1 69 E 1 70 F 1 77 M 1 80 P 1

113 q 1 120 x 1 122 z 1 34 ” 2 73 I 2

78 N 2 79 O 2 85 U 2 40 ( 3 41 ) 3

68 D 3 74 J 3 45 - 4 48 0 5 52 4 5

55 7 6 53 5 7 71 G 7 84 T 7 51 3 8

54 6 9 46 . 11 50 2 11 83 S 12 119 w 13

57 9 14 67 C 15 103 g 19 44 , 20 65 A 21

118 v 21 49 1 22 112 p 23 121 y 30 117 u 37

10 38 98 b 38 100 d 46 108 l 56 102 f 58

99 c 64 109 m 64 115 s 71 104 h 73 105 i 96

97 a 105 114 r 119 111 o 121 110 n 140 116 t 157

101 e 263 32 340

The following code gives a sample implementation for determining the frequency of each

character, and then sorting the characters by their frequencies. This is the header ﬁle:

// freq . h1

#i f n d e f FREQ_H2

#d ef in e FREQ_H3

typedef s t ru ct4

char value ;6

int freq ;7

} CharFreq ;8

// count the frequ encies of the letters9

// NUML ETTER is a constant (128) defined in constant . h10

// fr e quencies is an array of NUMLETTER elements11

// The functi o n returns the number of c h aracters in the file12

// The functi o n returns 0 if cannot read from the file13

int count Freque ncy ( char * filename , CharFreq * fr equencie s ) ;14

// print the array15

void print Freque ncy ( CharFreq * fr e quencies ) ;16

// sort the array17

void sortF requenc y ( C harFreq * f requenci e s );18

#e ndif19

This listing deﬁnes the function implementations.

// freq . c1

#in clude " constant .h "2

#in clude " freq .h "3

#in clude < stdio .h >4

#in clude < stdlib .h >5

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 24. Huffman Compression (1/10)

Create new playlist

Sign In

Sign Up

Table of Contents for
24. Huffman Compression (1/10)