24. Huffman Compression (5/10)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Huﬀman Compression 415

int col = 1;54

// print the code55

while ( codebook [ row ][ col ] != -1)56

{57

printf (" %d ", code b o ok [ row ][ col ]) ;58

col ++;59

}60

printf (" n" );61

}62

}63

After the code book has been built, it is printed. Below is the output of printCodeBook.

h: 0

d: 1 0

p: 1 1 0

T: 1 1 1 0 0

E: 1 1 1 0 1 0 0

N: 1 1 1 0 1 0 1

G: 1 1 1 0 1 1

g: 1 1 1 1

24.2.5 Compress a File

The ﬁle can be compressed once the code book has been created. The compressed ﬁle

must include the code book because it is required to decode the ﬁle. The code book is

dependent on the characters, and the frequencies of the characters, within the ﬁle, and thus

each code book is unique to a given ﬁle. Thus, a compressed ﬁle has two parts: a header

that represents the code book and the data immediately after the code book. The code can

be expressed in diﬀerent ways as long as the compression program and the decompression

program agree to the format. This book uses the following way to express the code book:

• The tree is traversed using post-order traversal. As explained earlier, in-order traversal

cannot distinguish diﬀerent shapes of binary trees. Thus, in-order traversal cannot be

used.

• When encountering a leaf node, ’1’ is printed before printing the character. This ’1’

is a “command” for describing the tree.

• When encountering a non-leaf node, one ’0’ is printed. This is the other “command”.

Commands can be either 1 or 0. Remember, the non-leaf node is encountered after

visiting both the left subtree and the right subtree—this is what a post-order traversal

means.

• One 0 is added after visiting the root. After that, a new line ’n’ is added.

Fig. 24.11 shows several examples that express the code trees using this format. Note

that if ’n’ appears in an article, then it is a leaf node in the code tree, and the node is

expressed as 1’n’. This method can also handle ’1’ and ’0’ in the article. They become 11

and 10. For Fig. 24.1, the header is 1h1d1p1T1E1N01G001g00000.

After the header, the rest of the compressed ﬁle is the data. If ’h’ appears in the original

uncompressed ﬁle, then it is replaced by the code 0. If ’g’ appears in the original uncom-

pressed ﬁle, then it is replaced by the code 1111. To do this eﬃciently, the program can

build a mapping table from the ASCII value to the indexes in the code book. Below is the

mapping to the code book.

416 Intermediate C Programming

(a) (b) (c)

(d)

FIGURE 24.11: The expressions for the code trees are (a) 1a1b00, (b) 1a1b1c000, (c)

1a1b01c1d000, (d) 1a1b01c1d1e0000. For each tree, the number of 1s is the same as the

number of leaf nodes. The number of 0s is one plus the number of non-leaf nodes.

Character index

E 4

G 6

N 5

T 3

d 1

g 7

h 0

p 2

The listing below shows a sample implementation that makes the header and compresses

the data.

// compres s . c1

#in clude < stdio .h >2

#in clude " tree .h "3

#in clude " constant .h "4

s t a t i c void Tree_ heade rHelp er ( TreeNode * tn , FILE * outfptr )5

i f ( tn == NULL )7

return ; // should not get here9

}10

TreeNode * lc = tn -> left ;11

TreeNode * rc = tn -> right ;12

i f (( lc == NULL ) && ( rc == NULL ) )13

{14

// leaf node15

fprintf ( outfptr , " 1% c" , tn -> value ) ;16

return ;17

Huﬀman Compression 417

}18

Tree _head erHel per ( lc , outfptr ) ;19

Tree _head erHel per ( rc , outfptr ) ;20

fprintf ( outfptr , "0 ");21

}22

void Tree_he ader ( TreeNode * tn , char * outfile )23

{24

FILE * outfptr = fopen ( outfile , "w ");25

i f ( outfptr == NULL )26

{27

return ;28

}29

Tree _head erHel per ( tn , outfptr ) ;30

fprintf ( outfptr , " 0 n" );31

fclose ( outfptr );32

}33

int compress ( char * infile , char * outfile ,34

int * * codebook , i n t * mapping )35

{36

FILE * infptr = fopen ( infile , "r ") ;37

i f ( infptr == NULL )38

{39

return 0;40

}41

FILE * outfptr = fopen ( outfile , "a "); // append42

i f ( outfptr == NULL )43

{44

fclose ( outfptr );45

return 0;46

}47

while (! feof ( infptr ) )48

{49

int onechar = fgetc ( infptr ) ;50

i f ( onechar != EOF )51

{52

int ind = mapping [ o n e c h a r ];53

int ind2 = 1;54

while ( codebook [ ind ][ ind2 ] != -1)55

{56

fprintf ( outfptr , "% d" , codeboo k [ ind ][ ind2 ]) ;57

ind2 ++;58

}59

}60

}61

fclose ( infptr ) ;62

fclose ( outfptr );63

return 1;64

}65

// * *** **** *** **** *** **** *** **** *** * *** *** * *** *66

// continu e encode ...67

Tree_h eader ( root , outfile );68

418 Intermediate C Programming

// mapping from ASCII to the in d exes of the code book69

int mapping [ N UMLETTER ];70

int ind ;71

for ( ind = 0; ind < NUMLETTER ; ind ++)72

{73

mapping [ ind ] = -1; // initial ized to invalid index74

int ind2 ;75

for ( ind2 = 0; ind2 < numRow ; ind2 ++)76

{77

i f ( codeboo k [ ind2 ][0] == ind )78

{79

mapping [ ind ] = ind2 ;80

}81

}82

}83

compress ( infile , outfile , codebook , mapping );84

If this is the input ﬁle:

ENNGGGGGGGTTTTTTTgggggggggggggggggggpppppppppppppppppppppppd

dddddddddddddddddddddddddddddddddddddddddddddhhhhhhhhhhhhhhh

hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh

This is the compressed ﬁle:

1h1d1p1T1E1N01G001g00000

1110100111010111101011110111110111110111110111110111110111110

1111100111001110011100111001110011100111111111111111111111111

1111111111111111111111111111111111111111111111111111110110110

1101101101101101101101101101101101101101101101101101101101101

0101010101010101010101010101010101010101010101010101010101010

1010101010101010101010101010100000000000000000000000000000000

000000000000000000000000000000000000000000

There is only one newline ’n’, which appears after the header. Line breaks are added

so that the very long sequence of 1s and 0s can ﬁt on this page. Please notice that the

header does not include the characters’ frequencies because this information is unnecessary

for decoding.

24.2.6 Compress with Bits

You may have noticed that the “compressed” ﬁle is actually longer than the original

ﬁle. The original ﬁle has 178 bytes and the “compressed” ﬁle has 433 bytes. Why? Because

the compressed ﬁle uses ’1’ and ’0’ (8-bit characters) to represent single bits. That is a

lot of waste. In order to actually compress the ﬁle, only single bits should be used. This

will not reduce the ﬁle size to one eighth or the original; however, it should be smaller

than 178 bytes. The compressed ﬁle contains more than the bits describing the data. The

compressed ﬁle starts with the header that describes the code book. The header includes

the “commands” 1 or 0 to indicate whether the following information is a leaf node or not.

If a command bit is 1, it must be followed by an ASCII character. ASCII codes go from 0

to 127 (this book ignores the characters whose values are greater than 128), and 2

= 128,

which means that 7 bits is suﬃcient to store a character. This allows us to put a command

Huﬀman Compression 419

bit and a character into a single byte. Note, however, that if the command bit is 0, then

the rest of the header needs to be shifted right by one bit. If some bits in the last byte are

unused, then these bits are zero. The header in the ASCII format is:

1h1d1p1T1E1N01G001g00000

If each letter uses 7 bits, then the header in the binary format becomes:

11101000 11100100 11110000 11010100 11000101

11001110 01100011 10011100 11100000

For clarity, we have added space between every byte. The ﬁrst four bytes are generated

by the following rules:

1. The letter ’h’ is 0x68 in hexadecimal and the 7-bit representation is 1101000. The ﬁrst

byte is 11101000.

2. The letter ’d’ is 0x64 and the 7-bit representation is 1100100. The second byte is

11100100.

3. The letter ’p’ is 0x70 and the 7-bit representation is 1110000. The third byte is

11110000.

4. The letter ’T’ is 0x54 and the 7 bit representation is 1010100. The fourth byte is

11010100.

The ﬁrst zero appears after ’N’ and it is the seventh byte. This zero is followed by

1. Thus, the ﬁrst two bits are 01. The next character is ’G’. It is 0x47 and the 7-bit

representation is 01000111. Only six bits can be accommodated in the seventh byte and

this byte is 010100011. The remaining (rightmost) bit enters the leftmost part of the eighth

byte. The following table shows the bits for the header:

C: command D: data U: Unused

1 1 1 0 1 0 0 0 1 1 1 0 0 1 0 0

C ← ’h’ → C ← ’d’ →

1 1 1 1 0 0 0 0 1 1 0 1 0 1 0 0

C ← ’p’ → C ← ’T’ →

1 1 0 0 0 1 0 1 1 1 0 0 1 1 1 0

C ← ’E’ → C ← ’N’ →

0 1 1 0 0 0 1 1 1 0 0 1 1 1 0 0

C C ← ’G’ → C C C ← ’g’

1 1 1 0 0 0 0 0

→ C C C C U

One major problem of using bits is that the minimum unit of memory in C is a byte

(unsigned char). The following is a function for writing one bit to a ﬁle. This function

accumulates 8 bits in a buﬀer, and then writes to the ﬁle. It is called whenever a bit needs

to be written to the ﬁle. The function uses curbyte to keep all of the written bits and when

8 bits have been sent, the function writes a single byte to the ﬁle.

#in clude " utility . h"1

// functio n for debugg ing purpose2

s t a t i c void printByte ( unsigned char onebyte )3

unsigned char mask = 0 x80 ;5

while ( mask > 0)6

printf (" %d" , ( onebyte & mask ) == mask );8

mask >>= 1;9

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 24. Huffman Compression (5/10)

Create new playlist

Sign In

Sign Up

Table of Contents for
24. Huffman Compression (5/10)