Huffman Compression 415
int col = 1;54
// print the code55
while ( codebook [ row ][ col ] != -1)56
{57
printf (" %d ", code b o ok [ row ][ col ]) ;58
col ++;59
}60
printf (" n" );61
}62
}63
After the code book has been built, it is printed. Below is the output of printCodeBook.
h: 0
d: 1 0
p: 1 1 0
T: 1 1 1 0 0
E: 1 1 1 0 1 0 0
N: 1 1 1 0 1 0 1
G: 1 1 1 0 1 1
g: 1 1 1 1
24.2.5 Compress a File
The file can be compressed once the code book has been created. The compressed file
must include the code book because it is required to decode the file. The code book is
dependent on the characters, and the frequencies of the characters, within the file, and thus
each code book is unique to a given file. Thus, a compressed file has two parts: a header
that represents the code book and the data immediately after the code book. The code can
be expressed in different ways as long as the compression program and the decompression
program agree to the format. This book uses the following way to express the code book:
The tree is traversed using post-order traversal. As explained earlier, in-order traversal
cannot distinguish different shapes of binary trees. Thus, in-order traversal cannot be
used.
When encountering a leaf node, ’1’ is printed before printing the character. This ’1’
is a “command” for describing the tree.
When encountering a non-leaf node, one ’0’ is printed. This is the other “command”.
Commands can be either 1 or 0. Remember, the non-leaf node is encountered after
visiting both the left subtree and the right subtree—this is what a post-order traversal
means.
One 0 is added after visiting the root. After that, a new line n is added.
Fig. 24.11 shows several examples that express the code trees using this format. Note
that if n appears in an article, then it is a leaf node in the code tree, and the node is
expressed as 1’n’. This method can also handle ’1’ and ’0’ in the article. They become 11
and 10. For Fig. 24.1, the header is 1h1d1p1T1E1N01G001g00000.
After the header, the rest of the compressed file is the data. If ’h’ appears in the original
uncompressed file, then it is replaced by the code 0. If ’g’ appears in the original uncom-
pressed file, then it is replaced by the code 1111. To do this efficiently, the program can
build a mapping table from the ASCII value to the indexes in the code book. Below is the
mapping to the code book.
416 Intermediate C Programming
(a) (b) (c)
(d)
FIGURE 24.11: The expressions for the code trees are (a) 1a1b00, (b) 1a1b1c000, (c)
1a1b01c1d000, (d) 1a1b01c1d1e0000. For each tree, the number of 1s is the same as the
number of leaf nodes. The number of 0s is one plus the number of non-leaf nodes.
Character index
E 4
G 6
N 5
T 3
d 1
g 7
h 0
p 2
The listing below shows a sample implementation that makes the header and compresses
the data.
// compres s . c1
#in clude < stdio .h >2
#in clude " tree .h "3
#in clude " constant .h "4
s t a t i c void Tree_ heade rHelp er ( TreeNode * tn , FILE * outfptr )5
{6
i f ( tn == NULL )7
{8
return ; // should not get here9
}10
TreeNode * lc = tn -> left ;11
TreeNode * rc = tn -> right ;12
i f (( lc == NULL ) && ( rc == NULL ) )13
{14
// leaf node15
fprintf ( outfptr , " 1% c" , tn -> value ) ;16
return ;17
Huffman Compression 417
}18
Tree _head erHel per ( lc , outfptr ) ;19
Tree _head erHel per ( rc , outfptr ) ;20
fprintf ( outfptr , "0 ");21
}22
void Tree_he ader ( TreeNode * tn , char * outfile )23
{24
FILE * outfptr = fopen ( outfile , "w ");25
i f ( outfptr == NULL )26
{27
return ;28
}29
Tree _head erHel per ( tn , outfptr ) ;30
fprintf ( outfptr , " 0 n" );31
fclose ( outfptr );32
}33
int compress ( char * infile , char * outfile ,34
int * * codebook , i n t * mapping )35
{36
FILE * infptr = fopen ( infile , "r ") ;37
i f ( infptr == NULL )38
{39
return 0;40
}41
FILE * outfptr = fopen ( outfile , "a "); // append42
i f ( outfptr == NULL )43
{44
fclose ( outfptr );45
return 0;46
}47
while (! feof ( infptr ) )48
{49
int onechar = fgetc ( infptr ) ;50
i f ( onechar != EOF )51
{52
int ind = mapping [ o n e c h a r ];53
int ind2 = 1;54
while ( codebook [ ind ][ ind2 ] != -1)55
{56
fprintf ( outfptr , "% d" , codeboo k [ ind ][ ind2 ]) ;57
ind2 ++;58
}59
}60
}61
fclose ( infptr ) ;62
fclose ( outfptr );63
return 1;64
}65
// * *** **** *** **** *** **** *** **** *** * *** *** * *** *66
// continu e encode ...67
Tree_h eader ( root , outfile );68
418 Intermediate C Programming
// mapping from ASCII to the in d exes of the code book69
int mapping [ N UMLETTER ];70
int ind ;71
for ( ind = 0; ind < NUMLETTER ; ind ++)72
{73
mapping [ ind ] = -1; // initial ized to invalid index74
int ind2 ;75
for ( ind2 = 0; ind2 < numRow ; ind2 ++)76
{77
i f ( codeboo k [ ind2 ][0] == ind )78
{79
mapping [ ind ] = ind2 ;80
}81
}82
}83
compress ( infile , outfile , codebook , mapping );84
If this is the input file:
ENNGGGGGGGTTTTTTTgggggggggggggggggggpppppppppppppppppppppppd
dddddddddddddddddddddddddddddddddddddddddddddhhhhhhhhhhhhhhh
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
This is the compressed file:
1h1d1p1T1E1N01G001g00000
1110100111010111101011110111110111110111110111110111110111110
1111100111001110011100111001110011100111111111111111111111111
1111111111111111111111111111111111111111111111111111110110110
1101101101101101101101101101101101101101101101101101101101101
0101010101010101010101010101010101010101010101010101010101010
1010101010101010101010101010100000000000000000000000000000000
000000000000000000000000000000000000000000
There is only one newline n’, which appears after the header. Line breaks are added
so that the very long sequence of 1s and 0s can fit on this page. Please notice that the
header does not include the characters’ frequencies because this information is unnecessary
for decoding.
24.2.6 Compress with Bits
You may have noticed that the “compressed” file is actually longer than the original
file. The original file has 178 bytes and the “compressed” file has 433 bytes. Why? Because
the compressed file uses 1 and 0 (8-bit characters) to represent single bits. That is a
lot of waste. In order to actually compress the file, only single bits should be used. This
will not reduce the file size to one eighth or the original; however, it should be smaller
than 178 bytes. The compressed file contains more than the bits describing the data. The
compressed file starts with the header that describes the code book. The header includes
the “commands” 1 or 0 to indicate whether the following information is a leaf node or not.
If a command bit is 1, it must be followed by an ASCII character. ASCII codes go from 0
to 127 (this book ignores the characters whose values are greater than 128), and 2
7
= 128,
which means that 7 bits is sufficient to store a character. This allows us to put a command
Huffman Compression 419
bit and a character into a single byte. Note, however, that if the command bit is 0, then
the rest of the header needs to be shifted right by one bit. If some bits in the last byte are
unused, then these bits are zero. The header in the ASCII format is:
1h1d1p1T1E1N01G001g00000
If each letter uses 7 bits, then the header in the binary format becomes:
11101000 11100100 11110000 11010100 11000101
11001110 01100011 10011100 11100000
For clarity, we have added space between every byte. The first four bytes are generated
by the following rules:
1. The letter ’h’ is 0x68 in hexadecimal and the 7-bit representation is 1101000. The first
byte is 11101000.
2. The letter ’d’ is 0x64 and the 7-bit representation is 1100100. The second byte is
11100100.
3. The letter ’p’ is 0x70 and the 7-bit representation is 1110000. The third byte is
11110000.
4. The letter ’T’ is 0x54 and the 7 bit representation is 1010100. The fourth byte is
11010100.
The first zero appears after ’N’ and it is the seventh byte. This zero is followed by
1. Thus, the first two bits are 01. The next character is ’G’. It is 0x47 and the 7-bit
representation is 01000111. Only six bits can be accommodated in the seventh byte and
this byte is 010100011. The remaining (rightmost) bit enters the leftmost part of the eighth
byte. The following table shows the bits for the header:
C: command D: data U: Unused
1 1 1 0 1 0 0 0 1 1 1 0 0 1 0 0
C ’h’ C ’d’
1 1 1 1 0 0 0 0 1 1 0 1 0 1 0 0
C ’p’ C ’T’
1 1 0 0 0 1 0 1 1 1 0 0 1 1 1 0
C ’E’ C ’N’
0 1 1 0 0 0 1 1 1 0 0 1 1 1 0 0
C C ’G’ C C C ’g’
1 1 1 0 0 0 0 0
C C C C U
One major problem of using bits is that the minimum unit of memory in C is a byte
(unsigned char). The following is a function for writing one bit to a file. This function
accumulates 8 bits in a buffer, and then writes to the file. It is called whenever a bit needs
to be written to the file. The function uses curbyte to keep all of the written bits and when
8 bits have been sent, the function writes a single byte to the file.
#in clude " utility . h"1
// functio n for debugg ing purpose2
s t a t i c void printByte ( unsigned char onebyte )3
{4
unsigned char mask = 0 x80 ;5
while ( mask > 0)6
{7
printf (" %d" , ( onebyte & mask ) == mask );8
mask >>= 1;9
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset