A few years back I designed a program for encoding and decoding files using a Huffman tree. This was for my Data Structures class.
Huffman compressions works by assigning symbols that seldom appear longer representations then the symbols that appear with a higher frequency; thus Huffman compression is variable length in its representation of its symbols. These symbols are generated from traversing a Huffman tree. The process of tree-generation is as follows:
- Count up the frequencies of each symbol in a string of data.
- Enqueue each of these frequencies in a priority queue, associated with their symbol.
- This is our leaf node for a given symbol.
- Remove the two node with the lowest frequencies off of the queue, and combine then.
- The new node's frequency will be the sum of the previous two, and its children will likewise be the previous two nodes. Enqueue this new node.
- Continue the process until all nodes have been removed and combined into a tree. We are left with the root node after all is said and done.
We then traverse the tree to each leaf node to find its corresponding binary representation. I arbitrarily picked '0' to be added to the representation when I went left, and '1' when I went right. This could be reversed and the data would be compressed all the same (so long as we keep using the same methodology.)
For this project, the symbols were the ASCII characters found in a text file, which each are represented with 8 bits. I created a table of 256 boolean vectors to represent each one of these variable characters, reducing the number of traversals that must be taken to encode each character.
The file representation is very important – we have to be able to store and read the compressed file for later viewing. The problem is, C++ does not have a 1 bit data type (since the smallest addressable data type in hardware is a byte) - using a 8 bit character to represent a bit would be complete nonsense. Instead, I used the character data type to represent these binary values. A character in C/C++ is 8 bits long – I treated a string of characters as a binary stream, with each bit in each character representing the subsequent boolean value. Since each character is 8 bits long, I only have a worst case waste of 7 bits if the binary stream ends on the first bit of a character.
The next problem is saving the Huffman tree for decoding. I simply saved the 256 frequency counts as 32 bit integers. In addition, I saved the size of the “stream” of booleans I had saved as characters – allowing me to easily read the characters back in as a boolean array of a proper size. I simply discarded any bool values that were beyond the size limit (these garbage values are the aforementioned 'waste' in the final character of the stream.)
This means I have 257 32 bit integers as overhead when saved to file, which is about 1 kilobyte. This means that if I'm encoding a text file smaller than around 2.5 kilobytes is a waste (since I get about a 40% reduction in space taken up by a compressed file, as I shall demonstrate shortly.)
To decode, we reconstruct the Huffman tree from the stored frequency counts, and we then read in all of those characters that we saved to the file. These characters are translated back into a boolean array (vector) with an appropriate size (since we saved that size as an integer), and we run that through the Huffman tree.
To properly decode a symbol, we simply trace a path along the Huffman tree with its binary representation. Therefore, we follow the binary stream (array), tracing along the tree by taking a left path for every 0 and a right path for every 1. Whenever we reach a leaf node, we output the appropriate character contained in the leaf node.
My results with various text files were as follows:
Size: 4.2 MB
Compressed Size: 2.5 MB
War and Peace (Leo Tolstoy)
Size: 3.1 MB
Compressed Size: 1.8 MB
ANNA KARENINA (Leo Tolstoy)
Size: 1.9 MB
Compressed Size: 1.1 MB
USA Zip Codes (List)
Size: 908 KB
Compressed Size: 460 KB
Meditations on Anger
Size: 6.7 KB
Compressed Size: 4.6 KB
Millionaire Gets Mugged (News Article)
Size: 896 B
Compressed Size: 1.5 KB
We can expect to have a 30 to 40 percent reduction in file size when the overhead of one kilobyte becomes negligible. For small files, it's clear that compressing them is an act of futility.
This algorithm works just fine for arbitrary binary files. The decoding algorithm I've implemented is bugging - it seems to have problems with decoding binary files after the initial encoding.
The source code can be found in this directory. Compilation should work and has been tested with GNU GCC and with Microsoft's CL compiler.
Invoking the Program
The program takes 3 arguments. Assuming the name of the program is "huffman," For encoding type:
huffman e filename_to_encode output_filename
For decoding type:
huffman e filename_to_decode output_filename