Notes for August 5, 2005

(courtesy of Stuart Reges)

We are exploring a technique known as "compression" that involves storing a file in a special format that allows it to take up less space on disk. Programs like Winzip use sophisticated compression algorithms to do this. We are going to examine a basic form of this algorithm that can be implemented with binary trees.

Normally characters are stored as a sequence of bits of a fixed length. For example, the original ASCII character set had a total of 128 characters that could be stored in 7 bits. The eighth bit was often used to indicate "parity" (odd or even), although this so-called parity bit often turned out to be more trouble than it was worth. Later we found ourselves wanting more than the 128 standard characters and that led to something known as extended ASCII which has 256 characters.

The nice thing about extended ASCII is that it fits nicely in 8 bits (what is known as a byte). The different integers we can form with one byte range from 00000000 to 11111111 in binary (which is 0 to 255 in base 10). So with one byte we can store 256 different sequences.

Most simple text files are stored this way, as a sequence of bytes each representing one character. To compress such a file, we need to come up with a different encoding scheme. The key idea is to abandon the requirement that the number of bits be a fixed number like 8. Instead we allow ourselves to have variable length codes. That way, we can use short codes for characters that occur often and we can have long codes for characters that appear less frequently.

The Huffman algorithm is a particular approach to finding such an encoding. We construct a binary tree that indicates how each different character is to be encoded. The particular tree we build will depend on the frequency of each character in the file we are trying to compress. So in the first part of this two-part assignment, the HuffmanTree constructor is passed an array of character frequencies.

First you construct a leaf node for each character with a non-zero frequency (we don't need codes for the other characters since they don't appear in the file). This gives us a list of leaf nodes with different frequencies. We now pick the two with lowest frequency and combine them into a new subtree whose frequency is the sum of the frequencies of the two we are combining. Once you make that subtree, you put it back into the list.

This process is repeated until you get down to one tree. Each time we remove two, combine them, and put the new subtree back into the list. That means that each time we get one closer to having a single tree.

Once the process is complete, we have the root of our Huffman tree. We assign character codes by thinking of each left branch as a 0 and each right branch as a 1. The leaves of the tree each contain the information for a single character. The path from the root to the leaf tells us what code to use for that character.

In the first part of the assignment, you are responsible for building up a Huffman tree given an array of frequencies and printing out the codes for each character in the tree.

When we go to use these codes to compress a file, we have to write a series of bits to an output stream in a compact format. The class called BitOutputStream does that. It has a significant limitation. The number of bits it writes will always be a multiple of 8. For example, suppose that you write a total of 803 bits to one of these output streams. The actual number of bits written will be 808. Your output will be "padded" with five 0's at the end. That's because the underlying input/output mechanisms are all based on bytes. You can't write part of a byte to a file.

This limitation of BitOutputStream causes a potential problem for our compression algorithm. Consider that case where we had written 803 bits to the output stream. When we read it back in, we'll get those 803 bits plus we'll get 5 extra 0's at the end. What if the code "0" represents a letter like "e"? Then those 5 extra 0's will look like 5 e's.

To get around this problem, we introduce a "fake" character that we refer to as the pseudo-eof character. We make up a character that doesn't actually exist and we write it to the output stream after the actual characters. That way, when we read the file back in, we'll know when to stop reading. That means that the "multiple of 8" limitation of BitOutputStream won't be a problem for us because we have a special signal to let us know when to stop reading the file.

For our purposes, we'll use an integer value one higher than the highest character code we've been asked to work with. In our case, we're dealing with character codes 0 through 255, so we'll use 256 as the code for the pseudo-eof character. You shouldn't include the actual value 256 in your code. Your code should be flexible enough that we could use a different maximum value. You can use the array length to determine this maximum value.

For the first part of the assignment, the only place this enters into things is that you have to manually add this character to the initial set of leaves for the Huffman algorithm. You're given the frequencies of each of the real characters from the input file and you will make a leaf node for each of them. You should also make a leaf node for the pseudo-eof character and give it a frequency of 1 since it will appear exactly once at the end of the file.

For the homework assignment, there are a total of three executable files:

Part 1 of the assignment involves writing supporting code for MakeCode (a constructor that takes and array of frequencies and builds a tree and a method that writes the code to an output file). The program Encode.java doesn't require the use of the tree. Part 2 of the assignment involves writing supporting code for the Decode program.

First, the Decode program expects to call a constructor that can reconstruct the original Huffman tree from a Scanner that contains the character codes. For this second part of the assignment, we don't need frequencies any longer. We just need to know the structure of the tree. You can use frequencies of 0 when I reconstructing the individual nodes.

In reconstructing the tree, you have to be careful not to create a node more than once. For example, the first pair of lines in short.code indicate that the character with ASCII value 108 should be encoded as 0000. Looking at that code, you'd know to make a left node of the overall root and a left node of that node and a left node of that node and a left node of that node. That would get you to a leaf where you'd store the character value 108. The second pair of lines in short.code indicate that the character with ASCII value 110 should be encoded as 0001. You could imagine starting from the beginning and constructing a left node of the overall root with a left node with a left node with a right node storing the character value 110. But that won't work. You don't want to construct a brand new set of nodes for the second character. Otherwise the first leaf won't be in the same tree as the second leaf. You want to take advantage of the fact that most of those nodes were created in processing the information for the first character. You'd only want to create new nodes that you hadn't yet encountered (in this case, the leaf for the character 110).

The other method you need to implement for part 2 is one that decodes the file. The two classes called BitOutputStream and BitInputStream write and read a series of bits in a compact manner. The Encode program uses BitOutputStream to produce the encoded binary file. The Decode program opens this file as a BitInputStream and passes it to your Huffman tree to have it do the actual decoding. These classes are truly minimal classes that have only three public methods each. BitOutputStream has a constructor, a method called writeBit and a method called close. BitInputStream has a constructor, a method called readBit and a method called close.

The only method you'll have to worry about is the readBit method of the BitInputStream class. The Decode program constructs the BitInputStream and also closes it. It passes it to your Huffman tree in between when it calls a method to decode the file.

The operation you perform repeatedly is to go to the top of your tree and to read bits from the input file, going left or right in the tree depending upon whether you see a 0 or 1 in the input stream. When you hit a leaf, you know that you've found the next character from the original file and you write it to the PrintStream object you've been passed. Then you go back to the top of the tree and descend again until you hit a leaf and you print that character. Then go back to the top of the tree and start all over.

Instead of calling print or println as we normally do, you should call the method called write that writes a single byte to the output stream. It takes a parameter of type int. As an example, the ASCII values 97, 98 and 99 correspond to the characters 'a', 'b' and 'c'. This code:

    System.out.write(97);
    System.out.write(98);
    System.out.write(99);
    System.out.println();
would produce a line of output in the console window with "abc" on it. Recall that System.out is of type PrintStream. The PrintStream object you will be manipulating will be sending output to an external output file stored on disk, but the process is the same as with System.out. You should only call the write method. You don't need to call println, as in the code above.

In decoding the file, you'll have to know when to stop processing. As mentioned above, you can't rely on the bit stream to reliably tell you where the end of the stream occurs. It will always have a length that is a multiple of 8 (because it is composed of complete bytes, each 8 bits long). That's where the pseudo-eof character comes in. The Encode program writes the characters of the original file to the bit stream and then it writes the code for the pseudo-eof character. So as you are processing characters, eventually you will come across this eof character. When you do, you should stop decoding. You should not write this character to the PrintStream because it is not an actual character from the original file. It's a fictitious character that we made up to signal the end of the input.

The extra credit version has its own main programs called Encode2 and Decode2. Encode2 produces a single binary file as output. It contains information about the tree and the encoded text. That way everything is kept in one file instead of two files. Decode2 reconstructs the tree from the information in the binary file and then decodes the file as before.