CSE143 Notes for Wednesday, 11/30/05

I used the lecture time to describe the next programming assignment. We are going to write a program that compresses a text file by creating something known as a "Huffman tree." It is a binary tree.

I started by demonstrating the execution of the various programs involved. I have written a total of three main programs:

MakeCode.java which takes an input file and produces a code file (an encoding scheme based on a Hufffman tree)
Encode.java which takes an input file and code file and produces a binary output file (the encoded, compressed file)
Decode.java which takes a code file and a binary encoded file and that recreates the original text file

I mentioned that I have a short data file called short.txt and a long data file called hamlet.txt (the full text of the play). I opened a command window so that we could do "dir" commands to see the various files on the directory. We ran each program using short.txt:

Using MakeCode program and short.txt, we made a file called short.code that contained an encoding scheme for the file.
Then we ran the Encode program that took short.txt and short.code to produce a compressed file called short.short. This is a binary file like a zip file, meaning that it is stored with a different encoding scheme than the normal ASCII text files that appear on the system.
Then we ran the Decode program that took short.short and short.code to produce a new file that we called short.new. This is supposed to be the "inflated" version of the binary file, which means it should be exactly the same as short.txt. Using the "dir" command, we could see that it had the same byte length. Using the "fc" command in Windows, we could see that the file was exactly the same as the original.

Of course, this isn't very impressive for the short file. The compressed file was shorter, but the code file was so long that the combination of the compressed file and the code file was larger than the original. But compression wasn't designed for short files because they don't need to be compressed.

So then we went through the same process with hamlet.txt. We found that it reduced the file size from around 200 thousand characters to around 110 thousand characters (roughly cutting the file size in half). This isn't as good as the standard zip compression, but it's pretty good given the fact that the algorithm is relatively simple.

I then switched to the overhead to discuss the details of the assignment. There is a detailed example in the assignment writeup, so I won't duplicate the one we did in lecture, but I'll review the key points for the assignment.

We are exploring a technique known as "compression" that involves storing a file in a special format that allows it to take up less space on disk. Programs like winzip use sophisticated compression algorithms to do this. We are going to examine a basic form of this algorithm that can be implemented with binary trees.

Normally characters are stored as a sequence of bits of a fixed length. One such scheme is known as ASCII:

        A merican
        S tandard
        C ode for
        I nformation
        I nterchange

The original ASCII character set had a total of 128 characters that could be stored in 7 bits. The eighth bit was often used to indicate "parity" (odd or even), although this so-called parity bit often turned out to be more trouble than it was worth. Later we found ourselves wanting more than the 128 standard characters and that led to something known as extended ASCII which has 256 characters.

The nice thing about extended ASCII is that it fits nicely in 8 bits (what is known as a byte). The different integers we can form with one byte range from 00000000 to 11111111 in binary (which is 0 to 255 in base 10). So with one byte we can store 256 different sequences.

Most simple text files are stored this way, as a sequence of bytes each representing one character. To compress such a file, we need to come up with a different encoding scheme. The key idea is to abandon the requirement that the number of bits be a fixed number like 8. Instead we allow ourselves to have variable length codes. That way, we can use short codes for characters that occur often and we can have long codes for characters that appear less frequently.

The Huffman algorithm is a particular approach to finding such an encoding. We construct a binary tree that indicates how each different character is to be encoded. The particular tree we build will depend on the frequency of each character in the file we are trying to compress. So in the first part of this two-part assignment, the HuffmanTree constructor is passed an array of character frequencies.

First you construct a leaf node for each character with a non-zero frequency (we don't need codes for the other characters since they don't appear in the file). This gives us a list of leaf nodes with different frequencies. We now pick the two with lowest frequency and combine them into a new subtree whose frequency is the sum of the frequencies of the two we are combining. Once you make that subtree, you put it back into the list.

This process is repeated until you get down to one tree. Each time we remove two, combine them, and put the new subtree back into the list. That means that each time we get one closer to having a single tree.

Once the process is complete, we have the root of our HuffmanTree. We assign character codes by thinking of each left branch as a 0 and each right branch as a 1. The leaves of the tree each contain the information for a single character. The path from the root to the leaf tells us what code to use for that character.

In the first part of the assignment, you are responsible for building up a Huffman tree given an array of frequencies and printing out the codes for each character in the tree.

In the second part of the assignment, you have to reconstruct the tree from the code file. For this second part of the assignment, the frequencies don't matter. The frequencies are only used in constructing the tree. That's why the instructions say for the second part that you can use frequencies like 0 or -1 when you reconstruct the nodes.

The other method you need to implement for part 2 is one that decodes the file. I pointed out that I had written two classes called BitOutputStream and BitInputStream that write and read a series of bits in a compact manner. The Encode program uses BitOutputStream to produce the encoded binary file. The Decode program opens this file as a BitInputStream and passes it to your HuffmanTree to have it do the actual decoding. These classes are truly minimal classes that have only three public methods each. BitOutputStream has a constructor, a method called writeBit and a method called close. BitInputStream has a constructor, a method called readBit and a method called close.

The only method you'll have to worry about is the readBit method of the BitInputStream class. The Decode program constructs the BitInputStream and also closes it. It passes it to your HuffmanTree in between when it calls a method to decode the file.

The operation you perform repeatedly is to go to the top of your tree and to read bits from the input file, going left or right in the tree depending upon whether you see a 0 or 1 in the input stream. When you hit a leaf, you know that you've found the next character from the original file and you write it to the PrintStream object you've been passed. Then you go back to the top of the tree and descend again until you hit a leaf and you print that character. Then go back to the top of the tree and start all over.

Stuart Reges

Last modified: Sun Dec 4 19:15:42 PST 2005