handout #25
CSE143—Computer Programming II
Programming Assignment #8
due: Tuesday, 8/16/05,
(courtesy of Stuart Reges)
This
assignment is worth a total of 30 points.
This document explains part 1 of the assignment, which is worth approximately
half of the 30 points.
This
program will give you practice with binary trees and priority queues. In this program you will explore how text
files can be compressed by using a coding scheme based on the frequency of
characters. We will use a coding scheme
called Huffman coding.
The
basic idea is to abandon the way that text files are usually stored. Instead of using the usual seven or eight
bits per character, Huffman's method uses only a few bits for characters that
are used often, more bits for those that are rarely used.
You
will solve this problem using a structure known as a priority queue. In a priority queue each value inserted into
the queue has a priority that determines when it will be removed. There are many ways to specify the priorities. For this program you will construct objects
that implement the Comparable interface, with objects that are “less” being
given a higher priority (to be removed first).
The
first step is to compute the frequency of each character in the file you wish
to encode. This allows you to determine
which characters should have the fewest bits, etc. The next step is to build a “coding tree”
from the bottom up according to the frequencies. An example will probably make this the most
clear. To make the example easier, suppose we only want to
encode the five letters (a, b, c, d, e) and they have frequencies 3, 3, 1, 1,
and 2, respectively.
We
first create a leaf node for each character/frequency pair and put them into a
priority queue, so that the characters with lower frequencies appear first:
+----+ +----+
+----+ +----+ +----+
| 1 |
| 1 | | 2
| |
3 | | 3 |
+----+ +----+ +----+
+----+ +----+
'c' 'd'
'e' 'a' 'b'
Now
we pick the two nodes with the smallest frequencies (the two at the front of
the priority queue) and create a new node with those two nodes as children (the
first value from the queue becomes the left, the second value from the queue
becomes the right). We assign this new
branch node a frequency that is the sum of the frequencies of the two
children. This new node is then put back
into the priority queue:
+----+ +----+
+----+ +----+
| 2 |
| 2 | | 3
| |
3 |
+----+ +----+ +----+
+----+
'e'
/ \
'a' 'b'
+----+ +----+
|
1 | | 1 |
+----+ +----+
'c' 'd'
Continuing
in this way, we build up larger and larger subtrees.
Here are the rest of the steps:
+----+ +----+
+----+
| 3 |
| 3 | | 4
|
+----+ +----+ +----+
'a' 'b'
/ \
+----+ +----+
| 2 |
| 2 |
+----+ +----+
'e' /
\
+----+ +----+
| 1 |
| 1 |
+----+ +----+
'c' 'd'
+----+ +----+
| 4 | | 6 |
+----+ +----+
/ \ / \
+----+ +----+
+----+ +----+
|
2 | | 2 |
| 3 | | 3
|
+----+ +----+
+----+ +----+
'e' /
\ 'a' 'b'
+----+ +----+
| 1 |
| 1 |
+----+ +----+
'c' 'd'
+----+
| 10 |
+----+
/ \
/ \
+----+ +----+
| 4 | | 6 |
+----+ +----+
/ \ / \
+----+ +----+
+----+ +----+
|
2 | | 2 |
| 3 | | 3
|
+----+ +----+
+----+ +----+
'e' /
\ 'a' 'b'
+----+ +----+
| 1 |
| 1 |
+----+ +----+
'c' 'd'
Note
that the nodes with low frequencies end up far down in the tree, and nodes with
high frequencies end up near the root of the tree. It turns out that this structural description
is exactly what is needed to create an efficient encoding. The Huffman code is derived from this coding
tree simply by assigning a zero to each left branch and a one to each right
branch. The code can be read directly
from the tree. The code for a is 10, the
code for b is 11, the code for c is 010, the code for d is 011 and the code for
e is 00.
An
interesting feature of the Huffman code is that delimiters between characters
are not stored, even though different characters may be coded with different
numbers of bits. The key is that a code
created by this method exhibits what is known as the prefix property, which means that no code for a character is the
prefix of the code of any other character.
Thus, to decode a message we need only traverse our tree. When we reach a leaf, we know that we have
decoded one character, and can now start decoding the next character.
For
our purposes, we will encode what are known as “bytes” (8 bits). This will allow us to encode standard text
files that use ASCII and binary files as well.
From the point of view of your Huffman code, you can think about the
individual bytes as simple integers in the range of 0 to 255, each representing
the ASCII value of a particular character.
In part 1, you are working with a program called MakeCode. It prompts the user for a file to examine and
it computes the frequency of each character in the file. These counts are passed as an array to your
HuffmanTree constructor.
The
array passed to your constructor will have exactly 256 values in it, but your
program should not depend on this.
Instead, you can use the length field of the array to know how many
there are. In your constructor, you
should use a priority queue to build up the tree as described above. First you will add a leaf node for each
character that has a frequency greater than 0 (we don’t include the other
characters in our tree). These should be
added in increasing character order (character 0, character 1, and so on).
Then
you build the tree. Initially you have a
bunch of leaf nodes. Your goal is to get
a single tree. While you haven’t gotten
down to a single tree, you remove two values from the priority queue and
combine them to make a new branch node which you put back into the queue, as
described above. You continue combining
subtrees until you get down to one tree.
This becomes your Huffman tree.
You are to define a
class called HuffmanTree with the following public methods (more methods will
be added in part 2 of this assignment):
Method |
Description |
HuffmanTree(int[] count) |
Constructs a Huffman tree using the given array of frequencies where count[i] is the number of occurrences of the character with ASCII value i. |
void write(PrintStream output) |
Writes the current tree to the given output stream in standard format. |
In
defining your class, you will also define a node class called HuffmanNode. You may decide what data fields to include
with this class.
In
writing the tree, you should produce two lines of output for each
character. The first line should have
the ASCII value of the character. The
second line should have the code (0's and 1's) for the character with this
ASCII value. For the simple example
above, the output would be (the letter a has ASCII value 97):
101
00
99
010
100
011
97
10
98
11
The
codes should be written in “traversal order.”
In other words, they should be written in the order that any standard
traversal of the tree would visit them.
It
turns out that Huffman coding works best if one character is designated as “end
of file,” meaning that every file is guaranteed to end with such a character
and it will be used for no other purpose.
(Please note that the simple sample output above is not complete,
because it does not show the EOF
character.) Some operating systems have such a character, but if we want to
write a general-purpose program, we have to do something that is not specific
to any one operating system. So in
addition to encoding the actual characters that appear in the file, we will
create a code for a fictitious end-of-file character that will be used only by
the Huffman encoding and decoding programs.
That means that in addition to all of the legal characters, you are also
going to introduce a special character that will be used to signal
end-of-file. We will refer to this as
the “pseudo-eof” character. Its value
will be one higher than the value of the highest character in the frequency
array passed to the constructor. It will
always have a frequency of 1 because it appears exactly once at the end of each
file to be encoded. You will have to
manually add this character to your priority queue because it will not be
included as part of the frequency array.
You
will be provided with a class called PriorityQueue that you should use to build
your Huffman Tree. As with the
LinkedQueue class that we saw earlier in the quarter, the PriorityQueue class
implements the Queue interface. The only
difference between it and a standard Queue is that it uses the natural ordering
of the objects to decide which object to dequeue first, with objects considered
“less” returned first. You are going to
be putting subtrees into your priority queue, which means you’ll be adding
values of type HuffmanNode. This means
that your HuffmanNode class will have to implement the Comparable
interface. It should use the frequency
of the subtree to determine its ordering relative to other subtrees, with lower
frequencies considered “less” than higher frequencies.
You are being given two data files for this assignment called short.txt and hamlet.txt. The file short.txt is a short input file suitable for preliminary testing. The file hamlet.txt contains the full text of Shakespeare’s play Hamlet.
In terms of correctness, your class must provide
all of the functionality described above.
In terms of style, we will be grading on your use of comments, good
variable names, consistent indentation and good coding style to implement these
operations. Remember that you will lose
points if you declare variables as data fields that can instead be declared as
local variables. You should also avoid
extraneous cases (e.g., don’t make something into a special case if it doesn’t
have to be).
You MUST name your files HuffmanNode.java and HuffmanTree.java and turn it in electronically from the “assignments” link on the class web page. A collection of files needed for the assignment is included on the web page as ass8.zip. You will need to have MakeCode.java, Queue.java, PriorityQueue.java, and Scanner.java in the same directory as your files in order to run MakeCode. The zip file will also include the sample data files (short.txt and hamlet.txt) along with their code files (short.code and hamlet.code).
If you would like to read more about Huffman’s algorithm, there are many sources on the web including, for example, http://www.maths.abdn.ac.uk/~igc/tch/mx4002/notes/node59.html.