Exam

Name: ________________________________
Section: ____________

CSE373 Winter Quarter
University of Washington
Midterm #2
February 18, 2005
Closed book, closed notes, closed neighbor; no calculators
2 points per part except as noted

.
Describe a good (standard) algorithm widely used for hashing strings. The algorithm used by Java, which is both good and standard, is to treat the characters as coefficents of a polynomical of powers of 31 (or 37). This can be evaluated efficiently using Horner's Rule. Less good is to treat the entire string as one vast binary number (effectively, this is a polynomial in powers of 2¹⁶). Not terribly good, though simple, all is to add the individual character values.

.
1. Define "perfect hash function". One which produces no collisions onto the array index range. 2. Explain why a perfect hash function would be preferable to an ordinary (imperfect) hash function Collisions are bad [Note: graded as one 2-pt question, since the answer to the first part is essential the answer to the second]

.
One hashing function for numbers is to add together the individual decimal digits of the number (for example, 12345 gives 15), and then mod'ing by the table size. Suppose this strategy is used with bank account numbers that are 3 digits long.	1. (1 pt.) If we wanted the table size to be about 30 (based on memory constraints -- a really tiny memory!!), what would be a good actual choice for the table size? 29 or 31 (prime numbers close to the target table size) 2. Assuming instead that a table size of 10 is chosen, show the result of adding these bank accounts to an initially empty table, assuming open addressing with linear probing: 445 111 202 510 321 610 042 Pretty much everyone got this, except for a few people who used the wrong hash calculation. "445" hashes to 3, not to 5, for example.

.
(1 pt. each) Assuming chained hashing, with N records and M table entries, what is the best case and worst case asymptotic complexity of the find(key) function?	1. Best case O(1) 2. Worst case O(N) [in the case where all elements hash to the same value] M is involved in calculating the average or expected complexity (which we didn't study, but can be estimated using the Poisson distribution).

.
Using tree notation, show the max-heap (not min-heap!) that results when these value are inserted, in this order, into an intially empty heap: 100 200 50 60 1000 3 6 777

.
Whatever your final answer was for the previous question, show it again, this time using array representation.

.
Show the binary search tree that would result from adding the following values, in the order shown, into an initially empty tree: 100 200 50 60 1000 3 6 777

.
Redraw your answer from the above question, and then show the result of deleting the root. Use the "left" choice, if you know what I mean (if you don't -- don't ask).	The algorithm for deletion, when the node to be deleted has two children, is to replace that node with either the largest value in its left subtree (the "left choice") or the smallest value in its right subtree.

.
A binary search tree contains the values 6, 7, 8, 9, 10, 11. The nodes with 7 and 11 have the same depth but different heights. Draw a tree satisfying these constraints.

.
Is it possible for a binary search tree to simultaneously be a heap? Explain. Yes, but only in the two very limited cases where a) the heap has only one node b) there are two nodes, the root and a left child. In this case you have a complete tree with the min-heap property. If a complete tree has 3 or more nodes, then the root must have two children, and to satisfy the heap property the root would have to be either <= or >= both its children, violating the search property.

.
True or false: every node in a complete tree is height-balanced. Explain. True. [Reminder: a node is "height-balanced" if the height of its left and right subtrees differ by at most 1. ] The height of a node is the longest distance from the node to a leaf. All leaves at either depth h or depth h+1. In particular, for any node, each of its children can't have a path to any leaf deeper than h or h+1, thus, the two path cannot differ by more than 1.

. (8 pts.)
The overall idea of a Favorites List is to make recently and/or frequently accessed elements easier to find. The implementation we discussed used a List as the underlying data structure. Design a Favorites "List" using a different strategy: keeping the data primarily in one structure (A) but also placing recent accesses in a separate, probably smaller structure (B). Come up with a design and then answer the questions. Hint: read all the questions first! 1a. What is A? A can be anything reasonable. You could use an unordered List, but now that we know about faster structures, why not use a balanced binary search tree? 1b. What is B? The key is to make B such that entries which are in B are guaranteed to be retrievable faster than from A (otherwise, what's the point?). A good choice is a hash table, since if you can avoid collision, has O(1) access time (for the elements in it). A queue or stack is not a good choice for either A or B. The "find" requests that come in are not likely to match either a FIFO or LIFO pattern. Queues and Stacks do not naturally support searching. 2. Is your design intended primarily to speed up access of the most recently accessed elements, or the most frequently accessed elements? Most recently accessed. 3. Give the "insert" algorithm Add the new element to B. If there is already an element at the location in B, insert (move) the old one into A. 5. Give the "find" algorithm See if the element is in B. If so, return it. If not, search for it in A. If not found in A, return that result. If it is in A, remove it from A and add it to B before returning. 6. Justify that your "find" is faster than it would be for either A or B by itself. (If it isn't, then you really don't have a Favorites List!) Any kind of favorites scheme depends on a locality assumption. If you don't frequently reacess previously accessed elements, there isn't any reason to expect a speedup. If you do have an appropriate access patther, though, wy not just use a hash table by itself? The answer is that it would be slower because of collisions -- on a find, each chain of collisions has to be searched. With our scheme, B never has any collisions. If the element is in B, it is found immediately. Thus, if recently accessed elements are frequently reaccessed, the cost remains low. One problem noted with some solutions is that both A and B just kept getting bigger. Any advantage B might have had to start out with simply by being small eroded over time. You need a way to keep B small. Fascinating Footnote: in the design of hardware memory systems, the practice is to use large, inexpensive but often slow memory for A (main memory), and for B use a relatively small but fast (and expensive) "associative" memory as a cache. Associativity basically is a hardware hash based on the address. There is usually a hardware-controlled mechanism for handling a fixed number of address collisions (typically 2, 4, or 8).