In this assignment, you consider a problem faced by a certain large
online bookseller which we will call Anonymous.com. Over time,
Anonymous has gathered a large database with entries of the form
(ID, book)
, signifying that the customer with the given
ID has purchased the given book. With such a large database in hand,
the company can easily gather associations of the form "People who buy
book A often buy book B" and use this to provide recommendations on
titles of interest to customers.
Suppose that Anonymous wishes to begin selling compact discs online
also. As customers buy CDs, it begins to develop a database with
entries of the form (ID, CD title)
. Unfortunately, this
database is not yet large enough to make associations between CD
titles, but the company still wants to provide its customers with CD
recommendations. One way of doing this is to leverage off of the book
recommendation information. Specifically, if book A is recommended to
a customer and people who buy book A often buy CD B, then Anonymous
can recommend CD B to the customer.
In this assignment, we will use the hash table and heap data
structures to efficiently solve the following problem: given the
database of (ID, book)
pairs and a separate database of
(ID, CD title)
pairs, find all of the pairs (CD,
book)
such that the CD and book were both purchased by the same
customer. Then, print out the k most-frequently occurring
such pairs.
Guidelines:You should write your programs individually, but you can consult with other students on all aspects of the work. A good rule of thumb is that I should not be able to hold two programs side by side and easily identify areas that were written jointly. You can use any language and computing platform that you like. As described in more detail below, you will hand in a printed copy of your program and sample output on specific test cases given below. Also see the programming project web page for grading guidelines.
The first step in solving the problem above is to generate a list of
all of the (book, CD)
pairs such that the book and CD
were purchased by the same customer. The easiest way to do this is to
use a brute-force quadratic-time algorithm which compares each pair of
entries of the form (ID, book)
, (ID, CD)
and
adds (book, CD)
to the list if the ID is the same for
both pairs. However, there is a better procedure which runs in expected
linear time. The procedure is called a hash join in database
lingo, and, as the name implies, is based on hash tables.
To do a hash join on our data, we first construct an open
(chained) hash table keyed on the customer ID. Into this hash table,
we insert all of the (ID, CD)
pairs (we use the smaller
of the two databases to minimize the table size). Now, for each
(ID, book)
pair, we go to the hash table list associated
with the ID. In that list, we look for all of the (ID,
CD)
entries it contains which have the same ID as the book
purchaser, and output the pair (book, CD)
for each match.
We now need to efficiently figure out the k most-frequent
(book, CD)
pairs. To do this, we want to maintain a max
heap whose entries are (book, CD)
pairs and which is
keyed on the number of times we have seen that pair. In order to
generate the heap while we process the pairs, however, we must augment
the heap to have an increase_key
function which
percolates specific heap entries up after their number of occurrences
increases. We also need to maintain a search structure (such as a
hash table) which can quickly tell us which index in the heap contains
a specific pair.
To demonstrate, suppose we have a structure called Pair
with the values num_occurrences
, heap_index
,
and id
for each (CD, book)
pair. As we
process the next such pair, we use the following routine:
process(cd : String, book : String, pairs : HashTable, heap : Heap) { x : Pointer to Pair; x = pairs.lookup(cd, book); /* Assume this returns NULL if the pair is not in the table */ if (x is NULL) { x = new Pair; x.id = (cd, book); x.num_occurrences = 1; heap.insert(x); /* Note that the heap operations must update the "heap_index" field! */ pairs.insert(cd, book, x); } else { x.num_occurrences++; heap.increase_key(x.heap_index); /* This percolates x up, changing "heap_index" fields of all affected items. */ } }
In order to receive full credit, your program must have the following features:
(book, CD)
pairs, along
with the number of times they occurred.
Here is a breakdown into small steps which might be helpful, although you are not required to follow it.
(CD, book)
pairs to verify your program is working so far.
process
routine above, and modify
your program to call it as each pair is generated via the
hash join algorithm.
remove
operation to retrieve
the k most frequent pairs.