Programming Project II

Generating CD Recommendations for Online Shoppers

CSE 373, Spring 1998

Due June 5, 1998

Overview

In this assignment, you consider a problem faced by a certain large online bookseller which we will call Anonymous.com. Over time, Anonymous has gathered a large database with entries of the form (ID, book), signifying that the customer with the given ID has purchased the given book. With such a large database in hand, the company can easily gather associations of the form "People who buy book A often buy book B" and use this to provide recommendations on titles of interest to customers.

Suppose that Anonymous wishes to begin selling compact discs online also. As customers buy CDs, it begins to develop a database with entries of the form (ID, CD title). Unfortunately, this database is not yet large enough to make associations between CD titles, but the company still wants to provide its customers with CD recommendations. One way of doing this is to leverage off of the book recommendation information. Specifically, if book A is recommended to a customer and people who buy book A often buy CD B, then Anonymous can recommend CD B to the customer.

In this assignment, we will use the hash table and heap data structures to efficiently solve the following problem: given the database of (ID, book) pairs and a separate database of (ID, CD title) pairs, find all of the pairs (CD, book) such that the CD and book were both purchased by the same customer. Then, print out the k most-frequently occurring such pairs.

Guidelines:You should write your programs individually, but you can consult with other students on all aspects of the work. A good rule of thumb is that I should not be able to hold two programs side by side and easily identify areas that were written jointly. You can use any language and computing platform that you like. As described in more detail below, you will hand in a printed copy of your program and sample output on specific test cases given below. Also see the programming project web page for grading guidelines.

The Algorithms

Generating (Book, CD) Pairs

The first step in solving the problem above is to generate a list of all of the (book, CD) pairs such that the book and CD were purchased by the same customer. The easiest way to do this is to use a brute-force quadratic-time algorithm which compares each pair of entries of the form (ID, book), (ID, CD) and adds (book, CD) to the list if the ID is the same for both pairs. However, there is a better procedure which runs in expected linear time. The procedure is called a hash join in database lingo, and, as the name implies, is based on hash tables.

To do a hash join on our data, we first construct an open (chained) hash table keyed on the customer ID. Into this hash table, we insert all of the (ID, CD) pairs (we use the smaller of the two databases to minimize the table size). Now, for each (ID, book) pair, we go to the hash table list associated with the ID. In that list, we look for all of the (ID, CD) entries it contains which have the same ID as the book purchaser, and output the pair (book, CD) for each match.

Finding the k Most-Frequent Pairs

We now need to efficiently figure out the k most-frequent (book, CD) pairs. To do this, we want to maintain a max heap whose entries are (book, CD) pairs and which is keyed on the number of times we have seen that pair. In order to generate the heap while we process the pairs, however, we must augment the heap to have an increase_key function which percolates specific heap entries up after their number of occurrences increases. We also need to maintain a search structure (such as a hash table) which can quickly tell us which index in the heap contains a specific pair.

To demonstrate, suppose we have a structure called Pair with the values num_occurrences, heap_index, and id for each (CD, book) pair. As we process the next such pair, we use the following routine:

process(cd : String, book : String, pairs : HashTable, heap : Heap)
{
   x : Pointer to Pair;

   x = pairs.lookup(cd, book);  /* Assume this returns NULL if
                                   the pair is not in the table */
   if (x is NULL) {
     x = new Pair;
     x.id = (cd, book);
     x.num_occurrences = 1;
     heap.insert(x);            /* Note that the heap operations
                                   must update the "heap_index"
                                   field! */
     pairs.insert(cd, book, x);
   }
   else {
     x.num_occurrences++;
     heap.increase_key(x.heap_index);  /* This percolates x up,
                                   changing "heap_index" fields
                                   of all affected items. */
   }
}

Specific Instructions and Advice

In order to receive full credit, your program must have the following features:

It must implement the hash table and heap structures described above. For this assignment, you must write these data structures yourselves. Do not use pre-written or built-in hash tables or heaps. Learning to implement these data structures is the whole point of the assignment. You may, however, copy code from the book and modify it.
It must be able to read in the sample database from the assignment web page and print out the k most frequent (book, CD) pairs, along with the number of times they occurred.

Here is a breakdown into small steps which might be helpful, although you are not required to follow it.

Write a hash table ADT whose key value is a string. Since we use several such tables in this assignment to store different types of information, I suggest writing a "generic" hash table if possible. In C++, this is done via templates. In C, it is typically done via void pointers (see my generic C list class from the Project 1 Sample Solution).
Once your hash table is done, writing the hash join is trivial. Do this and print out all of the (CD, book) pairs to verify your program is working so far.
Write the heap data structure and test it.
Implement the process routine above, and modify your program to call it as each pair is generated via the hash join algorithm.
Finally, use heap remove operation to retrieve the k most frequent pairs.

Submission Procedure

You must print out, staple, and hand in a copy of your code.
You must print out the 10 most frequent pairs in the databases given by the following files, along with their frequencies:
- books.dat [ The (ID, Book) pairs ]
- cds.dat [ The (ID, CD) pairs ]
(However, you are highly encouraged to hand-make smaller examples to test and debug your programs).

Maintained by:
dfasulo@cs.washington.edu
(Last Update: )