CSEP 546 - Data Mining - Spring 2010

CSEP 546: Data Mining (Spring 2010)

Assignment 1: Collaborative Filtering, IBL, Decision Trees, Methodology

Due Date: Monday, April 12, 2010 in class; submit your code by emailing it to akolobov@cs.washington.edu by 6.30 p.m.

(40 points) Collaborative Filtering on Netflix Ratings:
- Read the paper Empirical Analysis of Predictive Algorithms for Collaborative Filtering. You need to read up to Section 2.1, and are encouraged to read further if you have time.
- The dataset we will be using is a subset of the movie ratings data from the Netflix Prize. You can download it here. It contains a training set, a test set, a movies file, a dataset description file, and a README file. The training and test sets are both subsets of the Netflix training data. You will use the ratings provided in the training set to predict those in the test set. You will compare your predictions with the actual ratings provided in the test set. The evaluation metrics you need to measure are the Mean Absolute Error (as defined here) and the Root Mean Squared Error (as defined here). The dataset description file further describes the dataset, and will help you get started. The README file is from the original set of Netflix files, and has been included to comply with the terms of use for this data.
- Implement the collaborative filtering algorithm described in Section 2.1 of the paper (Equations 1 and 2; ignore Section 2.1.2) for making the predictions.
- (10 points of extra credit) Add yourself as a new user to the training set. To do this, you will need to create a new user ID for yourself. Select some movies that you have seen among those in the training set, and add your ratings for them to the training set. Extend your system to output predictions for the movies you haven't rated, ranked in order of decreasing ratings. Do you agree with the predictions of your system? Check out some of the top ranked movies that you haven't seen (but only after you have finished work on the project).
(20 points) Suppose the instance space is the square -4 < X,Y < 4, and you have a training set composed of positive examples at (-2, 0), (0, 0), (0, 2), (2, 0) and negative examples at (-1, 1), (1, 1), (1, -1), (0, -2), (-1, -1).
(a) (4 points) Draw the Voronoi diagram of this training set using Euclidean distance.
(b) (8 points) Suppose you measure the error rate of the 3-nearest neighbor algorithm on this training set using leave-one-out cross-validation (i.e., you take out each example in turn, and predict its class using the other examples). What would the measured error rate be? If some example has more than 3 examples at the same nearest distance from it such that different choices of the 3 nearest neighbors give different predictions then count this example as an error. You thus compute the worst-case error rate.
(c) (8 points) Suppose you apply the 3-nearest-neighbor algorithm with backward elimination. Which features would be eliminated (X, Y, both, neither, or depends)? Why?
(20 points) Give decision trees to represent the following boolean functions (5 points each):
(a) A ∧ ¬ B
(b) A ∨ [B ∧ C]
(c) [A ∨ ¬ B] ∧ [B ∨ C]
(d) [A ∧ B] ∨ [C ∧ D]
(10 points) Mitchell 3.2:
Consider the following set of training examples:

Instance Classification a1 a2

1 + T T

2 + T T

3 - T F

4 + F F

5 - F T

6 - F T

(a) What is the entropy of this collection of training examples with respect to the target classification?
(b) What is the information gain of a2 relative to these training examples?

(10 points) You are given a classifier that, for each data point, outputs the probability of this point having the "+" label. Running the classifier on a set of 10 examples has yielded the following results:

Example # Classifier Output True Label

1 94% +

2 91% +

3 88% -

4 83% +

5 76% -

6 46% +

7 37% -

8 29% -

9 14% +

10 8% -

Based on the above data, build a ROC curve for the classifier as described in the lecture using five different threshold values of your choice.

Turn in the following:

Your code for Problem 1. Include instructions on how to run your code. Send your files in a single archive (zip, tar, etc) to akolobov@cs.washington.edu. The due date is 6.30 p.m., April 12, 2010. Your code should contain appropriate comments to facilitate understanding.
A report with your answers to the homework problems. Hand in a hard copy of it at the beginning of class. For problem 1, you should describe from a high-level perspective how your code works and what the accuracy of your results is. If your accuracy is low, analyze the errors and explain what you think might be the cause and what changes can be made for improvement.

Good luck, and have fun!