CSEP 546: Data Mining (Spring 2010)

Assignment 4: Clustering, SVMs, PAC Learning, Association Rules

Due Date: Tuesday, June 1st, 2010 via email.

  1. (25 points) Perform by hand two iterations of the K-Means clustering algorithm on the following set of points: (-1.1, 2.4), (-0.9, 0.1), (-0.2, 0.2), (-0.1, 1.1), (0.2, 0.8), (0.4, 0.1), (1.2, 2.0), (1.4, 0.3). Each consists of the point reassingment step and the centroid recalculation step. Take the number of clusters to be two, with initial centroids at (1,1) and (-1, 1). Show your work. If you have not done this problem yet, please use Euclidean distance.

  2. (10 points) Given all the rule learning schemes we have previously seen in class, why do you think there is a need for association rule mining at all?

  3. (20 points) Consider the following set of transactions from an electronics shop. The entries in columns 2-5 indicate how many items of the corresponding kind were bought in each transaction.
    Transaction ID Photo camera Lens Carrying Bag Tripod
    11100
    21101
    31010
    40001
    51111
    61101

    Given these data, calculate the support and confidence of the rule "{Camera} implies {Carrying bag}".

    Given these data, calculate the support and confidence of the rule "{Camera, Lens} implies {Tripod}".

  4. (20 points) Mitchell 7.5a. You get 8 points for your answer and 12 points for your justification.

  5. (15 points) Consider learning a perceptron on nine-dimensional linearly separable data. How many training examples do you need to guarantee with 99% confidence that the learned perceptron has true error of at most 10%?

  6. (10 points) Consider running an SVM on a data set where use Kernel 1 and Kernel 2. When using Kernel 1, you have 1,000 support vectors. When using Kernel 2, you have 100 support vectors. Which kernel would expect to generalize better to unseen data and why?

Turn in the following:

  • A report with your answers to the homework problems.

    Good luck, and have fun!