CSEP 546: Data Mining (Spring 2010)

Assignment 3: Voted Perceptron, Ensemble methods, Genetic Algorithms

Due Date: Monday, May 10, 2010 in class; submit your code by emailing it to akolobov@cs.washington.edu by 6.30 p.m.

  1. (50 points) Spam filtering Redux: Now with voted perceptron.

    • The dataset is the same subset of 2005 TREC Public Spam Corpus. You can download it here. The training and a test sets it contains are in the same format: each line represents the space-delimited properties of an email, with the first one being the email ID, the second one being whether it is a spam or ham, and the rest are words and their occurence numbers in this email. Non-word characters have been removed in preprocessing and feature selection similar to Mehran Sahami's has been conducted.

    • Implement the perceptron algorithm to classify spam. Use your implementation to learn from the training set and report accuracy on the test set. Does the accuracy seem high or low?

    • How many iterations does it take for perceptron learning to converge? Compared to Naive Bayes, which algorithm performs better? Which one learns faster? (If your Naive Bayes implementation didn't work in one way or another, assume that a correct one would run in 3 minutes and would yield 90% accuracy.)

    • (10 points of extra credit, carryover from last homework) If you didn't get a chance to do the extra credit part of the previous homework, you can do so now. Alternatively, you may do the same exercise with the perceptron -- pick additional features, retrain the perceptron, and describe the results. Note, however, that if you did the extra credit part last time, you won't get credit for it this time.

  2. (10 points) Decision stumps are decision trees with only one node (the root). Which ensemble method would you use to improve the accuracy of decision stumps - bagging or boosting? Why?

  3. (10 points) What is the result of applying crossover to the strings 00110 and 10111, with crossover mask 11100?

  4. (20 points) Mitchell 4.2

  5. (20 points) Genetic Algorithm for Sudoku.

    Design a genetic algorithm that generates valid 3x3 Sudoku grids. Describe how you encode the Sudoku, and which heuristic genetic operators you can design for this particular problem.

Turn in the following:

  • Your code for Problem 1. Include instructions on how to run your code. Send your files in a single archive (zip, tar, etc) to akolobov@cs.washington.edu. The due date is 6.30 p.m., May 10, 2010. Your code should contain appropriate comments to facilitate understanding.

  • A report with your answers to the homework problems. For problem 1, as usual, you should describe from a high-level perspective how your code works.

    Good luck, and have fun!