Due Date: Monday, April 26, 2010 in class; submit your code by emailing it to akolobov@cs.washington.edu by 6.30 p.m.
|
- (40 points) Spam filtering.
- The dataset we will be using is a subset of 2005 TREC Public Spam Corpus. You can download it here. It contains a training set and a test set. Both files use the same format: each line represents the space-delimited properties of an email, with the first one being the email ID, the second one being whether it is a spam or ham (non-spam), and the rest are words and their occurence numbers in this email. In preprocessing, I removed non-word characters, and conducted feature selection similar to what Mehran Sahami did in his original paper using Naive Bayes to classify spams, though with larger cut-offs since our corpus is larger.
- Implement the multinomial naive Bayes algorithm classify spam. Use your algorithm to learn from the training set and report accuracy on the test set.
- Try various smoothing parameters for the Naive Bayes learner. Which parameters work best?
- (10 points of extra credit) Our feature selection makes learning much easier, but it also throws out useful information. For example, exclamation mark (!) often occurs in spams. Even the format of email sender matters: in the case when an email address appears in the address book, a typical email client will replace it with the contact name, which means that the email is unlikely to be a spam (unless, of course, you are a friend of the spammer!). Sahami's paper talked about a few such features he had used in his classifier. For extra credit, you can play with the original files and come up with useful features that improve your classifier. Here are the lists of the files used in train/test.
- For those who are interested in adversial classification, the emerging theme for spam detection, here is a nice KDD-04 paper from UW. Also, NIPS (a very popular machine learning conference) has recently hosted a workshop on adversial learning.
- (20 points) Apply FOIL to relational learning:
- (20 points) Mitchell 10.1.
- (10 points) Bayesian learning.
(a) (5 points) The fraction of rabid raccoons in the entire raccoon population is 4.2%. About 79% of rabid raccoons have trouble controlling their glands and constantly drool as a result. However, around 6% of healthy raccoons are always hungry so they drool all the time as well (OK, they are healthy physically but perhaps not psychologically). Suppose that you see a drooling raccoon in your kitchen, about to devour the contents of your garbage can. What is the probability that this raccoon is rabid?
(b) (5 points) Drooling isn't the only problem with rabid raccoons; 97% of them also tend to attack people at the first opportunity. At the same time, 2% of the physically healthy but raccoons will also attack people, as they feel that the latter threaten their food supply. Now, suppose the drooling raccoon in your kitchen, upon seeing you, starts charging at you and producing threatening sounds in the process. Assuming that the events of a raccoon drooling and attacking are independent given the state of the animal's health, what is the probability that your visitor is rabid?
P.S. Don't quote the figures mentioned in this problem's statement. Remember that 78.41936% of all statistical data is made up on the spot.
- (10 points) Conditional independence in Bayesian networks.
Consider the following Bayesian network:
- Is D independent of E?
- Is A independent of B given C?
- Is E independent of B given C?
- Is A independent of B given D?
- Is E independent of D given B?
Justify your answers.
Turn in the following:
- Your code for Problem 1. Include instructions on how to run your code. Send your files in a single archive (zip, tar, etc) to akolobov@cs.washington.edu. The due date is 6.30 p.m., April 26, 2010. Your code should contain appropriate comments to facilitate understanding.
- A report with your answers to the homework problems. For problem 1, as usual, you should describe from a high-level perspective how your code works and what the accuracy of your results is; if your accuracy is low, analyze the errors and explain what you think might be the cause and what changes can be made for improvement.
Good luck, and have fun!
|