CSEP 546 - Data Mining - Spring 2010

CSEP 546: Data Mining (Spring 2010)

Assignment 2: Rule Induction and Bayesian Learning

Due Date: Monday, April 26, 2010 in class; submit your code by emailing it to akolobov@cs.washington.edu by 6.30 p.m.

(40 points) Spam filtering.
- The dataset we will be using is a subset of 2005 TREC Public Spam Corpus. You can download it here. It contains a training set and a test set. Both files use the same format: each line represents the space-delimited properties of an email, with the first one being the email ID, the second one being whether it is a spam or ham (non-spam), and the rest are words and their occurence numbers in this email. In preprocessing, I removed non-word characters, and conducted feature selection similar to what Mehran Sahami did in his original paper using Naive Bayes to classify spams, though with larger cut-offs since our corpus is larger.
- Implement the multinomial naive Bayes algorithm classify spam. Use your algorithm to learn from the training set and report accuracy on the test set.
- Try various smoothing parameters for the Naive Bayes learner. Which parameters work best?
- (10 points of extra credit) Our feature selection makes learning much easier, but it also throws out useful information. For example, exclamation mark (!) often occurs in spams. Even the format of email sender matters: in the case when an email address appears in the address book, a typical email client will replace it with the contact name, which means that the email is unlikely to be a spam (unless, of course, you are a friend of the spammer!). Sahami's paper talked about a few such features he had used in his classifier. For extra credit, you can play with the original files and come up with useful features that improve your classifier. Here are the lists of the files used in train/test.
- For those who are interested in adversial classification, the emerging theme for spam detection, here is a nice KDD-04 paper from UW. Also, NIPS (a very popular machine learning conference) has recently hosted a workshop on adversial learning.
(20 points) Apply FOIL to relational learning:
- The purpose of this exercise is to familiarize you with FOIL and have you take a look at what kind of rules an ILP system can produce.
- Download FOIL6 from here. (It also contains a picture of our influential alumnus.) Decompress the shell archive and understand how FOIL works by checking out the following files: README, MANUAL, and member.explain. You can also try running FOIL with some example files (e.g., sort.d).
- (10 points) Download the kinship data file. Run FOIL and report the rules you have learned and accuracy on the test relations. This file is created from the original Kinship dataset by random split. The command is
  ./foil6 -v0 < kin.d
  Here, -v controls the verbosity level; you can use higher numbers (1-6) for more detailed output.
- Look carefully at the rules FOIL has learned. Do all of them make sense? (You won't be graded on your answer to this question.)
- If you only have access to a Windows machine, you will need to run FOIL from Cygwin, a tool that lets you compile and run many *nix-based applications on Windows. If you don't have Cygwin installed already, please follow these steps:
  - Go to www.cygwin.com and start Cygwin installation.
  - Accept the default choices the installer offers (make sure to note the root directory) until it asks you to choose a download server. Pick any server and click Next.
  - The installer screen you should see now offers you to select packages for installation. Find the package named "Devel" (7th package from the beginning of the list). Click on "Default" next to "Devel" once; it should now change into "Install". Click "Next" at the bottom of the installer window and relax while Cygwin is getting installed.
  - Go to the %cygwin_root_directory%\home\%your_username%\ directory. Create a directory named "foil" and download the FOIL script, named "foil6.sh", from the above link into this directory.
  - Start Cygwin (e.g., from the desktop shortcut). Execute "cd foil", then "./foil6.sh", and, finally, "make". You are now ready to run FOIL as described above!
- (10 points) Suppose that you are learning a rule for brother(A,B). Your current candidate is brother(A,B) :- wife(C,A) ^ daughter(E,A), and you are considering candidate specialization by adding another literal to the antecedent. How many distinct literals can be generated? Two literals are considered identical if they differ only in the names of the new variables that they contain.
(20 points) Mitchell 10.1.
(10 points) Bayesian learning.

(a) (5 points) The fraction of rabid raccoons in the entire raccoon population is 4.2%. About 79% of rabid raccoons have trouble controlling their glands and constantly drool as a result. However, around 6% of healthy raccoons are always hungry so they drool all the time as well (OK, they are healthy physically but perhaps not psychologically). Suppose that you see a drooling raccoon in your kitchen, about to devour the contents of your garbage can. What is the probability that this raccoon is rabid?
(b) (5 points) Drooling isn't the only problem with rabid raccoons; 97% of them also tend to attack people at the first opportunity. At the same time, 2% of the physically healthy but raccoons will also attack people, as they feel that the latter threaten their food supply. Now, suppose the drooling raccoon in your kitchen, upon seeing you, starts charging at you and producing threatening sounds in the process. Assuming that the events of a raccoon drooling and attacking are independent given the state of the animal's health, what is the probability that your visitor is rabid?
P.S. Don't quote the figures mentioned in this problem's statement. Remember that 78.41936% of all statistical data is made up on the spot.
(10 points) Conditional independence in Bayesian networks.
Consider the following Bayesian network:
- Is D independent of E?
- Is A independent of B given C?
- Is E independent of B given C?
- Is A independent of B given D?
- Is E independent of D given B?
Justify your answers.

Turn in the following:

Your code for Problem 1. Include instructions on how to run your code. Send your files in a single archive (zip, tar, etc) to akolobov@cs.washington.edu. The due date is 6.30 p.m., April 26, 2010. Your code should contain appropriate comments to facilitate understanding.
A report with your answers to the homework problems. For problem 1, as usual, you should describe from a high-level perspective how your code works and what the accuracy of your results is; if your accuracy is low, analyze the errors and explain what you think might be the cause and what changes can be made for improvement.

Good luck, and have fun!

CSEP 546: Data Mining (Spring 2010)

Assignment 2: Rule Induction and Bayesian Learning

Due Date: Monday, April 26, 2010 in class; submit your code by emailing it to akolobov@cs.washington.edu by 6.30 p.m.

Good luck, and have fun!