CSEP 573: Applications of Artificial Intelligence
Homework Assignment #4
Due: Before midnight Tuesday, March 16, 2010
Turn-in procedure: Use the dropbox: https://catalysttools.washington.edu/collectit/dropbox/afriesen/8677
to submit a folder containing a document with all your answers and any supporting files. Name the folder HW4_<LastnameFirstname>.
Machine Learning for Spam Detection
“Viagra for free, pay shipping only” “RE:
These are examples of words (PG-13 only) found in spam email received at a UW CSE account over the past month. Spammers are constantly finding new ways to get around current spam filters. Many early filters based on straightforward principles such as a fixed set of keywords are no longer effective because spam is constantly changing. What we need are spam filters that can also change and adapt to new spamming strategies. The job seems cut out for machine learning.
In this exercise, you will explore the application of
several machine learning algorithms we discussed in class on a dataset
containing examples of spam and legitimate email. The exercise is based on the
work of Andy Menz at
Next, download the Weka machine learning Java toolkit using this link: Weka version 3-2-3 (note: download the version in this link and not the latest version from the Weka home page; the FeatureFinder.java program we will use only works with the older version). Extract all files to a directory. Double click the executable jar file called “weka” and you should get a window for the Weka GUI with a picture of the Weka bird on it. Click on the “Explorer” button to start exploring the world of machine learning weka-style. The interface is very intuitive but if you need more information, you can read the files README and Tutorial.pdf in your Weka directory or consult the Witten & Frank Data Mining textbook.
Download the feature extractor program FeatureFinder.java that extracts features from email text files. Read the documentation for FeatureFinder. Create a directory called Ling-Spam in the same directory as FeatureFinder. Download the Ling-Spam dataset and extract the dataset into your Ling-Spam directory (you should end up with files in various subdirectories of the form Ling-Spam/lingspam_public/bare/part# where # is 1-10). FeatureFinder requires that you ungzip all of the files in the folders in lingspam_public/bare/part# before running it.
Compile FeatureFinder with weka.jar (in the weka-3-2-3 folder). Execute FeatureFinder using the following setting of parameters: Feature Vector = Boolean, No Stemming, No Stop Terms, Number of Features = 250. The output will be an .arff file you can use with Weka. To get familiar with the .arff file format, you may open the file using a text editor such as Notepad. Other example data files can be found in the “data” directory in the weka-3-2-3 folder.
Use the Weka Explorer GUI to run all your experiments below.
Turn in a write-up answering the questions for each experiment.
(Each problem below is worth 25 points)
Extra Credit Problem 1 (25 points): Doing it better with better data (maybe). The Ling-Spam dataset is based on email sent to a linguistics mailing list. A more general dataset is the SpamAssassin corpus. Create a dataset of spam/ham email messages (ham = non-spam email) from the SpamAssassin corpus and run a version of MyFeatureFinder to obtain an .arff file. Repeat the experiments in Problems 3 and 4 (but using the SpamAssassin dataset instead of Ling-Spam). Is the accuracy rate of the SpamAssasin classifier on your test dataset better than what you got in Problem 4 using the Ling-Spam classifier?
Extra Credit Problem 2 (25 points): Tracking spam evolution. Spamming techniques typically evolve in response to progressively better spam filters. As a result, the criteria that a spam filter might rely on to catch spam can also be expected to evolve over time. The goal of this problem is to get a snapshot of spam evolution in action. Download a dataset from SpamAssassin corpus and split the data into several different time periods (e.g., Jan-March, March-May, etc.). For each time period, run the J48 decision tree learner (weka.classifiers.j48.J48) in Weka. Copy and paste to your write-up the output of Weka for the first and last time periods. Recall that the decision tree algorithm chooses attributes according to information gain. Note down the top 6 attributes used by the decision tree for each time period (you can get the top 6 attributes via a breadth-first traversal of the decision tree; the tree can be visualized by right clicking on the highlighted item in “Result list”). Construct a table where each column lists the top 6 attributes for one time period. Based on the table, do you see any trends in the evolution of spam in this dataset? Are there attributes that drop in importance, remain relatively constant, or rise in importance over the time course of the dataset?
Extra Credit Problem 3 (25 points): Personalizing your spam filter. Create a reasonably large personalized training dataset from your own email containing examples of spam and non-spam email. Repeat the experiments in Problems 3 and 4 (but using your personalized email training dataset instead of Ling-Spam). Is the accuracy rate of your personalized spam classifier on your test dataset better than what you got in Problem 4 using the Ling-Spam classifier?