Problems 1-4, Due May 28, 2004
This page has been updated on May 24 to include the full text of Problem 5 and an appendix to further explain how to choose split points when building decision trees.
This page has been updated on June 1 to make the text of the explanations in Problem 5 consistent with the lecture slides.
Problem 1 [4 points] R&N problem 18.7
Problem 2 [2+4+2 points] R&N 11.4a, b, & d.
Problem 3 [2+4 points] R&N problem 11.17 a & c
Problem 4: Decision Trees And Ensembles [30 points]
Overview: In this exercise you will form groups of two and use machine learning techniques to build two spam filters and you will compare their performance. The concept of spam is diverse and difficult to define precisely: advertisements for products/web sites, make money fast schemes, chain letters, pornography... Our collection of spam e-mails comes from George Forman (a former UW graduate student,now working at HP Labs); he gather the data set from the HP postmaster and from individuals who had filed spam. The collection of non-spam e-mails came from filed work and personal e-mails, and hence the word 'george' and the area code '650' are indicators of non-spam. While specific to this corpus, these words are useful when constructing a personalized spam filter. One would either have to blind such non-spam indicators or get a very wide collection of non-spam to generate a general purpose spam filter. We won't worry about such issues for this assignment.
The dataset contains about 39% spam. Each instance corresponds to a single email, and is represented with 57 attributes plus a class label (1 if spam, 0 if not). The data files contain one instance per line, and each line has 58 comma delimited attributes, ending with the class label. Most of the attributes indicate whether a particular word or character was frequently occuring in the e-mail; frequency is encoded as a percentage in [0, 100]. A few attributes measure the length of sequences of consecutive capital letters. The files spambase-docs.txt and spambase-names.txt have more detailed information.
You will implement a decision tree learner that hill-climbs through the space of decision trees, driven by the information-gain heuristic, as discussed in class. To avoid overfitting, use chi-squared pruning, or cross-validation as described in R&N pages 662-663. The latter (also called reduced-error pruning or expected error pruning) is simpler. You can find additional information on it at the bottom of http://www.cse.unsw.edu.au/~billw/cs9414/notes/ml/06prop/id3/id3.html.
One challenge for you stems from the fact that this data set has a plethora of real-valued attributes, which we only touched on in class. R&N explains them briefly on page 664, but here are some additional ideas. Since a real-valued attribute might spawn an infinite number of children when tested agains specific values, we must convert them into a discrete attribute with a finite number of values. There are several options here. 1) before doing any learning you could assign a real-valued attribute into a bucket. For example, you could arbitrarily divide a [0,100] percentage attribute into 10 regions: less than 10%, between 10 and 20%, etc. This is simple, but often doesn't work so well. In addition you will probably need to counter-act the tendancy of the informatio-gain heuristic to prefer discrete attributes with many values. One way to do this is by using the "gain ratio" heuristic instead. See R&N page 663 and exercise 18.13. See also slides 32 and 33 on the web for lecture on May 7 (note we didn't discuss these slides in class but they may help.
Another method is to convert a real-valued attribute to a binary attribute, e.g. is the percentage greater-than threshhold T or not? (R&N discuss this at the top of p664 under the name "split point") Now the question is how to determine the best split point, T? Instead of an arbitrary choice, one can use the data values to suggest the best threshold, i.e. the one with the biggest information gain. If you sort the instances on the real-values of the attribute in question and then look for adjacent instances whose classification differs, you may generate a set of thresholds intermediate in value between those values. It turns out that these values dominate other possible split points and so we need only choose the best of these. For example, if we had the following five instances of emails (with one word-frequency attribute for simplicity)
Freq 10% 20% 40% 50% 64% 80% Spam? No No No Yes Yes No
Then it would clearly be crazy to test on splits other than a threshold of 45% or a threshold of 72% (For 45% I've chosen the midpoint between 40% and 50%; Testing on thresholds outside both this range and the 64%-80% will lead to a smaller expected information gain. Why the precise midpoint of the range? Other values work as well, but experience shows that the midpoint is the best choice unless you have a strong reason to try something else).
See Appendix B for a clarification.
Goals and procedures: Study and empirically compare the accuracy of plain decision trees with your ensemble method. Define a method's accuracy as the percentage of test examples for which they make the correct class prediction. To keep things simple, learn on the training set and test on the test set. Remember, no peeking.
Once you have written a standard decision tree learner. Your next task is to create an ensemble of k decision trees. To start, I suggest setting k=13. You may use bagging, cross-validated committees, boosting, stacking, or another method.
What to turn in:
Optional Extra Credit: If you wish you can extend this assignment in several ways. Implement and compare more than one ensemble method. Or as a separate experiment, compare several different values of k (the number of learners in your ensemble) – does it matter? What’s the best number for this problem?. Or as a separate experiment, compare different pruning methods in your DT learner and report on the effect. Or something else that interests you.
Problem 5: Naive Bayes [20 points]
Due date: June 4
Overview:This problem is an extension of Problem 4. You need to implement another (very simple) learning algorithm and compare it with your DT learner and your ensemble method and your report will be refinements of what you had already produced.
The lecture slides for May 21 may be a good reference for naive Bayes, but the book's coverage is also quite good. You will need to process the training data to compute, for each attribute X, P(X|Spam) and you'll also need P(not-X|spam) which is just 1- the first number. When given a test example, all you need do is multiply together P(X_i|spam) for all the 57 attributes, X_i, (and then multiply the whole thing by P(spam)) and compare this to the corresponding probability of not-spam. Note how this assumes independence. And don't forget to smoothe or use a prior when you estimate the values of the parameters!
There are two tricks necessary to get this to work correctly.
Goals and procedures: Study and empirically compare the accuracy of the naive Bayes learner against plain decision trees and your ensemble method. The experiments should be conducted as in Problem 4. (You will only need to test the accuracy of your Naive Bayes classifier because you already have the results for the other two methods) Extend your report to cover a similar discussion of your NB learner and your expanded experimental comparison.
Appendix A: instructions on turning in your code
As before, please submit both the source code and an executable .jar file. This time we will automatically test your programs on a secret set of test data (drawn from the same distribution as the date we gave you). So it is important that your programs conform to the input and output syntax specified here.
The turnin commands are as follows:
For Problem 4:
turnin -c cse473 -p trees your filesFor Problem 5:
turnin -c cse473 -p bayes your files
Arguments: We should be able to execute your program with the following command syntax:
java -jar yourProgram.jar -t|-e|-b trainingFile testFileThe first argument is one of three switches:
Output format: It is ok for your program to generate random diagnostic messages but at some point it must produce output in the following format:
==== 1 0 1 1 0 1 ... ====In other words, the "official" output starts and ends with four equal signs and then on each line you output your classification for each of the entries in the test file (1=spam, 0=non-spam). We will make the automatic testing program available to you before the assignment is due so that you can verify that your program uses the correct syntax.
Appendix B: Clarifications on Problem 4The explanation on how to pick the best split point is ambiguous, because it assumes that every training instance has a unique value of each attribute. What should one do if several instances have the same numeric value - indeed, what should you do if a piece of spam and a non-spam message shared the same value?? Consider the following dataset, where this occurs. + means a single spam message & - means a single non-spam message.
Attribute X: 5% 10% 20% 22% 30% ... Messages: + + + - - + - -In other words, there are two spam messages with X=5%, 1 with X=20% and 2 regular messages with X also = 20.
What split points should you consider? Threshold = 21% results in two groups, but focus on the one for (x<21%). This set has [4+, 2-]. If on the other hand, one split at 15% then the corresponding set would be [3+,0-] which is much lower entropy! Now it's true that the entropy of x=>15 will be higher than x=>21, but the reduction for x<15 outweighs this and the net information gain is higher for threshold=15.
So what does this mean in terms of the original problem set text? You need to consider 10%/20% as a place where adjacent classifications differ (even though it is also a place where adjacent classifications agree!). Similarly your system need to consider splitting at 21% because there exists an instance at 20 with a different value than one at 22%.
Finally, remember that your split points should always be points in between observed data values for each attribute.