CSEP 546 - Data Mining - Autumn 2004 - Project 2:

Text Classification

Due Date: Tue, Nov 30, 2004 at 6:30 PM

The project is to be done individually.

The project is to implement text classification algorithms and apply them to a standard data set, the Reuters collection of news stories. The goal is to predict the topic of the each news story in the test set, given the topics of the stories in the training set. You will implement the following algorithms.

  1. Naive Bayes

  2. K-nearest neighbor

Naive Bayes should be implemented as described in class and in Section 6.10 of Mitchell. Be sure to implement Naive Bayes as a sum of logarithms of probabilities, rather than a product of probabilities, to avoid numeric underflow problems.

For k-nearest neighbor, you should try the following representations/distance measures.

  1. Hamming distance: between binary vectors, where each bit of a document's vector represents whether the corresponding word/token appears in the document.
  2. Euclidean distance: between numeric vectors, where each coordinate is the number of times the word/token occurs in the document.
  3. Cosine distance (often used in information retrieval): dot product of the vectors in (b) above, divided by the product of their norms.

You should try k = 1, k = 3 and k = 5 with each of the distance measures above.

Feel free to experiment with additional variations of the algorithms, and to investigate any other relevant design choices. Please describe these (along with their justification and evaluation) in your report.

Turn in the following:

  • Your code, and reasonable documentation for it (i.e., enough for us to understand how it works and how to use it). Please place this documentation in a file named README.

  • A report of at most 3 pages. Include a description of your architecture, approach, extra features, bugs and any problems you had. Report the test-set accuracy of each algorithm that you tried, and its standard deviation. Comment on the results. Which algorithm(s) did best? Why? When does each algorithm tend to do well versus poorly? How would you further improve your best text classifier? Any of the Word/PS/PDF/HTML/plain text files are ok for the report.

Turn-in procedure: Follow the link for turning in your project. Make a compressed archive of all your code and the report and submit this single archived file. The deadline for submission is before class on Tue, Nov 30.

Files:

Format:

The format of the test and training files is the same. Each file contains a set of examples delimited by two blank lines. Each example is composed of the following parts:

  1. topic classification.
  2. blank line
  3. story title
  4. blank line
  5. story location, date
  6. blank line
  7. story text

URLs:

  1. Training Data

  2. Test Data

  3. These are the topics used for classifying the texts:

  4. The original source for this data is the UCI Knowledge Discovery in Databases Archive

  5. The Reuters-21578 Text Categorization Collection was the collection used to generate the test and training data above. The SGML was converted to XML, cleaned up, and queried using XPath to generate a simple text file format. To access the collection, Click here.

We may ask you to do a demo of your system and/or an oral discussion.

Good luck, and have fun!