UW Home     CSE Home   Announcements    Message Board    Contact Info 

 
 
 

CSE143 Autumn 2004 Project #2 Part A

Eine Kleine Google - Reading Words 

Due: Wednesday, November 3, at 9:00 pm. No late assignments will be accepted.

In this assignment you will be reading text from files and searching it for strings. By part B, the program will take a query and provide the documents that match that query. For part A, we're just concerned with reading the file, parsing it in to documents, parsing the words out, and storing them in HashMaps.

Work with your assigned partner. The same rules as always apply: try to do most of the actual coding while sitting at the same computer, trading off the keyboard. The two of you should turn in a single set of files. After the final part of the project is done, each of you will individually write a final report.

Grading: When the project is complete, your project will be evaluated both for how well the code works and how well it is written and tested. For this part of the project, we will try to give you quick feedback on the scale of 0-3: 3 = no major problems, 2 = something needs to be fixed, 1 = serious trouble, and 0 = no credible effort. For all phases of the project, be sure to include appropriate JavaDoc and other comments in your code, use meaningful names, indent sensibly, provide toString() methods where appropriate, and so forth.

Keep track of the major design decisions, algorithms, and tests you perform to verify your code is working. You will be asked to include a summary of these in the report that you will write individually at the end of the final part of this project.

Overview

When presented with several documents, it's normal to want to be able to pick out a few of them that are of interest. In the case of the Internet, the documents are web pages. You're all familiar with searching the 'net; you provide a query of some search terms, and the search engine provides documents that match those terms. For this project, rather than searching the 'net, you'll be searching part of the UW course catalog.

There are two major steps in running a search engine.

  1. Read in text files and record which documents have which words.
  2. Given a query from the user, find the documents that best match the query, sorted by how well they match.

We're going to do something odd on this project for the sake of convenience. While a "document" normally comes in its own file, in this case we're redefining "document" to mean one entry of the UW course catalog. We're providing a text file of part of the course catalog. Your code should treat lines that are next to each other as if they are part of the same document. So a group of lines that are right next to each other is a document; and as soon as there's a blank line you start over.

The file is here: coursecatelog.txt. If you choose to make more files to search, be careful that they follow the same format of blank lines separating the documents. (Also beware of tricky situations when there are blank lines at the beginning of the document.)

What we're doing for part A

For part A of the project, you and your partner should implement the part of the program that opens an input file, reads lines from the file, separates out the individual "documents" and separates the input into individual words. Parts of what you write for part A are likely to change when you use them in part B, but what we're doing in part A will provide the basic infrastructure for part B.

To gain a bit of experience with HashMaps, you should also store each individual word in a HashMap, along with a count of how many times each word appears. There should be one of these frequency maps per document containing the frequencies of the words in that document. Be sure to include the words that appear in the first line of the course description (the course number and title) in the word counts.

For our purposes in this project, a word is any sequence of letter or digit characters separated by spaces, punctuation, end of line separators, or whitespace such as space, tab, return, and so on).

Since a word has the same meaning regardless of the punctuation next to it or its capitalization, we will consider “example”, “Example”, “example,”, “example:”, “EXAMPLE!”, and “EXAMPLE!!!” to all be the same word. It's best to remove the punctuation and capitalization and store them all as "example" in the frequency table. This is not required in part A - you can store the words without converting to lowercase or stripping punctuation, but you can do this now if you want to get a bit ahead.

Hints: Look at classes Character and String for methods that can help you decide if characters are whitespace, punctuation, or letters or digits. You may also decide to use regular expressions, though these can be tricky.

Implementation Requirements

The purpose of this project is to give you practice with various file and text processing techniques. Unlike the previous project, which was almost totally free-form, for this project there are some specific requirements for how you implement your program.

  • Use a JFileChooser dialog to select the input file
  • Use a BufferedReader object to read lines from the file one at a time.
  • Use exception handlers (try-catch clauses) to detect problems opening the input file and reading lines from it, and print or display sensible error messages if something goes wrong. In particular, you may not use a “throws IOException” clause everywhere to avoid handling errors, and you must have separate error handling code to report problems opening the input file and problems reading the input.
  • Create a Document class. It should store the original text of the document and a HashMap of the word frequencies for that document (more on this below). Remember that we're using an unusual definition of "document"; see the overview above.
  • Put each separate course description from the file in its own Document object.
  • Separate the text of each document into individual words. You may do this by hand with classes from Character or String classes, or you may use the StringTokenizer class from the standard library.
  • For each Document, store the words and the number of times that each word appears in the input in a HashMap using the word (a String) as a key and the count (an Integer object) as the associated value. (We recommend making the word lowercase and removing punctuation before storing it. This is not a requirement for part A, but it will be for part B.)
  • Include at least a few JUnit unit tests to verify that your code that breaks lines into individual words works properly. (Hint: you can use a StringReader object to turn a String in the test into a stream that your code can read instead of a file.)

Implementation Hints

As usual on a software project, it's best to start small and gradually add to what you've got, eventually arriving at a program that does the entire job. That means looking at the entire problem and figuring out the smallest part that you can implement without having to do anything else, then which small part can be added next, and so on. At each stage of the implementation, it is important to test your code to be sure that the new part that you’ve added works properly and, furthermore, didn’t break any of the previous code.

While it’s up to you and your partner to figure out how to proceed, here is a suggested order that you might find useful. If you and your partner decide to do something different, you should think about the implications and be as sure as you can that your strategy makes sense.

We recommend that you start by implementing the code to open a file and initialize a BufferedReader object to read from it. Check that the code works correctly before you move on to the rest of the assignment.

As always, you need to think about how to divide your code into appropriate classes, each of which does one thing well (high coherency) and are loosely coupled to other classes to the extent that this is possible.

Testing

Be sure to test your code thoroughly. In particular, be sure you can handle various kinds of input: extra blank lines in the middle or at the beginning or end of the file, leading or trailing whitespace (blanks, tabs, etc.) on a line, an arbitrary number of whitespace characters between words, etc. You and your partner should think through the possibilities and create a comprehensive (note that this does not necessarily mean numerous) set of tests. You should use JUnit to verify that the code to break words apart works properly, then use some plain text files (i.e., NotePad or something similar) to test that your program can open a file and read it.

To the extent you can, you should also test your exception handling code to see if it properly reports errors. This may be a bit tricky to do, but you might be able to try things like deleting a file after selecting its name but before clicking OK in a JFileChooser window, trying to open a file using an invalid File object, deleting a very large file while the program is reading it, etc. See what you can come up with!

What to turn in

Use this online turnin form to turn in the files that make up your project and the test programs and test input files you used to verify that it works. Do not turn the project in under two different names.