CSE 143 Autumn 2004

UW Home

CSE Home

Announcements

Message Board

Contact Info

CSE 143

Home page

Syllabus

Classwork

Calendar/Lectures

Projects

Exams

Academic Conduct

Staying in Touch

Mailing Lists

Discussion Board

Announcement Archive

People

Instructors

TAs

Consultants

Consultant Schedule

Software & Computing

Programming Lab Info

Computing at Home

Java Libraries

Check Your Scores

MyUW

CSE143 Autumn 2004 Project #2 Part B

Eine Kleine Google - Searching for Documents

Due: Wednesday, November 10, at 9:00 pm. No late assignments will be accepted.

In this part of the project, you'll take a query from the user, find the documents that best match that query, and report them.

You should continue to work with the same partner you had for part A of this project, using the pair-programming style discussed in class. Remember that not only is it important to get the project done, but it is also important that you learn how to work effectively with someone else.

Grading: You and your partner will receive the same scores on the programming parts of the project. Programming projects will be evaluated both for how well the code works and how well it is written and tested, each on a scale of 0 to 4. Be sure to include appropriate JavaDoc and other comments in your code, use meaningful names, indent sensibly, and so forth.

Keep track of the major design decisions, algorithms, and tests you perform to verify your code is working. You will be asked to include a summary of these in the report that you will write individually at the end of the final part of this project.

Overview

The scenario for this project was described in part A. You've already gotten and analyzed the documents; now comes the part where you let the user search their contents.
Basic Requirement

To receive full credit for this project (assuming everything works and is well written, of course), you are only required to do the following:

Get a query (search terms) from the user. (See the section below about interacting with the user.) If necessary, break the query into its individual words (its "search terms"). Your program must be able to correctly handle queries with one or several terms.
Find the documents that match those search terms. The basic requirement is that you find all the documents that match all the terms. So if my query is "cat dog shoe" then your program should report every document that contains all of the words "cat", "dog" and "shoe".
The search terms must match occurences in the documents that differ in punctuation or capitalization. So if your query is "PUPPY", documents containing "puppy!!!", "Puppy?" and so on should match.
Give the documents that match a score which represents how well they match the query. For the basic part of the project, you should compute a score that is just the sum of the frequencies of the search terms in the document. (So, in the example above, the sum of the number of occurences of "cat", of "dog" and of "shoe".)
Report the documents that match and their scores. This report should show the first line of the document (which in the case of the course catalog is conveniently the same as the course title) along with the score. It'd be nice if this report were sorted by scores, but this is not required.
Let the user pick one of the hits to see in its entirety. The most basic way to do this is to associate a number or letter with each matching document, and display those when you display the titles and scores; then let the user enter one of those numbers. When the user does this, display the whole document that they chose, including the title line. You can do this in the interactions pane or console window, or if you want you can implement a Swing interface (see extra credit, below). If you'd like to do a GUI that lets the user click on the document they want, that's also fine. But the basic requirement can be done just with interaction on the command line (the interactions pane).
After presenting the chosen document, let the user choose another document, or do another search, or quit.

Implementation Hints
The simplest way to find documents which match the query terms is to go through the list of documents and look for the ones that contain all the search terms. That's fine for this project. However, if there is a large number of documents, this would be very inefficient. A much more efficient implementation would include a way to look up documents based on a single word. This could be done with a HashMap where the keys are words and the values are Sets of Documents.
To display the list of hits, you can just print them out in the interactions pane or console window (remember to take out the extraneous printlns that might get mixed in). When the user selects the one they want, print out its text.
To keep track of which documents are getting what scores for a given query, consider using a HashMap with Documents as keys and their scores as values, or some other data structure where you can store information about documents and their scores. Do not add a "score" instance variable to Document -- the score only makes sense in the context of the one particular query, which the Document shouldn't know about.
Most people probably won't need to use this, but in case you're interested, the Java regular expression package provides powerful tools for changing strings and finding patterns in them. But it's very tricky to get the hang of. If you have the basic requirements working, consider spiffing up the existing code with regular expressions if it helps, or using them for some of the extra credit.
Interacting with the User

To read user input, you can use a InputStreamReader to read from the DrJava interactions window (or whatever console window is available in your development environment):
Reader systemIn = new InputStreamReader(System.in); BufferedReader input = new BufferedReader(systemIn); String lineFromConsole = input.readLine();
Another, probably easier, way to read input is to use the static showInputDialog method of JOptionPane to pop up a dialog box with a message and a space for the user to enter a response. See the JavaDoc pages for JOptionPane for the details.
You can use the same method to get the user's choice of which document to display.
Extra Credit
There are plenty of things that can be done for extra credit.

Replace the console interaction with a GUI, getting user information from dialog boxes and/or letting them pick which document to display by clicking its description. If you do this, look at JTextComponent and its subclasses for Swing components that can be used to display blocks of text. There's an overview of this in the Text Components section of Sun's Java Swing Tutorial.
Implement the hash map that maps words to the set of documents that contain at least one occurrence of that word and use that to find relevant documents without having to search each document.
Use a more complex scoring algorithm. You might weight words more heavily if they appear in the first line of the document, or if they're less common across all the documents, or give more weight to words that came earlier in the query. And so on. (Note that the score need not be an integer if something else makes more sense.)
Disregard extremely common words like "a" "the" "of" "is", etc. in the search.
Display partial hits (that only hit on some of the search terms).
Highlight the search terms when displaying a document (possibly just by putting stars on either side of them).
Allow the user to specify words their documents must not match, for instance by putting a "-" flag in front of the term.
Allow the user to search for whole strings of words, by putting quote marks around parts of their queries. (This is harder than it sounds at first -- unless you store substrings of the documents, or use regular expressions to find the search terms in the (possibly capitalized and punctuated) text.
Be able to write the results of one or more queries to a text output file.
Come up with something else.

One requirement about extra credit: If you do implement one or more additional scoring algorithms, you must also implement the basic scoring algorithm described above (search for all the words in the query and add up the number of occurrences), and you must provide some way in the user interface for the user to select which scoring algorithm to use. This could be as simple as entering a number to select the algorithm from a list, or you could use radio buttons in a GUI or something similar.

If you implement additional scoring algorithms, you might do some experiments to see which algorithms give the best results. You'll want to discuss the different algorithms, describe the results, and discuss why you think you got the results you did in your final project report.

Testing

Be sure to try your program on some small examples so you can verify that it is actually analyzing the input correctly and producing appropriate output. For instance, run it on a very small section of the catalog, one where you can be sure that it returns the documents that really are the best match. You should use JUnit to run these sample tests.

Remember to test that your program gets all the documents that match all the search terms when you use the basic algorithm.
What to turn in

Use this online turnin form to turn in the files that make up your project, including all the tests from both parts of the project (A and B). You should also turn in at least two different text files containing sample output generated by your program for different pairs of starter words (these should be relatively short, and you can cut and paste text or screen snapshots into a text editor to do this - you don't need to include the ability to write files in your program). Do not turn the project in under two different names.

Department of Computer Science & Engineering
University of Washington
[comments to cse143-webmaster]

CSE143 Autumn 2004 Project #2 Part B

Eine Kleine Google - Searching for Documents

Due: Wednesday, November 10, at 9:00 pm. No late assignments will be accepted.

Overview

Basic Requirement

Implementation Hints

Interacting with the User

Extra Credit

Testing

What to turn in