FAQ and Tips for Assignment #1

FAQ

How do I get the indexer class to recognize everything it needs to import from within Eclipse?
Is there any way to run the cse454 file locally within (Windows) Eclipse or do I have to drag my StudentIndexer.class file onto a Linux server and run it from command line? If that's the way I have to do it, how do I set the classpath to look for it without screwing up other classpaths?
Exception in thread "main" java.lang.NoClassDefFoundError: StudentIndexer (wrong name: edu/washington/cse454/StudentIndexer)
It's clear from the assignment that we shouldn't index stopwords but should they be taken into account for term positions? For example, should a snippet like "hair of the cat" count the position(hair) = position(cat) - 1 or position(hair) = position(cat) - 3?
When I try to use cse454 -----, I always get an error message like cse454: Command not found.
What's the purpose of saveIndex() / loadIndex()?
Some documents don't have title or text
How do I instantiate objects that are supposed to be inner classes?
How do I submit my project?

Q: How do I get the indexer class to recognize everything it needs to import from within Eclipse?
A: See this email message
Q: Is there any way to run the cse454 file locally within (Windows) Eclipse or do I have to drag my StudentIndexer.class file onto a Linux server and run it from command line? If that's the way I have to do it, how do I set the classpath to look for it without screwing up other classpaths?
A: See this email message

Q: Exception in thread "main" java.lang.NoClassDefFoundError: StudentIndexer (wrong name: edu/washington/cse454/StudentIndexer)

I try to run cse454:
(CSE454_CLASSPATH points to the folder where I have StudentIndexer.class)

cse454 index -indexclass StudentIndexer

And it gives me:

Setting student classpath .. .. ..
Creating StudentIndexer .. .. ..
Exception in thread "main" java.lang.NoClassDefFoundError: StudentIndexer (wrong name: edu/washington/cse454/StudentIndexer)
    at java.lang.ClassLoader.. .. .. ..
    .. .. ..

What could be wrong?

A: You want to pass the entire class name:

cse454 index -indexclass edu.washington.cse454.StudentIndexer

Q: It's clear from the assignment that we shouldn't index stopwords but should they be taken into account for term positions? For example, should a snippet like "hair of the cat" count the position(hair) = position(cat) - 1 or position(hair) = position(cat) - 3?
A: Yes, you have to take stop words into account for term positions. So in the document "hair of the cat," position(hair) = position(cat) - 3.
Q: When I try to use cse454 -----, I always get an error message like cse454: Command not found.
A: You have to put in the full path to the program each time. So "/projects/instr/cse454-05au/assignment1/bin/cse454 index ..." instead of "cse454 index ..."
Q: What's the purpose of saveIndex() / loadIndex()?
A: saveIndex() is called by the cse454 command after indexPages() finishes. The purpose is to return the path to your index on disk that you will submit, along with your write-up and your entire index. I will instantiate your indexer object and call loadIndex() on the path returned by your saveIndex() in order to compare your index with mine.
Q: Some documents don't have title or text
A: Apparently some page contents are missing in the crawl but are returned as documents anyway. I'm not sure why that's the case, but it's just a small percentage of them. You can treat them as zero-length pages.
Q: How do I instantiate objects that are supposed to be inner classes?
A: Some classes, such as IIndexer.FileList.DocOccurrence, can be a pain to instantiate (sorry). The easiest thing to do is to have a FileList object around so you can do something like
filelist.new DocOccurrence()

(assuming filelist is a FileList object that you've instantiated earlier).
Q: How do I submit my project?
A: See this email message

Tips

Practical concerns when developing your code
Discussion of disk usage while indexing

Practical concerns when developing your code

To use the medium-sized crawl, add "-size med" to the cse454 command invocation. I'd recommend against using it while you're developing, as it will be much slower and take up significantly more disk space. Once your indexer is farther along, try indexing the medium sized crawl to stress test your program. You definitely want to do this to see if you run out of memory.

Find a Linux machine that doesn't have a lot of users on it, if possible. Try finding a machine in one of the labs and remember it's hostname so you can ssh to it in order to run your indexer.

Use the 'df' command to see how much free space is on each directory. Directories with names like /local1 or /tmp or /var/tmp are the best places to save the files your indexer generates. If you try saving to your own home directory, you'll quickly hit your quota.

Discussion of disk usage while indexing

As the assignment says, the lexicon should be kept in main memory. But the occurrence index gets too big to keep entirely in memory, so we have to write out parts of it to disk at times. We read those parts back in when we have to make changes to them. So how should you divide up your occurrence index? This is one of the design decisions you must make, and it will have a big impact on how long it takes your indexer to index a crawl. If you choose a bad way of keeping track of term occurrences, you'll end up reading and writing too often.

One possible intermediate structure you can use is a trie (see the indexing lecture). The trie has the advantage of being easy to sort terms alphabetically, which can be a huge benefit if you want to create a sorted index. Having a sorted index allows you to not have to worry about reading in old inverted file lists because you'll never need to add occurrences of terms that you've already passed alphabetically. The problem is, you can't fit the entire trie in memory because it represents your entire index, too. So you'll have to consider ways of splitting it into chunks.

Here's a naive approach that you could (but shouldn't) use: put every unique term's inverted file list in a separate file. Why is this bad? You'll end up with a huge number of files, which the operating system will not handle very well. You could put your files in a trie structure, so that all files placed in the same directory start with the same prefix, but this won't decrease the number of files and will increase the number of directories, so you'll be even more inefficient! Clearly, you have to find some middle ground between one file per term and one file for all terms...

One last idea is to expand the number of steps and intermediate files you use. You can't sort all the terms and their document occurrences all at once, but you can sort a small portion at a time...