CSE 490i - Project Part 3: Crawling the web

Due: February 23, 2000; 5pm.

Please: Be sure to read the whole assignment from start to finish before you start working; there are numerous dependencies where the wrong design choice early-on could hurt you later.

Objective: It the highest level, you should seek to build the coolest music search site on the web, one with a great UI and comprehensive music information. See Gigabeat for one possible role model. Specifically, you should take your site from part 2 and improve it by populating your database with more MP3 files by crawling the web and extracting information from various web sites, and fleshing out the user interface to your site to provide a complete web site experience.

Groups & Collaboration: As before, we encourage groups to collaborate, provided that (1) each group does the reading and spends some time thinking about each problem before discussing it with others, and (2) each group writes up each problem itself. Directly copying another group's work (architectures, diagrams, or code) is cheating, a serious offense.

Safety

Now that you have a database containing MP3 files, and a web interface for searching and displaying the contents of this database, the next step is to round out your collection by finding more MP3 files on the web to add to your database.

Of course, adding entries to your database by hand would be a tedious and time-consuming task. In this part of the project, you're going to automate this task as well as you can, by constructing a web crawler that automatically goes and traverses the web, extracts information about MP3 files, and adds that information to your database.

The first step in building a web crawler is to learn the proper etiquette. Virtually all content providers frown on misuse of their resources by poorly written crawlers that try to download big chunks of their site all in the space of several seconds. It might even be the case that large numbers of hits from the same site appear to be a coordinated attack! So the foremost principle you should be thinking about in this part of the project is how to be a well-behaved net citizen, by spacing out your requests to any given site and not requesting the same content over and over again. We recommend never hitting a given site more than four times a minute (and perhaps this is too high).

In addition, content providers might wish to limit access to certain parts of their site to automated spiders. To this end, the robots.txt protocol was created, so that site owners can specify which parts of the site are accessible to which web robots. A good place to start reading about this is the Web Robots Pages. There are links to the robots.txt standard, the list of active robots, and a FAQ. For example, you can also view our own CS department's robots.txt file.

We recommend writing an API layer above HTTP which ensures that you'll never violate frequency of connection or robots.txt principles. Failure to observe these principles could have serious consequences: