CSE 490i - Project Part 3: Crawling the web

Due: February 23, 2000; 5pm.

Please: Be sure to read the whole assignment from start to finish before you start working; there are numerous dependencies where the wrong design choice early-on could hurt you later.

Objective: It the highest level, you should seek to build the coolest music search site on the web, one with a great UI and comprehensive music information. See Gigabeat for one possible role model. Specifically, you should take your site from part 2 and improve it by populating your database with more MP3 files by crawling the web and extracting information from various web sites, and fleshing out the user interface to your site to provide a complete web site experience.

Groups & Collaboration: As before, we encourage groups to collaborate, provided that (1) each group does the reading and spends some time thinking about each problem before discussing it with others, and (2) each group writes up each problem itself. Directly copying another group's work (architectures, diagrams, or code) is cheating, a serious offense.

Safety

Now that you have a database containing MP3 files, and a web interface for searching and displaying the contents of this database, the next step is to round out your collection by finding more MP3 files on the web to add to your database.

Of course, adding entries to your database by hand would be a tedious and time-consuming task. In this part of the project, you're going to automate this task as well as you can, by constructing a web crawler that automatically goes and traverses the web, extracts information about MP3 files, and adds that information to your database.

The first step in building a web crawler is to learn the proper etiquette. Virtually all content providers frown on misuse of their resources by poorly written crawlers that try to download big chunks of their site all in the space of several seconds. It might even be the case that large numbers of hits from the same site appear to be a coordinated attack! So the foremost principle you should be thinking about in this part of the project is how to be a well-behaved net citizen, by spacing out your requests to any given site and not requesting the same content over and over again. We recommend never hitting a given site more than four times a minute (and perhaps this is too high).

In addition, content providers might wish to limit access to certain parts of their site to automated spiders. To this end, the robots.txt protocol was created, so that site owners can specify which parts of the site are accessible to which web robots. A good place to start reading about this is the Web Robots Pages. There are links to the robots.txt standard, the list of active robots, and a FAQ. For example, you can also view our own CS department's robots.txt file.

We recommend writing an API layer above HTTP which ensures that you'll never violate frequency of connection or robots.txt principles. Failure to observe these principles could have serious consequences:

Your account could get turned off by the CSE lab (this has happened in the past for exactly this reason: complaint by an outside webmaster).
The class professor and TA could get reprimanded by the Department Chair.
If your account is disabled, it will be very difficult to do work, so you'll likely turn in a poor project and get a bad grade. Don't expect an extension.
If the professor and TA get in trouble because of your carelessness, how do you think they will grade your project and treat you when end of quarter rolls around?
Crawling the web
Once you have read up on how to write a well-behaved robot, you can start thinking about how you are going to search the web. Where will you start? How do you extract the links from a web page? Once you have them, how will you decide which links to follow? How will you extract MP3 information from a web page once you have found it?
In the previous part of the project, you should have learned how to parse an XML document and extract information from it. While it would be great to be able to reuse your code, unfortunately the dominant language on the web is still HTML, so you will have to figure out how to parse HTML.
Various HTML parsers exist for practically every language. Java seems to have one built-in: see the class javax.swing.text.html.parser.DocumentParser. This interface allows you to register callbacks that are called when a tag is opened or closed; you can then build up a tree-node representation of the document (similar to the XML DOM), or process only certain tags, as necessary. Of course, you're welcome to use any HTML parser you wish.
After parsing the HTML document, you should be able to extract links to more HTML web pages from the document, so from this point you can choose what link to follow next (do you want to do breadth-first or depth-first search? how will you prioritize which links to follow first?). Alternatively, you could use an existing search engine like Altavista or hotbot to find pages that are likely to contain MP3 files and not do any real "crawling" yourself (i.e. only search one ply deep from the sites suggested by the search engine. Or do a hybrid where you search 2 or 3 ply deep from the search engine.
Site features and extracting content from the web

In addition to providing links, each web page you parse might have information you wish to add to your database; in this case, you should be able to extract that information and use it to update your database as well. There are several different approaches to follow here, depending on what kind of site you want to build.
One kind of music site is typified by mp3.com. It lists plenty of information about each artist, such as a history of the artist, pictures, links to CDs for sale, etc. You can browse the collection of artists and tracks by genre or by artist, find other similar songs, see the list of top requested songs, and generally wander around the site browsing all of its rich content. On the other hand, it's likely that if you're searching for one particular popular song, you're not likely to be able to find that song on mp3.com.
The other kind of site is exemplified by gigabeat.com. Instead of having a broad browsing-friendly interface, gigabeat gives you a single search box. Type in the name of your favorite artist or song, and you'll probably find a link to the (undoubtedly pirated) mp3 file. Gigabeat trades off information quality for completeness. They probably have an automated crawler that looks for mp3 files on the web and extracts artist and song title information---just enough data for people to search on.
Your own creation can draw from any of these features, or make up your own. If you're looking for music, what kind of site would you like to visit? What kind of information do you want to see? We're giving you the opportunity to build exactly the kind of music resource you wish existed on the web. So based on the kinds of features you want to support on your site, here are some ideas on to how to accomplish them.
1. Site specific wrappers. You might wish to download content from a few, high quality sites. For example, you could do aggregation (ala metacrawler) on mp3.com, iuma.com, listen.com, musicsearch.com, audiofind.com, etc. If you choose to aggregate a small number of sites (say a dozen) then you can do it by writing wrappers for each site. The simplest strategy here is to download a few pages from a given site, look for regularities in formatting, and then write Perl-like regular expressions to extract what you want to the database.
  The advantage of this approach is simplicity. It should be easy to get something to work. The disadvantage is you won't get many interesting files containing music of popular bands.
2. Site-independent mp3 crawler. The idea here is to aggregate content from thousands of different sites, extracting gigabeat-style snippets of information and mp3 links, using heuristic methods to identify promising data.
  One way to approach this problem would be to look for pages that contain links to MP3 files and then heuristically try to classify each such link by assigning an artist and song description to it. How would you go about doing this? You could start by looking at the text surrounding the link. Consider all the words in the link text and in the text surrounding the link. Then look up all these words or phrases in a list of bands, such as the the Ultimate Band List, and see which words/phrases are valid band names. You'll probably need to experiment with several heuristics here.
  The advantage of this approach is it should be capable of getting a much more interesting collection of songs than the site-specific appraoch. The disadvantage is that it is much harder and is more likely to result in noisy (i.e. bogus) data.
Hybrid strategies are also possible. Note that our grading system will take into account the difficulty of what you attempted. I.e. a group which produces a site-independent system that gets noisy attributes might get the same grade as a site-specific approach that works perfectly.
Scaling up your crawler
So you have a functional crawler. If you want to build up a large database quickly, you'll need to parallelize your crawler so that it can visit different sites simultaneously. You can do this using threads or however you please. Just be sure that you don't send too many requests to a given site in any given time interval.
What to Hand In
Hand in the URL of a top-level web page that lists your team name and contact information for each member. In addition, please provide the following:
1. A description of your site design and the rationale behind it. What kinds of features do you have on your site? Do you want to support browsing lots of different types of music? Do you want in-depth and complete information about a smaller number of artists? Do you sport the best search interface ever?
2. A description of your crawler. How does it work? Describe your method for ensuring the safety of target web sites, web search strategy, your approach to extraction, and tell us what kinds of problems you had and how you overcame them.
3. The source code for your crawler.
4. The URL of your search form (upgraded from the previous project), and an example query that will show off the new features and content you've added to the database in this part of the project.
If you get stuck or can't complete every part of the assignment, do as much as you can. If you try an ambitious method for information extraction, we understand you may not have as much time for other parts and will take this into account. Partial credit (and extra credit!) will definitely be awarded. If a bug or software glitch gets you, let us know as soon as possible (we'll give credit for finding these, and even more for finding the solution or workaround) and keep working on other parts of the assignment.
Pointers to readings
- The familiar (from last assignment) but crucial set of documentation and suggestions provides instructions, hints, and troubleshooting ideas.
- Web robots pages
Tessa Lau | tlau@cs.washington.edu

CSE 490i - Project Part 3: Crawling the web

Safety

Crawling the web

Site features and extracting content from the web

Scaling up your crawler

What to Hand In

Pointers to readings