Due: February 23, 2000; 5pm.
Please: Be sure to read the whole assignment from start to finish before you start working; there are numerous dependencies where the wrong design choice early-on could hurt you later.
Objective: It the highest level, you should seek to build the coolest music search site on the web, one with a great UI and comprehensive music information. See Gigabeat for one possible role model. Specifically, you should take your site from part 2 and improve it by populating your database with more MP3 files by crawling the web and extracting information from various web sites, and fleshing out the user interface to your site to provide a complete web site experience.
Groups & Collaboration: As before, we encourage groups to collaborate, provided that (1) each group does the reading and spends some time thinking about each problem before discussing it with others, and (2) each group writes up each problem itself. Directly copying another group's work (architectures, diagrams, or code) is cheating, a serious offense.
Of course, adding entries to your database by hand would be a tedious and time-consuming task. In this part of the project, you're going to automate this task as well as you can, by constructing a web crawler that automatically goes and traverses the web, extracts information about MP3 files, and adds that information to your database.
The first step in building a web crawler is to learn the proper etiquette. Virtually all content providers frown on misuse of their resources by poorly written crawlers that try to download big chunks of their site all in the space of several seconds. It might even be the case that large numbers of hits from the same site appear to be a coordinated attack! So the foremost principle you should be thinking about in this part of the project is how to be a well-behaved net citizen, by spacing out your requests to any given site and not requesting the same content over and over again. We recommend never hitting a given site more than four times a minute (and perhaps this is too high).
In addition, content providers might wish to limit access to certain parts of their site to automated spiders. To this end, the robots.txt protocol was created, so that site owners can specify which parts of the site are accessible to which web robots. A good place to start reading about this is the Web Robots Pages. There are links to the robots.txt standard, the list of active robots, and a FAQ. For example, you can also view our own CS department's robots.txt file.
We recommend writing an API layer above HTTP which ensures that you'll never violate frequency of connection or robots.txt principles. Failure to observe these principles could have serious consequences:
In the previous part of the project, you should have learned how to parse an XML document and extract information from it. While it would be great to be able to reuse your code, unfortunately the dominant language on the web is still HTML, so you will have to figure out how to parse HTML.
Various HTML parsers exist for practically every language. Java seems to have one built-in: see the class javax.swing.text.html.parser.DocumentParser. This interface allows you to register callbacks that are called when a tag is opened or closed; you can then build up a tree-node representation of the document (similar to the XML DOM), or process only certain tags, as necessary. Of course, you're welcome to use any HTML parser you wish.
After parsing the HTML document, you should be able to extract links to more HTML web pages from the document, so from this point you can choose what link to follow next (do you want to do breadth-first or depth-first search? how will you prioritize which links to follow first?). Alternatively, you could use an existing search engine like Altavista or hotbot to find pages that are likely to contain MP3 files and not do any real "crawling" yourself (i.e. only search one ply deep from the sites suggested by the search engine. Or do a hybrid where you search 2 or 3 ply deep from the search engine.
In addition to providing links, each web page you parse might have information you wish to add to your database; in this case, you should be able to extract that information and use it to update your database as well. There are several different approaches to follow here, depending on what kind of site you want to build.
One kind of music site is typified by mp3.com. It lists plenty of information about each artist, such as a history of the artist, pictures, links to CDs for sale, etc. You can browse the collection of artists and tracks by genre or by artist, find other similar songs, see the list of top requested songs, and generally wander around the site browsing all of its rich content. On the other hand, it's likely that if you're searching for one particular popular song, you're not likely to be able to find that song on mp3.com.
The other kind of site is exemplified by gigabeat.com. Instead of having a broad browsing-friendly interface, gigabeat gives you a single search box. Type in the name of your favorite artist or song, and you'll probably find a link to the (undoubtedly pirated) mp3 file. Gigabeat trades off information quality for completeness. They probably have an automated crawler that looks for mp3 files on the web and extracts artist and song title information---just enough data for people to search on.
Your own creation can draw from any of these features, or make up your own. If you're looking for music, what kind of site would you like to visit? What kind of information do you want to see? We're giving you the opportunity to build exactly the kind of music resource you wish existed on the web. So based on the kinds of features you want to support on your site, here are some ideas on to how to accomplish them.
The advantage of this approach is simplicity. It should be easy to get something to work. The disadvantage is you won't get many interesting files containing music of popular bands.
One way to approach this problem would be to look for pages that contain links to MP3 files and then heuristically try to classify each such link by assigning an artist and song description to it. How would you go about doing this? You could start by looking at the text surrounding the link. Consider all the words in the link text and in the text surrounding the link. Then look up all these words or phrases in a list of bands, such as the the Ultimate Band List, and see which words/phrases are valid band names. You'll probably need to experiment with several heuristics here.
The advantage of this approach is it should be capable of getting a much more interesting collection of songs than the site-specific appraoch. The disadvantage is that it is much harder and is more likely to result in noisy (i.e. bogus) data.