CSE 490i - Project Part 2: Database-Driven Websites

Due: February 4, 2000; 5pm.

Please: Be sure to read the whole assignment from start to finish before you start working; there are numerous dependencies where the wrong design choice early-on could hurt you later.

Objective: Provide the infrastructure so that your site can support large numbers of MP3 files and support quick and flexible searches.

Groups & Collaboration: Starting with this assignment, students will form into groups of two or three students. We encourage groups to collaborate, provided that (1) each group does the reading and spends some time thinking about each problem before discussing it with others, and (2) each group writes up each problem itself. Directly copying another group's work (architectures, diagrams, or code) is cheating, a serious offense.

Creating the DBMS

Rather that managing a huge XML document of all your MP3 files, a more scalable approach is to store them in a relational database. We've hooked MySQL to Cocoon to facilitate your effort.

The first step is creating a short name for your group and sending Tessa the name along with the usernames of the group participants; she'll create a MySQL database for each group and name it using the group name.

Your next step is designing the database schema, i.e. answering the following questions: How many tables should we create? Which attributes will each table have? What data type should be used for each attribute? Which attributes should have indicies? If a group member has taken a database class, he or she will have studied normalization procedures and should know quite a bit about this schema design problem, but since this is a relatively small project it's pretty easy to come up with a reasonable design by following one's intuitions. You should feel to use any design you think is best, but one reasonable choice is to use the following two tables:

band(name, location, genre)
track(title, band, size-in-mb, source, url)

where name is the key (i.e. is presumed unique) of the band relation, and the band attribute of the track relation indexes into the band relation. It's probably wise to have indicies on name and title attributes; possibly also on band and genre.

Since bands often release multiple recordings (e.g., studio, live, etc.) of a given song, it seems likely that one needs to give a title, band, and date in order to uniquely specify a track. Feel free to augment your tables in order to store this information, pictures, lyrics, or any other information you like. You might wish to add another string field to store any XML data which doesn't match the relational attributes you have defined (or you could simply discard all that info). Or you may choose a completely different table design if you prefer. Simple is fine, but we'll award extra credit for ambitious efforts.

Once you have designed your table format, you'll want to connect to the database and create the tables. Probably the easiest way to do this is through the command-line program mysql. Invoke it by specifying your username and password: mysql -u user -p (which will prompt you to type in your password).

Loading Data into the DBMS

The objective of this part of the assignment is twofold:
  1. To transfer the XML music information which you created in part 1 of the project to the database, and
  2. To develop the infrastructure and experience with programmatic database access (i.e. embedded SQL) that you will need in part 3 of the project (where you'll be writing an MP3 crawler).
So you have to write some sort of program to parse the XML and add it to the database. WHile you may use any language you choose, we strongly recommend Java for two reasons. First, there are many XML parsers written in Java. Second, Java's support for multithreading will make it the ideal platform for the web crawler of part 3.

To help get you started, we provide a sample Java program which uses the Xerces parser to create a DOM tree and extract elements. Specifically, /www/htdocs/project2/xmlparse.java parses an XML document (such as the hello.xml file in that directory) and extracts <title> elements and their contents. There are plenty of XML parsers, and you are welcome to use a different one if you choose. We expect (hope?) that the choice of parser shouldn't matter much, since the DOM part of the XML standard hasn't fluctuated nearly as much as the XSL parts we used in part 1, and the parsers have been around longer and appear substantially more stable.

Parsing XML is just a a matter of instantiating the parser, converting a relative filename into a file:// URL, and feeding that URL to the parser. The hardest part is figuring out what to do with a DOM tree once you get it from the parser. You should read the DOM specification at w3.org in conjunction with its appendix on the Java language binding.

We also provide DB.java (in the same directory) which will insert a (name, age) pair into the test database, and print the updated contents of the database after the insert.

DB to Web

One of the advantages of storing your MP3 information in a relational database is the ability to efficiently process SQL queries on large datasets. In this part of the assignment, you should create one or more forms that allow users to search the DB by specifying Traditionally, programmers have implemented database search by writing some web forms that connect to CGI scripts on the server, having the scripts perform the query, and return the results as HTML; we've already talked about the performance penalty of this approach. An alternative method is to write a Java servlet which uses JDBC to connect to the DB. Some people may prefer to use PHP, JSP, Perl or some other scripting language; these approaches may be faster to get running, but tend not to scale as well.

A final approach is to use the SQLProcessor part of Cocoon to generate dynamic search results. With this approach, however, it seems that you have to embed the SQL query into the XML file rather than having it formulated on the fly. It's probably possible to write server-side JavaScript to generate a query dynamically, and have Cocoon render the page twice (once to turn the query paramaters into a SQL query using JavaScript, and another time to turn the XMLified query results into HTML). You can see a demo of this approach as well as read documentation.

You are welcome to use any approach you wish; just explain what you did and why.

Note that a small part of the grade will depend on the cleanliness and usability of your forms interface. You might wish to look at several professional sites to get some tips about usability.

What to Hand In

Hand in the URL of a top-level web page that lists the team name (be creative here!) along with the name and contact information for each team member.
  1. A description of your database schema (and if you did anything clever, please add any rationale you think is helpful )
  2. The program you wrote to add the XML data to the database.
  3. A very brief explanation about your treatment of extra XML entities and attributes. I.e. did you store them in an overflow table or drop them or ?
  4. The URL of your search form, the code which drives it, and a very brief explanation of its operation.
If you get stuck or can't complete every part of the assignment, do as much as you can. Partial credit will definitely be awarded. If a bug or software glitch gets you, let us know as soon as possible (we'll give credit for finding these, and even more for finding the solution or workaround) and keep working on other parts of the assignment.

Pointers to readings


Tessa Lau | tlau@cs.washington.edu