CSE 303: Concepts and Tools for Software Development, Winter 2008
  CSE Home   About Us   Search   Contact Info 
 
Course Home
 Home
Administation
 Overview
 Course Wiki
 Email archive
 Anonymous feedback
 View feedback
 Homework Turnin
 
Most Everything
 Schedule
 
Other Information
 UW/ACM Tutorials
 303 Computing: Getting Started
   

CSE 303 Homework 8
Due: Friday, 3/14/08, 6:00PM
Turnin: Described Below

FAQ

  • HW8 Q&A Wiki Page
  • Overview

    This assignment is to be done individually.

    There are two parts. The first is a small Perl program, with no databases involved. The second is a small Perl program that accesses an sqlite database.

    Part A: freedb CD Data

    freedb is an organization that maintains a database of CD metadata: who the artist is; what the CD title is; what the tracks are; etc. The database is used by a large number of CD ripping tools, and the like. It's also available for download.

    The database distribution is actually a tar'ed set of files, each of which describes a single CD. This pdf file is the official description of the format of those files. The example on the last page is particularly useful.

    In this part of the assignment we're going to use a Perl script to look through the information in the freed files. The full set of files is very large (XXGB). Worse, it's a very, very large number of files. Opening and closing individual files is very, very slow. To speed things up for this assignment, I've taken just a portion of the full information (that which freedb put in a directory named 'rock') and cat'ed it into a single file, called rock.txt. Presumably each of the original files looked like what's described on the last page of the specification (linked above), so that's what you should expect to find in rock.txt. On the other hand, there is so much data, and it has been contributed by so many people, that there is sure to be thousands (or more) of errors in it. We're not going to worry about those errors - we're going to write a straightforward script whose results are our definition of correctness.

    The goal of Part A of this assignment is to write a Perl script invoked like this:

    ./findCDs.pl 'P.J. Harvey' /projects/instr/08wi/cse303/hw8/rock.txt
    
    (Bash requires apostrophes (or quotes) around an argument that contains embedded blanks.) The Perl script looks through the freedb information contained in the file (/projects/.../rock.txt), finds all albums that are attributed to the artist named as the first argument (P.J. Harvey), and then prints the names of all the albums and the year they were published, sorted by year (ascending). Here's an example result, for the command above:
       0 Send Your Love To Me (bootleg)
       0 Send His Love To Me
       0 Black Sessions
       0 Room Five O Nine
    1992 Dry
    1995 To Bring You My Love
    1998 A Perfect Day Elise CDS
    1999 The Wind (single-volume 1)
    

    A warning: rock.txt is big: 918MB and 1.1M songs. It takes a little under two minutes for my sample Perl script implemenation to run over it on attu. The full (untar'ed, uncompressed) set of freedb files is 11GB. They're located in /projects/instr/08wi/cse303/hw8/completeData, in case you want to "browse."

    Details

    • We consider rock.txt to be made up of successive CD "entries," each corresponding to one of the original freedb files.
    • Each entry starts with the "# xmcd" line. That's the definition of a new entry -- seeing that line.
    • We are generous about allowing whitespace where it's easy to do so. For the line above, it is allowed to have whitespace after the 'xmcd', for instance.
    • An entry SHOULD have a DTITLE=artist/CD title line (but I wouldn't be surprised to find that some do not). We allow arbitrary whitespace after the '=', before and after the '/', and at the end of the line. That white space is not significant. The artist on that line matches the one on the command line argument if they are identical, doing a case insensitive match. The title is everything between the non-significant whitespace after the '/' and at the end of the line.
    • If the same album is in the file more than once, we keep track of only the last occurence. (For example, if the title is "Great CD" and it shows up twice, with years 2005 and 2011 (in that order), we report only the 2011 copy.)
    • We don't care that there might be some entries with "Harvey, P.J." or "Harvey, Polly Jean" for the artist. Matches have to be exact; that's our definition of correctness.
    • An entry SHOULD have a DYEAR=year line, where the year is made up of digits. If there is no such line, or year is missing, we take the year to be 0. Whitespace is allowed after the '=' and at the end of the line. The year must be a string of digits (0..9). If not, it's as though no year were specified.
    • You do a case insenstive match in perl by putting a 'i' after the pattern: m/regexp/i.
    • You can use parentheses in the regular expression to retrieve the matching portion afterwards; the special variables $1, $2, ... are the pieces matching the sub-expressions (of the regular expression) inside parentheses, in order. For example, m/(1+).*(A+)/ will result in some string of 1's as $1 and some string of A's as $2, assuming the match succeeds.
    • Testing on one of the original freedb files, rather than rock.txt, will probably save some time.
    • There is plenty of opportunity for us all to get results that will not pass diff. Among other things, the output is sorted by year, but when there are multiple CDs in a year it's undefined what order they should come in.
    • When you're all done with this, think about what happens if the user gives an artist name like this: '.*Harvey.*'. Cool, eh? (Well, probably cool - it depends a bit on just how you implemented. It's not required as part of the assignment that anything cool happen; it should "just work" on it's own or else forget about it.)

    Sample Perl Code

    The sample Perl programs shown in class are here (and also linked from the course schedule).

    Part B: Perl + sql = Survey Results

    Preliminaries

    This part is based on a SQLite database that stores both the questions of a survey and the responses received. You start with a Perl program that extracts a bit of information from the database, and modify it to extract a bit more. The main focus is on reading some (heavily commented) Perl and on writing a SQL SELECT statement with a join.

    Other Preliminaries

    I'm supplying a dummy SQLite database (which is just a file). It has completely silly responses to the suvey questions, generated by a program, so that we have some data in the database to work on.

    Please consider filling out the survey questions; having some non-silly data has to make this more interesting. The survey is here.

    The questions in our survey are taken from this survey. It appeared in January 2008.

    What To Do

    The survey consists of a number of sections, each of which has a number of questions. That information is (part of what is) stored in the database. The starter program reads the database, extracts the section, question number, and text of each question, and prints it. Here's a slightly reformatted excerpt:

    Using '../dbDir/cse303-08wiSurvey.db'
             Section     Q#  Question
      People/Society      1  I feel like I'm part of the society I live in
      People/Society      2  My future is promising
      People/Society      3  Society's future is promising
      People/Society      4  People can change society by their choices and actions
      People/Society      5  People in my country can choose their own way of living
    
    All questions are printed as one long list. (That first line is just alerting you to the name of the database file being used, so that we don't have any confusion because of it.)

    Your task is to modify the information that is extracted, and the format in which it is printed. Instead of the question number, we want the minimum, average, and maximum response for given for each question. Instead of printing the section title on each line, we want to print it as a kind of header. Each section also has a prompt string, and we want to print that as well. Here's an example (using the synthetic data):

    Using '../dbDir/cse303-08wiSurvey.db'
    
    People/Society
    To what extent are you in agreement with the following statements?
    
      Min    Avg    Max  Question
        1      4      7  I feel like I'm part of the society I live in
        2      4      6  My future is promising
        3      4      5  Society's future is promising
        4      4      4  People can change society by their choices and actions
        3      4      5  People in my country can choose their own way of living
    

    The Database Schema

    The database consists of five tables, as shown in this figure.

    Information about the questions is kept in tables questions and sections. Each section is described by one row of the sections table. A section has an id (an integer), a section that follows it (nextsectionid, an integer), a title (a string), a prompt (a string), and two labels used to compose the web page for the survey. Each question occupies a row of the questions table. A question has an id (an integer), is part of some section (sectionid, an integer corresponding to the id field of the sections table), a unique question number (an integer), and some text (the question field).

    When someone fills out the survey their responses are recorded in the responses table. Each row is an answer to a single question by a single "user." The userid field corresponds to the id field of the people table; the questionid field to the id field of the questions table; the value field is a number from 1 to 7, indicating their answer to the question on the survey.

    The other two tables, people and origins, could be used to analyze the data by demographics. Since we aren't doing that, they end up being irrelevant to this assignment.

    SQL SELECT

    SQL's select statement fetches information from the database. The data is returned as an array of rows.

    The starter program (whose output is shown above) contains this code:

    $query = "SELECT Q.id, S.title, Q.question
              FROM questions AS Q, sections as S
              WHERE Q.sectionid=S.id";
    $sth = doQuery( \$query );
    ...
    while ( my ($qid, $sTitle, $q) = $sth->fetchrow_array() ) {
       ...
    }
    
    The $query variable is set to a SQL select statement. This particular statement says to fetch data from tables questions and sections. More particularly, it fetches the id and question fields from the questions table, and the title field from the sections table. The FROM clause says to form the cross product (known as a join, in database terminology) of the rows in the questions and sections table: for each row in the questions table, append every row of the sections table, creating long rows with fields and values from both tables. The WHERE clause says which of these long rows are returned: only those in which the sectionid field value from the question table matches the id field in the sections table. Finally, the list of fields after SELECT means that I don't want the entire long row returned, I just want the three fields named returned (for each long row that matches the WHERE criterion).

    The query is executed by calling the doQuery() method, which is a Perl subroutine that is part of the sample code. (Note the backslash before the argument ($query) in the call.) Executing the query doesn't actually return the results, it simply sets things up so those results can be fetched a row at a time. The while loop is doing that: fetching the results a row at a time. The fetchrow_array() function returns the next result row, as a Perl array. The stuff on the left of the equals means that each element of the array should be assigned to the scalar variables named there, in order. (I.e., the 0th element of the array is assigned to $qid, the 1st to $sTitle, and the thrid to $q.) The result array has values in the order given by the fields named after SELECT in the SQL query.

    One main part of this assignment is modifying the SQL query to get the information you need. You might, as a Java/C programmer, be tempted to use many SQL queries to do this (e.g., one or more to get section headings, one for minimum response values, on for average values, etc.), but that isn't how we think in SQL. Instead, just fetch everything as one SELECT query; it will be a lot easier, and so be less frustrating, and one could easily make an argument that in most real situations it would be faster than issuing many distinct SQL queries.

    Additional information sources for SQL are linked from the course schedule. We are using such a basic form of SELECT that almost any source of information should be accurate. However, different database implementations can do things slightly differently, so be aware of that. The SQLite documentation is here. You will want to use the aggregate function (for min, avg, and max) documented (by SQLite) here. MySQL has more extensive documentation. The SELECT page is here.

    Note that you must use a GROUP BY clause in your SELECT if you use an aggregate function.

    Debugging SQL

    Unless you're an experience SQL programmer, you should debug your SELECT separately from worrying about your Perl program. To do that, find the directory that has the database file, and issue this command:

    attu> sqlite3 cse303-08wiSurvey.db
    SQLite version 3.4.2
    Enter ".help" for instructions
    sqlite>
    
    You can now just issue any SQL query (e.g., your select), ending it with a semi-colon:
    sqlite> SELECT * FROM people;
    1|male|1
    2|male|1
    
    If you want to see/verify the database schema, SQLite has a special command:
    sqlite> .schema
    CREATE TABLE origins(id integer primary key,value text);
    CREATE TABLE people(id integer primary key,gender text,originid integer);
    CREATE TABLE questions(id integer primary key,sectionid integer,
                           questionnum integer,question text);
    CREATE TABLE responses(userid integer,questionid integer,value integer);
    CREATE TABLE sections(id integer primary key,nextsectionid integer,title text,
                          secprompt text,lowlabel text,highlabel text);
    
    Once your query is giving back the results you want, copy it to the Perl program and work on the Perl part of the assginment.

    Ctrl-d to get out of sqlite3.

    Files

    Note: The file distribution scheme has changed since the original posting of this assignment. If you have already fetched and modified the Part A starter code, be sure not to clobber it when doing the following. Once you have fetched the re-packaged files, just copy your modified Part A code over the newly fetched starter copy.

    Fetch attu:/cse/courses/cse303/08wi/hw8Dist.tar.gz. Once unpacked, you will find subdirectories:

    • freedb - contains findCDs.pl, nearly functionless Part A starter code.
    • perlScripts - contains printQuestions.pl, starter code for Part B.
    • dbDir - contains cse303-08wiSurvey.db, the dummy SQLite database.

    Once there are a few more responses to the online survey, I'll make it's "live" cse303-08wiSurvey.db file available.

    Turnin

    Follow the usual online procedure. It's okay to hand in the SQLite database file, but please make sure you don't turnin a copy of rock.txt or anything else big. (You shouldn't have copied anything big from Part A anyway - you can just directly access them out in the /projects/... space.)

    Computer Science & Engineering
    University of Washington
    Box 352350
    Seattle, WA  98195-2350
    (206) 543-1695 voice, (206) 543-2969 FAX
    [comments to zahorjan at cs.washington.edu]