freedb is an organization that
maintains a database of CD metadata: who the artist is; what the
CD title is; what the tracks are; etc. The database is used by a large
number of CD ripping tools, and the like. It's also available for
download.
The database distribution is actually a tar'ed set of files, each of
which describes a single CD.
This pdf file
is the official description of the format of those files. The example on the
last page is particularly useful.
In this part of the assignment we're going to use a Perl script to
look through the information in the freed files. The full set of files
is very large (XXGB). Worse, it's a very, very large number of files.
Opening and closing individual files is very, very slow.
To speed things up for this assignment, I've taken just a portion of
the full information (that which freedb put in a directory named 'rock')
and cat'ed it into a single file, called rock.txt.
Presumably each of the original files looked like what's described on the
last page of the specification (linked above), so that's what you should
expect to find in rock.txt. On the other hand, there is so much data,
and it has been contributed by so many people, that there is sure to be
thousands (or more) of errors in it. We're not going to worry about
those errors - we're going to write a straightforward script whose
results are our definition of correctness.
The goal of Part A of this assignment is to write a Perl script
invoked like this:
./findCDs.pl 'P.J. Harvey' /projects/instr/08wi/cse303/hw8/rock.txt
(Bash requires apostrophes (or quotes) around an argument that contains
embedded blanks.)
The Perl script looks through the freedb information contained in the
file (/projects/.../rock.txt), finds all albums that are attributed to
the artist named as the first argument (P.J. Harvey), and then prints
the names of all the albums and the year they were published,
sorted by year (ascending).
Here's an example result, for the command above:
0 Send Your Love To Me (bootleg)
0 Send His Love To Me
0 Black Sessions
0 Room Five O Nine
1992 Dry
1995 To Bring You My Love
1998 A Perfect Day Elise CDS
1999 The Wind (single-volume 1)
A warning: rock.txt is big: 918MB and 1.1M songs.
It takes a little under two minutes for my sample Perl script implemenation
to run over it on attu.
The full (untar'ed, uncompressed) set of freedb files is 11GB.
They're located in /projects/instr/08wi/cse303/hw8/completeData,
in case you want to "browse."
Details
- We consider rock.txt to be made up of successive CD "entries,"
each corresponding to one of the original freedb files.
- Each entry starts with the "# xmcd" line. That's the definition
of a new entry -- seeing that line.
- We are generous about allowing whitespace where it's easy to do
so. For the line above, it is allowed to have whitespace after the 'xmcd',
for instance.
- An entry SHOULD have a DTITLE=artist/CD title line (but I wouldn't be
surprised to find that some do not).
We allow arbitrary whitespace after the '=', before and after the '/', and
at the end of the line. That white space is not significant.
The artist on that line matches the one on the command line argument
if they are identical, doing a case insensitive match.
The title is everything between the non-significant whitespace after
the '/' and at the end of the line.
- If the same album is in the file more than once, we keep track of only
the last occurence. (For example, if the title is "Great CD" and it shows
up twice, with years 2005 and 2011 (in that order), we report only the
2011 copy.)
- We don't care that there might be some entries with "Harvey, P.J." or
"Harvey, Polly Jean" for the artist. Matches have to be exact; that's
our definition of correctness.
- An entry SHOULD have a DYEAR=year line, where the year is made up
of digits. If there is no such line, or year is missing, we take
the year to be 0.
Whitespace is allowed after the '=' and at the end of the line.
The year must be a string of digits (0..9).
If not, it's as though no year were specified.
- You do a case insenstive match in perl by putting a 'i' after the
pattern: m/regexp/i.
- You can use parentheses in the regular expression to retrieve the matching
portion afterwards; the special variables $1, $2, ... are the pieces matching
the sub-expressions (of the regular expression) inside parentheses, in order.
For example, m/(1+).*(A+)/ will result in some string of 1's as $1
and some string of A's as $2, assuming the match succeeds.
- Testing on one of the original freedb files, rather than
rock.txt, will probably save some time.
- There is plenty of opportunity for us all to get results that will not
pass diff. Among other things, the output is sorted by year, but when
there are multiple CDs in a year it's undefined what order they should come in.
- When you're all done with this, think about what happens if the
user gives an artist name like this: '.*Harvey.*'. Cool, eh? (Well, probably
cool - it depends a bit on just how you implemented. It's not required
as part of the assignment that anything cool happen; it should "just work"
on it's own or else forget about it.)
Sample Perl Code
The sample Perl programs shown in class are
here (and also linked
from the course schedule).
Preliminaries
This part is based on a SQLite database that stores both
the questions of a survey and the responses received.
You start with a Perl program that extracts a bit of
information from the database, and modify it to extract
a bit more. The main focus is on reading some (heavily
commented) Perl and on writing a SQL SELECT statement
with a join.
Other Preliminaries
I'm supplying a dummy SQLite database (which is just a file).
It has completely silly responses to the suvey questions,
generated by a program, so that we have some data in the
database to work on.
Please consider filling out the survey questions; having
some non-silly data has to make this more interesting.
The survey is here.
The questions in our survey are taken from
this survey.
It appeared in January 2008.
What To Do
The survey consists of a number of sections, each of which
has a number of questions. That information is (part of what is)
stored in the database.
The starter program reads the database, extracts the section,
question number, and text of each question, and prints it.
Here's a slightly reformatted excerpt:
Using '../dbDir/cse303-08wiSurvey.db'
Section Q# Question
People/Society 1 I feel like I'm part of the society I live in
People/Society 2 My future is promising
People/Society 3 Society's future is promising
People/Society 4 People can change society by their choices and actions
People/Society 5 People in my country can choose their own way of living
All questions are printed as one long list. (That first line is just
alerting you to the name of the database file being used, so that
we don't have any confusion because of it.)
Your task is to modify the information that is extracted, and the
format in which it is printed. Instead of the question number,
we want the minimum, average, and maximum response for given for
each question. Instead of printing the section title on each line,
we want to print it as a kind of header. Each section also has a
prompt string, and we want to print that as well.
Here's an example (using the synthetic data):
Using '../dbDir/cse303-08wiSurvey.db'
People/Society
To what extent are you in agreement with the following statements?
Min Avg Max Question
1 4 7 I feel like I'm part of the society I live in
2 4 6 My future is promising
3 4 5 Society's future is promising
4 4 4 People can change society by their choices and actions
3 4 5 People in my country can choose their own way of living
The Database Schema
The database consists of five tables, as shown in
this figure.
Information about the questions is kept in tables questions and sections.
Each section is described by one row of the sections table.
A section has an id (an integer), a section that follows it (nextsectionid,
an integer), a title (a string), a prompt (a string), and two labels used
to compose the web page for the survey.
Each question occupies a row of the questions table.
A question has an id (an integer), is part of some section (sectionid, an
integer corresponding to the id field of the sections table),
a unique question number (an integer), and some text (the question field).
When someone fills out the survey their responses are recorded in the
responses table.
Each row is an answer to a single question by a single "user."
The userid field corresponds to the id field of the people table;
the questionid field to the id field of the questions table;
the value field is a number from 1 to 7, indicating their answer
to the question on the survey.
The other two tables, people and origins, could be used to analyze the data
by demographics. Since we aren't doing that, they end up being irrelevant
to this assignment.
SQL SELECT
SQL's select statement fetches information from the database.
The data is returned as an array of rows.
The starter program (whose output is shown above)
contains this code:
$query = "SELECT Q.id, S.title, Q.question
FROM questions AS Q, sections as S
WHERE Q.sectionid=S.id";
$sth = doQuery( \$query );
...
while ( my ($qid, $sTitle, $q) = $sth->fetchrow_array() ) {
...
}
The $query variable is set to a SQL select statement.
This particular statement says to fetch data from tables
questions and sections. More particularly, it fetches the
id and question fields from the questions table, and the title
field from the sections table.
The FROM clause says to form the cross product (known as a join,
in database terminology) of the rows in the questions and
sections table: for each row in the questions table, append
every row of the sections table, creating long rows with fields
and values from both tables.
The WHERE clause says which of these long rows are returned:
only those in which the sectionid field value from the question
table matches the id field in the sections table.
Finally, the list of fields after SELECT means that I don't want
the entire long row returned, I just want the three fields named
returned (for each long row that matches the WHERE criterion).
The query is executed by calling the doQuery() method, which
is a Perl subroutine that is part of the sample code.
(Note the backslash before the argument ($query) in the call.)
Executing the query doesn't actually return the results, it
simply sets things up so those results can be fetched a row
at a time. The while loop is doing that: fetching the results
a row at a time. The fetchrow_array() function returns the next
result row, as a Perl array. The stuff on the left of the equals
means that each element of the array should be assigned to the scalar
variables named there, in order. (I.e., the 0th element of the array
is assigned to $qid, the 1st to $sTitle, and the thrid to $q.)
The result array has values in the order given by the fields named
after SELECT in the SQL query.
One main part of this assignment is modifying the SQL query
to get the information you need. You might, as a Java/C programmer,
be tempted to use many SQL queries to do this (e.g., one or more to get
section headings, one for minimum response values, on for average
values, etc.), but that isn't how we think in SQL. Instead,
just fetch everything as one SELECT query; it will be a lot easier,
and so be less frustrating, and one could easily make an argument that in
most real situations it would be faster than issuing many distinct
SQL queries.
Additional information sources for SQL are linked from the course schedule.
We are using such a basic form of SELECT that almost any source
of information should be accurate. However, different database implementations
can do things slightly differently, so be aware of that.
The SQLite documentation is here.
You will want to use the aggregate function (for min, avg, and max)
documented (by SQLite) here.
MySQL has more extensive documentation. The SELECT page is
here.
Note that you must use a GROUP BY clause in your SELECT if you use
an aggregate function.
Debugging SQL
Unless you're an experience SQL programmer, you should debug
your SELECT separately from worrying about your Perl program.
To do that, find the directory that has the database file,
and issue this command:
attu> sqlite3 cse303-08wiSurvey.db
SQLite version 3.4.2
Enter ".help" for instructions
sqlite>
You can now just issue any SQL query (e.g., your select), ending it
with a semi-colon:
sqlite> SELECT * FROM people;
1|male|1
2|male|1
If you want to see/verify the database schema, SQLite has a special
command:
sqlite> .schema
CREATE TABLE origins(id integer primary key,value text);
CREATE TABLE people(id integer primary key,gender text,originid integer);
CREATE TABLE questions(id integer primary key,sectionid integer,
questionnum integer,question text);
CREATE TABLE responses(userid integer,questionid integer,value integer);
CREATE TABLE sections(id integer primary key,nextsectionid integer,title text,
secprompt text,lowlabel text,highlabel text);
Once your query is giving back the results you want, copy it
to the Perl program and work on the Perl part of the assginment.
Ctrl-d to get out of sqlite3.