CSE 341 -- Programming Languages

Autumn 2000

Department of Computer Science and Engineering, University of Washington

Steve Tanimoto (instructor) and Jeremy Baer (teaching assistant).

Assignment P1

Version 1.1 of November 14.  Subject to change.

Perl Introduction

Due dates and time:  This assignment is due on Wednesday, November 22, 2000 at 5:30.  Submit your assignment with this form.

Title: Processing Web Pages with Perl

Purposes:  To get you acquainted with Perl and its facilities for string processing, CGI and document fetching on the web.

Instructions:  Create a directory P1 under your www directory and put your files for this assignment in that directory. First create a web page index.html with HTML that provides a pair of text field input areas and a submit button.  Set these up so that the user can enter a pair of URLs.  Next write a CGI script in Perl that processes the user's submission by (a) fetching the web pages (reporting any problems), (b) comparing the words used in the two documents, and (c) generating a new web page that gives the results of analysis.  This should include (1) an alphabetical list of those words that occur in both documents, (2) an alphabetical list of the words occurring in the first document but not the second, (3) an alphabetical list of the words occurring in the second document but not the first, and (4) the following statistics: n1 = total number of words (but not HTML tags) in the first document (counting each occurrence of any repeated word), the similar number n2 of words in the second document, m1 = the number of word occurrences in the first document involving words that do not occur in the second document, m2 = the similar number for the second document, and finally (m1 + m2) / (n1 + n2) which is a kind of distance between the two documents. This should be 0 if the two documents are identical or use the same set of words, and it should be 1 is the documents don't have any words in common. Use the following rules in determining the set of words in each document:  Convert all words to lower case; consider hyphenated words as separate words; remove all HTML tags and discard them.

Optional enhancements:  (a) After you have finished the program above, modify the form to provide a radio button or check box that lets the user specify either "Normal" or "Enhanced" document comparison.  The normal choice should invoke the program as described above.  The enhanced choice should cause the words to be "stemmed" -- this means that standard endings on verbs should be removed -- "ing", "ed".  Also, reduce nouns to the singular ... "cats" becomes "cat".  You can use simple heuristic rules here, and you do not have to correctly handle all the special cases.   This option of the assignment simply gives you the opportunity to play around with Perl's regular expression capability and try your hand at stemming.  If your stemming does not work very well, you will certainly not receive any fewer points than if you didn't try it at all. (b) Before comparing words from each document with words in the other, eliminate all "stop words".  Stop words are words like "a", "an", "the" and other common words that don't carry hints about the topic being discussed.  A list of stop words that your Perl script should read in is available at this address on cubist..

Programming style:  Comment your code.