CSE 303 Homework 5 - Software Updates

Version 1.1

Version 1.1 fixes a bug in the lexer that made the output from getToken() slightly dependent on the order in which pages are scanned. It also provides a new version of the sample solution, which incorporates the lexer bug fix and contains some new functionality.

Updating

$ scp <login>@attu:/cse/courses/cse303/05au/hw5/V1.1/hw5soln* .
hw5soln                                       100%   31KB  30.9KB/s   00:00
hw5soln.exe                                   100%   37KB  37.4KB/s   00:00

$ rm xxx # hw5soln is an attu executable; hw5soln.exe is a cygwin executable.  Delete one...

$ cd lexer

$ scp <login>@attu:/cse/courses/cse303/05au/hw5/V1.1/lexer.c .
lexer.c                                       100% 7680     7.5KB/s   00:00

$ scp <login>@attu:/cse/courses/cse303/05au/hw5/V1.1/html.flex .
html.flex                                     100% 1439     1.4KB/s   00:00

$ make clean
rm -f *.o ../liblexer.a lex.yy.c *~

$ export MOREFLAGS="-O3"    # capital-oh 3

$ make
flex -i html.flex
gcc -O3 -Wall -c lex.yy.c
lex.yy.c:1469: warning: 'yyunput' defined but not used
gcc -O3 -Wall -c lexer.c
ar -rvc ../liblexer.a lex.yy.o lexer.o
a - lex.yy.o
a - lexer.o

$ cd ..

$ ls -l liblexer.a
-rw-rw-rw-  1 zahorjan None 22860 Nov  9 11:39 liblexer.a

$ ranlib liblexer.a # this command is needed on mac systems only

V1.1 Sample Solution Output

attu
Done:
        1898 pages, 1464414 words, 161944 links scanned
        54432 distinct words

cygwin

Done:
        1899 pages, 1464981 words, 162011 links scanned
        54432 distinct words

Different Number of Pages Scanned on Different Machines

Unix file systems are case sensitive. Windows file systems are not case sensitive. URLs may or may not be case sensitive - it seems to depend on whether the web server is running Windows or Unix. This means that it's possible for a link embedded in a web page to differ in case from the name of the page it references. When that happens in the sample data, if you're running on attu the link appears to refer to a page that doesn't exist. On cygwin (i.e., Windows), a page corresponding to the link does exist.

(Don't think about trying to "fix" this in code - it can't be done (in any reasonable sense).)

V1.1 Sample Solution Debugging Enhancements

V1.1 of the sample solution has some enhancements that might be useful for debugging. There are new switches that will cause the sample solution to dump files containing what data it processed. If you're not getting the same results as the sample solution, you can dump similar data from your program and compare the two output files. (The best way to do that is probably by using sort to sort each file, and then using diff.)

$ ./hw5soln -h

Usage:   ./hw5soln [OPTION] dataRootDir startFile [startFile...]

OPTIONS:
    -dl file  Output list of links returned by getToken() to 'file'.
    -dw file  Output list of words returned by getToken() to 'file'.
    -dp file  Output list of pages scanned to 'file'.
    -dd file  Output list of distinct words found to 'file'.
    -v        Print version and quit.