Version 1.1
Version 1.1 fixes a bug in the lexer that made the output fromgetToken()
slightly dependent on the order in which pages are scanned. It also provides a new version of the sample solution, which incorporates the lexer bug fix and contains some new functionality.Updating
$ scp <login>@attu:/cse/courses/cse303/05au/hw5/V1.1/hw5soln* . hw5soln 100% 31KB 30.9KB/s 00:00 hw5soln.exe 100% 37KB 37.4KB/s 00:00 $ rm xxx # hw5soln is an attu executable; hw5soln.exe is a cygwin executable. Delete one... $ cd lexer $ scp <login>@attu:/cse/courses/cse303/05au/hw5/V1.1/lexer.c . lexer.c 100% 7680 7.5KB/s 00:00 $ scp <login>@attu:/cse/courses/cse303/05au/hw5/V1.1/html.flex . html.flex 100% 1439 1.4KB/s 00:00 $ make clean rm -f *.o ../liblexer.a lex.yy.c *~ $ export MOREFLAGS="-O3" # capital-oh 3 $ make flex -i html.flex gcc -O3 -Wall -c lex.yy.c lex.yy.c:1469: warning: 'yyunput' defined but not used gcc -O3 -Wall -c lexer.c ar -rvc ../liblexer.a lex.yy.o lexer.o a - lex.yy.o a - lexer.o $ cd .. $ ls -l liblexer.a -rw-rw-rw- 1 zahorjan None 22860 Nov 9 11:39 liblexer.a $ ranlib liblexer.a # this command is needed on mac systems onlyV1.1 Sample Solution Output
attu
Done: 1898 pages, 1464414 words, 161944 links scanned 54432 distinct words
cygwin
Done: 1899 pages, 1464981 words, 162011 links scanned 54432 distinct wordsDifferent Number of Pages Scanned on Different Machines
Unix file systems are case sensitive. Windows file systems are not case sensitive. URLs may or may not be case sensitive - it seems to depend on whether the web server is running Windows or Unix. This means that it's possible for a link embedded in a web page to differ in case from the name of the page it references. When that happens in the sample data, if you're running onattu
the link appears to refer to a page that doesn't exist. Oncygwin
(i.e., Windows), a page corresponding to the link does exist.(Don't think about trying to "fix" this in code - it can't be done (in any reasonable sense).)
V1.1 Sample Solution Debugging Enhancements
V1.1 of the sample solution has some enhancements that might be useful for debugging. There are new switches that will cause the sample solution to dump files containing what data it processed. If you're not getting the same results as the sample solution, you can dump similar data from your program and compare the two output files. (The best way to do that is probably by usingsort
to sort each file, and then usingdiff
.)
$ ./hw5soln -h Usage: ./hw5soln [OPTION] dataRootDir startFile [startFile...] OPTIONS: -dl file Output list of links returned by getToken() to 'file'. -dw file Output list of words returned by getToken() to 'file'. -dp file Output list of pages scanned to 'file'. -dd file Output list of distinct words found to 'file'. -v Print version and quit.