CSE 303: Concepts and Tools for Software Development, Winter 2008

CSE 303: Concepts and Tools for Software Development, Winter 2008

CSE Home

About Us

Search

Contact Info

Course Home
	Home

Administation
	Overview
	Course Wiki
	Email archive
	Anonymous feedback
	View feedback
	Homework Turnin
Most Everything
	Schedule
Other Information
	UW/ACM Tutorials
	303 Computing: Getting Started

CSE 303 Homework 4

Due: Friday 2/1/08 Sunday, 2/3/08 11:59PM
Turnin: online - turnin instruction

FAQ

HW4 Q&A Wiki Page

Overview

This homework is a set of short exercises involving input, arrays, strings, and more multi-file builds. A few of them are inter-related. Each is reasonably small. I expect you'll encounter quite a few problems, though, of the "I don't understand why it won't compile" form. Start early. Work in piece by piece, in smallish doses. Don't plan on a superhuman effort at the last minute (or last eight hours).
Please read through the entire assignment promptly, so that we can resolve any questions about what to do early.
All programs in this homework read stdin and write stdout (and possibly, when appropriate, stderr). There is no need to enable command line specification of arguments, or other goodies that you might want if we were actually going to use these programs.

For this assignment, you should check for errors when the checking is just a line or so of code, plus maybe some code to print an error message. As we'll see, some kinds of errors are very difficult to prevent in C, and we won't be trying to do that here.

The files for each part should be put in separate directories, with names following the style of 'hw4A'. Each part must build with the command "gcc *.c" and then run with "./a.out".

Your program should not crash, no matter what the input is, WITH THE SOLE EXCEPTION of crashes that are due to individual input lines or words that are larger than expected. (We'd try not to crash for those errors as well, but it takes more and more complicated C code than we want at this point.)
When you detect an error condition that prevents your program from producing correct results, it should return (as a process) a non-zero status code. (Distinct non-zero values for distinct error conditions would be nice.)

A. Simple Input

You'd think it would be easy to read in a sequence of integers and print out the max from among them. It is, except that you have to do it the C way. This part is about the simplest possible exercise that involves reading input, and leads to the next parts.
The full specification is: read a stream of integers until you detect end-of-file. (End-of-file can be indicated for keyboard input by typing Ctrl+D as the first character in a line.) When you detect end-of-file, print the maximum of those you've read. The very first token printed must be the maximum -- if you want to print anything more, you can, just print it after that.
You should detect when the input stream contains something that scanf doesn't think is an integer. In that case, print an error message and call exit() with a non-zero status (for help,' man 3 exit').
Uses: scanf, the '&' operator

B. Simple Input Into Array Elements

Declare an array of 50 floats. Read a stream of floats and put them into successive array elements. When end-of-file is detected, print all the negative values, one value per line in any order, and then all the positive values, one value per line in any order. For instance, if the input is 1.2 -3.5 2 2 -7, the output could be
-7
-3.5
2
1.2
2
(It will probably be some other ordering, though.)
If the stream contains something scanf doesn't think is a float, print an error and call exit(). If the stream contains 51 or more floats, print an error and call exit(). (C does not detect an out of bounds array index. It's behavior when you use one can be anything from "hey, it works" to a seg fault, and can vary from run to run.)
Uses: scanf, a one-dimensional array, the '&' operator, #define to create symbolic constants (i.e., don't have '50' sprinkled throughout your code)

C. Reading Entire Lines Into An Array

Write a program that reads a stream of lines and, on end-of-file, prints them in reverse order. To do this declare an array of 100 "strings," where each string can be up to 128 characters long. (In C this means you'll end up creating a 2 dimensional array of characters.) Use symbolic constants.
Follow the lead of the rules from part B for dealing with input of 101 or more lines.
To read a line, use fgets, which allows you to specify a maximum number of characters to read. (If the input line is longer than that, fgets() will read only the initial portion the first time its called, and will read the rest (up to and including the '\n') the second. That is not considered an error for this assignment, so you don't have to detect it or otherwise worry about it.)
Example: if the input is
Time goes by, time comes along.
All is old and all is new;
What is right and what is wrong,
You must think and ask of you;
Have no hope and have no fear,
Waves that rise can never hold,
If they urge or if they cheer,
You remain aloof and cold.
your program should print
You remain aloof and cold.
If they urge or if they cheer,
Waves that rise can never hold,
Have no hope and have no fear,
You must think and ask of you;
What is right and what is wrong,
All is old and all is new;
Time goes by, time comes along.
To quote Andrei (who suggested this problem), "The excerpt is from a poem in the most Draconian poetry form ever - the glossa. The last stanza must repeat the first stanza in reverse form."
Uses: two-dimensional array of characters, other notions about C strings, fgets, the '&' operator

D. Word Input With Scanf

Alert: From here on all our programs will be broken.  We're going
to use scanf() to read 'words'.  scanf() doesn't know how big the
array it's writing into is.  If a word in the input stream is longer
than the array provided to scanf, bad things happen.  It is
non-trivial to fix this, and we're not going to worry about it for
now.  We will assume that through incredible luck no word is ever
longer than 127 characters (which requires an array of size 128 in C,
because of the '\0' that is always used to terminate a string).

Read each word from a text input stream, "clean it up," and print it, one word per line.
Here 'word' means whatever scanf('%s', xxx) decides one is. (It's going to decide that it's a consecutive string of non-whitespace characters. Whether or not that is a word in English, or any other language, is irrelevant.) For the input text in the example of the last section, the first three words are 'Time', 'goes', and 'by,'. Note the comma in the 'by,'.
Once you've read a word, you should clean it. By that I mean doing two things: (1) eliminate all characters that for which isalpha() returns false; and (2) turn all upper case letters into lower case. So, if scanf gave us 'Time' we'd print 'time'. If it gave us 'by,' we'd print 'by'. If it gave us 'Arts+Sciences' we'd print 'artssciences'.
Unlike the previous three parts, there is some code being distributed for this part. In the hw4D directory you'll find the file cleanWord.h. You should implement the one procedure it declares, putting your code in file cleanWord.c. That procedure takes a string as an argument and modifies it in the way described above. (This can, and should, be done without requiring any temporary string variables in the cleanWord() procedure - you can modify the string in place by operating on it as an array of characters.)
You must also write a main.c that reads words from standard input, calls the cleanWord procedure, and prints the cleaned words, one per line.
Uses: 1-d array of char, scanf, isalpha, the fact that all C strings are '\0' terminated, separate compilation
May or may not use: the '&' operator

E. Another Word Counting Program

Read successively each word from an input stream of text. Clean each word. Look in an array you keep of previously seen and cleaned word to determine whether or not you've already seen this cleaned word. If you have, increment a counter. If not, add the word to the array (and cause its counter to have value 1). When you reach end of file, print the count of each cleaned word and the cleaned word itself, one per line, in any order. For instance, the output might look like this
    301   'then'
   1785   'to'
   3068   'and'
   1568   'of'
etc.
You'll find the file wordHisto.h in directory hw4E. It defines an interface that you should implement. There is also the very beginnings of the file you should put that implementation in, wordHisto.c. I'm supplying a few of the most confusing lines (for Java programmers) so that you don't spend a lot of time trying to figure out how to get started.
You should write a mainline that uses the wordHisto interface, plus your cleanWord implementation, to build the complete program. (So, you'll end up with five files in this directory, after copying the releveant cleanWord files.)
There are sample data files, taken from Project Gutenberg, in the directory datafiles.
Uses: an array of WordEntry structs, your solution (minus the mainline) to hw4D, more complicated multi-file build, plus all the other things above

F. Sorted Output

HW3E but the output is printed in descending order by count.
There is nothing to implement for this. I'm going to give you the solution:
cat myFile | ./hw4E | sort -r -n

Uses: the work you did building hw4D, plus the work someone else did building sort

G. Going Faster

The hw4F program runs pretty slowly. I measure somewhere in the 3 to 6 second range for (and using) this command
time cat ../datafiles/74.txt | a.out >/dev/null
I get times of 1 to 3 minutes for this:
time cat ../datafiles/*.txt | a.out >/dev/null
(The variation in times depends on which machine I'm using for the testing.)
One thing that is slow is that looking to see if a newly cleaned word is already in the array requires that you look through the array element by element. If the new word is near the end, that can take a while. If a frequently occuring word is near the end, long lookups happen a lot.
CSE 326 will tell you more about data structures, and many more sophisticated solutions than the one we're going to use here. Our solution is the following: each time you look for a word in the array and find it, move it to the 0th position in the array, shifting everything else down one (from the current element at index 0 up through the elemnt that came just before the one found). For instance, if I'm looking for 'aardvark' and find it at element 273 in the array, I shift element 272 to 273, element 271 to 272, ..., element 0 to element 1, and then move the aardvark entry into element 0. This scheme tries to keep the most frequently encountererd words near the head of the array, resulting in short searches in the most common cases.
Uses: nothing much all that new here; the time command can be used to measure running times

H. Still Faster, Maybe

You might guess that a lot of the time to execute hw4G is going into shifting array elements. We try to make that shifting a little faster here. The key idea is to use "an indirection array" - an array of integers that hold indices into the array of WordEntry's. For instance, suppose there are 5 elements in the WordEntry array. If no shifting has taken place, the indirection array would have values 0, 1, 2, 3, 4 in successive elements. Now suppose you want to shift the WordEntry from index 4 in its array to index 0. Instead of copying all those big WordEntry structures, you instead shift the integers in the index array, resulting in (in order) 4, 0, 1, 2, 3. To look up what is logically the 0th WordEntry element, you first fetch the 0th index array element, then use its value as an index into the WordEntry array.
As stated earlier, the goal here is to try to speed things up by reducing the time spent copying elements (because an int is a lot smaller than a WordEntry, so requires less machine time to copy). My experience says that sometimes this approach speeds things up and sometimes it slows things down. It seems to depend on the details of exactly which processor you're running on, and by that I don't mean Intel vs. AMD, or even Pentium vs. X2, I mean exactly which model (as in "Intel Core Duo T2400 Yonah 1.833GHz Socket M 31W Dual-Core Processor Model BX80539T2400"). All this is to say that you may, or may not, find that this version runs faster than the previous one, so don't worry too much if it doesn't. (Do verify that your program is doing what it is supposed to do, though!)
Uses: an indirection array (that's the entire point of this, to use an indirection array)

I. OPTIONAL 1

If you have time (perhaps because you're already an experienced C programmer) try to make the Part H program run as fast as you can, while still maintaining one or more statically allocated arrays as the only data structures. (No dynamic memory allocation.) For a benchmark, we'll use execution times on attu.

J. OPTIONAL 2

(This is indpendent of Optional 1.) Turn your cleanWord code into a library. This means that instead of copying cleanWord.h and cleanWord.c into the folders for parts E-G and recompiling the source over and over, you instead compile it once and then create libcleanWord.a. In parts E-G you compile against cleanWord.h (which is still needed) and link against libcleanWord.a.
Uses: gcc -c, ar and maybe ranlib, the LIBRARY_PATH environment variable (also man gcc and then '/LIBRARY_PATH', plus much more web information)

Starter Files

attu:/cse/courses/cse303/08wi/hw4V1.0Dist.tar.gz

Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA 98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to zahorjan at cs.washington.edu]