|
|
|
|
CSE 303 Homework 4
Due: Friday 2/1/08 Sunday, 2/3/08 11:59PM
Turnin: online - turnin instruction
FAQ
HW4 Q&A Wiki Page
Overview
This homework is a set of short exercises involving input, arrays, strings,
and more multi-file builds.
A few of them are inter-related.
Each is reasonably small. I expect you'll encounter quite a few
problems, though, of the "I don't understand why it won't compile"
form. Start early. Work in piece by piece, in smallish doses.
Don't plan on a superhuman effort at the last minute (or last eight hours).
Please read through the entire assignment promptly, so that we can
resolve any questions about what to do early.
All programs in this homework read stdin and write stdout (and
possibly, when appropriate, stderr). There is no need to
enable command line specification of arguments, or other goodies
that you might want if we were actually going to use these
programs.
For this assignment, you should check for errors when the checking
is just a line or so of code, plus maybe some code to print an
error message. As we'll see, some kinds of errors are very
difficult to prevent in C, and we won't be trying to do that here.
The files for each part should be put in separate directories,
with names following the style of 'hw4A'. Each part must build
with the command "gcc *.c" and then run with "./a.out".
Your program should not crash, no matter what the input is,
WITH THE SOLE EXCEPTION of crashes that are due to individual
input lines or words that are larger than expected. (We'd
try not to crash for those errors as well, but it takes more
and more complicated C code than we want at this point.)
When you detect an error condition that prevents your program
from producing correct results, it should return (as a process)
a non-zero status code. (Distinct non-zero values for distinct
error conditions would be nice.)
A. Simple Input
You'd think it would be easy to read in a sequence of
integers and print out the max from among them.
It is, except that you
have to do it the C way. This part is about the simplest
possible exercise that involves reading input, and leads
to the next parts.
The full specification is: read a stream of integers
until you detect end-of-file. (End-of-file can be
indicated for keyboard input by typing Ctrl+D as the first
character in a line.) When you detect end-of-file,
print the maximum of those you've read. The very first
token printed must be the maximum -- if you want to print
anything more, you can, just print it after that.
You should detect when the input stream contains something that
scanf doesn't think is an integer. In that case, print an
error message and call exit()
with a non-zero status (for help,' man 3 exit').
Uses: scanf , the '&' operator
B. Simple Input Into Array Elements
Declare an array of 50 floats. Read a stream of floats
and put them into successive array elements. When end-of-file
is detected, print all the negative values, one value per line
in any order,
and then all the positive values, one value per line in any order.
For instance, if the input is 1.2 -3.5 2 2 -7, the output could be
-7
-3.5
2
1.2
2
(It will probably be some other ordering, though.)
If the stream contains something scanf doesn't think is a float,
print an error and call exit() . If the stream
contains 51 or more floats, print an error and
call exit() . (C does not detect an out of bounds
array index. It's behavior when you use one can be anything
from "hey, it works" to a seg fault, and can vary from run to run.)
Uses: scanf , a one-dimensional array, the '&' operator,
#define to create symbolic constants (i.e., don't have
'50' sprinkled throughout your code)
C. Reading Entire Lines Into An Array
Write a program that reads a stream of lines and,
on end-of-file, prints them in reverse order. To do
this declare an array of 100 "strings," where each string
can be up to 128 characters long. (In C this means you'll
end up creating a 2 dimensional array of characters.)
Use symbolic constants.
Follow the lead of the rules from part B for dealing with
input of 101 or more lines.
To read a line, use fgets , which allows you
to specify a maximum number of characters to read. (If the
input line is longer than that, fgets() will read only
the initial portion the first time its called, and will read
the rest (up to and including the '\n') the second.
That is not considered an error for this assignment, so you don't have to
detect it or otherwise worry about it.)
Example: if the input is
Time goes by, time comes along.
All is old and all is new;
What is right and what is wrong,
You must think and ask of you;
Have no hope and have no fear,
Waves that rise can never hold,
If they urge or if they cheer,
You remain aloof and cold.
your program should print
You remain aloof and cold.
If they urge or if they cheer,
Waves that rise can never hold,
Have no hope and have no fear,
You must think and ask of you;
What is right and what is wrong,
All is old and all is new;
Time goes by, time comes along.
To quote Andrei (who suggested this problem),
"The excerpt is from a poem in the most Draconian poetry form ever - the
glossa. The last stanza must repeat the first stanza in reverse form."
Uses: two-dimensional array of characters, other notions about C strings,
fgets , the '&' operator
D. Word Input With Scanf
Alert: From here on all our programs will be broken. We're going
to use scanf() to read 'words'. scanf() doesn't know how big the
array it's writing into is. If a word in the input stream is longer
than the array provided to scanf, bad things happen. It is
non-trivial to fix this, and we're not going to worry about it for
now. We will assume that through incredible luck no word is ever
longer than 127 characters (which requires an array of size 128 in C,
because of the '\0' that is always used to terminate a string).
Read each word from a text input stream,
"clean it up," and print it, one word per line.
Here 'word' means whatever scanf('%s', xxx) decides one is.
(It's going to decide that it's a consecutive string
of non-whitespace characters. Whether or not that is a word
in English, or any other language, is irrelevant.)
For the input text in
the example of the last section, the first three
words are 'Time', 'goes', and 'by,'. Note the comma
in the 'by,'.
Once you've read a word, you should clean it. By that
I mean doing two things: (1) eliminate all characters that
for which isalpha() returns false; and (2)
turn all upper case letters into lower case.
So, if scanf gave us 'Time' we'd print 'time'.
If it gave us 'by,' we'd print 'by'. If it gave us
'Arts+Sciences' we'd print 'artssciences'.
Unlike the previous three parts, there is some code
being distributed for this part. In the hw4D directory
you'll find the file cleanWord.h. You should implement
the one procedure it declares, putting your code in
file cleanWord.c. That procedure takes a string as
an argument and modifies it in the way described above.
(This can, and should, be done without requiring any
temporary string variables in the cleanWord() procedure -
you can modify the string in place by operating on it as
an array of characters.)
You must also write a main.c that reads words from standard input,
calls the cleanWord procedure, and prints the cleaned words, one per line.
Uses: 1-d array of char, scanf, isalpha, the fact that all
C strings are '\0' terminated, separate compilation
May or may not use: the '&' operator
E. Another Word Counting Program
Read successively each word from an input stream
of text. Clean each word. Look in an array you
keep of previously seen and cleaned word to
determine whether or not you've already seen this
cleaned word. If you have, increment a counter.
If not, add the word to the array (and cause its
counter to have value 1).
When you reach end of file, print the count of each
cleaned word and the cleaned word itself, one per line,
in any order.
For instance, the output might look like this
301 'then'
1785 'to'
3068 'and'
1568 'of'
etc.
You'll find the file wordHisto.h in directory hw4E.
It defines an interface that you should implement. There is also
the very beginnings of the file you should put that implementation in,
wordHisto.c. I'm supplying a few of the most confusing lines
(for Java programmers) so that you don't spend a lot of time trying
to figure out how to get started.
You should write a mainline that uses
the wordHisto interface, plus your cleanWord implementation, to
build the complete program. (So, you'll end up with five
files in this directory, after copying the releveant cleanWord files.)
There are sample data files, taken from
Project Gutenberg,
in the directory datafiles .
Uses: an array of WordEntry structs, your solution (minus
the mainline) to hw4D, more complicated
multi-file build, plus all the other things above
F. Sorted Output
HW3E but the output is printed in descending order by count.
There is nothing to implement for this. I'm going to
give you the solution:
cat myFile | ./hw4E | sort -r -n
Uses: the work you did building hw4D, plus the work someone
else did building sort
G. Going Faster
The hw4F program runs pretty slowly. I measure somewhere in the 3 to 6
second range for (and using) this command
time cat ../datafiles/74.txt | a.out >/dev/null
I get times of 1 to 3 minutes for this:
time cat ../datafiles/*.txt | a.out >/dev/null
(The variation in times depends on which machine I'm using for the testing.)
One thing that is slow is that looking to see if a newly cleaned word
is already in the array requires that you look through the array element by element.
If the new word is near the end, that can take a while. If a frequently
occuring word is near the end, long lookups happen a lot.
CSE 326 will tell you more about data structures, and many more
sophisticated solutions than the one we're going to use here. Our
solution is the following: each time you look for a word in the array
and find it, move it to the 0th position in the array, shifting
everything else down one (from the current element at index 0 up
through the elemnt that came just before the one found). For
instance, if I'm looking for 'aardvark' and find it at element 273 in
the array, I shift element 272 to 273, element 271 to 272, ...,
element 0 to element 1, and then move the aardvark entry into element
0.
This scheme tries to keep the most frequently encountererd words
near the head of the array, resulting in short searches in the
most common cases.
Uses: nothing much all that new here; the time command
can be used to measure running times
H. Still Faster, Maybe
You might guess that a lot of the time to execute hw4G is going into
shifting array elements. We try to make that shifting a little
faster here. The key idea is to use "an indirection array" - an array
of integers that hold
indices into the array of WordEntry's. For instance, suppose there
are 5 elements in the WordEntry array. If no shifting has taken place,
the indirection array would have
values 0, 1, 2, 3, 4 in successive elements. Now suppose you want
to shift the WordEntry from index 4 in its array to index 0. Instead
of copying all those big WordEntry structures, you instead shift the
integers in the index array, resulting in (in order) 4, 0, 1, 2, 3.
To look up what is logically the 0th WordEntry element, you first fetch
the 0th index array element, then use its value as an index into the
WordEntry array.
As stated earlier, the goal here is to try to speed things up by
reducing the time spent copying elements (because an int is a lot smaller
than a WordEntry, so requires less machine time to copy).
My experience says that sometimes this approach speeds things up and sometimes
it slows things down. It seems to depend on the details of exactly which
processor
you're running on, and by that I don't mean Intel vs. AMD, or even Pentium vs.
X2, I mean exactly which model (as in "Intel Core Duo T2400 Yonah 1.833GHz Socket M 31W Dual-Core Processor Model BX80539T2400"). All this is to say that
you may, or may not, find that this version runs faster than the previous one,
so don't worry too much if it doesn't. (Do verify that your program is doing
what it is supposed to do, though!)
Uses: an indirection array (that's the entire point of this, to use an indirection array)
I. OPTIONAL 1
If you have time (perhaps because you're already an experienced C programmer)
try to make the Part H program run as fast as you can, while still maintaining
one or more statically allocated arrays as the only data structures. (No
dynamic memory allocation.) For a
benchmark, we'll use execution times on attu .
J. OPTIONAL 2
(This is indpendent of Optional 1.) Turn your cleanWord code into a library.
This means that instead of copying cleanWord.h and cleanWord.c into
the folders for parts E-G and recompiling the source over and over,
you instead compile it once and then create libcleanWord.a.
In parts E-G you compile against cleanWord.h (which is still needed) and
link against libcleanWord.a.
Uses: gcc -c , ar and maybe ranlib ,
the LIBRARY_PATH environment variable (also man gcc and then '/LIBRARY_PATH', plus much more web information)
Starter Files
attu:/cse/courses/cse303/08wi/hw4V1.0Dist.tar.gz
|