|
|
|
|
Homework 0
Out: Tuesday January 5
Due: Tuesday, January 12 (midnight)
Turnin: Online
Optional plot tools now available
Wiki page for hw0 set up: It's
here.
Javadoc for hw0 classes: It's here.
(You can also make yourself a local copy with 'make jdoc'.)
Assignment Overview
There are three parts to this assignments:
- Answer some questions about how the code we're
giving you works.
- Complete the implementation of the code. (My solution
to this part is 19 lines, including lines that are only braces.)
When done, your code will be able to communicate across a network.
- Use the code to take some measurements of goodput (data bits
delivered per second) as a function of the total amount of data sent
and evaluate a hypothesis about TCP behavior.
In many ways, this assignment is preparation for what follows,
rather than an end in itself. You'll create and use the client side
of a TCP connection, which is a good thing to have experienced, and
will likely understand how to set up and use both sides.
You'll have seen some code that is structured like a lot of networking
code, and in particular the next assignment. And you'll have some
experience with both the characteristics of the live networks
you'll measure and with how difficult it is to measure them.
A Word of Preamble
You've got a week. A week has limited hours. This assignment is
open-ended. What's expected is what is appropriate for a week's work
for this one course, no more. I think that should mean working code,
correct answers to the answers part, and then as much of the measurement
part as you have time for (which I'd expect would be more than none).
Application Overview / One Part of the Assignment
The application consists of three things. First, there is a
regular old user program, like you're used to. This is
called the Client - it's the client of the pieces
we're actually interest in, as you'll see.
The essence of what the Client does is this
(written in a language that is not claimed to be real):
DataSource src;
DataSink sink.
while ( true ) {
int len = src.read(buf, maxBytes);
if ( len < 0 ) break;
sink.write(buf, len);
}
The Client is intended to allow us to measure how fast
the sources and sinks operate.
There are three kinds of sources and sinks:
- Null
As a source, just says "okay" to read() calls, without
having actually provided any data. As a write, just ignores
the data and returns "okay." These are intended to allow
measuring the overhead of executing the code, to help
isolate just how fast the "real" sinks and sources operate
from how slow Java might be.
- File
As a source, reads a file. As a sink, writes a file.
These exist due to the high ratio of how interesting it might
be to try to evaluate file system/disk performance to the
trivial effort required to create the code that implements them.
(Plus, while we're focused on networks in this course, in any
networking application it could well be that the bottleneck
is the disk at one end or the other of the connection, and not
the network. So, it might prove useful to you in the future
to have some sense of what filesystem
throughput rates might be.)
- Net
As a source, reads from a network connection, and to be more
precise, a TCP connection. As a sink writes to a network connection.
This code can be run in what will feel like two different ways.
If you don't want to use TCP in a run, it looks like this:
The source can be a Null or a File, and, independently,
the Sink can be a Null or a File.
This is the execution of a single Java program, and can be run on any single
machine with a sufficiently recent release of Java.
The second way is to run two instances of the above, with the
first having a Network sink and the second having a Network
source:
These are two distinct running instances of the program.
The upper one is called "the client"
and the bottom one is called "the server" (confusingly enough, eh?).
The "network" in that picture could be the internet, in which case the
two instances could run on any two machines capable of talking
with each other. Alternatively, you can run both
instances on a single machine. They'll run fine, with no
code modification required, but won't
actually be putting any bytes out on any network when they
run (they'll just be moving bytes through the operating system
from one process to another).
You're going to eventually want to run on two distinct machines,
although it might be handy to debug while running on only one.
It turns out that attu.cs.washington.edu provides
all the separate machines you'll need, because that name is really
a pseudonym for four distinct machines, named attu1
through attu4 . You can login to a specific one
just by giving its specific name when you connect.
Everything but the NetSink (which is the component that knows how to
establish a client-side TCP connection and write to it) is already
fully implemented.
One part of the assignment is to complete
the implementation of the NetSink class.
I've presented that first, but you probably want to do it
after the piece described next.
Note: All instantiatable sink classes are derived from the
DataSink abstract class. There are two already fully implemented,
NullSink and FileSink. The latter, especially, looks a whole
lot like what is required to implement NetSink.
If you're a confident programmer, and especially if you have
some socket programming experience and/or have heard of Google,
you can probably complete this part in short order, even
doing it first. But, you'll still have to do the next part.
The Implementation / Another Assignment Part
I'm going to start out talking about threading. I understand
that most of you haven't had any exposure to this, and that it isn't
a pre-requisite. You don't have to write any threaded code
in this (or the next) assignment. You also don't even really have to
know anything about threads, or that this implementation uses them,
to write your code. The overall application has been carefully written
so that the parts you touch look, feel, and behave as though this was the
standard (single threaded) Java programs you've seen since 142.
On the other hand, it's a lot easier to see the truth of "you don't need to know
about the rest of the code" when you understand the rest of the code
than when you don't, and most everyone wants to know a lot about the
context their code is running in before they really feel ready to start.
So, here's the explanation.
First off, the programs you're accustomed to writing are called "singled threaded,"
meaning they have one thread. That thread is what you're thinking of when
you think about where the program is executing right now -- it's some one statement,
wherever it is. The single thread executes one statement after another,
but it's always in some one place. In a multi-threaded program, each of the
threads executes one statement after another, following the usual rules, it's
just that each one can be at a different place at any one time. That is, the
execution of the program as a whole is in as many places as there are threads.
That's half of everything you need to know for CSE461: the program is executing in more than one
place.
Second, what makes threads different from distinct running programs (processes) is
that they share some of the variables. That is, if one thread executes a
statement like:
foo = 0;
and another
System.out.println(foo);
they may, or may not, be referring to the same foo .
The only rule we need about that is this: local variables (those created inside procedures)
are private to individual threads -
what one thread does to a local variable named foo has nothing to do with what
other threads do to foo .
This should seem very, very familiar. It's the same rule that governs what the name foo
means for recursive calls of a procedure, say.
If foo is not local, then the threads share foo , and what one does affects
what another will see. The alternative, "not local", where what one thread does to the variable
affects what another one will see if it looks at that variable, includes instance variables of objects.
That's the only case used in this and the next assignment (and I believe, without double checking,
in any assignment in this course).
All that said, let's revisit the client application picture from above, this time showing threads.
Execution begins in the usual way: there is a single thread, which begins at main() .
That thread creates a DataSource object and prepares it for use. As part of that process, the DataSource
code creates a new thread. In Java, the usual way to create a thread is to create an object of a class
derived from Thread , and then to call the start() method of that new object.
That causes a new thread of execution to begin running, starting at method run() of the
new object. That's right -- you call start() and what happens is some new thread executes
run() . The original thread then executes its next statement, just as though start()
were any old method you're used to.
So, the original thread creates a DataSource and prepares it, resulting in a DataSource object with its own
thread. Then the original thread creates and prepares a DataSink object, which results in a new thread
in that object as well.
At this point the picture looks like this:
Here I've left off the names of the components either to protect the innocent or because
I'm running out of space.
Now what are these two extra threads doing? They're busy communicating with their source
or sink devices. Let's take the case that both the source and sink communicate with files.
In that case the source thread is busy reading the input file as fast as it can, and the sink object
is busy writing the output file as fast as it can. The original thread, executing the Client object
code, is glue - it carries data from the source object to the sink object.
To make this work, we need some buffering. Both the source and sink have FIFO queue objects inside them.
The source object's thread is reading data from the file and stuffing it into that queue (which we call
a buffer). The main thread is sitting in a loop taking data out of the source's buffer
and putting it in the sink's buffer. The sink's thread is in a loop taking data out of its buffer and
writing it to the output file.
Okay, why do it this way? We could get everything done with just one thread - the main thread would
invoke read() on the source, the code of which would read from the source file. Then
the main thread would invoke write() on the sink, the code of which would write the file.
One reason to use threads is that we hope we can overlap some execution, and so get better performance.
If we used only a single thread, then, for instance, while that thread was reading there couldn't possibly
be any writing going on. With multiple threads its possible to be both reading and writing at the same time.
There are other reasons as well. Multiple threads turns out to be a very managable way to build some
kinds of software, like some networking software, in the sense that the code is much cleaner and simpler
than any single-threaded implementation could be. We'll see the why's of that later in the quarter.
That's everything, except for some details.
To terminate the entire execution, all threads have to terminate. They terminate by
returning from the procedure they started in. For the original (main) thread, this means returning from
main() , as always. For the two other threads, it means returning from their run()
methods.
A design (and implementation) issue is how all the threads know when it's time to terminate.
Exactly how this is implemented is one of the questions below. At the highest level, though,
they terminate when they know they have nothing more left to do.
The Doubly Schizophrenic Nature of the Code
There are two things to know about the code that might help you be
comfortable with it. Both have a double-intention flavor to them.
First, the code is intended both to serve as a vehicle for you to write a little
networking code yourself, and for you to use to obtain measurements.
For the former, we want something simple. In particular, the way to think about
what the code is trying to do is that it allows you to start a server instance,
then start a client instance that sends data to the server, and when that's done
they both terminate.
That is, the first use is one data transfer per execution.
For measurement, you want to run multiple trials, so you can average the results.
It turns out that running multiple transfers involves avoiding some not so obvious
issues having to do with asynchrony (both sides are running at once) and the
details of how ports are managed by the operating system.
Because I wanted you to be able to run many trials with a single invocation
(rather than having to launch the program 30 times to get 30 samples, say),
I wrote the code so that it works when used that way.
On the other hand, the code required for that has nothing to do with most
of the questions, and so you should try to ignore it.
For the most part, you'll recognize that code easily because either it's
looping on a variable called something like "trial" or because the comments
are talking about "port hopping." We'll see why port hopping is required later
in the course. There's only a bit of code dealing with it, and what that code
is doing is changing the server's port number on each trial. If you think of the
code as running only one trial (as it in fact does, out of the box), none of port
hopping code has any effect, and it can be ignored.
The final artifact of multiple trials is that source/sink object creation and preparation
is broken up into two pieces - some in the constructor, and some in a separate method() -
rather than all just being in the constructor. This lets the multiple-trial scenario
reinitialize objects, and so reuse them rather than having to create new ones all the time.
The second dually-personalitied aspect of the code has to do with how the
sources and sinks are written. Although this is one big piece of code,
the source and sink classes are really something that would be in written
to be re-usable by many applications. One implication of that is that their
implementations shouldn't depend on exactly how their only user in this assignment
(the Client class) operates. They should do reasonable things for all the usage
scenarios one can imagine. Another is that you shouldn't think of one person
having written all this code. Instead, different people would have written
each of the sinks and sources, and the Client. That means that the ultimate user,
the one writing the Client code, won't know anything about the internal implementation
of the source/sink classes. And that means they're going to make mistakes using
them.
It's much, much nicer to detect mistakes
and give informative messages than to simply crash if the client code has a bug.
So, the upshot is that I've written some tests for reasonableness of the arguments
being handed to the sources/sinks, and thrown exceptions (say) when things don't look
right, even though the things being checked can never go wrong in this particular application.
(There was a point at which
there were many more of these checks, but I deleted a bunch in an attempt to make finding
the interesting code easier. So, the current implementation is neither free of these checks
nor comes close to doing as many as it should. Still, I think this state is much better than
no checks at all, because none of us needs more practice looking at code that is as far as
possible from what one should have written.)
The Questions
Okay, here's the list of questions to answer:
- How, specifically, does termination work? For each of the three threads, what happens that causes it to decide to
terminate? What causes those things to happen? Your answer should be comprehensive and coherent enough that the reader understands
how termination is done without having read the details of the code.
- Imagine we run one copy of the Client, with a file source and a file sink. How many times is each byte of the source
file copied by Java code before it gets to the output file? That is, from the time a byte of the source file makes its way
into the Java code, how many times is it copied before we're done with it?
- Taking the DataSink class as a specific example, why is it copying the bytes handed to it via write() calls? That is,
why does it
arraycopy the byte[] argument buf
rather than just doing something in the nature of byte[] myBuf = buf ?
(The answer isn't "because the rest of the code wouldn't work.")
- If you run a Client with a network source, it's operating as a server. A server creates a socket
and then waits around for a client to connect to it. Is that waiting time included in the elapsed time measurement
for the server or not? Explain briefly.
Measurements / Final Assignment Part
In this section we want to understand what the throughput of a connection might be.
Let's define "goodput" as (bytes transferred)/(time taken to transfer). Let's use
"link bandwidth" to mean the hardware bit transmission rate. For example, the
wired infrastructure in our building all has 1Gbps link bandwidth.
Our goal is to get an idea how TCP goodput varies as a function of the amount of
data transferred, as well as how close the link bandwidth it might come.
Why might transfer size affect goodput? A reasonable assumption about almost anything
is that there is a fixed cost plus a cost proportional to the size of the operation.
In this case, the fixed cost might be things like
the execution time required to communicate the destination address to the network
interface card. It doesn't matter how many bytes we're trying to send, the cost of
that is the same. The proportional cost is the additional time each byte requires.
Putting that together, we might guess that the time required to transfer is
something like
transfer time = FixedCost + (# bytes)/(link bandwidth)
and so
goodput = (# bytes) / (FixedCost + (# bytes)/(link bandwidth))
Measure actual transfers across a network and try to come to some conclusion about whether
that goodput expression seems to be a reasonable model of what you experience or not. Explain how you
came to this conclusion. Does some simple modification of that expression seem to be adequate, if this one
itself isn't? Make sure you briefly explain just what measurements you took, including the measurement
environment.
Note: it is impossible to get this question wrong, so long as you do a responsible job of trying to answer it.
The real point is to look at and think about what you see. The second point is to have some firsthand knowledge
of what performance is actually obtained. We'll talk later in the quarter about theoretical models of TCP performance.
(But even those would have trouble validating in the situation in which you're almost certainly going to be taking
measurements.)
The Source and Sundry Tools
The source is a gzip'ed tar file: hw0-dist.tar.gz.
On Linux, tar xzf hw0-dist.tar.gz will create a directory, hw0-dist , with the code.
The build tools are Linux-y. You can run make in the main direcory to build the source.
(Note that it won't build as until you supply at least a tiny bit of NetSink code.)
Building that way, all class files end up in directory classes , keeping their clutter out of the way.
The invocation lines given in a bit assume this, but are easily modified if you build in some other way.
The code is broken into Java packages. Each package lives in a distinct subdirectory.
The directory dataFiles is a place where source files to transfer can live, and output files can be created.
make in that directory will create files with sizes 0 and then powers of 10 from 0 to 6 or 7 or thereabouts.
If you transfer one of these files across your NetSink object and back into that directory, you can run diff
to verify that the input and output files have the same contents. (And the files are generated in a way that makes it
extremely unlikely their contents will be the same unless all code is working correctly.)
To run a client/server execution, you must start the server before the client.
To run an execution that performs multiple trials and prints average statistics on the results, you need to do two things:
- Flip the commented/uncommented lines in two places in Client.main():
//for ( trial=0; trialsWatch.elapsedTime() < trialTimeLimit || trial%2==1 || amServer; trial++ ) {
for ( trial=0; trial<1; trial++ ) {
//if (amServer) {
if (false) {
- Use the NetSink I wrote, rather than the one you wrote. To do this,
you essentially need to use NetSink.class-multitrial as your NetSink.class
file.
One way is to say
make multitrial in the main directory.
(This works even before you've got your NetSink implementation working,
since it replaces it. You have to make/unmake the edits noted above,
though, to move between the multitrial and your, single trial, version.)
The multitrial version runs as many trials as it can in (about) 10 seconds. You can change just what it does if you'd like,
though.
In the multitrial version, the server never terminates. You have to kill (ctrl-C) it. (There are, of course, reasons for this.)
The multitrial version may generate a very high rate of connection requests.
If you're operating over a non-CSE network (e.g., from home to an attu), it's
possible that some network you're traversing will view this as suspcious,
and may block your requests after a short while (and for
a short while, most likely). You'll see some successes (probably), and then
connection exception messages.
There is a bash script, driver , that tries to automate running many kinds trials using many
different transfer sizes. There are too many things that can wrong for me to recommend its use, but
if you love bash scripts, it might be for you.
Invoking the Application
Here's an example invocation:
java -cp classes netapp.Client file dataFiles/file1000 file dataFiles/outFile1000
The "-cp classes" is telling java where the class files are.
After that come the source specifier and then the sink specifier.
Possible source specifiers are:
- null <N>
Use the Null source (no data actually produced), and have it indicate end-of-data after N bytes have been produced.
Example: null 1000
- file <filename>
Read the file whose name is give.
Example: file dataFiles/file0
- net <port>
Port numbers are integers. The system on which you're running restricts which ones you can request. On the attu's, the
range seems to be 32768 to 61000. Windows is rumored to have a slightly wider range. If you try one that isn't allowed, you'll
see some kind of exception. Try another one.
You should also know that only one process on the system can use a port number at a time. So, if you're unlucky enough
to ask for one that is already in use, you'll see some kind of exception message. Try another one.
(If everyone tries to use the one in the example next, collisions are likely.)
Example: net 34567
The sink specifiers are nearly the same, so I'll be brief:
If you run the server on one of the attu's, you should be able to connect to it from any machine with internet access, e.g.,
a home machine. The reverse may not be true: if you run the server on your home machine and try to connect from attu, that could
fail. Unless you already know that you can connect to your home machine from campus, you probably can't.
Here's an example running the server on attu3 and the client on my home machine.
First, on attu3 I say:
java -cp classes netapp.Client net 34567 null
Then on my home machine:
java -cp classes netapp.Client null 100000 net attu3.cs.washington.edu 34567
|