---------------------------------------------------
Return-Path: matthai@franklin.cs.washington.edu
Received: from franklin.cs.washington.edu (franklin.cs.washington.edu [128.95.2.103]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id RAA03889 for <bershad@whistler.cs.washington.edu>; Mon, 13 Jan 1997 17:22:44 -0800
Received: from localhost (localhost [127.0.0.1]) by franklin.cs.washington.edu (8.8.3+CSE/7.2ws+) with SMTP id RAA10595 for <bershad>; Mon, 13 Jan 1997 17:22:44 -0800 (PST)
Message-Id: <199701140122.RAA10595@franklin.cs.washington.edu>
X-Mailer: exmh version 1.5.3 12/28/94
To: bershad@franklin.cs.washington.edu
Subject: SRCRPC 552-reading summary 
Reply-to: Matthai Philipose <matthai@cs.washington.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Mon, 13 Jan 1997 17:22:43 PST
From: Matthai Philipose <matthai@franklin.cs.washington.edu>

The Firefly RPC paper discusses a set of measurements that
characterize the performance of the SRC RPC, and gives a 
timing breakdown of the common-case critical path of the RPC. 
Getting a detailed performance breakdown of the parts
of a reasonably complex system is difficult, but useful,
whatever the system being built, and I haven't read many papers
that try to do a thorough job of it. I therefore found the methodology
of measurement more interesting than the fairly conventional techniques
that were used to make the SRC RPC fast (by "collapsing
layers of abstraction", special-casing the OS for the RPC,
sacrificing security in the multi-user case, hand-coding
in assembly, etc).

As far as methodolgy is concerned, the paper addresses at
least two thorny issues:
1)What to measure, and
2)How to measure it.

The "what" part is a question of identifying the common-case.
So retransmission, assembling multi-packet messages, acks, binding
to a server and waiting for threads to service requests are not
considered part of the RPC fast-path that needs to be optimized/examined.
There's not much justification given for ignoring these aspects,
except to note that they "intrude very little on the fast path",
which I interpret as "they happen rarely". S & B identify an
"if-conditions-are-favorable" scenario, call it the "send+receive"
operation, assert it is the common case, and break down its overhead.

The "how" part is interesting too, because they use different 
tricks to measure different parts of the system. Many of the
software parts of the system seem to be timed by "counting up"
the number of instructions of different types in the path,
figuring out how long each instruction takes, and figuring
the weighted sum to get the total running time of that
part of the code. On a modern (wide) processor, this would be
akin to assuming that the machine sticks to the schedule (based
on average-case latencies) produced by the static compiler.
I'm not sure how good an assumption this is in general (instruction 
latencies may vary because of cache misses, VM interactions, branch
misprediction, interrupts, etc), but maybe the kernel code in the
RPCs contains especially well-behaved code, and experiences no
interrupts. On the whole, though, I think this could be a dicey way
of estimating software running time on a modern processor. On the
other hand, if this is a reasonable technique, they could have used
it for measuring UDP overhead too (and gotten yet another sanity check).
It's cool that they stuck an oscilloscope in to get ethernet overheads.

The correspondence between the total measured overhead, and the sum of the 
partsis impressive. I wonder if their estimate of the "common-case" was right,
and how the RPC performed on various classes of work-loads (some of which
presumably violate the common-case workload).


---------------------------------------------------
Return-Path: owa@linen
Received: from linen.cs.washington.edu (linen.cs.washington.edu [128.95.2.245]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id TAA05000 for <bershad@whistler.cs.washington.edu>; Mon, 13 Jan 1997 19:42:01 -0800
Received: (owa@localhost) by linen.cs.washington.edu (8.7.2/7.2ws+) id TAA17164; Mon, 13 Jan 1997 19:42:01 -0800 (PST)
Date: Mon, 13 Jan 1997 19:42:01 -0800 (PST)
From: owa@linen
Message-Id: <199701140342.TAA17164@linen.cs.washington.edu>
To: bershad@cs
Subject: SRCRPC 552-reading summary


  The paper reports measurements of Firefly RPC performance to show the
Firefly RPC is fast enough as the primary communication paradigm for
application and system.  The hardware measured here consists of 16MRAM,
5 MicroVAX II CPUs and 10 Mbit/Sec Ethernet connected to only one CPU
via a DEQNA device controller with 16Mbit/sec bandwidth.  The Firefly
RPC is implemented with the standard stub procedures and heavily
optimized especially for the fast path.  It takes advantage of several
design features that lower latency as well as the power of assembly
language.  They measure the base latency, server-to-caller (and
caller-to-server) throughput of the RPC then compare with the one
calculated by adding the time of each instruction executed and of each
hardware latency encountered.  As the difference between those are
within about 5%, the result gives a good perspective of where time is
consumed.  With these measurements, they point out further possibilities
to improve the performance and also estimate those effects.  It
concludes that the performance of Firefly RPC is good enough and appears
to be limited by the network controller hardware right now.

-- owa
---------------------------------------------------
Return-Path: sparekh@crocus
Received: from crocus.cs.washington.edu (crocus.cs.washington.edu [128.95.1.67]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with SMTP id WAA07088 for <bershad@whistler.cs.washington.edu>; Mon, 13 Jan 1997 22:46:06 -0800
Received: (sparekh@localhost) by crocus.cs.washington.edu (8.6.12/7.2ws+) id WAA27468; Mon, 13 Jan 1997 22:46:05 -0800
Date: Mon, 13 Jan 1997 22:46:05 -0800
Message-Id: <199701140646.WAA27468@crocus.cs.washington.edu>
From: Sujay Parekh <sparekh@cs.washington.edu>
To: bershad@cs
CC: sparekh@cs
Subject: SRCRPC 552-reading summary


The basic motivation of this paper is to speed up RPC so that it can
become a realistic means of standard inter-address-space
communication.  In order to do so, they were firstly very clever in
structuring the RPC code to use less copying.  Secondly, they try to
reduce procedure calls without overly destroying modularity.  Third,
by being very careful about memory re-use they avoid unnecessary time
spent in memory operations.  The final notable optimization was to
overlap computation and communication where possible.

The basic result of the paper is that by carefully measuring and
analyzing the time spent in the various components of RPC, it is
possible to reduce the cost to an acceptable level.  By rewriting the
most time-intensive software routines in assembly, it is possible to
further improve performance.  Also, they were able to identify areas
where the hardware is the bottleneck and how changing that could
improve RPC response.  While they admit that some of their
optimizations do compromise the structural elegance of the RPC
mechanism, the potential benefits outweigh this loss.
---------------------------------------------------
Return-Path: ddion@june
Received: from june.cs.washington.edu (june.cs.washington.edu [128.95.2.4]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id XAA07559 for <bershad@whistler.cs.washington.edu>; Mon, 13 Jan 1997 23:41:03 -0800
Received: (ddion@localhost) by june.cs.washington.edu (8.8.3+CSE/7.2ju) id XAA07344; Mon, 13 Jan 1997 23:41:03 -0800
From: ddion@june (David Dion)
Message-Id: <199701140741.XAA07344@june.cs.washington.edu>
Subject: SRCRPC 552-reading summary
To: bershad@cs
Date: Mon, 13 Jan 1997 23:41:02 -0800 (PST)
X-Mailer: ELM [version 2.4 PL23]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit


RPC is established as the primary means for inter-address space
communication on the Firefly, whether across machine boundaries or
not.  The goal of this work is to make RPC fast enough that
programmers will use it rather than develop custom communication
mechanisms.  Improving RPC performance involved breaking down exactly
where time was spent, removing unnecessary work, and translating some
steps into hand-coded machine language.  Additional proposals for
performance are included, although they have not been implemented.

The Firefly RPC model consists of multiple software layers on both the
client and server.  By streamlining individual layers, avoiding data
copies between layers, allowing faster context switches between
layers, and eliminating layers altogether, overall performance can be
improved.  For instance, the Ethernet interrupt service routine layer
was streamlined by translating it to hand-generated assembly code.
Its execution time was cut by 75%.  The largest software cost is
waking up user-level threads from the transport layer.  A proposal to
busy-wait these threads could significantly reduce this context
switch.  Clever buffer management allows client and server stubs
direct access to transport layer packets, eliminating extra copies.
Lastly, Ethernet interrupt service routines are able to decode RPC
packet headers and wake up user-level threads, skipping the datalink
layer.  These strategies help make Firefly RPC competitive in
throughput and latency with RPC mechanisms in other operating systems.
Further improvements may require hardware, such as more advanced
network controller cards.
---------------------------------------------------
Return-Path: tian@wally
Received: from wally.cs.washington.edu (wally.cs.washington.edu [128.95.2.122]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id AAA07793 for <bershad@whistler.cs.washington.edu>; Tue, 14 Jan 1997 00:20:01 -0800
Received: (tian@localhost) by wally.cs.washington.edu (8.8.3+CSE/7.2ws+) id AAA23048; Tue, 14 Jan 1997 00:20:00 -0800 (PST)
Date: Tue, 14 Jan 1997 00:20:00 -0800 (PST)
From: tian@wally
Message-Id: <199701140820.AAA23048@wally.cs.washington.edu>
To: bershad@cs
Subject: SRCRPC 552-reading summary 

In attempting to prove Firefly's RPC services adequate for a fast
distributed system, Schroeder and Burrows have constructed an
interesting case study in the methodology and analysis of measuring a
complex system.  The paper also describes the optimizations they
employ which involve fast packet recycling, language optimizations, a
form of coscheduling/smart interrupt handlers to avoid extra context
switches, and using globally mapped i/o buffers.  A number of these
have compromised the clean design considerably.

They begin by taking end to end measurements of RPCs that test
throughput and latency in the "fast" case. They outline the "fast"
pathway that has been measured, obtain numbers for each step, add up
the results and compare them against the latency numbers originally
obtained.

In optimizing for the common case they assume few retransmissions are
necessary, server threads are available and that multipacket calls are
rare.  They also ignore binding costs.  However, some of their other
assumptions seem questionable.  The most obvious is that to measure
small code sequences, they measure execution times for individual
instructions and then sum them up.  how would I/D cache effects and
page faults, especially in the case of a mixed workload, affect these
numbers (and how would they counter)?  That their computed times are
within 5% of the measured latency suggests that they did an excellent
job in estimating the costs of the component operations, and allows
them to isolate the expensive operations.  Finally, I am curious why
they did not consider a more realistic workload (probably to sell
Firefly RPC), for instance, considering garbage collection/VM/cache
interactions, or more traffic on the internal bus or network
controller.  It would also be interesting to see how well their
methodology maps to newer distributed systems on modern processors.

---------------------------------------------------
Return-Path: rgrimm@cs.washington.edu
Received: from june.cs.washington.edu (june.cs.washington.edu [128.95.2.4]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id AAA07963 for <bershad@whistler.cs.washington.edu>; Tue, 14 Jan 1997 00:41:45 -0800
Received: from [128.95.8.129] (h130.dyn.cs.washington.edu [128.95.8.130]) by june.cs.washington.edu (8.8.3+CSE/7.2ju) with SMTP id AAA10977 for <bershad@cs>; Tue, 14 Jan 1997 00:41:43 -0800
X-Sender: rgrimm@june.cs.washington.edu
Message-Id: <v02140b05af00f68d2416@[128.95.8.129]>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Tue, 14 Jan 1997 00:39:28 -0800
To: bershad@cs
From: rgrimm@cs.washington.edu (Robert Grimm)
Subject: SRCRPC 552-reading summary

Schroeder and Burrows provide a comprehensive and detailed study of
the performance of RPC on the Firefly shared memory multiprocessor
system. By using a null RPC and two calls that use a 1440 byte
argument and result respectively (the largest argument that can fit
into a single packet), Schroeder and Burrows first provide the base
latency and the maximum throughput relative to the number of caller
threads. They then provide a detailed account of the individual steps
in the Firefly RPC, detailing the latency of each step. Based on these
measurements and instruction timing characteristics they then provide
estimates as to how various improvements influence the RPC latency.
Finally, Schroeder and Burrows measure the impact of processors on RPC
performance.

The fast path of the Firefly RPC has been carefully optimized (for
example, by allowing the interrupt driver to directly schedule waiting
threads and thus avoiding the scheduling of a intermediate datalink
thread) while also using standard networking protocols. The paper thus
presents a good argument that good RPC performance can be achieved
without special purpose protocols that sacrifice the portability,
flexibility and reliability of standard protocols such as UDP and IP.
However, Schroeder's and Burrows' results are optimistic for modern
timesharing systems, since one of their optimizations is to map all
buffers into all application address spaces (at the same location).
This optimization presents an obvious security leak in the presence of
untrusted applications.

A rather curious result for the impact of the number of processors on
RPC performance is the jump in null RPC latency when moving from two
server processors to one server processor compared with a slow
increase when incrementally reducing the number of processors from
five to two. While it is unclear to what degree other factors (such as
the structure of the operating system) influence this result, it seems
to indicate that the null server thread does not scale well over
processors (i.e., two processors see most of the benefits of
multi-processing).  This result is surprising given the fact that a
null server thread has no shared state or communication with other
server threads and can thus be perfectly partitioned.


---------------------------------------------------
Return-Path: govindk@shasta.ee.washington.edu
Received: from june.cs.washington.edu (june.cs.washington.edu [128.95.2.4]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id IAA10928 for <bershad@whistler.cs.washington.edu>; Tue, 14 Jan 1997 08:30:11 -0800
Received: from shasta.ee.washington.edu (shasta.ee.washington.edu [128.95.28.11]) by june.cs.washington.edu (8.8.3+CSE/7.2ju) with SMTP id IAA26225 for <bershad@cs>; Tue, 14 Jan 1997 08:30:10 -0800
Received: from andes.faulty (andes.ee.washington.edu) by shasta.ee.washington.edu (4.1/SMI-4.1)
	id AA29371; Tue, 14 Jan 97 08:31:15 GMT
Received: by andes.faulty (SMI-8.6/SMI-SVR4)
	id IAA11635; Tue, 14 Jan 1997 08:31:03 -0800
Date: Tue, 14 Jan 1997 08:31:03 -0800
From: govindk@shasta.ee.washington.edu (Govindarajan K)
Message-Id: <199701141631.IAA11635@andes.faulty>
To: bershad@cs
Subject: SRCRPC 552-reading summary

The paper analyzes the performance of the Firefly RPC system.
The main aim in doing this was to prove that RPC provides a
viable means of communication in distributed systems.
The paper can be divided into the following parts. The first part of the
paper talks about the criteria the authors have considered for measuring
the performance and the next part justifies their measurements. They
conclude the paper by presenting ideas by which the RPC mechanism
could be further improved.

The authors begin with measuring the end-to-end latency and throughput
of the RPC performance of Firefly using The Null() and  MaxResults()
routines respectively, followed by the measurements of the Marshalling
time. They show later that their measurements were within 5% of the measured latency.

In the analysis portion of the paper the authors describe the various
steps involved in the RPC. They describe the process involved at
both the caller as well as the server side in great detail. Here they
show that the Waking up of the RPC thread is the major contributor
to the overall latency of the procedure.

The authors also explain in detail on various methodologies they
have taken to optimize the performance of the RPC. This includes
rewriting some portions of the Modula-2+ code in assembly language, directly awakening a suitable thread from the Ethernet interrupt routine and good packet buffer management.

The authors have compared the latency and throughput with varying
number of processors and have shown that there is a major jump in the
latency when they have moved from 3 to 2 processors on the caller side.
The MAxResult thread on the other hand seems to scale well over processors. The authors have done the performance analysis on a dedicated ethernet , 
it would be interesting to see how their performance is affected when external traffic is present.
---------------------------------------------------
Return-Path: yasushi@silk
Received: from silk.cs.washington.edu (silk.cs.washington.edu [128.95.2.238]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id IAA10964 for <bershad@whistler.cs.washington.edu>; Tue, 14 Jan 1997 08:31:50 -0800
Received: (yasushi@localhost) by silk.cs.washington.edu (8.7.2/7.2ws+) id IAA08988; Tue, 14 Jan 1997 08:31:49 -0800 (PST)
Date: Tue, 14 Jan 1997 08:31:49 -0800 (PST)
From: yasushi@silk
Message-Id: <199701141631.IAA08988@silk.cs.washington.edu>
To: bershad@silk
Subject: SRCRPC 552-reading summary

This paper shows the detailed analysis of the performance of the
Firefly RPC system. The authors measured the speed of a null RPC and a
RPC with an argument that uses up one Ethernet packet. They also
studied a detailed breakdown of how much time is spent in
where. Finally, they speculate some improvements using faster
hardwares and fansier software techniques.

The contribution of the paper is that it precisely measured the
hardware overhead of network data transmission, and showed that the
software overhead was the dominant cost in the RPC transaction
especially when the argument size is small.

The methodology the paper used for the measurement seems a bit out of
date.  For example, it simply adds up the instruction execution cycles
to calculate some of the software overhead, but this is meaningless on
today's pipelined processors. Also, UDP checksum overhead calculation
can't take into account the induced cache misses.


---------------------------------------------------
Return-Path: echris@merganser
Received: from merganser.cs.washington.edu (merganser.cs.washington.edu [128.95.2.192]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id IAA11169 for <bershad@whistler.cs.washington.edu>; Tue, 14 Jan 1997 08:55:05 -0800
Received: (echris@localhost) by merganser.cs.washington.edu (8.8.3+CSE/7.2ws+) id IAA11479; Tue, 14 Jan 1997 08:55:05 -0800 (PST)
Date: Tue, 14 Jan 1997 08:55:05 -0800 (PST)
From: echris@merganser
Message-Id: <199701141655.IAA11479@merganser.cs.washington.edu>
To: bershad@cs
Subject: SRCRPC 552-reading summary
Reply-to: echris@cs.washington.edu

This paper is an organized brain dump by two people who tried with great
vigor (and success) to understand in magnificent detail the performance
of Firefly RPC.  The thesis sentence reads, "Programming a fast RPC is
not for the squeamish." (p 8) Indeed, nearly every microsecond of the
fast path is accounted for, and speculation on the effects of certain
hardware and software improvements is given.

Naturally, this paper begs many questions.  For example, how much would
I need to pay (on the Firefly) in order NOT to share packet buffers
between all user address spaces?  How much would I need to pay on a
modern OS?  Assembly language code is faster than well tuned Modula-2,
but how does it compare against more "conventional" systems programming
languages?  Do I REALLY need to write the fast path in AL?  Is this
still a relevant picture of an RPC system?  Or have modern OSes
relegated it to the historical archives?  And so on...
---------------------------------------------------
Return-Path: sungeun@wormwood
Received: from wormwood.cs.washington.edu (wormwood.cs.washington.edu [128.95.2.107]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id IAA11251 for <bershad@whistler.cs.washington.edu>; Tue, 14 Jan 1997 08:59:23 -0800
Received: (sungeun@localhost) by wormwood.cs.washington.edu (8.8.3+CSE/7.2ws+) id IAA23834; Tue, 14 Jan 1997 08:59:23 -0800 (PST)
Date: Tue, 14 Jan 1997 08:59:23 -0800 (PST)
Message-Id: <199701141659.IAA23834@wormwood.cs.washington.edu>
From: Sung-Eun Choi <sungeun@cs.washington.edu>
To: bershad@cs
Subject: SRCRPC 552-reading summary
Reply-To: sungeun@cs.washington.edu

This paper details the performance of an RPC system built for the
Firefly, a cache-coherent, shared memory multiprocessor of VAX
processors.  Specifically, the authors report measurements taken for
"inter-machine" RPC for synthetic procedures run from one or more
threads in a single address space on the client machine.  Finally,
they comment on the projected improvement to RPC performance given
improvements in other aspects of their system.

A primary goal in designing this RPC system was to provide a single
communication paradigm to be used by all programs, i.e. discouraging
users from rolling-their-own interprocessor communication.  To this
end, they present performance results for the optimized common
inter-machine case.  Each step of a remote procedure call is broken
into distinct parts and the time to execute each of those parts
measured using two dedicated(?) machines on an uncongested network
(10MB/sec Ethernet).  From numbers cited in the paper (and what I
think I understand about the system), it appears that about 1057
microseconds (plus data marshalling) of any remote procedure call is
performed by the client.  The rest of the latency is "hideable" for it
no work is performed by the client, i.e., if there is enough work in
the program, the thread will experience little more than this amount
for the remote procedure call.  For a call to a remote procedure that
does nothing and sends no data, this about 40% of the total latency, a
decent upper bound.  If little other work is available, the entire
latency (which seems to increase significantly with larger packet
sizes) is experienced.  Regardless, without context, it is diffcult to
decide whether their goals were met.

---------------------------------------------------
Return-Path: mernst@nishin.cs.washington.edu
Received: from june.cs.washington.edu (june.cs.washington.edu [128.95.2.4]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id KAA12589 for <bershad@whistler.cs.washington.edu>; Tue, 14 Jan 1997 10:36:43 -0800
Received: from nishin.cs.washington.edu (nishin.cs.washington.edu [128.95.4.39]) by june.cs.washington.edu (8.8.3+CSE/7.2ju) with ESMTP id KAA05759; Tue, 14 Jan 1997 10:36:35 -0800
Received: (from mernst@localhost) by nishin.cs.washington.edu (8.8.4/8.8.2) id KAA08818; Tue, 14 Jan 1997 10:36:35 -0800
Date: Tue, 14 Jan 1997 10:36:35 -0800
Message-Id: <199701141836.KAA08818@nishin.cs.washington.edu>
From: Michael Ernst <mernst@cs.washington.edu>
To: bershad@nishin.cs.washington.edu
Subject: SRCRPC 552-reading summary

Performance of Firefly RPC

Main question:  What contributes to RPC latency?  (Tells us what to
concentrate on improving, lets us make predictions about effect of changes.)

Firefly is a 5-way multiprocessor, single memory, coherent cache, 10 Mb/sec
Ethernet, one processor connected to network.

Standard RPC scheme with local stubs (actually, just one stub that also
takes a function identifier as an argument) for remote procedures.
Local stub does:
 1. allocate packet buffer
 2. marshal arguments
	Most marshalling is trivial -- no functions to call.
 3. call transport mechanism,
	This involves transmittal over both QBus and Ethernet.
	Also register oneself to receive a reply, block waiting for response.
 4. unmarshal results
 5. deallocate packet buffer
Server stub reuses incoming packet buffer, so need not do 1 or 5.

Optimizations include:
 * awaken thread directly from Ethernet interrupt, so only one thread
	wakeup instead of two
 * on-the-fly packet buffer recycling
 * packet buffers in shared memory, so no extra copying (but insecure, and
	still has to be copied into the network processor's cache)

General performance observations:
 * Send and receive dominate for small packets (not unexpectedly).
 * More threads => better performance due to network saturation (which hides
	latency).
 * Throughput was limited by network controller hardware (likewise for
	other contemporary systems).
 * In stubs and RPC runtime, 20% of cost is procedure call overhead.
 * Some code was good for uni- but not multi-processors, but not vice versa.
	(No details given about this.)
 * Predictions matched (best) measurements within 5%; we can trust the numbers.

Predicted performance improvements (eliminate how much of current time):
 Overlap Ethernet and QBus interface: 11-28%
 10x network speedup, to 100Mb/sec: 4-18%
 3x processor speedup: 36-52%
 No UDP checksum: 7-16%
 Redesign RPC packet header: 3-8%
 No IP/UDP, use lower level of abstraction: 1-4%
 Busy wait, so no wakeup (but no concurrency): 7-17%
 More assembly code: 4-10% (assembly gave 5x across-the-board improvement
	over Modula-2+)

It is hard to compare RPC performance numbers:
 * different hardware
 * RPC runs in kernel or user space?
 * generality of header information -- what networks is it compatible with?
 * hand-coded versus automatically-generated stubs
 * checksums vs. no checksums
 * implementation language

What about real loads?  The tests were only on:
 * no arguments
 * very big IN argument
 * very big OUT argument

Page 1 line -7: why are multiple threads per address space necessary?  (The
processes mentioned are in *different* address spaces.)

---------------------------------------------------
Return-Path: rgrimm@lutefisk
Received: from lutefisk.cs.washington.edu (lutefisk.cs.washington.edu [128.95.1.204]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with SMTP id IAA08044 for <bershad@whistler.cs.washington.edu>; Thu, 16 Jan 1997 08:27:36 -0800
Received: (rgrimm@localhost) by lutefisk.cs.washington.edu (8.6.12/7.2ws+) id IAA19332 for bershad@cs; Thu, 16 Jan 1997 08:27:31 -0800
Date: Thu, 16 Jan 1997 08:27:31 -0800
From: rgrimm@lutefisk (Robert Grimm)
Message-Id: <199701161627.IAA19332@lutefisk.cs.washington.edu>
To: bershad@cs
Subject: SRCRPC 552-reading summary

This is a postscript to my observation about the jump in null RPC
latency when moving from two to one server processor. One possible
explanation could be the fact that the authors observed a load of 1.2
for the original experiments using five processors. Obviously, such
a load can be better distributed over two processors than
one. However, this realization can only be a partial explanation, given
that the client load in the original experiments also was 1.2 and only
one processor is used on the client side.