--------------------------------------------------- Return-Path: matthai@franklin.cs.washington.edu Received: from franklin.cs.washington.edu (franklin.cs.washington.edu [128.95.2.103]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id RAA03889 for ; Mon, 13 Jan 1997 17:22:44 -0800 Received: from localhost (localhost [127.0.0.1]) by franklin.cs.washington.edu (8.8.3+CSE/7.2ws+) with SMTP id RAA10595 for ; Mon, 13 Jan 1997 17:22:44 -0800 (PST) Message-Id: <199701140122.RAA10595@franklin.cs.washington.edu> X-Mailer: exmh version 1.5.3 12/28/94 To: bershad@franklin.cs.washington.edu Subject: SRCRPC 552-reading summary Reply-to: Matthai Philipose Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Mon, 13 Jan 1997 17:22:43 PST From: Matthai Philipose The Firefly RPC paper discusses a set of measurements that characterize the performance of the SRC RPC, and gives a timing breakdown of the common-case critical path of the RPC. Getting a detailed performance breakdown of the parts of a reasonably complex system is difficult, but useful, whatever the system being built, and I haven't read many papers that try to do a thorough job of it. I therefore found the methodology of measurement more interesting than the fairly conventional techniques that were used to make the SRC RPC fast (by "collapsing layers of abstraction", special-casing the OS for the RPC, sacrificing security in the multi-user case, hand-coding in assembly, etc). As far as methodolgy is concerned, the paper addresses at least two thorny issues: 1)What to measure, and 2)How to measure it. The "what" part is a question of identifying the common-case. So retransmission, assembling multi-packet messages, acks, binding to a server and waiting for threads to service requests are not considered part of the RPC fast-path that needs to be optimized/examined. There's not much justification given for ignoring these aspects, except to note that they "intrude very little on the fast path", which I interpret as "they happen rarely". S & B identify an "if-conditions-are-favorable" scenario, call it the "send+receive" operation, assert it is the common case, and break down its overhead. The "how" part is interesting too, because they use different tricks to measure different parts of the system. Many of the software parts of the system seem to be timed by "counting up" the number of instructions of different types in the path, figuring out how long each instruction takes, and figuring the weighted sum to get the total running time of that part of the code. On a modern (wide) processor, this would be akin to assuming that the machine sticks to the schedule (based on average-case latencies) produced by the static compiler. I'm not sure how good an assumption this is in general (instruction latencies may vary because of cache misses, VM interactions, branch misprediction, interrupts, etc), but maybe the kernel code in the RPCs contains especially well-behaved code, and experiences no interrupts. On the whole, though, I think this could be a dicey way of estimating software running time on a modern processor. On the other hand, if this is a reasonable technique, they could have used it for measuring UDP overhead too (and gotten yet another sanity check). It's cool that they stuck an oscilloscope in to get ethernet overheads. The correspondence between the total measured overhead, and the sum of the partsis impressive. I wonder if their estimate of the "common-case" was right, and how the RPC performed on various classes of work-loads (some of which presumably violate the common-case workload). --------------------------------------------------- Return-Path: owa@linen Received: from linen.cs.washington.edu (linen.cs.washington.edu [128.95.2.245]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id TAA05000 for ; Mon, 13 Jan 1997 19:42:01 -0800 Received: (owa@localhost) by linen.cs.washington.edu (8.7.2/7.2ws+) id TAA17164; Mon, 13 Jan 1997 19:42:01 -0800 (PST) Date: Mon, 13 Jan 1997 19:42:01 -0800 (PST) From: owa@linen Message-Id: <199701140342.TAA17164@linen.cs.washington.edu> To: bershad@cs Subject: SRCRPC 552-reading summary The paper reports measurements of Firefly RPC performance to show the Firefly RPC is fast enough as the primary communication paradigm for application and system. The hardware measured here consists of 16MRAM, 5 MicroVAX II CPUs and 10 Mbit/Sec Ethernet connected to only one CPU via a DEQNA device controller with 16Mbit/sec bandwidth. The Firefly RPC is implemented with the standard stub procedures and heavily optimized especially for the fast path. It takes advantage of several design features that lower latency as well as the power of assembly language. They measure the base latency, server-to-caller (and caller-to-server) throughput of the RPC then compare with the one calculated by adding the time of each instruction executed and of each hardware latency encountered. As the difference between those are within about 5%, the result gives a good perspective of where time is consumed. With these measurements, they point out further possibilities to improve the performance and also estimate those effects. It concludes that the performance of Firefly RPC is good enough and appears to be limited by the network controller hardware right now. -- owa --------------------------------------------------- Return-Path: sparekh@crocus Received: from crocus.cs.washington.edu (crocus.cs.washington.edu [128.95.1.67]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with SMTP id WAA07088 for ; Mon, 13 Jan 1997 22:46:06 -0800 Received: (sparekh@localhost) by crocus.cs.washington.edu (8.6.12/7.2ws+) id WAA27468; Mon, 13 Jan 1997 22:46:05 -0800 Date: Mon, 13 Jan 1997 22:46:05 -0800 Message-Id: <199701140646.WAA27468@crocus.cs.washington.edu> From: Sujay Parekh To: bershad@cs CC: sparekh@cs Subject: SRCRPC 552-reading summary The basic motivation of this paper is to speed up RPC so that it can become a realistic means of standard inter-address-space communication. In order to do so, they were firstly very clever in structuring the RPC code to use less copying. Secondly, they try to reduce procedure calls without overly destroying modularity. Third, by being very careful about memory re-use they avoid unnecessary time spent in memory operations. The final notable optimization was to overlap computation and communication where possible. The basic result of the paper is that by carefully measuring and analyzing the time spent in the various components of RPC, it is possible to reduce the cost to an acceptable level. By rewriting the most time-intensive software routines in assembly, it is possible to further improve performance. Also, they were able to identify areas where the hardware is the bottleneck and how changing that could improve RPC response. While they admit that some of their optimizations do compromise the structural elegance of the RPC mechanism, the potential benefits outweigh this loss. --------------------------------------------------- Return-Path: ddion@june Received: from june.cs.washington.edu (june.cs.washington.edu [128.95.2.4]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id XAA07559 for ; Mon, 13 Jan 1997 23:41:03 -0800 Received: (ddion@localhost) by june.cs.washington.edu (8.8.3+CSE/7.2ju) id XAA07344; Mon, 13 Jan 1997 23:41:03 -0800 From: ddion@june (David Dion) Message-Id: <199701140741.XAA07344@june.cs.washington.edu> Subject: SRCRPC 552-reading summary To: bershad@cs Date: Mon, 13 Jan 1997 23:41:02 -0800 (PST) X-Mailer: ELM [version 2.4 PL23] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit RPC is established as the primary means for inter-address space communication on the Firefly, whether across machine boundaries or not. The goal of this work is to make RPC fast enough that programmers will use it rather than develop custom communication mechanisms. Improving RPC performance involved breaking down exactly where time was spent, removing unnecessary work, and translating some steps into hand-coded machine language. Additional proposals for performance are included, although they have not been implemented. The Firefly RPC model consists of multiple software layers on both the client and server. By streamlining individual layers, avoiding data copies between layers, allowing faster context switches between layers, and eliminating layers altogether, overall performance can be improved. For instance, the Ethernet interrupt service routine layer was streamlined by translating it to hand-generated assembly code. Its execution time was cut by 75%. The largest software cost is waking up user-level threads from the transport layer. A proposal to busy-wait these threads could significantly reduce this context switch. Clever buffer management allows client and server stubs direct access to transport layer packets, eliminating extra copies. Lastly, Ethernet interrupt service routines are able to decode RPC packet headers and wake up user-level threads, skipping the datalink layer. These strategies help make Firefly RPC competitive in throughput and latency with RPC mechanisms in other operating systems. Further improvements may require hardware, such as more advanced network controller cards. --------------------------------------------------- Return-Path: tian@wally Received: from wally.cs.washington.edu (wally.cs.washington.edu [128.95.2.122]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id AAA07793 for ; Tue, 14 Jan 1997 00:20:01 -0800 Received: (tian@localhost) by wally.cs.washington.edu (8.8.3+CSE/7.2ws+) id AAA23048; Tue, 14 Jan 1997 00:20:00 -0800 (PST) Date: Tue, 14 Jan 1997 00:20:00 -0800 (PST) From: tian@wally Message-Id: <199701140820.AAA23048@wally.cs.washington.edu> To: bershad@cs Subject: SRCRPC 552-reading summary In attempting to prove Firefly's RPC services adequate for a fast distributed system, Schroeder and Burrows have constructed an interesting case study in the methodology and analysis of measuring a complex system. The paper also describes the optimizations they employ which involve fast packet recycling, language optimizations, a form of coscheduling/smart interrupt handlers to avoid extra context switches, and using globally mapped i/o buffers. A number of these have compromised the clean design considerably. They begin by taking end to end measurements of RPCs that test throughput and latency in the "fast" case. They outline the "fast" pathway that has been measured, obtain numbers for each step, add up the results and compare them against the latency numbers originally obtained. In optimizing for the common case they assume few retransmissions are necessary, server threads are available and that multipacket calls are rare. They also ignore binding costs. However, some of their other assumptions seem questionable. The most obvious is that to measure small code sequences, they measure execution times for individual instructions and then sum them up. how would I/D cache effects and page faults, especially in the case of a mixed workload, affect these numbers (and how would they counter)? That their computed times are within 5% of the measured latency suggests that they did an excellent job in estimating the costs of the component operations, and allows them to isolate the expensive operations. Finally, I am curious why they did not consider a more realistic workload (probably to sell Firefly RPC), for instance, considering garbage collection/VM/cache interactions, or more traffic on the internal bus or network controller. It would also be interesting to see how well their methodology maps to newer distributed systems on modern processors. --------------------------------------------------- Return-Path: rgrimm@cs.washington.edu Received: from june.cs.washington.edu (june.cs.washington.edu [128.95.2.4]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id AAA07963 for ; Tue, 14 Jan 1997 00:41:45 -0800 Received: from [128.95.8.129] (h130.dyn.cs.washington.edu [128.95.8.130]) by june.cs.washington.edu (8.8.3+CSE/7.2ju) with SMTP id AAA10977 for ; Tue, 14 Jan 1997 00:41:43 -0800 X-Sender: rgrimm@june.cs.washington.edu Message-Id: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Tue, 14 Jan 1997 00:39:28 -0800 To: bershad@cs From: rgrimm@cs.washington.edu (Robert Grimm) Subject: SRCRPC 552-reading summary Schroeder and Burrows provide a comprehensive and detailed study of the performance of RPC on the Firefly shared memory multiprocessor system. By using a null RPC and two calls that use a 1440 byte argument and result respectively (the largest argument that can fit into a single packet), Schroeder and Burrows first provide the base latency and the maximum throughput relative to the number of caller threads. They then provide a detailed account of the individual steps in the Firefly RPC, detailing the latency of each step. Based on these measurements and instruction timing characteristics they then provide estimates as to how various improvements influence the RPC latency. Finally, Schroeder and Burrows measure the impact of processors on RPC performance. The fast path of the Firefly RPC has been carefully optimized (for example, by allowing the interrupt driver to directly schedule waiting threads and thus avoiding the scheduling of a intermediate datalink thread) while also using standard networking protocols. The paper thus presents a good argument that good RPC performance can be achieved without special purpose protocols that sacrifice the portability, flexibility and reliability of standard protocols such as UDP and IP. However, Schroeder's and Burrows' results are optimistic for modern timesharing systems, since one of their optimizations is to map all buffers into all application address spaces (at the same location). This optimization presents an obvious security leak in the presence of untrusted applications. A rather curious result for the impact of the number of processors on RPC performance is the jump in null RPC latency when moving from two server processors to one server processor compared with a slow increase when incrementally reducing the number of processors from five to two. While it is unclear to what degree other factors (such as the structure of the operating system) influence this result, it seems to indicate that the null server thread does not scale well over processors (i.e., two processors see most of the benefits of multi-processing). This result is surprising given the fact that a null server thread has no shared state or communication with other server threads and can thus be perfectly partitioned. --------------------------------------------------- Return-Path: govindk@shasta.ee.washington.edu Received: from june.cs.washington.edu (june.cs.washington.edu [128.95.2.4]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id IAA10928 for ; Tue, 14 Jan 1997 08:30:11 -0800 Received: from shasta.ee.washington.edu (shasta.ee.washington.edu [128.95.28.11]) by june.cs.washington.edu (8.8.3+CSE/7.2ju) with SMTP id IAA26225 for ; Tue, 14 Jan 1997 08:30:10 -0800 Received: from andes.faulty (andes.ee.washington.edu) by shasta.ee.washington.edu (4.1/SMI-4.1) id AA29371; Tue, 14 Jan 97 08:31:15 GMT Received: by andes.faulty (SMI-8.6/SMI-SVR4) id IAA11635; Tue, 14 Jan 1997 08:31:03 -0800 Date: Tue, 14 Jan 1997 08:31:03 -0800 From: govindk@shasta.ee.washington.edu (Govindarajan K) Message-Id: <199701141631.IAA11635@andes.faulty> To: bershad@cs Subject: SRCRPC 552-reading summary The paper analyzes the performance of the Firefly RPC system. The main aim in doing this was to prove that RPC provides a viable means of communication in distributed systems. The paper can be divided into the following parts. The first part of the paper talks about the criteria the authors have considered for measuring the performance and the next part justifies their measurements. They conclude the paper by presenting ideas by which the RPC mechanism could be further improved. The authors begin with measuring the end-to-end latency and throughput of the RPC performance of Firefly using The Null() and MaxResults() routines respectively, followed by the measurements of the Marshalling time. They show later that their measurements were within 5% of the measured latency. In the analysis portion of the paper the authors describe the various steps involved in the RPC. They describe the process involved at both the caller as well as the server side in great detail. Here they show that the Waking up of the RPC thread is the major contributor to the overall latency of the procedure. The authors also explain in detail on various methodologies they have taken to optimize the performance of the RPC. This includes rewriting some portions of the Modula-2+ code in assembly language, directly awakening a suitable thread from the Ethernet interrupt routine and good packet buffer management. The authors have compared the latency and throughput with varying number of processors and have shown that there is a major jump in the latency when they have moved from 3 to 2 processors on the caller side. The MAxResult thread on the other hand seems to scale well over processors. The authors have done the performance analysis on a dedicated ethernet , it would be interesting to see how their performance is affected when external traffic is present. --------------------------------------------------- Return-Path: yasushi@silk Received: from silk.cs.washington.edu (silk.cs.washington.edu [128.95.2.238]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id IAA10964 for ; Tue, 14 Jan 1997 08:31:50 -0800 Received: (yasushi@localhost) by silk.cs.washington.edu (8.7.2/7.2ws+) id IAA08988; Tue, 14 Jan 1997 08:31:49 -0800 (PST) Date: Tue, 14 Jan 1997 08:31:49 -0800 (PST) From: yasushi@silk Message-Id: <199701141631.IAA08988@silk.cs.washington.edu> To: bershad@silk Subject: SRCRPC 552-reading summary This paper shows the detailed analysis of the performance of the Firefly RPC system. The authors measured the speed of a null RPC and a RPC with an argument that uses up one Ethernet packet. They also studied a detailed breakdown of how much time is spent in where. Finally, they speculate some improvements using faster hardwares and fansier software techniques. The contribution of the paper is that it precisely measured the hardware overhead of network data transmission, and showed that the software overhead was the dominant cost in the RPC transaction especially when the argument size is small. The methodology the paper used for the measurement seems a bit out of date. For example, it simply adds up the instruction execution cycles to calculate some of the software overhead, but this is meaningless on today's pipelined processors. Also, UDP checksum overhead calculation can't take into account the induced cache misses. --------------------------------------------------- Return-Path: echris@merganser Received: from merganser.cs.washington.edu (merganser.cs.washington.edu [128.95.2.192]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id IAA11169 for ; Tue, 14 Jan 1997 08:55:05 -0800 Received: (echris@localhost) by merganser.cs.washington.edu (8.8.3+CSE/7.2ws+) id IAA11479; Tue, 14 Jan 1997 08:55:05 -0800 (PST) Date: Tue, 14 Jan 1997 08:55:05 -0800 (PST) From: echris@merganser Message-Id: <199701141655.IAA11479@merganser.cs.washington.edu> To: bershad@cs Subject: SRCRPC 552-reading summary Reply-to: echris@cs.washington.edu This paper is an organized brain dump by two people who tried with great vigor (and success) to understand in magnificent detail the performance of Firefly RPC. The thesis sentence reads, "Programming a fast RPC is not for the squeamish." (p 8) Indeed, nearly every microsecond of the fast path is accounted for, and speculation on the effects of certain hardware and software improvements is given. Naturally, this paper begs many questions. For example, how much would I need to pay (on the Firefly) in order NOT to share packet buffers between all user address spaces? How much would I need to pay on a modern OS? Assembly language code is faster than well tuned Modula-2, but how does it compare against more "conventional" systems programming languages? Do I REALLY need to write the fast path in AL? Is this still a relevant picture of an RPC system? Or have modern OSes relegated it to the historical archives? And so on... --------------------------------------------------- Return-Path: sungeun@wormwood Received: from wormwood.cs.washington.edu (wormwood.cs.washington.edu [128.95.2.107]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id IAA11251 for ; Tue, 14 Jan 1997 08:59:23 -0800 Received: (sungeun@localhost) by wormwood.cs.washington.edu (8.8.3+CSE/7.2ws+) id IAA23834; Tue, 14 Jan 1997 08:59:23 -0800 (PST) Date: Tue, 14 Jan 1997 08:59:23 -0800 (PST) Message-Id: <199701141659.IAA23834@wormwood.cs.washington.edu> From: Sung-Eun Choi To: bershad@cs Subject: SRCRPC 552-reading summary Reply-To: sungeun@cs.washington.edu This paper details the performance of an RPC system built for the Firefly, a cache-coherent, shared memory multiprocessor of VAX processors. Specifically, the authors report measurements taken for "inter-machine" RPC for synthetic procedures run from one or more threads in a single address space on the client machine. Finally, they comment on the projected improvement to RPC performance given improvements in other aspects of their system. A primary goal in designing this RPC system was to provide a single communication paradigm to be used by all programs, i.e. discouraging users from rolling-their-own interprocessor communication. To this end, they present performance results for the optimized common inter-machine case. Each step of a remote procedure call is broken into distinct parts and the time to execute each of those parts measured using two dedicated(?) machines on an uncongested network (10MB/sec Ethernet). From numbers cited in the paper (and what I think I understand about the system), it appears that about 1057 microseconds (plus data marshalling) of any remote procedure call is performed by the client. The rest of the latency is "hideable" for it no work is performed by the client, i.e., if there is enough work in the program, the thread will experience little more than this amount for the remote procedure call. For a call to a remote procedure that does nothing and sends no data, this about 40% of the total latency, a decent upper bound. If little other work is available, the entire latency (which seems to increase significantly with larger packet sizes) is experienced. Regardless, without context, it is diffcult to decide whether their goals were met. --------------------------------------------------- Return-Path: mernst@nishin.cs.washington.edu Received: from june.cs.washington.edu (june.cs.washington.edu [128.95.2.4]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with ESMTP id KAA12589 for ; Tue, 14 Jan 1997 10:36:43 -0800 Received: from nishin.cs.washington.edu (nishin.cs.washington.edu [128.95.4.39]) by june.cs.washington.edu (8.8.3+CSE/7.2ju) with ESMTP id KAA05759; Tue, 14 Jan 1997 10:36:35 -0800 Received: (from mernst@localhost) by nishin.cs.washington.edu (8.8.4/8.8.2) id KAA08818; Tue, 14 Jan 1997 10:36:35 -0800 Date: Tue, 14 Jan 1997 10:36:35 -0800 Message-Id: <199701141836.KAA08818@nishin.cs.washington.edu> From: Michael Ernst To: bershad@nishin.cs.washington.edu Subject: SRCRPC 552-reading summary Performance of Firefly RPC Main question: What contributes to RPC latency? (Tells us what to concentrate on improving, lets us make predictions about effect of changes.) Firefly is a 5-way multiprocessor, single memory, coherent cache, 10 Mb/sec Ethernet, one processor connected to network. Standard RPC scheme with local stubs (actually, just one stub that also takes a function identifier as an argument) for remote procedures. Local stub does: 1. allocate packet buffer 2. marshal arguments Most marshalling is trivial -- no functions to call. 3. call transport mechanism, This involves transmittal over both QBus and Ethernet. Also register oneself to receive a reply, block waiting for response. 4. unmarshal results 5. deallocate packet buffer Server stub reuses incoming packet buffer, so need not do 1 or 5. Optimizations include: * awaken thread directly from Ethernet interrupt, so only one thread wakeup instead of two * on-the-fly packet buffer recycling * packet buffers in shared memory, so no extra copying (but insecure, and still has to be copied into the network processor's cache) General performance observations: * Send and receive dominate for small packets (not unexpectedly). * More threads => better performance due to network saturation (which hides latency). * Throughput was limited by network controller hardware (likewise for other contemporary systems). * In stubs and RPC runtime, 20% of cost is procedure call overhead. * Some code was good for uni- but not multi-processors, but not vice versa. (No details given about this.) * Predictions matched (best) measurements within 5%; we can trust the numbers. Predicted performance improvements (eliminate how much of current time): Overlap Ethernet and QBus interface: 11-28% 10x network speedup, to 100Mb/sec: 4-18% 3x processor speedup: 36-52% No UDP checksum: 7-16% Redesign RPC packet header: 3-8% No IP/UDP, use lower level of abstraction: 1-4% Busy wait, so no wakeup (but no concurrency): 7-17% More assembly code: 4-10% (assembly gave 5x across-the-board improvement over Modula-2+) It is hard to compare RPC performance numbers: * different hardware * RPC runs in kernel or user space? * generality of header information -- what networks is it compatible with? * hand-coded versus automatically-generated stubs * checksums vs. no checksums * implementation language What about real loads? The tests were only on: * no arguments * very big IN argument * very big OUT argument Page 1 line -7: why are multiple threads per address space necessary? (The processes mentioned are in *different* address spaces.) --------------------------------------------------- Return-Path: rgrimm@lutefisk Received: from lutefisk.cs.washington.edu (lutefisk.cs.washington.edu [128.95.1.204]) by whistler.cs.washington.edu (8.8.3/7.2ws+) with SMTP id IAA08044 for ; Thu, 16 Jan 1997 08:27:36 -0800 Received: (rgrimm@localhost) by lutefisk.cs.washington.edu (8.6.12/7.2ws+) id IAA19332 for bershad@cs; Thu, 16 Jan 1997 08:27:31 -0800 Date: Thu, 16 Jan 1997 08:27:31 -0800 From: rgrimm@lutefisk (Robert Grimm) Message-Id: <199701161627.IAA19332@lutefisk.cs.washington.edu> To: bershad@cs Subject: SRCRPC 552-reading summary This is a postscript to my observation about the jump in null RPC latency when moving from two to one server processor. One possible explanation could be the fact that the authors observed a load of 1.2 for the original experiments using five processors. Obviously, such a load can be better distributed over two processors than one. However, this realization can only be a partial explanation, given that the client load in the original experiments also was 1.2 and only one processor is used on the client side.