Cheating the I/O Bottleneck:
Network Storage with Trapeze/Myrinet
Darrell Anderson, Jeff Chase, Syam Gadde,
Andrew Gallatin, Alvin Lebeck, and Ken Yocum
Duke University
Mike Feeley
University of British Columbia
Two recent hardware advances boost the potential of cluster computing:
high-quality PCI bus implementations, and switched cluster interconnects that can
deliver a gigabit or more of point-to-point bandwidth. We are
developing system facilities to realize the potential for high-speed
data transfer over Myricom's 1.28 Gb/s Myrinet LAN, and harness it for
cluster file systems, network memory systems, and other distributed
OS services that cooperatively share data
across the cluster. Our broad goal is to use the power of the network
to sidestep the disk I/O
bottleneck for data-intensive computing on
workstation clusters. Our recent work focuses on three areas:
- Trapeze network interface. Trapeze is a new messaging
interface for cluster interconnects, prototyped as a custom firmware
program for Myrinet LANs. The interface delivers short, fixed-size
control messages (128 bytes) with optional attached payloads (up
to 8K) transferred directly to or from arbitrary frames of host
memory. The firmware implements a DMA pipelining technique called
adaptive cut-through delivery to minimize latency of large
payloads (e.g., page migration traffic) while yielding high
bandwidth under load. Experiments with Intel platforms based on the
Natoma PCI bridge show that raw Trapeze can demand-fetch a 4K page
from remote memory in less than 100
s, with
bandwidth over 105 MB/s for streams of 8K payloads.
- Trapeze-based messaging software. Above the raw Trapeze
interface, a software layer supports kernel-to-kernel messaging for
cluster OS services. A primary focus of our effort has been zero-copy
handling of responses for RPC-style request/response messaging, in
which the the network interface deposits a response payload directly
into a memory frame specified by the RPC caller. We support zero-copy
responses for two RPC variants: (1) multiway exchanges in
which requests are delegated to third parties, and (2) asynchronous
handling of responses using continuation procedures invoked from
interrupt handlers. These features are important for peer-peer
cluster OS services: the first supports directory lookups for fetched
data, and the second supports lightweight asynchronous calls, which
are useful for prefetching.
- GMS over Trapeze/Myrinet. Trapeze was optimized for page
migration traffic in the Global Memory Service (GMS), presented at
SOSP 95. GMS implements remote paging and cooperative caching of file
blocks at a low level of the OS kernel. Our enhanced GMS runs over
Trapeze on Myrinet clusters of DEC AlphaStations running Digital Unix.
It uses the Trapeze facilities to unify buffering of
migrated pages, eliminating all page copies by sending and
receiving directly from the file/VM page cache. We extended GMS for
zero-copy prefetching of sequentially accessed files; the current
prototype can read a mapped sequential file from network memory at 50
MB/s, saturating a 266 MHz Alpha CPU in the file system code.
We are pursuing several directions for continuing work. First, our
preliminary results have exposed unnecessary overheads in common-case
paths through the file and VM systems, which were previously limited
by disk speeds. Second, cluster OS services using RPC-style messaging
introduce new flow control considerations, particularly with
prefetching, which is bursty and hungry for bandwidth. GMS/Trapeze
systems under load can interfere with application-mandated network
traffic, or even saturate network trunk links and node I/O buses. We
have extended the Trapeze/Myrinet firmware to reflect back to the host
precise information about congestion delays incurred while moving a
packet through the interconnect, and we are experimenting with a
modified GMS system that factors this data into its policies for
selecting peer sites to receive its page migration requests. This is
one example of a network-aware distributed system that can
adapt what it sends and where in response to feedback
about network topology and conditions. We believe that these mechanisms,
which supplement more familiar mechanisms to control the
send rate and routing of the generated traffic, are essential for
delivering the potential of gigabit interconnects for data-intensive
computing on workstation clusters.
Jeff Chase
Thu Aug 21 16:48:08 EDT 1997