Limits to OS-based ATM Network Striping
(work-in-progress proposal for SOSP)
Jacob J. Schrøder Thomas Stormark Povl T. Koch*
Department of Computer Science, University of Copenhagen (DIKU)
Universitetsparken 1, DK-2100 Copenhagen East, Denmark
OS-based network striping is a flexible way to provide higher aggregate network bandwidths, lower message latencies for larger packets, and more reliable communication, by using multiple network adapters in parallel. We have experimented with network striping using two 155 Mbps ATM connections between Dual Pentium Pro machines running Windows NT 4.0. We find that performance is limited by the design of the operating system and the ATM driver. We have a new striping algorithm, called TRR, that preserves packet ordering, as required by ATM. We see reasonable benefits for UDP/IP, but are still working on getting decent TCP/IP performace.
In the distributed systems of today, the limited bandwidth and high latency of even high-speed networks are major bottlenecks. One way of increasing bandwidth and lowering latency of network communication is by using network striping  also known as inverse multiplexing, where the aggregation of multiple, parallel network connections - stripes - may provide a way to increase bandwidth and lower latency. The potential benefits obtainable by network striping are (1) multiple low cost network interfaces may provide a cost-effective alternative to a single expensive high-speed network interface, (2) multiple interfaces may provide higher performance than can be achieved by using the current single interface technology, and (3) the reliability of network subsystems may be improved by using multiple stripes.
Implementation: We have access to the source codes for ATM driver module (from Olicom) and we have placed the striping point at the lowest possible level. This allows all applications to benefit from network striping, using protocols such as TCP/IP, UDP/IP, and LAN Emulation. Three different striping protocols have been implemented: Basic round-robin (BRR) , Surplus Round Robin  (SRR), and our new algorithm called Trigger-based surplus Round-Robin (TRR). The main contribution of the TRR algorithm is that guarantees in-order delivery of all packets, although some packets may be lost. This is an ATM requirement. In-order delivery is not guaranteed with the BRR and SRR algorthms; BRR has no guarantees about delivery and SRR only guarantees almost-FIFO. The TRR algorithm is based on ATM control cells, trigger cells, injected into the stream of normal ATM cells. The trigger cells allow for correct ordering of ATM cells or AAL5 datagrams during reassembly by (a) designating which stripe to read from next and (b) giving a high-level sequence number. A trigger cell is sent for each 100-200 normal ATM cells, thereby introducing very little overhead.
Early performance results: At the lowest-level, between ATM drivers, using two 155 Mbps ATM adapter, we are able to obtain a throughput 261 Mbps. At the user-level, throughput is not as good. With UDP/IP in a non-striped environment, on one connection, we obtain a user-level throughput of 109 Mbps. With striping, using two parallel connections in the ATM driver, we obtain bandwidths of 171, 167, 160 Mbps for BRR, SRR, and TRR algorithms, repectively. The round trip time of 65,527 byte packets drops from around 11.5 to 8.9 milliseconds when we use striping. For TCP/IP, the throughput in a non-striped environment is 130 Mbps. When using striping, the throughput drops to 120 Mbps. TCP/IP round trip times increase from 10.9 milliseconds to around 12 milliseconds when using striping for 65,627 byte packets.The reason is not yet clear (see below). The round trip times for small packets are affected very little from the striping.
OS striping limitations: Our very first experiments showed no benefit from network striping because we used spin locks to synchronize interrupt handlers’ access to shared data structures in the ATM driver. Now, with a special workers thread per adapter, we have lowered the CPU usage and gotten much higher bandwidth . For all our bandwidth and round trip measurements, the CPU utilization is far from 100%. The main question now why we cannot get closer to the 261 Mbps throughput seen at the lowest level and why TCP/IP is performing so much worse using striping. For TCP/IP, we see no retransmits of packets and for the TRR algorithm, the packets get delivered in the same order as they were sent. We see a major limitation in the way Windows NT handles memory buffers; only a single receive buffer can be handled at a time per virtual adapter (which handles the striping). Internal buffers can be used inside the driver to receive packets in parallel, but this will require extra, expensive copies when the packets are handled to the upper layers. We believe that it is a general problem that driver interfaces and protocol stacks serializes packets at a very early point, thereby greatly reducing the effects of the multiple network interfaces and multiple processors. We are now investigating how to avoid these limitations in the OS.
 C. Brendan, S. Traw, and J.M. Smith. Striping Within the Network Subsystem. In IEEE Network, July/August, 1995.
 H. Adiseshu, G. Parulkar, and G. Varghese. A Reliable and Scalable Striping Protocol. In Proceeding of SIGCOMM’96. October, 1996.
 J.S. Hansen and E. Jul. A Scheduling Scheme for Network Saturated NT Multiprocessors. In Proceedings of the USENIX Windows NT Workshop. USENIX. August, 1997.
*Currently at INRIA Rhone-Alpes, Project SIRAC, Grenoble, France.