Department of Computer Science, Princeton University
Popular communication APIs like stream sockets or RPC are connection-based. The performance advantage of new Gigabit networks like Myrinet is often not delivered to user space because previous implementations of these APIs incur high overhead and multiple data copies.
The basic virtual memory-mapped communication (VMMC) model provides protected, direct communication between the sender's and receiver's virtual address spaces. Each message carries a destination address in the receiver's virtual memory. When a message arrives at its destination, it is transferred directly into the memory of the receiving process, without interrupting the receiver's CPU. Thus, there is no explicit receive operation in VMMC.
The basic VMMC model has been implemented on two platforms: SHRIMP multicomputer which consists of PCs connected by the Intel Paragon routing network and PCs connected by a Myrinet network. Programming with VMMC directly delivers communication latency and bandwidth very close to the hardware limit. We already have a set of communication libraries such as NX, Sun RPC, stream sockets and shared virtual memory (SVM).
However, the basic model does not support connection-oriented high-level communication APIs well. The problem is that the VMMC model requires the sender to know the destination address before it sends data. On the other hand, connection-based high-level APIs such as stream sockets do not have destination addresses on the sending side; a sender knows only the name of the connection. To accomplish a zero-copy protocol, it often requires a scout message or a handshake in the implementation.
In this work, we propose, implement and evaluate a mechanism called transfer redirection. The basic idea is to use a default, redirectable receive buffer in case when a sender does not know the final receive buffer addresses. When the data arrives at the receive side, the redirection mechanism checks to see if a redirection address has been posted. If the receiver posts its buffer address before the message arrives, the message will be put into the user buffer directly from the network without any copying. If no redirection address has been posted, the data will be moved to the default buffer. Later, when the receiver posts the receive buffer address, the data will be copied from the default buffer to the receive buffer. This mechanism naturally extends the virtual memory-mapped communication model and can easily support connection-based APIs to achieve zero-copy without any additional messages.
To support redirection efficiently, the network interface must be able to transfer data directly into user buffers. Our previous implementations of virtual memory-mapped communication required static pinning of all receive buffers, which limits the total receive space. This work describes a user-managed TLB (UTLB) mechanism that allows a user library to dynamically manage the amount of pinned memory used for both send and receive operations. The UTLB consists of per-process array holding physical addresses of pages belonging to this process virtual memory regions that are pinned in the host physical memory. Each UTLB is allocated by the driver in the kernel memory. For protection, the VMMC library cannot read or modify the UTLB directly. Instead, the library manages the UTLB indirectly by asking the device driver to replace TLB entries and to provide physical addresses. The library maintains a lookup data structure to keep track of pages that have been pinned and whose translations are present in the UTLB. This data structure is consulted for every send and redirect request to obtain UTLB indices for the user buffer. The network interface uses an UTLB index passed with a user request to obtain user buffer page addresses. In order to avoid cost of fetching UTLB entries from the host, the network interface maintains a cache of UTLB entries.
The UTLB mechanism requires only standard support available in most existing operating systems. As a result, one can realize protected, user-level communication without modifying operating systems and without pinning a large amount of space for communication.
To validate our extensions we have implemented stream sockets on top of the extended VMMC running on a Myrinet network of Pentium PCs. This is a zero-copy implementation with maximum bandwidth over 84 Mbytes/s and one way latency of 20 microseconds.