IMMD IV - Department of Computer Science (Operating Systems)
University of Erlangen-Nürnberg
Processor caches are designed to store the most recently used subset of the main memory and to provide this subset with low latency. Most contemporary cache architectures use physically indexed caches to facilitate cache coherency and to avoid cache flushing during a context switch. Memory of multiple contexts can simultaneously be encached in physically indexed caches. If two threads with a different working set follow upon each other, the cache content is displaced. In this situation, the processor stalls due to cache misses in the instruction-fetch and load/store pipeline stage.
Frequently, kernel threads share a lot of memory. Examples are code segments, shared-memory segments, or shared address spaces in multi-threaded applications. Switching between threads that share large parts of their working set results in a high cache reusage, in few cache misses, and in a good system performance . Our analysis of typical application- and information-servers shows that a large fraction of the physical memory pages is shared by multiple contexts. In a highly loaded server system, there are normally many runable threads in each run-queue sharing a lot of memory. It is desirable that threads with the same working-set follow upon each other to reuse the content of the cache, but in contemporary operating systems the sequence of execution is independent of working set aspects.
Our approach to improve system performance uses information derived from the translation lookaside buffer (TLB) to detect kernel threads which share lots of memory pages. To figure out the pages a thread is accessing, we need to analyze the content of the TLB. As we can not easily look into the TLB, we have to modify the TLB miss-handler or have to analyze data structures used to cache the page tables (e.g., translation storage buffer TSB in the Sparc V9 architecture ). If related threads reside in the same dispatch queue, the order of the dispatch queue is carefully changed such that related threads follow upon each other. This approach draws most profit from operating systems, where threads are enqueued into a fixed priority-dispatch queue after they have returned from sleep (e.g, like in System V Release 4).
The determination of related threads is a compute-intensive procedure, because there exists no trivial mapping from protection domains (physical pages/contexts) used in the VM subsystem to active entities (threads) managed by the process management. Therefore, we were forced to implement a reverse lookup mechanism to coalesce threads and pages.
Measurements of a prototype implementation using the Solaris source tree demonstrate the advantages of using TLB information in a multiprogrammed information- and application-server (SUN Enterprise X000 architecture). We also propose some simple enhancements of the TLB structures to reduce the overhead implied by the determination of related threads.
 F. Bellosa, "The Performance Implications of Locality Information Usage in Shared-Memory Multiprocessors", Journal of Parallel and Distributed Computing, Special Issue on Multithreading for Multiprocessors, Vol. 37 No. 1, Aug. 96
 "UltraSPARC Programmer Reference Manual", Revision 1.0, SUN Microsystems Inc., 1995