Multiprocessor Architecture

The organization and management of the memory hierarchy of uniprocessors and of MIMD multiprocessors depends very much on the structure of the elements of the memory system and, in the multiprocessor case, on the interconnection between processors.

In single processors and in bus-based multiprocessors, either with a single shared-bus or a hierarchy of buses, private caches are associated with processors. In spite of these caches, the overhead of cache misses for very fast processors and bus contention are the main impediments to high performance. In addition, the presence of multiple caches in multiprocessors gives rise to the cache coherence problem. We are studying various ways to minimize the cache miss penalty, e.g., (1) through hardware and compiler-based prefetching, (2) through compiler restructuring of shared data to reduce false sharing traffic, (3) through placement of data in various levels of multibus and cache hierarchies, and (4) through structures such as sector caches and associated protocols.

When the switching structure is not a single shared bus, solving the cache coherence problem solely through hardware mechanisms is less appealing. We are investigating a number of schemes ranging from combining a compile-time marking of (data) references and a strictly local hardware mechanism to an entirely compile-time analysis of cross-processor memory references that determines the type of coherency code to be generated.

Another area of research is centered around multiprocessors whose processors have multiple hardware contexts. Multithreaded architectures hide the latency of individual operations by executing useful instructions from another context while waiting for them to complete. However, frequent context-switching among multiple hardware contexts can increase interconnect traffic by increasing conflict misses for the combined working sets of multiple threads. We are investigating several design and performance issues of multithreaded multiprocessors: the tradeoffs between scheduling a thread in another hardware context within the same processor or in a context on another processor, developing thread placement (on processors) algorithms, evaluating architectural techniques for reducing conflict misses that result from the enlarged multiple context working sets, and evaluating the interaction between synchronization techniques and processor context switch policies.

A related project will investigate issues of multiscaling vs. multithreading, i.e., how best to utilize multiple functional units in a wide superscalar processor. At one extreme a single thread would issue multiple instructions in the same cycle; at the other extreme each functional unit would execute an instruction from a different thread (a multithreaded processor). The optimal design is probably somewhere in between. It will investigate a variety of processor design, thread scheduling and memory management issues under this general umbrella.

There are several techniques that can be used to evaluate multiprocessor architectures: analytical models, trace driven simulation with either synthetic or real program traces, and instruction-level simulations. We are investigating several of these techniques. One of our projects is to obtain parallel program traces with which to run simulations. We have built a facility that modifies parallel programs for our Sequent multiprocessor to trace themselves as they run. The facility is integrated with our Presto lightweight thread package so that both the programs and the thread package itself are traced. Data from this facility is now being used to study sharing patterns, prefetching algorithms and scheduling for multiple hardware contexts, as well as for validating the tracing methodology itself.

A second project related to performance evaluation is parallel trace-driven simulation. We have implemented both conservative and optimisitic trace-driven simulation algorithms on the KSR-2. We are in the process of finding the bottlenecks of the two algorithms (using the traces generated by the facility described in the previous paragraph) in order to perform needed optimizations. The next step will be to compare the two methods qualitatively and quantitatively to see which one should be pursued for a full-fledged simulation of parallel architectures.

Principal Investigators: Baer, Eggers, Levy

webmaster@cs.washington.edu