Data Prefetching for High-Performance Processors, UW CSE TR 93-07-01

By Tien-Fu Chen

ABSTRACT

Recent technological advances are such that the gap between processor cycle times and memory cycle times is growing. Techniques to reduce or tolerate large memory latencies become essential for achieving high processor utilization. In this dissertation, we propose and evaluate data prefetching techniques that address the data access penalty problems.

First, we propose a hardware-based data prefetching approach for reducing memory latency. The basic idea of the prefetching scheme is to keep track of data access patterns in a reference prediction table (RPT) organized as an instruction cache. It includes three variations of the design of the RPT and associated logic: generic design, a lookahead mechanism, and a correlated scheme. They differ mostly on the timing of the prefetching. We evaluate the three schemes by simulating them in a uniprocessor environment using the ten SPEC benchmarks. The results show that the prefetching scheme effectively eliminates a major portion of data access penalty and is particularly suitable to an on-chip design and a primary-secondary cache hierarchy.

Next, we study and compare the substantive performance gains that could be achieved with hardware-controlled and software-directed prefetching on shared-memory multiprocessors. Simulation results indicate that both hardware and software schemes can handle programs with regular access patterns. The hardware scheme is good at manipulating dynamic information, whereas software prefetching has the flexibility of prefetching larger blocks of data and of dealing with complex data access patterns. The execution overhead of the additional prefetching instructions may decrease the software prefetching performance gains. An approach that combines software and hardware schemes is shown to be very promising for reducing the memory latency with the least overhead.

Finally, we study non-blocking caches that can tolerate read and write miss penalties by exploiting the overlap between post-miss computations and data accesses. We show that hardware data prefetching caches generally outperform non-blocking caches. We derive a static instruction scheduling algorithm to order instructions at compile time. The algorithm is shown to be effective in exploiting instruction parallelism available in a basic block for non-blocking loads.

%A Tien-Fu Chen
%T Data Prefetching for High-Performance Processors
%R 93-07-01
%I University of Washington Department of Computer Science and Engineering
%D July 1993

pardo@cs.washington.edu