The current work in compilers focuses on developing backend compiler algorithms, and then using them, along with compiler techniques developed by others, to alleviate memory bottlenecks, both through reducing memory latencies and eliminating memory operations.
We are addressing a full range of issues in this area. For uniprocessors we have developed an instruction scheduling algorithm, known as Balanced Scheduling, that supports machines in which the uncertainty of memory latencies is exposed to the compiler through lockup-free caches or multiple hardware contexts. Our current work extends Balanced Scheduling in two respects: combining it with compiler optimizations that increase the size of the basic block (such as loop unrolling and trace scheduling), thereby providing more opportunities to exploit instruction level parallelism; and avoiding using it when latencies are known. If that work proves fruitful, we will then begin scheduling for superscalar architectures.
We also eliminate memory operations through dynamic compilation. By applying (input) value-specific optimizations and then generating special case code at runtime, we turn memory references into instruction stream constants. The technique enables further compiler optimizations, such as fullu unrolling loops, that eliminate additional memory operations.
In the shared memory arena, we have eliminated memory operations that were lost to false sharing and prefetch-induced cache conflicts within and across process working sets. The latter were eliminated through a variety of architecture as well as compiler techniques, such as victim caching, special prefetch algorithms for shared data and compiler-based shared data restructuring. The underpinning of the false sharing work is control flow analysis on process families, non-currency analysis applied to barriers and traditional summary side effect analysis, extended to handle patterns and frequencies of cross-process memory accesses. The approach has potentially wide applicability in areas ranging from memory allocation in distributed memory machines to less conservative automatic parallelization techniques. Thus far, we have used it to improve the benefits of prefetching, and, of course, to eliminate false sharing.
Principal Investigator: Eggers