CHiMPS, with Andrew Putnam and the CHiMPS group from Xilinx
There have been (at least) two hindrances to the widespread adoption of FPGAs by scientific application developers: having to code in a hardware description language, such as Verilog (with its accompanying hardware-based programming model) and poor FPGA memory performance for random memory accesses. CHiMPS, our C-to-FPGA synthesis compiler, solves both problems with one memory architecture, the many-cache memory model. Many-cache organizes the small, distributed memories on an FPGA into application-specific caches, each targeting a particular data structure or region of memory in an application and each customized for the particular memory operations that access it.
CHiMPS provides all the traditional benefits we expect from caching. To reduce cache latency, CHiMPS duplicates the caches, so that they're physically located near the hardware logic blocks that access them. To increase memory bandwidth, CHiMPS banks the caches to match the memory parallelism in the code. To increase task-level parallelism, CHiMPS duplicates caches (and their computation blocks) through loop unrolling and tiling. Despite the lack of FPGA support for cache coherency, CHiMPS facilitates data sharing among FPGA caches and between the FPGA and its CPU through a simple flushing of cached values. And in addition, to harness the potential of the massively parallel computation offered by FPGAs, CHiMPS compiles to a spatial dataflow execution model, and then provides a mechanism to order dependent memory operations to retain C memory ordering semantics.
CHiMPS's compiler analyses automatically generate the caches from C source. The solution allows scientific programmers to retain their familiar programming environment and memory model, and at the same time provides performance that is on average 7.8x greater and power that is one fourth that of a CPU executing the same source code. The CHiMPS work has been published in the International Symposium on Computer Architecture (ISCA, 2009), the International Conference on Field Programmable Logic and Applications (FPL, 2008), and High-Performance Reconfigurable Computing Technology and Applications (HPRCTA, 2008), where it received the Best Paper Award.
Concurrency Exceptions, with Luis Ceze and the UW Sampa group
With ubiquitous multicores and emerging parallel programs, we are now dealing with far harder software reliability problems. Concurrency errors are hard to understand, are typically non-deterministic and manifest themselves way past the point of their occurrence. Moreover, they have major implications for programming language semantics. Recent work on support for concurrency debugging has made good progress, but has often focused on best-effort techniques for bug detection, with probabilistic guarantees
Our work takes a direct approach to the problem: making concurrency errors fail-stop by delivering an exception before the error manifests itself. In other words, the system will detect that a concurrency error is about to happen and will raise an exception before the code with an error is allowed to execute. We call this mechanism concurrency exceptions. Concurrency exceptions will allow concurrency errors to be handled as conveniently as division by zero and segmentation fault.
Treating concurrency errors as exceptions has major positive implications on debugging, because execution can stop exactly at the point where the bug occurred. It also improves safety, because the exception will either exit the program and avoid delayed ill-effects or trigger a recovery action that prevents misbehavior. Moreover, supporting fail-stop behavior for data-races, in particular, will significantly simplify the specification of programming languages.
Like all other exceptions, the support for concurrency exceptions requires exactness in the definition of concurrency error conditions (to unambiguously designate what an error is), absolute precision in detecting these conditions (so that there are neither false positives nor false negatives), and absolute efficiency (since this feature will be enabled continuously and used synchronously with respect to program execution). We propose to provide these guarantees with a variety of techniques that span the entire systems stack.
- WaveScalar, the first dataflow machine to execute imperative language programs, such as C.
- Simultaneous multithreading , the first commercially viable mulththreaded computer.
- DyC , the first (C) source-to-machine language dynamic compiler.
- Coarse-grain, multithreaded architectures
- False sharing optimizations
- Prefetching on shared memory multiprocessors
- Code scheduling
- Cache coherency
- Other research