|
|
|||||||||||||
|
Coarse-grained reconfigurable architectures (CGRAs) have the potential to offer performance approaching an ASIC with the flexibility, within an application domain, similar to a digital signal processor. In the past, coarse-grained reconfigurable architectures have been encumbered by challenging programming models that are either too far removed from the hardware to offer reasonable performance or bury the programmer in the minutiae of hardware specification. Additionally, the ratio of performance to power hasn't been compelling enough to overcome the hurdles of the programming model to drive adoption.
The goal of our research is to improve the power efficiency of a CGRA at an architectural level, with respect to a traditional island-style FPGA. Additionally, we are continuing previous research into a unified mapping tool that simplifies the scheduling, placement, and routing of an application onto a CGRA.
Reconfigurable computing architectures provide large numbers of tightly integrated processing elements to achieve the performance of custom hardware without sacrificing reprogrammability. These "micro-parallel" architectures have yet to be adopted for general computation for two main reasons: First, they do not execute sequential code efficiently. Second, writing programs for micro-parallel execution is an arcane art far removed from the experience of most programmers.
The first problem has been addressed by a hybrid computing model that integrates a sequential processor along with a spatial fabric such as an FPGA. Such hybrid processors have become prevalent with recent FPGAs like the Altera Stratix and the Xilinx Virtex. While hybrid computers address the issue of combining sequential and spatial computation, the burden of integrating sequential and spatial code in a single application, and especially programming the spatial fabrics, remains challenging. Part of the difficulty lies in the lack of an agreed upon computational model and family of programming languages.
To address this challenge, we have developed a
type architecture
1 that extends the familiar von Neumann
model to include the micro-parallel engine in hybrid architectures.
This hybrid type architecture provides an abstraction for programmers
that allows them to understand the essential features and constraints
of the underlying hybrid system without being overwhelmed by
second-order details.
In addition to a mental model, programmers need a language that allows them to address key features of the model. We propose a language based on C called Macah, with extensions that reflect our proposed type architecture. Programmers can take advantage of the features of the type architectures using the programming language extensions programs provided by Macah.
We use ``micro-parallel'' to describe both computations and architectures. The defining features of micro-parallel computations are:
Note that programs can have several micro-parallel sections embedded within control-flow dominated code, which drives the need for hybrid architectures. The defining features of micro-parallel architectures are:
Micro-parallel computations can be mapped to efficient spatial implementations whose static structure reflects the structure of the computation: Parallel operations are executed by different, dedicated, functional units, communication between operations is performed by dedicated wires and registers, and repetition means that the same structure can be reused many times for different data.
Computations that run efficiently on systolic arrays are good examples of micro-parallel computations and systolic arrays are good examples of micro-parallel engines. FPGAs, as well as various FPGA cousins (for example RaPiD and PipeRench), are ``configurable'' micro-parallel engines that can be restructured dynamically to execute different micro-parallel computations. The vector unit of a vector processor might also be considered a micro-parallel engine, but we do not consider the datapath of a superscalar processor to be a micro-parallel engine, because the number of function units is relatively small and cannot be dedicated, the communication between function units is not scalable, and such processors utilize significant control resources to maintain the appearance that instructions execute sequentially.
There is an important distinction between micro-parallelism, and task-, process-, or thread-level parallelism2. Algorithms that can be implemented to take advantage of micro-parallelism can, in most cases, be implemented in a task-parallel style as well. However, multiprocessor architectures have so much overhead relative to micro-parallel architectures that a micro-parallel implementation is much cheaper, in terms of both dollars and energy for equivalent performance. In some cases, the close, fine-grained communication between the operations in a micro-parallel computation precludes the efficient use of a task-parallel architecture like a multiprocessor. It is true that many computations exhibit both task- and micro-parallelism and for those it is appropriate to build a multi-processor from the hybrid compute nodes that we describe in this paper. Architectures such as the Cray XD1, SCORE, and Merrimac focus on both task- and micro-parallelism.
1A type architecture is an abstract model of a family of computers.
2Micro-parallelism encompasses traditional instruction-level parallelism (ILP).
|
UW
Embedded Research Group Last modified: Wed Jun 7 16:38:57 PDT 2006 |