|
|
|||||||||||||
|

Kernel accelerator architectures have at least an order of magnitude advantage over conventional sequential architectures in terms of both performance and energy efficiency on kernels from a wide variety of applications. However, kernel accelerators are significantly harder to program than sequential processors, and this programming gap has contributed to their relegation to research prototypes and small niche products. At the University of Washington, we are working on a new C-level programming language, called Macah, that will make it much easier to write efficient, predictable, concise, and portable programs for kernel accelerator architectures.
In designing Macah, we are attempting to achieve the same kind of balance between abstraction from the underlying hardware and exposure of implementation details that C does for sequential processors. The balance that C represents not only allows C to be compiled relatively easily and efficiently to a wide range of sequential processors (a feat achieved by almost all programming languages in wide use today), but also gives C programmers a relatively clear view of how their program is executed. This transparency gives programmers a great deal of control over the behavior and performance of their programs. Somewhat paradoxically, this control promotes a useful and less widely recognized kind of portability: compiler portability. Because C compilers and runtime systems have less freedom than, say, Java or SML compilers, C programmers can be relatively confident that their programs will behave and perform similarly on most platforms ("platform" including architecture, compiler and runtime software). These two kinds of portability make C an extremely useful language, both for programmers who want more control over how their programs are implemented and for compilers of more abstract languages.
Probably the most important guiding principle in deciding how much abstraction is "just enough" is that we should abstract away an aspect of the implementation if we know or can develop algorithms that can fill in the details without making many (if any) non-trivial trade-offs. Non-trivial trade-offs are best left to the programmer, because humans tend to be much better than computers at making those kinds of decisions. For orientation, here is how we see some of C's features breaking down into features that abstract away architectural details and features expose important implementation issues.
| Abstract | Exposed Implementation |
|
|
Notice that the abstraction features correspond neatly to the major components of a typical C compiler backend. These features are exactly those whose implementation the designers of C felt could be efficiently and reliably automated. Notice also that a great deal of language design effort has gone into making new languages that expose fewer implementation details than C, in order to make programming faster, easier and safer. Nonetheless, C still remains a popular language, which I believe is a testament to the strength of the "local optimum" in the abstraction-portability-performance-reliability space that C represents.
Here is a high-level look at how we have decided to strike the abstraction-implementation exposure balance in Macah.
| Abstract | Exposed Implementation |
|
|
We believe that a promising path to a high performance, reliable, predictable programming systems for micro-parallel architectures is through the development of languages designed for easy expression and analysis of micro-parallelism. Macah is our language that will serve as a laboratory for micro-parallel language features and semantics. It is intended to be for micro-parallel architectures what C is for conventional sequential architectures-it abstracts away details like the particulars of the state and routing resources, but gives the programmer the kind of control that C does over issues like the layout of data structures in memory.
We propose a familiar programming language extended with features that allow the programmer to conveniently describe the order of execution and the data accesses required. As long as the programmer exposes the parallelism in the program and meets the constraints of the model, the compiler should be able to generate the details of the micro-parallel execution. However, if the program is written without regard to what is possible in the model, even a heroic compiler will, in general, be unable to find the appropriate execution order.
Just as the HMP architecture is an extension of the von Neumann machine, our language, called Macah, is a variation on C. It is very similar in spirit to the genealogically related languages NAPA C, Streams-C, and Impulse C. In contrast to those languages, however, Macah includes features for micro-parallelism in the language itself, as opposed to adding special libraries to plain C. Boehm has argued that it is impossible to correctly implement thread-level parallelism as a library, and we speculate that the same is true of micro-parallelism. Also, particularly in Streams-C and Impulse C, processes are emphasized as a critical feature. Though we intend to introduce process-, thread- or task-level concurrency in the future, it is not currently a part of Macah.
Most programming systems (i.e., languages, compilers, libraries) for micro-parallel engines fall into one of three categories. First, it is possible to design compilers for conventional languages, like C and Fortran, that target micro-parallel architectures. The and Cash compiler are excellent examples of this approach. This approach has worked well for vector computers, but has not been widely adopted for more flexible and complicated architectures, like FPGAs. Alternatively, one can force programmers to use languages like Verilog and VHDL that typically demand a great deal of understanding of the details of micro-parallel execution (e.g. pipelining), and perhaps an understanding of the micro-architectural details of particular computers. Finally, many researchers have explored the use of programming languages that to a greater extent than C or Java allow programmers to specify \textit{what} to compute, as opposed to how to compute it, such as SA-C or pH. Compilers for such languages have more freedom to automatically parallelize code, but may not produce implementations as efficient as a programmer with more low-level control could achieve in many cases.
Hybrid computers introduce the added complexity of deciding which parts of a program run where, and how communication between the components is specified. Again, because automatic partitioning is difficult, and manual programming of low-level interfaces is tedious and error-prone, we advocate a style in which the programmer is explicit about the most crucial details of sequential{\slash}micro-parallel interaction and communication.
Though all of these approaches offer significant benefits, we believe that there is an important void that in the sequential domain is filled primarily by C. C's enormous and enduring popularity can be attributed, at least in part, to the fact that C gives programmers the power to explicitly manage important resources, like memory, and a relatively high degree of predictability. On the other hand, C does abstract away some architecture-specific details, such as register file structure and instruction set idiosyncrasies. By reflecting the HMP architecture, in the way that C reflects the von Neumann machine, Macah will function as an ``abstract assembly'' for hybrid architectures.
We will introduce each of Macah's relevant features in the context of the stripped-down Smith-Waterman code in Figure \ref{fig:SWMacah}.
Micro-parallel engines can be extremely sensitive to long latencies, such as the ever-increasing (in terms of processor cycles) latency to main memory. Luckily, micro-parallel algorithms usually have predictable memory access patterns. These two facts led to link between the main memory and the workspace memory in the HMP model, and also \textit{memory streams} in Macah.
A memory stream is defined by an accessor function that reads from some data structure and sends data out or receives data and writes it into some data structure. The main program can spawn memory streams and then access them through handles returned by the spawning functions. This notion of streams is similar to that found in Stream-C.
The functions defined on lines 3 and 8 will be used as memory stream accessor functions. The first argument to both functions is a stream handle. On lines 5, 6 and 9, the stream receive/send operator (\texttt{<|}) is used to send to or receive from a stream. On lines 26 through 29, the memory streams are spawned by calling the appropriate \texttt{spawn} function with a reference to the accessor function and the actual arguments for the accessor function. On line 82, the main program waits for the memory streams to finish and recovers resources allocated to them.
In the future, we plan to extend Macah with general thread-, process-, or task-parallelism, at which point the streams will be extended to first-in, first-out, unidirectional channels between two different threads, similar to those in StreamIt.
Data that needs to be accessed often or with low latency by code running on a micro-parallel engine should be allocated into local memory\footnote{We do not mean local in the standard C sense of function-local, but rather local to the micro-parallel engine, that is, in the workspace memory.}. Up to this point it has been sufficient for us to simply allocate all variables accessed within a {\mpar} block into local memory. It may be necessary in the future to give the programmer more control of where particular variables are allocated, but we have not yet seen a need.
To encourage programmers to use efficient data access patterns, and to ease compilation, we have introduced a new primitive data type called a shiftable array. Shiftable arrays are similar to conventional arrays, but also support a shift operator that takes a shiftable array $A$ and an integer $i$ as arguments and moves each element of $A$ at some index $j$ to index $j + i$; similar constructs were presented in Silage. Shiftable arrays can be implemented efficiently on micro-parallel engines as FIFOs or chains of connected registers. Examples of a shiftable array declarations can be seen on lines 17, 19, and 24. On lines 46 through 48, the \texttt{rotate} and \texttt{last} macros, which use the subscript and shift operations, are used to access shiftable arrays. In sequential mode, shiftable arrays can be implemented efficiently by simply storing an offset with each shiftable array that is used in address computations and updated by shift operations.
The {\mpar} keyword before the block that begins on line 38 serves two roles. The statement following a {\mpar}, which can be an entire block, is assumed to run on the micro-parallel engine. In this role {\mpar} is simply a hint; implementations are free to run any part of a program on the micro-parallel engine, as long as the meaning is preserved. However, this construct gives programmers a way to guide the partitioning process.
The second issue \mpar was introduced to address is more semantically significant. A fundamental complication arises whenever code written in a sequential language is executed in a micro-parallel fashion. In order to implement constructs such as loops efficiently in a micro-parallel engine, they must be pipelined. This pipelining means that the beginning of ``later'' iterations will execute \textit{before} the end of ``earlier'' iterations. This reordering does not create any difficulties, as long as the system can identify and handle all of the inter-iteration dependences. Unfortunately, in general, identifying where all inter-iteration dependences are is undecidable. Aliasing--two different names referring to the same object--is a particularly well-known and well-studied, though in general unsolvable problem that complicates dependence analysis. Vectorizing compilers demonstrate both the impressive transformations that are possible, and also the fact that there are limitations on what code can be automatically transformed.
We have experimented with several different mechanisms for allowing programmers to give compilers more information about dependences, such as which variables may or may not alias, but have not yet arrived at a definition that seems ``right''. Explicitly specifying exactly which statements in which iterations of which loops depend on other statements in previous iterations would quickly make the code unreadable--and likely unwritable in the first place. Programmers may be asked to explicitly state that memory accessed through certain variables or streams will never overlap. We have also explored systems inspired by dataflow machines, in which programmers have to annotate expressions and statements with ordering constraints in certain cases. While uncertain about the exact details, we know that {\mpar} blocks will relax execution order and simplify alias analysis.
A good rule of thumb in Macah is that each expression or statement within a {\mpar} block corresponds to a unique computational resource in the micro-parallel engine. This rule can be violated by implementations that automatically apply transformations like loop unrolling or time-multiplexing, and though these transformations can be convenient, it can also be desirable for the programmer to have control over how many resources are used. This rule suggests that if a programmer wants to instantiate several multipliers and adders to compute, say, a matrix multiplication in a micro-parallel fashion, he or she is forced to write an unreasonable number of more \texttt{*}'s and \texttt{+}'s.
In order to reconcile our rule of thumb with programs brevity, we provide some metaprogramming facilities. Metaprogramming is the programmatic generation of programs; C's preprocessor macros are a particularly simple example. The only metaprogramming construct that we have used to this point is {\FOR}, which looks like a standard \texttt{for} loop, but is automatically and completely unrolled by our compiler. If compile-time unrolling is not possible (for example, if the loop bound is dependent on some value that cannot be known until run-time), an error is generated. In the future, we plan to explore more general metaprogramming features, as found in languages like `C.
{\FOR} is used in the S-W code on line 49. The body of this loop is the core of S-W algorithm, and \texttt{TILESZ} determines how many entries of the S-W table will be computed in parallel. Notice that the \texttt{for} loop that begins on line 39, and iterates over the columns of the table, proceeds by increments of \texttt{TILESZ}, not \texttt{1}. The diligent reader may notice that there are data dependences between adjacent iterations of the {\FOR} loop (for example, each iteration reads \texttt{E[k-1]} and writes to \texttt{E[k]}), and may be concerned that these iterations will not, in fact, be able to run in parallel. As explained in section \ref{sec:sched}, the compiler will automatically pipeline code, and will therefore enable all of the iterations of the {\FOR} loop to work concurrently.
|
UW
Embedded Research Group Last modified: Wed Jun 7 16:38:57 PDT 2006 |