

Dataflow machine

- good at exploiting ILP (dataflow parallelism)
- also traditional coarser-grain parallelism
  - cheap thread management
- · low operand latency because of a hierarchical organization
- memory ordering enforced through wave-ordered memory
  - no special dataflow languages















































## Multithreading the WaveCache

Architectural-support for WaveScalar threads

- instructions to start & stop memory orderings, i.e., threads
- memory-free synchronization to allow exclusive access to data (thread communicate instruction)
- fence instruction to force all previous memory operations to fully execute (to allow other threads to see the results of this one's memory ops)

Combine to build threads with multiple granularities

- coarse-grain threads: 25-168X over a single thread; 2-16X over CMP, 5-11X over SMT
- · fine-grain, dataflow-style threads: 18-242X over single thread
- a demonstration that one can combine the two in the same application (equake): 1.6X or 7.9X -> 9X











## **Building WaveScalar**

**RTL-level** implementation

- · some didn't believe it could be built in a normal-sized chip
- some didn't believe it could achieve a decent cycle time and loaduse latencies
- Verilog & Synopsis CAD tools

Different WaveCache's for different applications

- 1 cluster: low-cost, low power, single-thread or embedded
  - 42 mm<sup>2</sup> in 90 nm process technology, 2.2 AIPC on Splash2
- 16 clusters: multiple threads, higher performance: 378 mm<sup>2</sup>, 15.8 AIPC

Board-level FPGA implementation

OS & real application simulations

