## WaveScalar [MICRO 03]

## **WaveScalar**

#### Dataflow machine

- · good at exploiting ILP
- dataflow parallelism + traditional coarser-grain parallelism
   cheap thread management
- · low operand latency because of a hierarchical organization
- memory ordering enforced through wave-ordered memory
  - no special languages

- Additional motivation:
  - increasing disparity between computation (fast transistors) &
  - communication (long wires)
  - · increasing circuit complexity
  - · decreasing fabrication reliability

Spring 2006 471 1 Spring 2006 471

#### **Monolithic von Neumann Processors**



A phenomenal success today. But in 2016?

Performance
 Centralized processing & control,
 e.g., operand broadcast networks

8 Complexity 40-75% of "design" time is design verification

3

8 Defect tolerance 1 flaw -> paperweight WaveScalar's Microarchitecture

2

4

Good performance via distributed microarchitecture 🕲 • hundreds of PEs

- dataflow execution no centralized control
- short point-to-point communication
- organized hierarchically for fast communication between neighboring PEs
- scalable

Low design complexity through simple, identical PEs 🕲 • design one & stamp out thousands

#### Defect tolerance 🕲

· route around a bad PE

Spring 2006

471

## Processing Element

- Simple, small (.5M transistors)
  - 5-stage pipeline (receive input operands, match tags, instruction schedule, execute, send output)
- Holds 64 (decoded) instructions
  128-entry token store
- 128-entry token store
- 4-entry output buffer



# PEs in a Pod



- Back-to-back producer-consumer execution across PEs
- Relieve congestion on intradomain bus

Spring 2006

5

Spring 2006

471

6

## **Domain**



Spring 2006

471

**Cluster** 



Spring 2006

7

### **WaveScalar Processor**

Long distance

communication

- · dynamic routing
- grid-based network .
- 2-cycle hop/cluster

| 471 |  |  | 9 |
|-----|--|--|---|

Spring 2006

## Whole Chip

Can hold 32K instructions

- Normal memory hierarchy • .
- Traditional directory-based cache coherence ~400 mm<sup>2</sup> in 90 nm
- technology
- 1GHz. .
- ~85 watts



8

Spring 2006

# WaveScalar Instruction Placement



Spring 2006

471

11

**Instruction Placement Trade-offs** 



Spring 2006



#### WaveScalar Instruction Placement

#### Example to Illustrate the Memory Ordering Problem

Place instructions in PEs to maximize data locality & instruction-level parallelism.

 Instruction placement algorithm based on a performance model that captures the important performance factors [SPAA 06]

471

13

- Depth-first traversal of dataflow graph to make chains of dependent instructions
- Broken into segments [ASPLOS 06]

Spring 2006

- · Snakes segments across the chip on demand
- K-loop bounding to prevent instruction "explosion"

|                 | A[j + i*i] = i;<br>b = A[i*j]; | A j i<br>* * *<br>Load +<br>b Store |
|-----------------|--------------------------------|-------------------------------------|
| Spring 2006 471 | Spring 2006                    | 471                                 |

Example to Illustrate the Memory Ordering Problem



## Example to Illustrate the Memory Ordering Problem





14

Example to Illustrate the Memory Ordering Problem



### Example to Illustrate the Memory Ordering Problem



#### Example to Illustrate the Memory Ordering Problem



#### Wave-ordered Memory







WaveScalar Tag-matching



## Single-thread Performance



## Single-thread Performance per Area



# Multithreading the WaveCache

Architectural-support for WaveScalar threads

- instructions to start & stop memory orderings, i.e., threads
  memory-free synchronization to allow exclusive access to data (thread communicate instruction)
- fence instruction to allow other threads to see this one's memory ops

Combine to build threads with multiple granularities

- coarse-grain threads: 25-168X over a single thread; 2-16X over CMP, 5-11X over SMT
- fine-grain, dataflow-style threads: 18-242X over single thread
- combine the two in the same application: 1.6X or 7.9X -> 9X

Spring 2006 471 26

## **Creating & Terminating a Thread**



**Thread Creation Overhead** 



## Performance of Coarse-grain Parallelism



# CMP Comparison



Spring 2006

471

29

27

30

#### Relies on: Cheap synchronization Load once, pass data (not load/compute/store)

#### Performance of Fine-grain Parallelism



## **Building the WaveCache**

RTL-level implementation [ISCA 06]

- some didn't believe it could be built in a normal-sized chip
  some didn't believe it could achieve a decent cycle time and load-
- use latencies
- · Verilog & Synopsis CAD tools

Different WaveCache's for different applications

1 cluster: low-cost, low power, single-thread or embedded
 42 mm<sup>2</sup> in 90 nm process technology, 2.2 AIPC on Splash2

471

 16 clusters: multiple threads, higher performance: 378 mm<sup>2</sup>, 15.8 AIPC

#### Board-level FPGA implementation

· OS & real application simulations

Spring 2006

31

32

### Compiling for the WaveCache

Eliminating dataflow control flow instructions [PACT 06]

- some didn't believe it could be built in a normal-sized chip
- some didn't believe it could achieve a decent cycle time and loaduse latencies
- Verilog & Synopsis CAD tools

Different WaveCache's for different applications

- 1 cluster: low-cost, low power, single-thread or embedded
- 42 mm<sup>2</sup> in 90 nm process technology, 2.2 AIPC on Splash2
- 16 clusters: multiple threads, higher performance: 378 mm<sup>2</sup>, 15.8 AIPC

Board-level FPGA implementation

OS & real application simulations

Spring 2006 471 33