### **WaveScalar: the Executive Summary**

### A modern dataflow machine

- · solves the language & memory ordering issues
- · solves the scalability issue

### The executive summary:

- · good at exploiting ILP (dataflow parallelism)
- · also traditional coarser-grain parallelism
  - · cheap thread management
- low operand latency because of a hierarchical PE-interconnect organization
- memory ordering enforced through wave-ordered memory
  - · can execute imperative language programs
  - no special dataflow languages

Spring 2011 CSE 471 - WaveScalar

### **WaveScalar**

Motivation stems from shrinking feature sizes:

- increasing disparity between computation (fast transistors) & communication (long wires)
- · increasing circuit complexity
- · decreasing fabrication reliability

Spring 2011 CSE 471 - WaveScalar

### **Monolithic von Nuemann Processors**



A success a few years ago. But in 2016?

Performance

Centralized processing & control Long wires

e.g., operand broadcast networks

**8** Complexity

40-75% of "design" time is design verification

**8** Defect tolerance

1 flaw -> tie pin, earrings, ...

Spring 2011 CSE 471 - WaveScalar

3

4

### **WaveScalar's Microarchitecture**

Good performance via distributed microarchitecture ©

- · hundreds of PEs
- organized hierarchically for fast communication between neighboring PEs
- short point-to-point (producer to consumer) operand communication
- dataflow execution no centralized control
- consequently scalable

Low design complexity through simple, identical PEs ©

· design one & stamp out hundreds

Defect tolerance ©

Spring 2011

· route around a bad PE

CSE 471 - WaveScalar

# **Processing Element**



- Simple, small (.5M transistors)
- 5-stage pipeline (receive input operands, match tags, instruction issue, execute, send output)
- Holds 64 (decoded) instructions
- 128-entry token store
- 4-entry output buffer

Spring 2011 CSE 471 - WaveScalar

# PEs in a Pod



- Share operand bypass network
- Back-to-back producer-consumer execution across 2 PEs

Spring 2011

CSE 471 - WaveScalar

6





# WaveScalar Processor Long distance communication • grid-based network • 2-cycle hop/cluster • dynamic routing Spring 2011 CSE 471 - WaveScalar 9

# • Can hold 32K instructions • Normal memory hierarchy • Traditional directory-based cache coherence • ~400 mm² in 90 nm technology • 1GHz. • ~85 watts





















# **WaveScalar Tag-matching**

### WaveScalar tag

- · thread identifier
- wave number

Token: tag & value





Spring 2011 CSE 471 - WaveScalar 21

# **Multithreading the WaveCache**

Architectural-support for WaveScalar threads

- · instructions to start & stop memory orderings, i.e., threads
- memory-free synchronization to allow exclusive access to data (thread communicate instruction)
- "barrier" instruction to force all previous memory operations to fully execute (to allow other threads to see the results of this one's memory ops)

Spring 2011 CSE 471 - WaveScalar 22

# **Creating & Terminating a Thread**



Spring 2011 CSE 471 - WaveScalar

23

# **Multithreading the WaveCache**

Combine to build threads with multiple granularities

- coarse-grain threads: 25-168X over a single thread; 2-16X over CMP, 5-11X over SMT
- fine-grain, dataflow-style threads: 18-242X over single thread
- a demonstration that one can combine the two in the same application (equake): 1.6X or 7.9X -> 9X

Spring 2011 CSE 471 - WaveScalar 24

### **Building WaveScalar**

### RTL-level implementation

- · some didn't believe it could be built in a normal-sized chip
- some didn't believe it could achieve a decent cycle time and loaduse latencies
- · Verilog & Synopsis CAD tools

### Different WaveScalar designs for different applications

- 1 cluster: low-cost, low power, single-thread or embedded
  - 42 mm<sup>2</sup> in 90 nm process technology, 2.2 AIPC on Splash2
- 16 clusters: multiple threads, higher performance: 378 mm², 15.8 AIPC

### Board-level FPGA implementation

· OS & real application simulations

Spring 2011 CSE 471 - WaveScalar 25

# **Important Issues**

Modern dataflow machines, aka Wavescalar

- · comparison to von Neumann microarchitecture
- · hierarchical structure
- · wave-ordered memory
- · thread management

Spring 2011 CSE 471 - Dataflow Machines