#### Pipelining - Delay, d, of slowest combinational stage determines performance - Throughput = 1/d : rate at which outputs are produced - Latency = n·d : number of stages \* clock period - Pipelining increases circuit utilization - Registers slow down data, synchronize data paths - Wave-pipelining - no pipeline registers waves of data flow through circuit - relies on equal-delay circuit paths no short paths #### Retiming Algorithm - Representation of circuit as directed graph - nodes: combinational logic - edges: connections between logic that may or may not include registers - weights: propagation delay for nodes, number of registers for edges path delay (D): sum of propagation dealys along path nodes - path weight (W): sum of edge weights along path - always > 0, no asynchronous feedback - Problem statement - given: cycle time, T, and a circuit graph - adjust edge weights (number of registers) so that all path delays < T, unless their path weight $\ge$ 1, and the outputs to the host are the same (in both function and delay) as in the original graph Pipelining and Retiming 10 #### Retiming Algorithm Approach - Compute path weights and delays between each pair of nodes W and D matrices - Choose a cycle time T - Determine if it is possible to assign new weights so that all paths with delays greater than T have a weight that is 1 or greater (use linear programming) - Choose a smaller cycle time and repeat until the smallest T is found #### Computing W and D - W[u,v] = number of registers on the minimum weight path from $u\to v$ - Any retiming changes the weight of all paths by the same constant $% \left( 1\right) =\left( 1\right) \left( \left$ i.e. Retiming cannot change which is the minimum weight path - D[u,v] = maximum delay over all paths with W[u,v] registers - Retiming does not affect D[u,v] - These matrices contain all the required register and delay information - If retiming removes all registers from the path $u\to v,$ then D[u,v] is the largest delay path that results Pipelining and Retiming 13 ## Retiming: One Step at a Time (cont'd) and after a few more . . Pipelining and Retiming 15 #### Retiming: Problem Formulation - r(v): number of registers pushed through a node in the forward - $w_{new}(u, v) = w_{old}(u, v) + r(u) r(v)$ - Problem statement - $r(v_h) = 0$ (host is not retimed) - $W_{new}(u, v) = W_{old}(u, v) + r(u) r(v) \ge 0$ , for all u, v - $w_{new}(u, v) = w_{old}(u, v) + r(u) r(v) \ge 0$ , for all u, v• $r(u) r(v) \ge w_{old}(u, v)$ (no negative registers!) For all D[u,v] > Tclk, $w_{new}(u, v) = w_{old}(u, v) + r(u) r(v) \ge 1$ $r(u) r(v) \ge w_{old}(u, v) + 1$ (every long path has at least 1 reg) Difference constraints like this can be solved by generating a graph that represents the constraints and using a shortest path algorithm like Bellman-Ford to find a set of r(v) values that meets all the constraints - The value of r(v) returned by the algorithm can be used to generate the new positions of the registers in the retimed circuit Pipelining and Retiming 16 #### Extensions to Retiming - Host interface - add latency - multiple hosts - Area considerations - limit number of registers - optimize logic across register boundaries - peripheral retiming - incremental retiming - pre-computation - Generality - different propagation delays for different signals - widths of interconnections #### Systolic Arrays - Set of identical processing elements specialized or programmable - Efficient nearest-neighbor interconnections (in 1-D, 2-D, other) - SIMD-like - Multiple data flows, converging to engage in computation Analogy: data flowing through the system in a rhythmic fashion – from main memory through a series of processing elements and back to main memory #### Systolic Algorithms - 2D Convolution - Image processing - String matching - Dynamic programming - DNA comparison - Matrix computations - LU decomposition - QR factorization Pipelining and Retiming 31 #### Systolic Architectures - Highly parallel - "fine-grained" parallelism - deep pipelining - Local communication - wires are short no global communication (except CLK) - Inear array no clock skew increasingly important as wire delays increase (relative to gate delays) - Linear arrays - most systolic algorithms can be done with a linear array - include memory in each cell in the array - linear array a better match to I/O limitations - Contrast to superscalar and vector architectures Pipelining and Retiming 32 #### Systolic Computers - Custom chips early 1980's - Warp (CMU) 1987 - linear array of 10 or more processing cells - optimized inter-cell communication for low-latency pipelined cells and communication - conditional execution - compiler partitions problem into cells and generates microcode - i-Warp (Intel) 1990 - successor to Warp - two-dimensional array time-multiplexing of physical busses between cells 32x32 array has 206flops peak performance - not a commercial success - Currently confined to ASIC implementations Pipelining and Retiming 33 #### Digital Correlator Revisited Optimally retimed circuit (clock cycle 13) Pipelining and Retiming 34 ### C-slow'ing a Circuit • Replace every register with C registers Now retime: (clock cyclonow ) Pipelining and Retiming 35 # Computation spread over time Only need one multiplier and one adder We can use this method to schedule for any number of resources