|                                                                                                                        | Reading and References                                                                                                                                                                                                                                  |
|------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Pipelining                                                                                                             | • Sections 6.1-6.3, Computer Organization and Design,<br>Patterson and Hennessy                                                                                                                                                                         |
| CSE 410, Spring 2004                                                                                                   |                                                                                                                                                                                                                                                         |
| Computer Systems                                                                                                       |                                                                                                                                                                                                                                                         |
| http://www.cs.washington.edu/education/courses/410/04sp/                                                               |                                                                                                                                                                                                                                                         |
| 9-Apr-2004 cse410-10-pipelining-a © 2004 University of Washington 1                                                    | 19-Apr-2004 cse410-10-pipelining-a © 2004 University of Washington                                                                                                                                                                                      |
| Execution Cycle                                                                                                        | IF and ID Stages                                                                                                                                                                                                                                        |
| IF ID EX MEM WB                                                                                                        | <ol> <li>Instruction Fetch</li> <li>» Get the next instruction from memory</li> </ol>                                                                                                                                                                   |
| <ol> <li>Instruction Fetch</li> <li>Instruction Decode</li> <li>Execute</li> <li>Memory</li> <li>Write Back</li> </ol> | <ul> <li>» Increment Program Counter value by 4</li> <li>2. Instruction Decode</li> <li>» Figure out what the instruction says to do</li> <li>» Get values from the named registers</li> <li>» Simple instruction format means we know which</li> </ul> |
| 9-Apr-2004 cse410-10-pipelining-a © 2004 University of Washington 3                                                    | registers we may need before the instruction is<br>fully decoded<br>19-Apr-2004 cse410-10-pipelining-a © 2004 University of Washington                                                                                                                  |

#### Simple MIPS Instruction Formats

| R | op code | source 1 | source 2 | dest               | shamt     | function |
|---|---------|----------|----------|--------------------|-----------|----------|
|   | 6 bits  | 5 bits   | 5 bits   | 5 bits             | 5 bits    | 6 bits   |
| I | op code | base reg | src/dest | offset o           | r immedia | te value |
|   | 6 bits  | 5 bits   | 5 bits   |                    | 16 bits   |          |
| J | op code |          | W        | ord offse          | t         |          |
|   | 6 bits  |          |          | 26 bits            |           |          |
|   |         |          |          |                    |           |          |
|   |         |          |          | University of Wash |           |          |

### EX, MEM, and WB stages

#### 3. Execute

- » On a memory reference, add up base and offset
- » On an arithmetic instruction, do the math
- 4. Memory Access
  - » If load or store, access memory
  - » If branch, replace PC with destination address
  - » Otherwise do nothing
- 5. Write back
  - » Place the results in the appropriate register

19-Apr-2004

cse410-10-pipelining-a © 2004 University of Washington

6

# Example: add \$s0, \$s1, \$s2

• IF get instruction at PC from memory

| op code | source 1 | source 2 | dest  | shamt | function |
|---------|----------|----------|-------|-------|----------|
| 000000  | 10001    | 10010    | 10000 | 00000 | 100000   |

- ID determine what instruction is and read registers
  - $\, \ast \,$  000000 with 100000 is the add instruction
  - » get contents of \$s1 and \$s2 (eg: s1=7, s2=12)
- **EX** add 7 and 12 = 19
- MEM do nothing for this instruction
- WB store 19 in register \$s0

19-Apr-2004

7

## Example: lw \$t2, 16(\$s0)

• IF get instruction at PC from memory

| op code | base reg | <pre>src/dest</pre> | offset or immediate value |
|---------|----------|---------------------|---------------------------|
| 010111  | 10000    | 01000               | 000000000010000           |

- **ID** determine what 010111 is
  - » 010111 is lw
  - » get contents of \$s0 and \$t2 (we don't know that we don't care about \$t2) \$s0=0x200D1C00, \$t2=77763
- **EX** add 16 to 0x200D1C00 = 0x200D1C10
- **MEM** load the word stored at 0x200D1C10
- WB store loaded value in \$t2

|           |                                       | L                                                                                  | alei                                         | icy                                     | & T         | mo            | ugn        | pui |    |                  |                                  | A case for p                                                                                |
|-----------|---------------------------------------|------------------------------------------------------------------------------------|----------------------------------------------|-----------------------------------------|-------------|---------------|------------|-----|----|------------------|----------------------------------|---------------------------------------------------------------------------------------------|
| 1         | 2                                     | 3                                                                                  | 4                                            | 5                                       | 6           | 7             | 8          | 9   | 10 |                  | • If exe                         | cution is non-overl                                                                         |
| IF        | ID                                    | EX                                                                                 | MEM                                          | WB                                      | 1           |               |            |     |    | inst 1           | units                            | are underutilized b                                                                         |
|           |                                       |                                                                                    |                                              |                                         | IF          | ID            | EX         | MEM | WB | inst 2           | only o                           | once every five cyc                                                                         |
| W<br>Thro | hat's th<br>One ir<br>Cycles<br>ughpu | e latend<br>structions<br>per Instructions<br>ter Instructions<br>ter Instructions | cy for th<br>on takes<br>struction<br>e numb | is impl<br>5 cloch<br>n (CPI)<br>per of |             | tion?         | that ex    |     |    | ecute<br>it time | desig<br>can b<br>• <b>Pipel</b> | truction Set Archite<br>ned, organization o<br>e arranged so that t<br>ining overlaps the s |
| vv        |                                       |                                                                                    | • •                                          |                                         | d every 5   |               |            |     |    |                  | every                            | stage has somethin                                                                          |
|           |                                       | ge CPI                                                                             |                                              | 1                                       | 2           |               | 2          |     |    |                  |                                  |                                                                                             |
| 19-Apr-2  | 2004                                  |                                                                                    | cse410-1                                     | 0-pipelinin                             | ng-a © 2004 | University of | of Washing | ton |    | 9                | 19-Apr-2004                      | cse410-10-pipelining-a © 2004                                                               |
|           |                                       |                                                                                    |                                              |                                         |             |               |            |     |    |                  |                                  |                                                                                             |
|           |                                       |                                                                                    |                                              |                                         |             |               |            |     |    |                  |                                  |                                                                                             |

#### Pipelined Latency & Throughput

|   | 1  | 2  | 3  | 4   | 5   | 6   | 7   | 8   | 9  |        |
|---|----|----|----|-----|-----|-----|-----|-----|----|--------|
| Γ | IF | ID | EX | MEM | WB  |     |     |     |    | inst 1 |
|   |    | IF | ID | EX  | MEM | WB  |     |     |    | inst 2 |
|   |    |    | IF | ID  | EX  | MEM | WB  |     |    | inst 3 |
|   |    |    |    | IF  | ID  | EX  | MEM | WB  |    | inst 4 |
|   |    |    |    |     | IF  | ID  | EX  | MEM | WB | inst 5 |

- What's the latency of this implementation?
- What's the throughput of this implementation?

#### 19-Apr-2004

# pipelining

- lapped, the functional ecause each unit is used cles
- tecture is carefully of the functional units they execute in parallel
- stages of execution so ing to do each cycle

University of Washington

# **Pipelined Analysis**

- A pipeline with N stages could improve throughput by N times, but
  - » each stage must take the same amount of time
  - » each stage must always have work to do
  - » there may be some overhead to implement
- Also, latency for each instruction may go up » Within some limits, we don't care

#### Throughput is good!



# MIPS ISA: Born to Pipeline

- Instructions all one length
   » simplifies Instruction Fetch stage
- Regular format
  - » simplifies Instruction Decode
- Few memory operands, only registers » only lw and sw instructions access memory
- Aligned memory operands
  - » only one memory access per operand

Memory accesses

- Efficient pipeline requires each stage to take about the same amount of time
- CPU is much faster than memory hardware
- Cache is provided on chip
  - » i-cache holds instructions
  - » d-cache holds data
  - » critical feature for successful RISC pipeline
  - » more about caches next week

# The Hazards of Parallel Activity

cse410-10-pipelining-a © 2004 University of Washington

- Any time you get several things going at once, you run the risk of interactions and dependencies
  - » juggling doesn't take kindly to irregular events
- Unwinding activities after they have started can be very costly in terms of performance
   » drop everything on the floor and start over

15

19-Apr-2004

19-Apr-2004

### Design for Speed

- Most of what we talk about next relates to the CPU hardware itself
  - » problems keeping a pipeline full
  - » solutions that are used in the MIPS design
- Some programmer visible effects remain
  - » many are hidden by the assembler or compiler
  - » the code that you write tells what you want done, but the tools rearrange it for speed

cse410-10-pipelining-a © 2004 University of Washington

# **Pipeline Hazards**

• Structural hazards

» Instructions in different stages need the same resource, eg, memory

- Data hazards
  - » data not available to perform next operation
- Control hazards

19-Apr-2004

» data not available to make branch decision

| Structura | 1 Hazards |
|-----------|-----------|

- Concurrent instructions want same resource
  - » **lw** instruction in stage four (memory access)
  - » add instruction in stage one (instruction fetch)
  - » Both of these actions require access to memory; they would collide if not designed for
- Add more hardware to eliminate problem » separate instruction and data caches
- Or stall (cheaper & easier), not usually done

#### Data Hazards

cse410-10-pipelining-a © 2004 University of Washington

- When an instruction depends on the results of a previous instruction still in the pipeline
- This is a data dependency \$s0 is written here add \$s0, \$s1, \$s2 IF ID EΧ MEM WB IF TD EΧ MEM add \$s4, \$s3, \$s0 WB \$s0 is read here

19-Apr-2004

19

17

### Stall for register data dependency

• Stall the pipeline until the result is available » this would create a 3-cycle *pipeline bubble*

#### Read & Write in same Cycle

- Write the register in the first part of the clock cycle
- Read it in the second part of the clock cycle
- A 2-cycle stall is still required



#### Stall for **1w** hazard Instruction Reorder for **1w** hazard • Try to execute an unrelated instruction • We can stall for one cycle, but we hate to stall between the two instructions ΕX MEM s0,0(s2) lw IF ID WB s0,0(s2) IF lw ID EΧ MEM WB IF ID ΕX MEM WB sub t4,t2,t3 add s4,s3,s0 ID stall EΧ MEM IF WB IF ID EΧ MEM WΒ add s4,s3,s0 sub t4,t2,t3 19-Apr-2004 25 19-Apr-2004 26 cse410-10-pipelining-a © 2004 University of Washington cse410-10-pipelining-a © 2004 University of Washington **Reordering Instructions** • Reordering instructions is a common technique for avoiding pipeline stalls • Static reordering » programmer, compiler and assembler do this • Dynamic reordering » modern processors can see several instructions

- » they execute any that have no dependency
- » this is known as *out-of-order execution* and is complicated to implement, but effective