# Optimality of Tomasulo's Algorithm

Luna, Dong Gang, Zhao Feb 28th, 2002

### **Our Questions about Tomasulo**

- Questions about Tomasulo's Algorithm
  - Is it optimal (can always produce the wisest instruction execution stream)?
  - Is there any room for improvement if it is not optimal?
  - Is it a wise trade-off between time and complexity?
- Related work
  - A lot of study on its correctness
  - Some tests on its performance under different implementation details
  - Few theoretical analysis of its performance

## **Our Approach**

- Our approach
  - Build up mathematical models for formal proof or find counterexamples for disproof
- Features of our approach
  - Making comparison with a reference system –
    Data Driven System which can produce all possible instruction execution orders without violating data dependences.
  - Introducing time description variables into the model to record the finish time of each instruction
  - Examining the problem under a group of ideal assumptions first, and then drop them one by one.

- Unlimited hardware resources, no delay, only calculations – optimal <sup>(2)</sup>
- With transmission delay optimal ③
- With limited instruction window – optimal <sup>©</sup>
- With Load/Store
  - not optimal 
    IMPROVEMENT
- With limited hardware resources – not optimal 
   IMPROVEMENT
- Issue one instruction per cycle – not optimal <sup>(3)</sup>

### **Assumptions for all of the Models**

- Register renaming in both systems – no WAR, WAW
- No jump, no branch
- Instruction misses fully hidden
- Fixed-point function units are regarded just as another kind of units like floatingpoint adder and floating-point multiplier



## Unlimited hardware resources, no delay – optimal <sup>©</sup>

- With transmission delay optimal ③
- With limited instruction window – optimal <sup>©</sup>
- With Load/Store not optimal ③
- With limited hardware resources – not optimal <sup>(3)</sup>
- Issue one instruction per cycle – not optimal <sup>(3)</sup>

### Model I – Utopia Model

#### DATA DRIVEN SYSTEM



### **Model I – Utopia Model**

#### **TOMASULO SYSTEM**



### Model I – Utopia Model

### PROOF

Time<sub>d</sub>[n] = max(time<sub>d</sub>[sr1],time<sub>d</sub>[sr2])+opTime+waitTime

>= max(time<sub>d</sub>[sr1],time<sub>d</sub>[sr2])+opTime

Time<sub>t</sub>[n] = max(time<sub>t</sub>[sr1],time<sub>t</sub>[sr2])+opTime

- Unlimited hardware resources, no delay – optimal <sup>©</sup>
- With transmission delay optimal ©
- With limited instruction window – optimal ③
- With Load/Store not optimal ③
- With limited hardware resources – not optimal <sup>(3)</sup>
- Issue one instruction per cycle – not optimal <sup>(3)</sup>

#### DATA DRIVEN SYSTEM



#### **TOMASULO SYSTEM**



### PROOF

IV

 $Time_{d}[n] = max(time_{d}[sr1], time_{d}[sr2]) + opTime + waitTime + tranTime$ 

>= max(time<sub>d</sub>[sr1],time<sub>d</sub>[sr2])+opTime+tranTime

Time<sub>t</sub>[n] = max(time<sub>t</sub>[sr1],time<sub>t</sub>[sr2])+opTime+tranTime



| P1 [ | T1 | T2 |   |    |    |   | Т5 | ji<br>V |
|------|----|----|---|----|----|---|----|---------|
| P2   |    |    |   | Т3 | T4 |   |    |         |
| 0    | 1  | Î  | 2 | 3  | 3  | 4 | 5  | i       |

| P1 | Т1 | T2 |    | T4 |
|----|----|----|----|----|
| P2 | Т1 | Т3 | T5 |    |



- Unlimited hardware resources, no delay – optimal <sup>(3)</sup>
- With transmission delay optimal ③
- With limited instruction window – optimal <sup>©</sup>
- With Load/Store not optimal ③
- With limited hardware resources – not optimal <sup>(2)</sup>
- Issue one instruction per cycle – not optimal <sup>(3)</sup>

### **Model III – Limited Instru Win**

#### **DATA DRIVEN SYSTEM**

**Instruction Window** 



### **Model III – Limited Instru Win**

#### TOMASULO SYSTEM



## **Model III – Limited Instru Win**

### PROOF

 $Time_{d} = \max(time_{d}[sr1], time_{d}[sr2], inWinTime) + opTime + waitTime + tranTime$ 

>= max(time\_d[sr1],time\_d[sr2],inWinTime)+opTime+tranTime

**Time**<sub>t</sub> = max(time<sub>t</sub>[sr1],time<sub>t</sub>[sr2],inWinTime)+opTime+tranTime

- Unlimited hardware resources, no delay
  optimal <sup>(c)</sup>
- With transmission delay optimal ③
- With limited instruction window
  optimal ③
- With Load/Store not optimal 🛞
- With limited hardware resources – not optimal <sup>(2)</sup>
- Issue one instruction per cycle – not optimal <sup>(3)</sup>

#### Load and store

#### Data dependence in load and store.

#### - The principles (We have discussed in our class)

- A stores is dependent on the previous store
- All Load are dependent on the previous store
- A store is dependent on all loads between it and the previous store.

#### Load and store buffer used in Tomasulo.

- L/S dependence may block instruction window as Tomasulo's design.
  - For load instructions, no address conflict in store buffer
  - For store instructions, no address conflict in load and store buffer
- In Tomasulo, the L/S buffer is a little bit conservative. More aggressive strategy may boost throughput in some case.



#### Load and store (cont.)

### Modifying L/S buffer to RS-like structure.



- Unlimited hardware resources, no delay – optimal <sup>(3)</sup>
- With transmission delay optimal ③
- With limited instruction window – optimal ③
- With Load/Store not optimal ③
- With limited hardware resources – not optimal <sup>(2)</sup>
- Issue one instruction per cycle
  not optimal

#### The last-ditch of the optimality of Tomasulo

- Resource competition forces you to make choice.
  - Finite RS,FU, CDB
  - Scheduling under resource competition is an NPC problem. (It's impossible to get the optimal result in polynomial time.)
  - Tomasulo is actually a greedy algorithm.
  - Tomasulo uses simple FIFO strategy to to solve ties in competition.







### Improvement

- So far, only "**previous information**" is used.
  - Enough to deal with dependence problem.
  - Far less for wise scheduling.
- "Following information" is valuable for scheduling previous instruction.
  - Construct the critical path information in RS
  - Instructions in the critical path should get higher priority.
  - Following instruction entered RS should affect previous instructions' priority in some way.
  - Introducing earliest possible finish time and latest necessary finish time to recode the information of critical path.
- Again, scheduling under resource competition is an NP-complete problem. It's still true for our improvement.





### Improvement(cont.)

- Is this always better than Tomasulo's when facing single FU? (LIKELY)
- Is it optimal in this condition? (HOPEFULLY)
- Is this always better than Tomasulo's when facing multiple FUs? (NOT SURE)
- Is it optimal in this condition? (NEVER)

- Unlimited hardware resources, no delay – optimal <sup>©</sup>
- With transmission delay optimal ③
- With limited instruction window – optimal ③
- With Load/Store not optimal ③
- With limited hardware resources – not optimal <sup>(2)</sup>
- Issue one instruction per cycle – not optimal <sup>(2)</sup>

### Answers to our previous questions.

#### Is it optimal?

- The essence of the problem underlying is NP-Complete.
- Is there any room for improvement if it is not optimal?
  - Yes, some problem comes from Tomasulo algorithm itself.
    - Conservative Load and store buffer.
    - Greedy dispatch.
    - FIFO strategy to solve ties in competition.
  - Two ways to verify the our improvement.
- Is it a wise trade-off between time and complexity?
  - Parallelism, SMT, ILP
  - Cost-efficient.

