







|                              | Multith                                                                                                                                                                                                                              | reading                                                                                                                              |    |
|------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------|----|
| Traditional r<br>to avoid    | nultithreaded processors<br>processor stalls                                                                                                                                                                                         | s hardware switch to a different contex                                                                                              | xt |
| 1. coar<br>• ;<br>• ;<br>• ; | <ul> <li>another thread executes we nodest increase in instruct</li> <li>doesn't hide latency of</li> <li>no switch if no long-latence to fill the pipelin potentially no slowdown to</li> <li>if stall is long, pipeling</li> </ul> | operation (e.g., L2 cache miss)<br>while the miss is handled<br>ction throughput<br>of short-latency operations<br>atency operations | lγ |
| Spring 2010                  | CSI                                                                                                                                                                                                                                  | E471                                                                                                                                 | 5  |











|                                                             | <u>Cray (Tera) MTA</u>                          |                   |
|-------------------------------------------------------------|-------------------------------------------------|-------------------|
| Interesting features                                        |                                                 |                   |
| <ul> <li>No processor-side</li> </ul>                       | data caches                                     |                   |
| <ul> <li>increases the lat<br/>variation between</li> </ul> | tency for data accesses but re<br>en memory ops | educes the        |
| <ul> <li>to avoid having t</li> </ul>                       | o keep caches coherent                          |                   |
| <ul> <li>memory-side but</li> </ul>                         | uffersinstead                                   |                   |
| <ul> <li>L1 &amp; L2 instruction of</li> </ul>              | aches                                           |                   |
| <ul> <li>instruction acces</li> <li>problem</li> </ul>      | sses are more predictable & h                   | nave no coherency |
| <ul> <li>prefetch fall-throad</li> </ul>                    | ough & target code                              |                   |
|                                                             |                                                 |                   |
|                                                             |                                                 |                   |
|                                                             |                                                 |                   |
|                                                             |                                                 |                   |





|                            | <u>Cray (Tera) MTA</u>                                                                              |          |
|----------------------------|-----------------------------------------------------------------------------------------------------|----------|
| Interesting fea            | atures                                                                                              |          |
| <ul> <li>tagged</li> </ul> | memory, i.e., full/empty bits                                                                       |          |
| <ul> <li>indi</li> </ul>   | rectly set full/empty bits to prevent data races                                                    |          |
| •                          | prevents a consumer/producer from loading/overv<br>value before a producer/consumer has written/rea |          |
| •                          | example for the consumer:                                                                           |          |
|                            | <ul> <li>set to empty when producer instruction starts<br/>executing</li> </ul>                     |          |
|                            | <ul> <li>consumer instructions block if try to read the p value</li> </ul>                          | oroducer |
|                            | <ul> <li>set to full when producer writes value</li> </ul>                                          |          |
|                            | <ul> <li>consumers can now read a valid value</li> </ul>                                            |          |
| • exp                      | licitly set full/empty bits for cheap thread synchronic                                             | zation   |
| •                          | primarily used accessing shared data                                                                |          |
|                            | <ul> <li>lock: read memory location &amp; set to empty</li> </ul>                                   |          |
|                            | <ul> <li>other readers are blocked</li> </ul>                                                       |          |
| Spring 2010                | unlock: write & set to full     CSE471                                                              | 15       |

















|                                                                                                             | Implementing SMT |    |
|-------------------------------------------------------------------------------------------------------------|------------------|----|
| Thread-shared har                                                                                           | dware:           |    |
| <ul> <li>fetch buffers</li> </ul>                                                                           |                  |    |
| <ul> <li>branch targe</li> </ul>                                                                            | tbuffer          |    |
| <ul> <li>instruction quality</li> </ul>                                                                     | ieues            |    |
| <ul> <li>functional ur</li> </ul>                                                                           | its              |    |
| all caches (physical tags)                                                                                  |                  |    |
| • TLBs                                                                                                      |                  |    |
| <ul> <li>store buffers</li> </ul>                                                                           | & MSHRs          |    |
| Thread-shared hardware is another reason why there is little single-thread performance degradation (~1.5%). |                  |    |
| What hardware might you not want to share?                                                                  |                  |    |
| Spring 2010                                                                                                 | CSE471           | 24 |

| Implem                                                                                                          | enting SMT  |    |
|-----------------------------------------------------------------------------------------------------------------|-------------|----|
| Does sharing hardware cause mo<br>– 2X more data cache misse<br>+ other threads hide the miss<br>+ data sharing | es          |    |
| Bottom line is huge overall perforn                                                                             | nance boost |    |
| Spring 2010                                                                                                     | CSE471      | 25 |











## 15

| Tiling00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 |                                                                                                                                                                                                                                                                        |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Blocked                                                                                                                                                     | <ul> <li>Threads share a tile &amp; iterations are cyclically distributed across threads</li> <li>Better performance for cache hierarchies of different sizes</li> <li>Insensitive to tile size</li> <li>Tiles can be large to reduce loop control overhead</li> </ul> |





