



1























| Cray MTA                     |                                                                                                                          |             |
|------------------------------|--------------------------------------------------------------------------------------------------------------------------|-------------|
| Fine-grain mu                | tithreaded processor                                                                                                     |             |
| <ul> <li>can sw</li> </ul>   | ch to a different thread each cycle                                                                                      |             |
| • swi                        | ches to ready threads only                                                                                               |             |
| <ul> <li>up to 1.</li> </ul> | 8 hardware contexts/processor                                                                                            |             |
|                              | of latency to hide, mostly from the multi-hop<br>connection network                                                      |             |
| • ave<br>(i.e<br>bus         | age instruction latency for computation: 22 cycles<br>22 instruction streams needed to keep functional units<br>/)       |             |
| cyc                          | age instruction latency including memory: 120 to 200-<br>es<br>120 to 200 instruction streams needed to hide all latence | :V          |
|                              | verage)                                                                                                                  | <b>''</b> , |
| <ul> <li>process</li> </ul>  | or state for all 128 contexts                                                                                            |             |
| • GP                         | Rs (total of 4K registers!)                                                                                              |             |
| <ul> <li>sta</li> </ul>      | us registers (includes the PC)                                                                                           |             |
| • bra                        | ch target registers                                                                                                      |             |

|             | Cray MTA                                                                                                                                                                                                                                                                                                                                                                                                                            |    |
|-------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| • N         | <ul> <li>ng features</li> <li>lo processor-side data caches</li> <li>increases the maximum latency for data accesses but reduces the variation between memory ops</li> <li>to avoid having to keep caches coherent</li> <li>memory-side buffers instead</li> <li>1 &amp; L2 instruction caches</li> <li>instructions have more locality &amp; have no coherency problem</li> <li>prefetch fall-through &amp; target code</li> </ul> | 5  |
| Spring 2014 | 471 - Multithreaded Processors                                                                                                                                                                                                                                                                                                                                                                                                      | 15 |





| Cray MTA                  |                                                                                                    |    |
|---------------------------|----------------------------------------------------------------------------------------------------|----|
| Interesting fe            | eatures                                                                                            |    |
| <ul> <li>tagge</li> </ul> | d memory, i.e., full/empty bits                                                                    |    |
| • inc                     | directly set full/empty bits to prevent data races                                                 |    |
|                           | <ul> <li>prevents a consumer from loading a value before a<br/>producer has written it</li> </ul>  |    |
|                           | <ul> <li>prevents a producer from overwriting a value before a<br/>consumer has read it</li> </ul> |    |
| • ex                      | ample for the consumer:                                                                            |    |
|                           | · set to empty when producer instruction starts executing                                          |    |
|                           | consumer instructions block if try to read the producer value                                      |    |
|                           | <ul> <li>set to full when producer writes value</li> </ul>                                         |    |
|                           | <ul> <li>consumers can now read a valid value</li> </ul>                                           |    |
|                           |                                                                                                    |    |
| Spring 2014               | 471 - Multithreaded Processors                                                                     | 18 |





10



















|                                        | Implementing SMT                                          |       |
|----------------------------------------|-----------------------------------------------------------|-------|
| Thread-shared hare                     | dware:                                                    |       |
| <ul> <li>fetch buffers</li> </ul>      |                                                           |       |
| <ul> <li>branch target</li> </ul>      | buffer                                                    |       |
| <ul> <li>instruction qu</li> </ul>     | ieues                                                     |       |
| <ul> <li>functional un</li> </ul>      | its                                                       |       |
| <ul> <li>all caches (pl</li> </ul>     | nysical tags)                                             |       |
| TLBs                                   |                                                           |       |
| store buffers                          | & MSHRs                                                   |       |
| Thread-shared hard<br>degradation (~1. | ware is why there is little single-thread perforn<br>5%). | nance |
| What hardware migh                     | nt you not want to share?                                 |       |
| Spring 2014                            | 471 - Multithreaded Processors                            | 30    |

|             | Implementing SMT                                                     |    |
|-------------|----------------------------------------------------------------------|----|
|             | ad-shared hardware cause more conflicts?<br>I more data cache misses |    |
|             | atter?<br>reads hide miss latencies for each other<br>ta sharing     |    |
| Spring 2014 | 471 - Multithreaded Processors                                       | 31 |







35

## **Tiling Example**

```
/* matrix multiple before */
for (i=0; i<n; i=i+1)</pre>
         for (j=0; j<n; j=j+1) {
    r = 0;</pre>
                  for (k=0; k<n; k=k+1) {
    r = r + y[i,k] * z[k,j]; }</pre>
                  x[i,j] = r;
                  };
/* matrix multiply after tiling */
for (jj=0; jj<n; jj=jj+T)</pre>
for (kk=0; kk<n; kk=kk+T)</pre>
   for (i=0; i<n; i=i+1)</pre>
         for (j=jj; j<min(jj+T-1,n); j=j+1) {
    r = 0;</pre>
                  for (k=kk; k<min(kk+T-1,n); k=k+1)</pre>
                  {r = r + y[i,k] * z[k,j]; }
x[i,j] = x[i,j] + r;
                  };
  Spring 2014
                                 471 - Multithreaded Processors
```



|                                                                                                                                                                                                                                                                                                           | Tiling                                                                                                                                                                                                                                                                                                                                   |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1       2       3       4         1       2       3       4         1       2       3       4         1       2       3       4         1       2       3       4         1       2       3       4         1       2       3       4         1       2       3       4         1       2       3       4 | The Normal Way (blocked):<br>Tiled to exploit data reuse, separate tiles/thread<br>Often works, except when: large number of threads,<br>large number of arrays, small data cache<br>Issue of tile size sweet spot                                                                                                                       |
| Cyclic<br>Spring 2014                                                                                                                                                                                                                                                                                     | <ul> <li>The SMT-friendly Way (cyclic)</li> <li>The ads share a tile so there is less pressure on the data cache</li> <li>Less sensitive to tile size</li> <li>tiles can be large to reduce loop control overhead</li> <li>cross-thread latency hiding hides misses</li> <li>more adaptable to different cache configurations</li> </ul> |







|       | Important Issues                                                           |  |
|-------|----------------------------------------------------------------------------|--|
| Cray  |                                                                            |  |
| •     | what are its goals & how are they met?                                     |  |
| •     | full-empty bits vs. locks vs. transactional memory                         |  |
| SMT   |                                                                            |  |
| •     | what are its goals & how are they met?                                     |  |
| •     | what extra hardware is needed, what extra hardware is not needed?          |  |
| •     | how does it do synchronization? fetch instructions? schedule instructions? |  |
| Match | ning hardware & compiler optimizations                                     |  |
|       |                                                                            |  |