

























|             | Cray MTA                                                                                                                                               |    |
|-------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| Fine-g      | rain multithreaded processor                                                                                                                           |    |
| •           | can switch to a different thread each cycle                                                                                                            |    |
|             | <ul> <li>switches to ready threads only</li> </ul>                                                                                                     |    |
| •           | up to 128 hardware contexts/processor                                                                                                                  |    |
|             | <ul> <li>lots of latency to hide, mostly from the multi-hop<br/>interconnection network</li> </ul>                                                     |    |
|             | <ul> <li>average instruction latency for computation: 22 cycles<br/>(i.e., 22 instruction streams needed to keep functional units<br/>busy)</li> </ul> |    |
|             | <ul> <li>average instruction latency including memory: 120 to 200-<br/>cycles</li> </ul>                                                               |    |
|             | (i.e., 120 to 200 instruction streams needed to hide all latency on average)                                                                           | ,  |
| •           | processor state for all 128 contexts                                                                                                                   |    |
|             | <ul> <li>GPRs (total of 4K registers!)</li> </ul>                                                                                                      |    |
|             | <ul> <li>status registers (includes the PC)</li> </ul>                                                                                                 |    |
|             | <ul> <li>branch target registers</li> </ul>                                                                                                            |    |
|             |                                                                                                                                                        |    |
| Spring 2013 | 471 - Multithreaded Processors                                                                                                                         | 14 |

|              | <u>Cray MTA</u>                                                                                                                                                                                                                                                                                                                                                                                                                |    |
|--------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| Interes<br>• | <ul> <li>ting features</li> <li>No processor-side data caches</li> <li>increases the latency for data accesses but reduces the variation between memory ops</li> <li>to avoid having to keep caches coherent</li> <li>memory-side buffers instead</li> <li>L1 &amp; L2 instruction caches</li> <li>instructions have more locality &amp; have no coherency problem</li> <li>prefetch fall-through &amp; target code</li> </ul> |    |
| Spring 2013  | 471 - Multithreaded Processors                                                                                                                                                                                                                                                                                                                                                                                                 | 15 |



8



|                 | Cray MTA                                                                                                                                              |    |
|-----------------|-------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| Interesting fea | atures                                                                                                                                                |    |
| • indi          | rectly set full/empty bits to prevent data races<br>prevents a consumer from loading a value before a                                                 |    |
| •               | producer has written it<br>prevents a producer from overwriting a value before a<br>consumer has read it                                              |    |
| • exa<br>•<br>• | mple for the consumer:<br>set to empty when producer instruction starts executing<br>consumer instructions block if try to read the producer<br>value |    |
|                 | set to full when producer writes value<br>consumers can now read a valid value                                                                        |    |
| Spring 2013     | 471 - Multithreaded Processors                                                                                                                        | 18 |





10



















|                            | Implementing SMT                                                            |    |
|----------------------------|-----------------------------------------------------------------------------|----|
| Thread-shar                | ed hardware:                                                                |    |
| fetch                      | buffers                                                                     |    |
| <ul> <li>branc</li> </ul>  | h target buffer                                                             |    |
| instru                     | ction queues                                                                |    |
| function                   | onal units                                                                  |    |
| • all cad                  | ches (physical tags)                                                        |    |
| TLBs                       |                                                                             |    |
| <ul> <li>store</li> </ul>  | buffers & MSHRs                                                             |    |
| Thread-share<br>degradatio | ed hardware is why there is little single-thread performance<br>on (~1.5%). |    |
| What hardwa                | re might you not want to share?                                             |    |
| Spring 2013                | 471 - Multithreaded Processors                                              | 30 |









35

## **Tiling Example**

```
/* matrix multiple before */
for (i=0; i<n; i=i+1)</pre>
         for (j=0; j<n; j=j+1) {
    r = 0;</pre>
                  for (k=0; k<n; k=k+1) {
    r = r + y[i,k] * z[k,j]; }</pre>
                  x[i,j] = r;
                  };
/* matrix multiply after tiling */
for (jj=0; jj<n; jj=jj+T)</pre>
for (kk=0; kk<n; kk=kk+T)</pre>
   for (i=0; i<n; i=i+1)</pre>
         for (j=jj; j<min(jj+T-1,n); j=j+1) {
    r = 0;</pre>
                  for (k=kk; k<min(kk+T-1,n); k=k+1)</pre>
                  {r = r + y[i,k] * z[k,j]; }
x[i,j] = x[i,j] + r;
                  };
  Spring 2013
                                 471 - Multithreaded Processors
```



|                                                                                                                                                                               | <u>Tiling</u>                                                                                                                                                                                                                                       |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| $\bigcirc \bigcirc $ | The Normal Way (blocked):                                                                                                                                                                                                                           |
|                                                                                                                                                                               | - Tiled to exploit data reuse, separate tiles/thread                                                                                                                                                                                                |
|                                                                                                                                                                               | _ Otten works, except when: large number of threads,                                                                                                                                                                                                |
|                                                                                                                                                                               | <ul> <li>Issue of tile size sweet spot</li> </ul>                                                                                                                                                                                                   |
| 0034                                                                                                                                                                          | · ·                                                                                                                                                                                                                                                 |
| l l l<br>Blocked                                                                                                                                                              |                                                                                                                                                                                                                                                     |
|                                                                                                                                                                               | The SMT-friendly Way (cyclic)                                                                                                                                                                                                                       |
|                                                                                                                                                                               |                                                                                                                                                                                                                                                     |
| •                                                                                                                                                                             | Threads share a tile so there is less pressure on the data cache                                                                                                                                                                                    |
| • •                                                                                                                                                                           | Threads share a tile so there is less pressure on the<br>data cache<br>Less sensitive to tile size                                                                                                                                                  |
| 0 0<br>0 0                                                                                                                                                                    | <ul> <li>Threads share a tile so there is less pressure on the data cache</li> <li>Less sensitive to tile size</li> <li>tiles can be large to reduce loop control overhead</li> </ul>                                                               |
| 0 2<br>3 4                                                                                                                                                                    | <ul> <li>Threads share a tile so there is less pressure on the data cache</li> <li>Less sensitive to tile size <ul> <li>tiles can be large to reduce loop control overhead</li> <li>cross-thread latency hiding hides misses</li> </ul> </li> </ul> |







|       | Important Issues                                                  |  |  |
|-------|-------------------------------------------------------------------|--|--|
| Cray  |                                                                   |  |  |
| •     | what are its goals & how are they met?                            |  |  |
| •     | full-empty bits vs. locks vs. transactional memory                |  |  |
| SMT   |                                                                   |  |  |
| •     | what is it?                                                       |  |  |
| •     | what are its goals & how are they met?                            |  |  |
| •     | what extra hardware is needed, what extra hardware is not needed? |  |  |
| •     | how does it do synchronization?                                   |  |  |
| Match | ing hardware & compiler optimizations                             |  |  |
|       |                                                                   |  |  |
|       |                                                                   |  |  |