

























|                             | Cray MTA                                                                                                                    |    |
|-----------------------------|-----------------------------------------------------------------------------------------------------------------------------|----|
| Fine-grain m                | Itithreaded processor                                                                                                       |    |
| <ul> <li>can sw</li> </ul>  | tch to a different thread each cycle                                                                                        |    |
| • SW                        | tches to ready threads only                                                                                                 |    |
| <ul> <li>up to 1</li> </ul> | 28 hardware contexts/processor                                                                                              |    |
|                             | of latency to hide, mostly from the multi-hop<br>rconnection network                                                        |    |
| • av<br>(i.e<br>bu          | rage instruction latency for computation: 22 cycles<br>, 22 instruction streams needed to keep functional units<br>y)       |    |
| CV                          | rage instruction latency including memory: 120 to 200-<br>les<br>, 120 to 200 instruction streams needed to hide all latenc | v. |
|                             | average)                                                                                                                    | ,  |
| <ul> <li>proces</li> </ul>  | sor state for all 128 contexts                                                                                              |    |
| • GF                        | Rs (total of 4K registers!)                                                                                                 |    |
| • sta                       | us registers (includes the PC)                                                                                              |    |
| • bra                       | nch target registers                                                                                                        |    |







|                            | <u>Cray MTA</u>                                                            |    |
|----------------------------|----------------------------------------------------------------------------|----|
| Interesting fea            | atures                                                                     |    |
| <ul> <li>tagged</li> </ul> | memory, i.e., full/empty bits                                              |    |
| <ul> <li>ind</li> </ul>    | irectly set full/empty bits to prevent data races                          |    |
| •                          | prevents a consumer from loading a value before a producer has written it  |    |
| •                          | prevents a producer from overwriting a value before a consumer has read it |    |
| • exa                      | imple for the consumer:                                                    |    |
| •                          | set to empty when producer instruction starts executing                    |    |
| •                          | consumer instructions block if try to read the producer value              |    |
| •                          | set to full when producer writes value                                     |    |
| •                          | consumers can now read a valid value                                       |    |
|                            |                                                                            |    |
| Spring 2015                | 471: Multithreaded Processors                                              | 18 |

9























|        | Implementing SMT                                                                             |    |  |  |
|--------|----------------------------------------------------------------------------------------------|----|--|--|
| Th     | ead-shared hardware                                                                          |    |  |  |
|        | branch target buffer                                                                         |    |  |  |
|        | <ul><li>instruction queues</li><li>functional units</li></ul>                                |    |  |  |
|        | <ul> <li>all caches (physical tags)</li> </ul>                                               |    |  |  |
|        | • TLBs                                                                                       |    |  |  |
|        | store buffers & MSHRs                                                                        |    |  |  |
| Th     | Thread-shared hardware is why there is little single-thread performance degradation (~1.5%). |    |  |  |
| Wh     | at hardware might you not want to share?                                                     |    |  |  |
| Spring | 2015 471: Multithreaded Processors                                                           | 30 |  |  |

|                                   | Implementing SMT                                                  |    |
|-----------------------------------|-------------------------------------------------------------------|----|
|                                   | I-shared hardware cause more conflicts?<br>hore data cache misses |    |
| Does it matt<br>• threa<br>• data | ads hide miss latencies for each other                            |    |
| Spring 2015                       | 471: Multithreaded Processors                                     | 31 |









|              | Tiling Example                                                                                                                                                                    |    |
|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| for (i=0; i< | <pre>altiple before */ <n; 'j="0;" (k="0;" *="" +="" for="" i="i+1)" j="j+1)" j<n;="" k="k+1)" k<n;="" pre="" r="r" x[i,j]="r;" y[i,k]="" z[k,j];="" {="" }="" };<=""></n;></pre> |    |
| for (jj=0;   | ultiply <b>after</b> tiling */<br>jj <n; jj="jj+&lt;b">T)<br/>kk<n; kk="kk+&lt;b">T)</n;></n;>                                                                                    |    |
|              | <pre>i<n; 'j="jj;" (k="kk;" for="" i="i+1)" j="j+1)" j<min(jj+t-1,n);="" k="k+1)&lt;/td" k<min(kk+t-1,n);="" r="0;" {=""><td></td></n;></pre>                                     |    |
| Spring 2015  | 471: Multithreaded Processors                                                                                                                                                     | 36 |









|                                 | Important Issues                        |    |  |  |
|---------------------------------|-----------------------------------------|----|--|--|
| <ul> <li>hardware su</li> </ul> | y?<br>n do they solve?                  |    |  |  |
| Coarse-grain vs. fin            | e-grain vs. simultaneous multithreading |    |  |  |
| Spring 2015                     | 471: Multithreaded Processors           | 41 |  |  |

|             | Important Issues                                                                                                                                                                                |    |  |  |
|-------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|--|--|
|             | what are its goals & how are they met?<br>full-empty bits vs. locks vs. transactional memory                                                                                                    |    |  |  |
| •           | what are its goals & how are they met?<br>what extra hardware is needed, what extra hardware is not<br>needed?<br>how does it do synchronization? fetch instructions? schedule<br>instructions? |    |  |  |
| Matchir     | ng hardware & compiler optimizations                                                                                                                                                            |    |  |  |
| Spring 2015 | 471: Multithreaded Processors                                                                                                                                                                   | 42 |  |  |