







| Non-blocking Caches       |                                                           |  |  |  |  |
|---------------------------|-----------------------------------------------------------|--|--|--|--|
| in-order processors       |                                                           |  |  |  |  |
| lw <b>\$3,</b> 100(\$4)   | in execution, cache miss                                  |  |  |  |  |
| add \$2, <b>\$3</b> , \$4 | consumer waits until the miss is satisfied                |  |  |  |  |
| sub \$5, \$6, \$7         | independent instruction waits for the add                 |  |  |  |  |
| out-of-order processors   |                                                           |  |  |  |  |
| lw <b>\$3,</b> 100(\$4)   | in execution, cache miss                                  |  |  |  |  |
| sub \$5, \$6, \$7         | independent instruction can execute during the cache miss |  |  |  |  |
| add \$2, <b>\$3</b> , \$4 | consumer waits until the miss is satisfied                |  |  |  |  |
|                           |                                                           |  |  |  |  |
| Spring 2013 Co            | SE 471 - Advanced Caching 5<br>Techniques 5               |  |  |  |  |



| Sub-block Placement                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |                                      |                  |                              |                  |                              |                  |                              |   |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------|------------------|------------------------------|------------------|------------------------------|------------------|------------------------------|---|
| Divide a b                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | olock into sub-blocks                |                  |                              |                  |                              |                  |                              |   |
| tag<br>tag<br>tag<br>tag                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | I data<br>I data<br>V data<br>I data | V<br>V<br>V<br>I | data<br>data<br>data<br>data | V<br>V<br>V<br>I | data<br>data<br>data<br>data | I<br>V<br>V<br>I | data<br>data<br>data<br>data |   |
| tag       I       data       I       data       I       data         •       sub-block       = unit of transfer on a cache miss         •       valid bit/sub-block         •       2 kinds of misses:         •       block-level miss: tags didn't match         •       sub-block-level miss: tags matched, valid bit was clear         +       the transfer time of a sub-block         +       fewer tags than if each block was the size of a subblock         -       less implicit prefetching |                                      |                  |                              |                  |                              |                  |                              |   |
| How does sub-block placement improve memory system performance?                                                                                                                                                                                                                                                                                                                                                                                                                                        |                                      |                  |                              |                  |                              |                  |                              |   |
| Spring 2013                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | C                                    | SE 47            | 1 - Advanced C<br>Techniques | Caching          |                              |                  |                              | 7 |









| Tiling Example                                                                                                                                                                                                                                 |                                                                                                                                                               |    |  |  |  |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|----|--|--|--|
| <pre>/* before */ for (i=0; i<n (j;<="" for="" td=""><td><pre>=0; j<n; (k="0;" *="" +="" <n;="" for="" j="j+1)" jj="jj+T)&lt;/pre" k="k+1)" k<n;="" r="r" x[i,j]="r;" y[i,k]="" z[k,j];="" {="" }="" };=""></n;></pre></td><td></td></n></pre> | <pre>=0; j<n; (k="0;" *="" +="" <n;="" for="" j="j+1)" jj="jj+T)&lt;/pre" k="k+1)" k<n;="" r="r" x[i,j]="r;" y[i,k]="" z[k,j];="" {="" }="" };=""></n;></pre> |    |  |  |  |
| for (i=0; :<br>for (j:                                                                                                                                                                                                                         | <pre>i<n; (k="kk;" =jj;="" for="" i="i+1)" j="j+1)" j<min(jj+t-1,n);="" k="k+1)&lt;/td" k<min(kk+t-1,n);="" r="0;" {=""><td></td></n;></pre>                  |    |  |  |  |
| Spring 2013                                                                                                                                                                                                                                    | CSE 471 - Advanced Caching<br>Techniques                                                                                                                      | 12 |  |  |  |





|           | Memory Banks                                                                                                                |  |
|-----------|-----------------------------------------------------------------------------------------------------------------------------|--|
|           | dent memory banks                                                                                                           |  |
|           | lifferent banks can be accessed at once, with different addresses<br>Ilows parallel access, possibly parallel data transfer |  |
| • n       | nultiple memory controllers & separate address lines, one for each                                                          |  |
| -         | <ul> <li>different controllers cannot access the same bank</li> </ul>                                                       |  |
| • 16      | ess area than dual porting                                                                                                  |  |
| Effect or | n memory system performance?                                                                                                |  |
|           |                                                                                                                             |  |
|           |                                                                                                                             |  |
|           |                                                                                                                             |  |
|           |                                                                                                                             |  |



|                           | Other Techniques                                                            |                 |
|---------------------------|-----------------------------------------------------------------------------|-----------------|
|                           | compiler-based prefetching (decreases misses)                               |                 |
|                           | ite-through memory update policy with a write buf<br>nides store latencies) | ier (eliminates |
| Merging reque<br>penalty) | ests to the same cache block in a non-blocking ca                           | che (hide miss  |
| TLB (reduce p             | page fault time (penalty))                                                  |                 |
| Cache hierard             | hies (reduce miss penalty)                                                  |                 |
| Virtual caches            | (reduce L1 cache access time)                                               |                 |
| Wider bus (ind            | crease bandwidth)                                                           |                 |
|                           |                                                                             |                 |
|                           |                                                                             |                 |
|                           |                                                                             |                 |
|                           |                                                                             |                 |
| Spring 2013               | CSE 471 - Advanced Caching<br>Techniques                                    | 17              |