### **Advanced Caching Techniques**

#### Approaches to improving memory system performance

- · eliminate memory operations
- · decrease the number of misses
- · decrease the miss penalty
- · decrease the cache/memory access times
- · hide memory latencies
- increase cache throughput
- · increase memory bandwidth

Autumn 2006

CSE P548 - Advanced Caching Techniques

]

# **Handling a Cache Miss the Old Way**

- (1) Send the address & read operation to the next level of the hierarchy
- (2) Wait for the data to arrive
- (3) Update the cache entry with data\*, rewrite the tag, turn the valid bit on, clear the dirty bit (if data cache & write back)
- (4) Resend the memory address; this time there will be a hit.
- \* There are variations:
  - · get data before replace the block
  - send the requested word to the CPU as soon as it arrives at the cache (early restart)
  - requested word is sent from memory first; then the rest of the block follows (requested word first)

How do the variations improve memory system performance?

CSE P548 - Advanced Caching Techniques 2

Autumn 2006

## **Non-blocking Caches**

#### Non-blocking cache (lockup-free cache)

- allows the CPU to continue executing instructions while a miss is handled
- some processors allow only 1 outstanding miss ("hit under miss")
- some processors allow multiple misses outstanding ("miss under miss")
- miss status holding registers (MSHR)
  - hardware structure for tracking outstanding misses
    - · physical address of the block
    - · which word in the block
    - · destination register number (if data)
    - mechanism to merge requests to the same block
    - mechanism to insure accesses to the same location execute in program order

Autumn 2006

CSE P548 - Advanced Caching Techniques 3

## **Non-blocking Caches**

#### Non-blocking cache (lockup-free cache)

- can be used with both in-order and out-of-order processors
  - in-order processors stall when an instruction that uses the load data is the next instruction to be executed (non-blocking loads)
  - out-of-order processors can execute instructions after the load consumer

How do non-blocking caches improve memory system performance?

Autumn 2006

CSE P548 - Advanced Caching Techniques

## **Victim Cache**

#### Victim cache

- · small fully-associative cache
  - contains the most recently replaced blocks of a direct-mapped cache
- · check it on a cache miss
  - swap the direct-mapped block and victim cache block
- · alternative to 2-way set-associative cache

How do victim caches improve memory system performance?

Why do victim caches work?

Autumn 2006

CSE P548 - Advanced Caching Techniques

5

# **Sub-block Placement**

Divide a block into sub-blocks

| tag |
|-----|
| tag |
| tag |
| tag |

| I | data | V | data | V | data | I | data |
|---|------|---|------|---|------|---|------|
| I | data | V | data | V | data | V | data |
| V | data | V | data | V | data | V | data |
| I | data | Ι | data | I | data | I | data |

- sub-block = unit of transfer on a cache miss
- valid bit/sub-block
- misses:
  - · block-level miss: tags didn't match
  - sub-block-level miss: tags matched, valid bit was clear
- + the transfer time of a sub-block
- + fewer tags than if each sub-block were a block
- less implicit prefetching

How does sub-block placement improve memory system performance?

Autumn 2006

CSE P548 - Advanced Caching Techniques

## **Pseudo-set associative Cache**

#### Pseudo-set associative cache

- · access the cache
- if miss, invert the high-order index bit & access the cache again
- + miss rate of 2-way set associative cache
- + access time of direct-mapped cache if hit in the "fast-hit block"
  - · predict which is the fast-hit block
- increase in hit time (relative to 2-way associative) if always hit in the "slow-hit block"

How does pseudo-set associativity improve memory system performance?

Autumn 2006

CSE P548 - Advanced Caching Techniques 7

## **Pipelined Cache Access**

#### Pipelined cache access

- simple 2-stage pipeline
  - · access the cache
  - · data transfer back to CPU
  - tag check & hit/miss logic with the shorter

How do pipelined caches improve memory system performance?

Autumn 2006

CSE P548 - Advanced Caching Techniques

## **Mechanisms for Prefetching**

#### Hardware-controlled prefetching

- · overlap prefetching & execution
- issue of how close to put the data
- · stream buffers
  - · where prefetched instructions/data held
  - if requested block in the stream buffer, then cancel the cache access

How do improve memory system performance?

Autumn 2006

CSE P548 - Advanced Caching Techniques

9

## **Trace Cache**

#### **Trace cache contents**

- contains instructions from the *dynamic* instruction stream
  - + fetch statically noncontiguous instructions in a single cycle
  - + a more efficient use of "I-cache" space
- · trace is analogous to a cache block wrt accessing

Autumn 2006 CSE I

CSE P548 - Advanced Caching Techniques

### **Trace Cache**

#### Assessing a trace cache

- trace cache state includes low bits of next addresses (target & fallthrough code) for the last instruction in the currently executing trace, which is a branch
- trace cache tag is high branch address bits + predictions for all branches in the trace
- assess trace cache & branch predictor, BTB, I-cache in parallel
- compare high PC bits & prediction history of the current branch instruction to the trace cache tag
- · hit: use trace cache & I-cache fetch ignored
- miss: use the I-cache start constructing a new trace

Why does a trace cache work?

Autumn 2006

CSE P548 - Advanced Caching Techniques 11

### **Trace Cache**

Effect on performance?

Autumn 2006

CSE P548 - Advanced Caching Techniques

# **Cache-friendly Compiler Optimizations**

**Exploit spatial locality** 

- · schedule for array misses
  - · hoist first load to each cache block

Improve spatial locality

- · group & transpose
  - makes portions of vectors that are accessed together lie in memory together
- loop interchange
  - · so inner loop follows memory layout

Improve temporal locality

- loop fusion
  - · do multiple computations on the same portion of an array
- · tiling (also called blocking)
  - do all computation on a small block of memory that will fit in the cache

Autumn 2006

Autumn 2006

CSE P548 - Advanced Caching Techniques

13

### Tiling Example

```
/* before */
for (i=0; i<n; i=i+1)
       for (j=0; j< n; j=j+1){
              r = \bar{0};
              for (k=0; k< n; k=k+1)  { r = r + y[i,k] * z[k,j];  }
              x[i,j] = r;
/* after */
for (jj=0; jj<n; jj=jj+T)
for (kk=0; kk<n; kk=kk+T)
  for (i=0; i< n; i=i+1)
       for (j=jj; j<\min(jj+T-1,n); j=j+1) {
              r = 0;
              for (k=kk; k<\min(kk+T-1,n); k=k+1)
              {r = r + y[i,k] * z[k,j];}
x[i,j] = x[i,j] + r;
              };
```

CSE P548 - Advanced Caching

# **Memory Banks**

#### Interleaved memory:

- · multiple memory banks
  - · word locations are assigned across banks
  - interleaving factor: number of banks
  - send a single address to all banks at once

| Word<br>Address | Bank 0 | Word<br>Address | Bank 1 | Word<br>Address | Bank 2 | Word<br>Address | Bank 3 |
|-----------------|--------|-----------------|--------|-----------------|--------|-----------------|--------|
| 0               |        | 1               |        | 2               |        | 3               |        |
| 4               |        | 5               |        | 6               |        | 7               |        |
| 8               |        | 9               |        | 10              |        | 11              |        |
| 12              |        | 13              |        | 14              |        | 15              | _      |
|                 |        |                 |        |                 |        |                 |        |

Autumn 2006 CSE P548 - Advanced Caching 15
Techniques

## **Memory Banks**

### Interleaved memory:

- + get more data for one transfer
  - data is probably used (why?)
- larger DRAM chip capacity means fewer banks
- power issue

Effect on performance?

Autumn 2006

CSE P548 - Advanced Caching Techniques

# **Memory Banks**

#### **Independent memory banks**

- different banks can be accessed at once, with different addresses
- · allows parallel access, possibly parallel data transfer
- multiple memory controllers & separate address lines, one for each access
  - different controllers cannot access the same bank
- · less area than dual porting

Effect on performance?

Autumn 2006

CSE P548 - Advanced Caching Techniques 17

|      | 21264                         | R12000                                | Ultr aSPA RC-III           | Pentium IV                    |  |
|------|-------------------------------|---------------------------------------|----------------------------|-------------------------------|--|
| L1 I | 64KB                          | 32KB                                  | 32KB                       | 12Kuop trace                  |  |
| onch |                               |                                       |                            | cache (~8-16KB)               |  |
|      | 2-way with set prediction     | 2-way                                 | 4-way                      |                               |  |
|      | 64B block                     | 64B block                             | 32B block                  | 6 uops/line                   |  |
|      | virtually indexed             |                                       | virtually indexed, virtual | virtually indexed             |  |
|      |                               |                                       | tags                       |                               |  |
|      |                               | 2-cycle access<br>critical word first | pipelined 2-cycle access   |                               |  |
| L1 D | 64KB                          | 32KB                                  | 64KB                       | 8KB                           |  |
| onch | p 2-way                       | 2-way, LRU replace-                   | 4-way                      | 4-way                         |  |
|      | *                             | ment                                  | -                          | ,                             |  |
|      | 64B block                     | 32B block                             | 32B block                  | 64B block                     |  |
|      | write-back                    |                                       | write-through              | write-through                 |  |
|      |                               |                                       | store compression          |                               |  |
|      | virtually indexed,            | physical tags                         | virtually indexed          | virtually indexed             |  |
|      | physical tags                 |                                       |                            |                               |  |
|      | TLB in parallel               |                                       | TLB in parallel            |                               |  |
|      | 3 (int) or 4 (FP) cycle reads | 2-cycle access                        |                            | 2 cycle latency               |  |
|      | phase-pipelined (read         |                                       | pipelined 2-cycle access   | pipelined                     |  |
|      | twice each cycle)             |                                       |                            |                               |  |
|      | miss under miss (32 loads     | nonblocking                           | nonblocking                | nonblocking                   |  |
|      | or 8 blocks outstanding))     |                                       |                            |                               |  |
|      | victim cache                  | critical word first                   |                            | requested word first          |  |
| L2   | external                      | external                              | external                   | onchip                        |  |
|      | 1MB-16MB                      | 1MB-16MB                              | up to 8MB                  | 256KB                         |  |
|      | direct-mapped                 | 2-way pseudo, way<br>prediction, LRU  | direct-mapped              | 8-way                         |  |
|      | 64B block                     | 128B blocks                           | 32B blocks                 | 128B block<br>64B "subblocks" |  |
|      | write-back                    | write-back                            | write-back                 | write-back                    |  |
|      | physical                      |                                       | physical                   | physically indexed            |  |
|      | nonblocking                   |                                       |                            | nonblocking                   |  |
|      | 12 cycles                     |                                       | 12 cycles                  |                               |  |
|      |                               |                                       | pipelined access           | pipelined                     |  |
| TLB  | 128 entries                   | 64 entries, each                      |                            |                               |  |
|      |                               | maps to 2 pages                       |                            |                               |  |
|      | FA                            | FA                                    |                            |                               |  |
| mı   | dual-ported                   |                                       |                            |                               |  |
|      | multiple page sizes           | 4KB - 16MB pages                      | multiple page sizes        | multiple page sizes           |  |
|      | PAL code handling             |                                       | software handling          | hardware handling             |  |

# **Today's Memory Subsystems**

Look for designs in common:

Autumn 2006

CSE P548 - Advanced Caching Techniques 19

## **Advanced Caching Techniques**

#### Approaches to improving memory system performance

- · eliminate memory operations
- decrease the number of misses
- · decrease the miss penalty
- · decrease the cache/memory access times
- hide memory latencies
- increase cache throughput
- · increase memory bandwidth

Autumn 2006

CSE P548 - Advanced Caching Techniques

### Wrap-up

Victim cache (reduce miss penalty)

TLB (reduce page fault time (penalty))

Hardware or compiler-based prefetching (reduce misses)

Cache-conscious compiler optimizations (reduce misses or hide miss penalty)

Coupling a write-through memory update policy with a write buffer (eliminate store ops/hide store latencies)

Handling the read miss before replacing a block with a write-back memory update policy (reduce miss penalty)

Sub-block placement (reduce miss penalty)

Non-blocking caches (hide miss penalty)

Merging requests to the same cache block in a non-blocking cache (hide miss penalty)

Requested word first or early restart (reduce miss penalty)

Cache hierarchies (reduce misses/reduce miss penalty)

Virtual caches (reduce miss penalty)

Pipelined cache accesses (increase cache throughput)

Pseudo-set associative cache (reduce misses)

Banked or interleaved memories (increase bandwidth)

Independent memory banks (hide latency)
Autumn 2006
CSE P548 - Advanced Caching

Wider bus (increase bandwidth) Techniques