HW4 Sample Solution


Calculating MCPI

In this problem, I assumed that "penalty" meant the number of cycles for an instruction beyond the normal CPI. So for a read hit, 1 cycle, there is no penalty. For a write hit, 2 cycles, there is a 1 cycle penalty. You may have counted the read hit as 1 cycle penalty and the write hit as a 2 cycle penalty. The problem stated that there was no penalty for a write miss. I assumed this meant writes took 0 cycles. You may have picked 1 cycle (if you said that the write hit penalty was 2 cycles). If you made any assumptions that differ from mine, be sure to plug in the appropriate numbers.

The I-cache references every cycle will not affect our MCPI.

Assume there are I instructions.
There are 0.3*I data memory accesses.
0.25*0.3*I = 0.075*I are writes.
0.75*0.3I = 0.225*I are reads.

95% of writes and reads are hits.
The read hits incur no memory penalty.
0.95*0.075*I write hits * 1 penalty cycle = 0.071*I write hit penalty cycles.

There is no penalty for a write miss.
We will ignore writing of data to main memory since no information was specified. Assume it has an infinite write buffer for now.
The memory penalty for a read miss is 4 cycles to send the address + 24 cycles to get the 4 word block out of memory + 8 cycles to get those words over the bus (4 cycles for each 2-word pair). This adds up to 36 cycles.
0.05*0.225*I read misses * 36 cycles = 0.405*I read miss penalty cycles.

Write hit penalty cycles + read miss penalty cycles = total memory penalty cycles
0.071*I + 0.405*I = 0.476*I.
0.476*I/I = 0.476

With a victim cache, we will have a smaller penalty on read misses, since 20% of our misses will now take only one penalty cycle instead of 36 penalty cycles.
0.8*0.05*0.225*I non-victim-hit read misses * 36 cycles + 0.2*0.05*0.225*I victim-hit read misses * 1 cycle = 0.326*I read miss penalty cycles.
0.071*I + 0.326*I = 0.397*I.
0.397*I/I = 0.397

With the L2 cache, we will also have a smaller penalty on read misses. 50% of our misses will now take 6 penalty cycles. The other 50% will now take a little bit longer since our extra check in L2 requires two more penalty cycles. So the other 50% will take 38 penalty cycles now.
0.5*0.05*0.225*I non-L2-hit read misses * 38 cycles + 0.5*0.05*0.225*I L2-hit read misses * 6 cycles = 0.248*I read miss penalty cycles.
0.071*I + 0.248*I = 0.319*I.
0.319*I/I = 0.319

It is pretty clear from these numbers that the L2 cache will be more useful.


Bus occupancy - Will a write-back cause a read/write miss to be delayed?

Many of you stated the occupancy of the bus due to read and write misses (ignoring write-backs) was 54%, but this would only be true if the CPU did not stall on read and write misses. Those of you who stated this occupancy did not mention anything about instructions that may depend on the read misses. If the CPU does not stall on read/write misses, you need to take into account stalls due to dependent instructions. Since this information was not given, it would be a better assumption that the CPU will stall on a read/write miss. Otherwise, you'd also have to assume how soon after a read the data is used, what percent of the time it occurs there, etc.

is the assumption that write-backs of replaced dirty blocks do not prevent misses to be processed immediately a correct one?

A memory reference occurs once every 10/3 instructions, or once every 3.33 instructions.
Of these references, only 5% are misses requiring bus traffic. So one in (100/5)*3.33 = 66.7 instructions is a miss.
3/10 of these misses will evict a dirty block, but we will ignore this detail for now.
Assume a given miss kicks out a dirty block from the cache, storing it in the write buffer. When the miss occurs, we stall on the read or write instruction until the data is brought in from memory to the cache. When we are done, our CPU continues executing.
The CPU now has 66.7 cycles left to handle the write-back. Since the write back will take 36 cycles (under my pessimistic assumption), there is plenty of time before the next miss.

Do you think that the "uniform distribution assumption" is a good one in reality?

The uniform distribution assumption is only a good assumption if you are looking for your optimal performance. If you were trying to get a better feel for the true performance, you would need to take into account the fact that write and read misses could be close in many situations. You would need to look at how often situations like this occur, and figure out an average penalty based on these numbers. Generally, you would use a powerful analysis tool to figure out statistics like these, since the number of variables grows very large.


Infinite write buffer for a write-through write-around cache.

which of the two policies would give a lower MCPI

The read and write hits have the same penalty in both policies now. (A write hit takes no time for write-through now because the write-buffer is there.) The read misses for both also have the same penalty. On the other hand, a write miss in write-allocate must go to the next level of memory to obtain its data, whereas a write miss in write-around will simply write to the write buffer, never loading from the next level of memory on a write. Because of this, the write-around method will have a lower MCPI, since it will have no write-miss penalty.

Does bus occupancy become a bottleneck when the write buffer is eliminated?

Now every write is going to take up the bus. I will argue that it will take 36 cycles to perform the write since no other reads or writes will be able to take place until 36 cycles after a write is begun. Yes, it may only take 4+8=12 cycles to send all the data, but I claim that the memory will not be able to handle any requests for another 24 cycles, when it has finally finished writing the data.

I will also assume that a write will never stall the CPU. On a write hit, it writes the cache and sends the data to the bus. On a write miss, it just sends the data to the bus. I assume this can be handled by a controller of some sort, especially since the CPU doesn't need to wait for any data back.

So now, since writes occur 25% of 30% of the instructions, one in every 13.3 instructions is a write. This write will occupy the bus for 36 cycles (in that no other reads or writes can occur during that time). So even assuming there is never a read miss, each write would have to wait on the previous one.

Do you think your analysis is a step towards justifying today's design of cache hierarchies that have write-through lock-up free L1 caches and write-back L2 caches?

Our analysis does hint that the setup described would be good. Since the L1 would be lockup-free and write-through, our reads and writes will not stall the CPU unless a dependent instruction needs the result of a load. These stalls will be rare, and quick, since our L2 cache will likely have the necessary data. By having our L2 as write-back on the bus, we avoid saturating the bus with writes. Since writes from L1 to L2 will likely take very little time, it is unlikely that writes from L1's write buffer will interfere with read misses in the L1 cache.