Cache Fault Propagation Through Program Execution

Caches are fast memories that holds frequently used data. Large cache memories are being used in small and large single and multi-processor systems. Introduction of caches, however, increases the probability of fault occurrence and the probability of latent errors. A fault in a processor or an error in a cache memory may cause the processor's state to diverge. The fault may corrupt the cache memory or lead to an erroneous internal CPU state. We are investigating the error propagation in processor/cache memory system once a transient fault affects cache memory and/or processor registers. The error propagation occurs when programs are executed and leads to system failure. The intend of this study is to understand this behavior and then devise recovery mechanisms for such failures.

Our initial result shows that a single word/register error has about 50% probability from the error without causing any system failure. However, the other half may be critical and needs attention. We achieve our results through simulation and analytical methods.

Principal Investigator: Somani

webmaster@cs.washington.edu