|
Commercial microprocessor systems are being operated at higher
and higher clock rates. Faster clocks impact the time that is available
to fetch data from cache memory, and enhances the probability of
transient error occurrence in cache memory systems. Besides data
reading error, an error may also occur in the processor subsystem,
and it may write erroneous data into the cache memory. Error-correcting
codes allow detection and recovery from some of these errors within
the code word limits. Fast error detection allows damage containment
and reduces the recovery time, which otherwise could be very expensive
in time.
The goals of
the proposed research are: (1) to study the extent of error propagation
due to transient faults in computer system when a fault originates
either in a processor register or a cache location; and (2) to develop
techniques and hardware support needed for early detection and recovery
from such errors in computation tasks with low overhead and low
performance loss. Our techniques will cover a broad spectrum from
only detection to full error recovery.
Our research identifes architectural features which, when provided
in commercial microprocessors, will make them suitable for use in
fault-tolerant applications. The intent is to keep the performance
impact in the normal operation to a minimal.
Top
|