Members

Research

Sponsors

Publications

Tools

Contact Us

HOME

Fault Error Detection and Recovery in

High-Performance Fault-Tolerant

Processor Systems Employing Caches

Sponsor: NSF

Research |Fault Error Detection & Recovery in High-Performance Fault-Tolerance Processor Systems Employing Caches

We proposed to develop and evaluate techniques for efficient error detection, recovery, and maintaining synchronization in high-performance redundant processor systems employing cache memories for use in real-time applications. The effective design of such systems relies heavily on the ability to recover from transient faults which have been estimated to occur at a rate of 5 to 100 times that of the permanent ones. Transient restoration schemes with near perfect fault detection capability can greatly improve system reliability with very little loss of performance. To achieve this goal, we proposed mechanisms that maintain cache data consistency in redundant copies, deter processors from working on an erroneous cache data, and constrain possible error propagation in the cache memory.

The proposed scheme remedies the insufficiency of the error correcting code when facing with processor transient fault. Unlike other schemes, we allow both the processor and the cache memory to be liable for transient faults. The nature of such fast error detection and recovery allows damage containment and reduces the re-synchronization overhead, which usually incurs excessive delays associated with the main memory recovery. We proposed to study the following issues:

  • Development of cache protocols to improve fault detection and recovery capability.
  • Development of schemes to maintain synchronization among redundant channels in high-performance fault-tolerant systems for use in real-time applications.
  • Performance evaluation of these protocols using representative real-time applications.
  • Estimation of the fault coverage of the proposed schemes.
  • Integration and evaluation of the proposed cache schemes with error correcting codes.
  • Applications of the proposed schemes in parallel computer systems with cache coherency protocols.

Our research identifies architectural features which, when provided in commercial microprocessors, will make them suitable for use in fault-tolerant applications. The performance will not be affected in case the microprocessor is used in non-fault-tolerant applications.


Top