|
We proposed to develop and evaluate techniques for efficient error
detection, recovery, and maintaining synchronization in high-performance
redundant processor systems employing cache memories for use in
real-time applications. The effective design of such systems relies
heavily on the ability to recover from transient faults which have
been estimated to occur at a rate of 5 to 100 times that of the
permanent ones. Transient restoration schemes with near perfect
fault detection capability can greatly improve system reliability
with very little loss of performance. To achieve this goal, we proposed
mechanisms that maintain cache data consistency in redundant copies,
deter processors from working on an erroneous cache data, and constrain
possible error propagation in the cache memory.
The proposed scheme remedies the insufficiency of the error correcting
code when facing with processor transient fault. Unlike other schemes,
we allow both the processor and the cache memory to be liable for
transient faults. The nature of such fast error detection and recovery
allows damage containment and reduces the re-synchronization overhead,
which usually incurs excessive delays associated with the main memory
recovery. We proposed to study the following issues:
- Development
of cache protocols to improve fault detection and recovery capability.
- Development
of schemes to maintain synchronization among redundant channels
in high-performance fault-tolerant systems for use in real-time
applications.
- Performance
evaluation of these protocols using representative real-time applications.
- Estimation
of the fault coverage of the proposed schemes.
- Integration
and evaluation of the proposed cache schemes with error correcting
codes.
- Applications
of the proposed schemes in parallel computer systems with cache
coherency protocols.
Our research
identifies architectural features which, when provided in commercial
microprocessors, will make them suitable for use in fault-tolerant
applications. The performance will not be affected in case the microprocessor
is used in non-fault-tolerant applications.
Top
|