| Introduction
Dependability
is becoming an increasingly important quality measure of microprocessors.
As high computing power is available at an affordable cost, we rely
on the microprocessor-based systems for much greater variety of
applications. This dependence indicates that a processor failure
could have diverse impacts on our daily lives. Temporary hardware
malfunctions caused by unstable environmental conditions can induce
transient or soft errors in the processor's operation. As advances
in VLSI technology reduce circuit dimensions dramatically, processor
chips become more vulnerable to soft errors. Studies have shown
that soft errors are the major source of system failures. Providing
system or thread level fault protection mechanisms in the processor
results in large execution overhead.
Basic Ideas
The goals of
the proposed research are: (1) to characterize soft error behavior
on commercial microprocessors through fault injection experiments;
(2) to provide a guideline for exploiting soft error susceptibility
in integrity checking strategy and predicting the error characteristics
from the processor's architecture; and (3) to develop comprehensive
micro-architectural solutions without full duplication that go beyond
localized solutions for specific aspects of pipeline or cache memory,
or register file to realize high dependability with low hardware
and performance overhead. Our proposed research differs from the
earlier work in that we assess the fault sensitivity of various
on-chip logic blocks and address the issues as per the relative
importance of the logic block. We will investigate individual components
of the processor with circuit-level approaches. The localized fault
protection mechanisms for individual logic blocks are backed up
by one or more global protection mechanisms. Subsequently, we study
the area overhead vs. fault coverage trade-off to show the effectiveness
of our proposed solutions. For design and verification simplicity,
the chip- and system-level techniques will be developed and analyzed.
Our research will provide a basis for the dependability enhancement
of cost sensitive products.
Representative
Papers
An important
issue in cache designs is the growing wire delays and clock rates
in large on-chip caches. We propose a novel fault-tolerant substrate
superimposed on the D-NUCA cache architecture. We combine two architectural
concepts, namely D-NUCA and shadow caching to offer protection against
both data corruption and micro-network disruption. We tolerate the
network disruption by replicating information packet and sending
shadow packet along a different route than the original packet.
The data corruption problem is addressed by reserving a small portion
of the overall cache capacity for in-cache shadow space. Our results
show that an average of 96% in data error coverage for Spec2K benchmarks
can be achieved and more than 99% of the transient faults on the
underlying switched micro-network can also be protected while incurring
less than 3% performance degradation in most of the above benchmarks.
Traditionally,
the random logic of most microprocessors is not checked for soft
errors due to great overhead, while the regular structured memory
arrays are often protected with error correcting codes. This paper
presents a low-cost reliability enhancement scheme for the processor’s
control logic. We classify control logic signals into static and
dynamic control depending on their changeability for a given instruction,
and employ different mechanisms for each. For static control, signals
used in each pipeline stage are integrated into a signature and
verified with a cached check code at commit time. The concept of
caching signatures is introduced. Dynamic control is examined on
the spot in which the signals are created using component-level
duplication. Fault injection simulations on the RTL model of a MIPS-like
processor demonstrate that our scheme can achieve more than 99%
coverage on average with a very small addition of hardware. We have
also investigated the criticalness of errors in the processor logic,
which provides a direction in devising efficient allocation of redundancy.
This paper proposes
an integrity checking architecture for superscalar processors that
can achieve fault tolerance capability of a duplex system at much
less cost than the traditional duplication approach. The pipeline
of the CPU core (P-pipeline) is combined in series with another
pipeline (V-pipeline), which re-executes instructions processed
in the P-pipeline. Operations in the two pipelines are compared
and any mismatch triggers recovery process. The V-pipeline design
is based on replication of the P-pipeline, and minimized in size
and functionality by taking advantage of control flow and data dependency
resolved in the P-pipeline. Idle cycles propagated from the P-pipeline
become extra time for the V-pipeline to keep up with program re-execution.
For a large-scale superscalar processor, the proposed architecture
can bring up to 61.4% reduction in die area and the average execution
time increase is 0.3%.
Information integrity
in cache memories is a fundamental requirement for dependable computing.
Conventional architectures for enhancing cache reliability using
check codes make it difficult to trade between the level of data
integrity and the chip area requirement. We focus on transient fault
tolerance in primary cache memories and develop new architectural
solutions to maximize fault coverage when the budgeted silicon area
is not sufficient for the conventional configuration of an error
checking code. The underlying idea is to exploit the corollary of
reference locality in the organization and management of the code.
A higher protection priority is dynamically assigned to the portions
of the cache that are more error-prone and have a higher probability
of access. The error-prone likelihood prediction is based on the
access frequency. We evaluate the effectiveness of the proposed
schemes using a trace-driven simulation combined with software error
injection using four different fault manifestation models. From
the simulation results, we show that for most benchmarks the proposed
architectures are effective and area efficient for increasing the
cache integrity under all four models.
Expected
Impact
The intellectual
merit of this proposal lies in developing specific tailored dependability
solutions that are independent of technology and based on fault
occurrence and error propagation characteristics of a given processor
architecture. The techniques however are easily adapted to future
generations of architectures.
Since use of
microprocessors and micro controllers is very wide spread and affects
every aspect of our lives, any improvement in the dependability
of such systems would have the widest possible broader impact. That
is the case for the proposed research.
.
Top
|