Integrity Monitoring and Recovery Techniques for Error-Prone Submicron Microprocessors

Dependable Computing and Networking Laboratory, Dept. of Electrical & Computer Engineering,

Iowa State University, Ames, IA 50010

PI: Arun Somani

Students: Amy Hammond, Adeel Israr (MSEE '02), Seongwoo Kim (Ph.D'02), Joe B. Nickel (MSEE'01), Heng Xu

Sponsor: NSF

 

Introduction

Dependability is becoming an increasingly important quality measure of microprocessors. As high computing power is available at an affordable cost, we rely on the microprocessor-based systems for much greater variety of applications. This dependence indicates that a processor failure could have diverse impacts on our daily lives. Temporary hardware malfunctions caused by unstable environmental conditions can induce transient or soft errors in the processor's operation. As advances in VLSI technology reduce circuit dimensions dramatically, processor chips become more vulnerable to soft errors. Studies have shown that soft errors are the major source of system failures. Providing system or thread level fault protection mechanisms in the processor results in large execution overhead.

Basic Ideas

The goals of the proposed research are: (1) to characterize soft error behavior on commercial microprocessors through fault injection experiments; (2) to provide a guideline for exploiting soft error susceptibility in integrity checking strategy and predicting the error characteristics from the processor's architecture; and (3) to develop comprehensive micro-architectural solutions without full duplication that go beyond localized solutions for specific aspects of pipeline or cache memory, or register file to realize high dependability with low hardware and performance overhead. Our proposed research differs from the earlier work in that we assess the fault sensitivity of various on-chip logic blocks and address the issues as per the relative importance of the logic block. We will investigate individual components of the processor with circuit-level approaches. The localized fault protection mechanisms for individual logic blocks are backed up by one or more global protection mechanisms. Subsequently, we study the area overhead vs. fault coverage trade-off to show the effectiveness of our proposed solutions. For design and verification simplicity, the chip- and system-level techniques will be developed and analyzed. Our research will provide a basis for the dependability enhancement of cost sensitive products.

Representative Papers

An important issue in cache designs is the growing wire delays and clock rates in large on-chip caches. We propose a novel fault-tolerant substrate superimposed on the D-NUCA cache architecture. We combine two architectural concepts, namely D-NUCA and shadow caching to offer protection against both data corruption and micro-network disruption. We tolerate the network disruption by replicating information packet and sending shadow packet along a different route than the original packet. The data corruption problem is addressed by reserving a small portion of the overall cache capacity for in-cache shadow space. Our results show that an average of 96% in data error coverage for Spec2K benchmarks can be achieved and more than 99% of the transient faults on the underlying switched micro-network can also be protected while incurring less than 3% performance degradation in most of the above benchmarks.

Traditionally, the random logic of most microprocessors is not checked for soft errors due to great overhead, while the regular structured memory arrays are often protected with error correcting codes. This paper presents a low-cost reliability enhancement scheme for the processor’s control logic. We classify control logic signals into static and dynamic control depending on their changeability for a given instruction, and employ different mechanisms for each. For static control, signals used in each pipeline stage are integrated into a signature and verified with a cached check code at commit time. The concept of caching signatures is introduced. Dynamic control is examined on the spot in which the signals are created using component-level duplication. Fault injection simulations on the RTL model of a MIPS-like processor demonstrate that our scheme can achieve more than 99% coverage on average with a very small addition of hardware. We have also investigated the criticalness of errors in the processor logic, which provides a direction in devising efficient allocation of redundancy.

This paper proposes an integrity checking architecture for superscalar processors that can achieve fault tolerance capability of a duplex system at much less cost than the traditional duplication approach. The pipeline of the CPU core (P-pipeline) is combined in series with another pipeline (V-pipeline), which re-executes instructions processed in the P-pipeline. Operations in the two pipelines are compared and any mismatch triggers recovery process. The V-pipeline design is based on replication of the P-pipeline, and minimized in size and functionality by taking advantage of control flow and data dependency resolved in the P-pipeline. Idle cycles propagated from the P-pipeline become extra time for the V-pipeline to keep up with program re-execution. For a large-scale superscalar processor, the proposed architecture can bring up to 61.4% reduction in die area and the average execution time increase is 0.3%.

Information integrity in cache memories is a fundamental requirement for dependable computing. Conventional architectures for enhancing cache reliability using check codes make it difficult to trade between the level of data integrity and the chip area requirement. We focus on transient fault tolerance in primary cache memories and develop new architectural solutions to maximize fault coverage when the budgeted silicon area is not sufficient for the conventional configuration of an error checking code. The underlying idea is to exploit the corollary of reference locality in the organization and management of the code. A higher protection priority is dynamically assigned to the portions of the cache that are more error-prone and have a higher probability of access. The error-prone likelihood prediction is based on the access frequency. We evaluate the effectiveness of the proposed schemes using a trace-driven simulation combined with software error injection using four different fault manifestation models. From the simulation results, we show that for most benchmarks the proposed architectures are effective and area efficient for increasing the cache integrity under all four models.

Expected Impact

The intellectual merit of this proposal lies in developing specific tailored dependability solutions that are independent of technology and based on fault occurrence and error propagation characteristics of a given processor architecture. The techniques however are easily adapted to future generations of architectures.

Since use of microprocessors and micro controllers is very wide spread and affects every aspect of our lives, any improvement in the dependability of such systems would have the widest possible broader impact. That is the case for the proposed research.

 

.

 

 


 


 

Top