Self-Repair by Program Reconfiguration in VLIW Processor Architectures

Self-Repair by Program Reconfiguration in VLIW Processor Architectures

Mario Schölzel (Brandenburg University of Technology Cottbus, Germany), Pawel Pawlowski (Poznan University of Technology, Poland) and Adam Dabrowski (Poznan University of Technology, Poland)
Copyright: © 2011 |Pages: 27
DOI: 10.4018/978-1-60960-212-3.ch011
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Statically scheduled superscalar processors (e.g. very long instruction word processors) are characterized by multiple parallel execution units and small sized control logic. This makes them easy scalable and therefore attractive for the use in embedded systems as application specific processors. The shrinking feature size in CMOS technology makes such processors in long living embedded systems more susceptible to several types of faults. Therefore, it should be possible to run an application, even if one or more components in the data path of a statically scheduled processor become permanently faulty. Then it becomes necessary either to reconfigure the hardware or to reconfigure the executed program such that operations are scheduled around the faulty units. The authors present recent investigations to reschedule operations in the field either on-line in hardware or off-line in software. Thus, the reconfiguration of the program is either done dynamically by the hardware or permanently by self-modifying code. If a permanent fault is present in the data path, then in both cases a delay may occur during the execution of the application. This graceful performance degradation may become critical for real-time applications. A framework to overcome this problem by using scalable algorithms is provided, too.
Chapter Preview
Top

Introduction

The continuously shrinking feature size of CMOS circuits makes them more susceptible to temporary (transient or intermittent) and permanent faults. Transient faults are caused by external events like radiation that hits the circuit, while intermittent faults are caused by internal events in the circuit like a voltage drop due to a certain system state. A temporary fault does not cause a permanent damage of the silicon and can be handled by means of fault tolerance techniques (Lala, 2000; Koren & Krishna, 2007). Such techniques use some kind of redundancy in order to detect and recover from a temporary fault. This can be either hardware redundancy (by providing backup components), information redundancy (e.g. by error correction codes or control flow based signatures), or time redundancy (by multiple execution of the same or different implementations of the same function on the same piece of hardware), or a combination of those types of redundancy. The detection of temporary faults must happen during the execution of the application, due to their intermittent nature. Thus, some kind of on-line monitoring of the system is required in order to check the correctness of the internal results before they are used outside of the system. The required amount of redundancy for on-line monitoring depends on the required fault coverage and the acceptable delay between the occurrence of a fault and its notification. If the time between the occurrence and notification must be short, then a widely used technique is the concurrent execution of the same operation, meaning an overhead of more than 100%; either in time or in the required computational resources. Special codes may be used for detecting faults in order to reduce this overhead. This is done, for example, in (Huang & Abraham, 1984) in order to detect faults that occur during matrix operations. However, techniques that use codes do always suffer from the problem that they are not generally applicable and that they will have lower fault coverage than techniques that produce results concurrently and check the results for equality. Anyway a recovery process is necessary after detecting a fault in order to replace the wrong internal result by the correct one. This can be accomplished by error correcting codes, check-pointing with roll-back techniques, or voting. A localization of the source of a transient or intermittent fault is not necessary, because it disappears due to its transient nature without any kind of repair process.

Complete Chapter List

Search this Book:
Reset