Advanced Technologies for Transient Faults Detection and Compensation

Advanced Technologies for Transient Faults Detection and Compensation

Matteo Sonza Reorda (Politecnico di Torino, Italy), Luca Sterpone (Politecnico di Torino, Italy) and Massimo Violante (Politecnico di Torino, Italy)
Copyright: © 2011 |Pages: 23
DOI: 10.4018/978-1-60960-212-3.ch006
OnDemand PDF Download:
No Current Special Offers


Transient faults became an increasing issue in the past few years as smaller geometries of newer, highly miniaturized, silicon manufacturing technologies brought to the mass-market failure mechanisms traditionally bound to niche markets as electronic equipments for avionic, space or nuclear applications. This chapter presents the origin of transient faults, it discusses the propagation mechanism, it outlines models devised to represent them and finally it discusses the state-of-the-art design techniques that can be used to detect and correct transient faults. The concepts of hardware, data and time redundancy are presented, and their implementations to cope with transient faults affecting storage elements, combinational logic and IP-cores (e.g., processor cores) typically found in a System-on-Chip are discussed.
Chapter Preview

1. Introduction

Advanced semiconductor technologies developed in the past few years are allowing giant leaps forward to the electronic industry. Nowadays, portable devices are available that provide several orders of magnitude more computing power than top-of-the-line workstations of few years ago.

Advanced semiconductor technologies are able to achieve such improvements by shrinking the feature size that is now at 22 nm and below, allowing integrating millions of devices on a single chip. As a result, it is now possible to manufacture an entire system (encompassing processors, companion chips, memories and input/output modules) on a single chip. Smaller transistors are also able to switch faster, thus allowing operational frequencies in the GHz range. Finally, low operational voltages are possible, significantly reducing the energy needs of complex chips.

All these benefits have however a downside in the higher sensitivity of newer devices to soft errors. The reduced amount of charge needed to store memory bits, the increased operational frequencies, as well as the reduced noise margins coming from lower operational voltages are making the occurrence of soft errors, i.e., unexpected random failures of the system, more probable during system lifetime.

Among the different sources of soft errors, radiation induced events are becoming more and more important, and interest is growing on this topic from both the academic and the industrial communities.

As described in (Dodd et al., 2004), when ionizing radiations (heavy ions or, protons in space, neutrons, and alpha particles in the earth atmosphere) hit the sensitive volume of a semiconductor device (its reserve biased depletion region) the injected charge is accelerated by an electric field, resulting in a parasitic current than can produce a number of effects, generally referred to as Single Event Effects (SEEs). Single Event Latchup (SEL) is the destructive event that takes place when the parasitic current triggers non-functional structures hidden in the semiconductor device (like parasitic transistors that shorten ground lines to power lines, which should never conduct when the device is operating correctly). Single Event Upset (SEU) is the not-destructive event that takes place when the parasitic current is able to trigger the modification of a storage cell, whose content flips from 0 to 1, or vice-versa. In case the injected charge reaches the sensitive volume of more than one memory device, multiple SEUs may happen simultaneously, causing the phenomenon known as Multiple Bit Upset (MBU). Finally, Single Event Transient (SET) is the not-destructive event that takes place when the parasitic current produces glitches on the values of nets in the circuit compatible with the noise margins of the technology, thus result in the temporary modification of the value of the nets from 0 to 1, or vice-versa.

Among SEEs, SEL is the most worrisome, as it corresponds to the destruction of the device, and hence it is normally solved by means of SEL-aware layout of silicon cells, or by current sensing and limiting circuits. SEUs, MBUs, and SETs can be tackled in different ways, depending on the market the application aims at. When vertical, high-budget, applications are considered, like for example electronic devices for telecom satellites, SEE-immune manufacturing technologies can be adopted, which are by-construction immune to SEUs, MBUs, and SETs, but whose costs are prohibitive for any other market. When budget-constrained applications are considered, from electronic devices for space exploration missions to automotive and commodity applications, SEUs, MBUs and SETs should be tackled by adopting fault detection and compensation techniques that allow developing dependable systems (i.e., where SEE effects produce negligible impacts on the application end user) on top of intrinsically not dependable technologies (i.e., which can be subject to SEUs, MBUs, and SETs), whose manufacturing costs are affordable.

Different types of fault detection and compensation techniques have been developed in the past years, which are based on the well-known concepts of resource, information or time redundancy (Pradhan, 1996).

In this chapter we first look at the source of soft errors, by presenting some background on radioactive environments, and then discussing how soft errors can be seen at the device level. When then present the most interesting mitigation techniques organized as a function of the component they aims at: processor, memory module, and random logic. Finally, we draw some conclusions.

Complete Chapter List

Search this Book: