Self-Adaptive SoCs for Dependability: Review and Prospects

Self-Adaptive SoCs for Dependability: Review and Prospects

Liang Guang (University of Turku, Finland), Juha Plosila (University of Turku, Finland) and Hannu Tenhunen (University of Turku, Finland & Royal Institute of Technology, Sweden)
DOI: 10.4018/978-1-4666-6034-2.ch001
OnDemand PDF Download:
$37.50

Abstract

Dependability is a primary concern for emerging billion-transistor SoCs (Systems-on-Chip), especially when the constant technology scaling introduces an increasing rate of faults and errors. Considering the time-dependent device degradation (e.g. caused by aging and run-time voltage and temperature variations), self-adaptive circuits and architectures to improve dependability is promising and very likely inevitable. This chapter extensively surveys existing works on monitoring, decision-making, and reconfiguration addressing different dependability threats to Very Large Scale Integration (VLSI) chips. Centralized, distributed, and hierarchical fault management, utilizing various redundancy schemes and exploiting logical or physical reconfiguration methods, are all examined. As future research directions, the challenge of integrating different error management schemes to account for multifold threats and the great promise of error resilient computing are identified. This chapter provides, for chip designers, much needed insights on applying a self-adaptive computing paradigm to approach dependability on error-prone, cost-sensitive SoCs.
Chapter Preview
Top

Introduction

Dependability is a primary requirement for computer-based systems. Communication systems, embedded control systems and even entertainment systems all demand certain level of dependability as to be satisfying to the application requirements (Kobayashi & Onodera, 2008). Dependability broadly encompasses the attributes of availability, reliability, safety, integrity, confidentiality and maintainability (Avizienis, Laprie, Randell, Landwehr, 2004; Knight, 2002).

SoC (System-on-Chip) is a modern form of VLSI (Very Large Scale Integration) chips. Diverse system components, e.g. processors, memories, accelerator, sensors and actuators, and different interconnections between these components, can be designed and implemented on a single chip, due to the constant scaling of semiconductor devices. The increasing usage of SoC in modern and emerging computing systems, e.g. health monitoring and industrial automation (Eshraghian, 2006), demands more attentive research on the dependability of SoCs. Especially the shrinking transistor feature size introduces significant process variations (Unsal et al., 2006), crosstalk, leakage and other submicron effects. Process variations refer to the deviation of actual physical parameters from nominal values due to the manufacturing process or post-manufacture aging. In particular time-dependent device degradation (Collet, Zajac, Psarakis, Gizopoulos, 2011), e.g. due to aging and material wearout, run-time voltage and temperature variations (Henkel, Ebi, Amrouch, & Khdr, 2013), and even radiations (Duzellier, 2005), pose more profound challenges to the design of SoCs.

Conventional techniques to provide dependability are either too costly or very limited in applicability. For example, TMR (triple modular redundancy) makes three-fold replicas of the circuits to detect and correct errors, whose area penalty is large (Kobayashi & Onodera, 2008). Worst-case-based design, which applies a conservatively large safety margin (e.g. a high supply voltage to account for voltage variations), incurs significantly performance or power overhead (Ernst et al., 2003). Besides, effects such aging cannot be detected by conventional duplication-based techniques, as replicas may suffer from aging simultaneously.

Self-adaptive computing is an emerging paradigm to achieve functional diversity, energy and power efficiency and dependability in systems of different scales (Salehie & Tahvildari, 2009). By monitoring the internal (e.g. power consumption) and external (e.g. ambient temperature) status, a self-adaptive system can tune its own configurations to approach various objectives (specified by user, system administrator, or derived from these external sources) (Guang, 2012). Without overly relying on static pre-assumption of circuit and environmental status, self-adaptive techniques can better address the run-time dependability concerns, and apply diverse temporal or spatial recovery techniques to maintain the system function and performance. Such techniques have been widely proposed on various processing, communication and memory components and the overall systems (e.g. Das et al., 2009; Collet et al., 2011; Boesen, Keymeulen, Madsen, Lu, & Tien-Hsin Chao, 2011b). Great potential can be observed in this direction.

There exist survey papers either on self-adaptive softwares, e.g. (Salehie & Tahvildari, 2009), or dependability of specific subsystems, e.g. (Radetzki, Feng, Zhao, & Jantsch, 2013), and overview papers on the dependability issues of embedded systems in general, e.g. (Henkel et al., 2011; Knight, 2002). This chapter gives a dedicated study on self-adaptive architectures and techniques for dependability provision on SoCs. Particular focus lies on time-dependent hardware errors and degradation, e.g. due to aging, thermal stress or radiation, and run-time techniques to predict, detect, recover and tolerate these dependability threats.

Complete Chapter List

Search this Book:
Reset