Enhancing System Reliability Through Targeting Fault Propagation Scope

Enhancing System Reliability Through Targeting Fault Propagation Scope

Hakduran Koc (University of Houston, Clear Lake, USA) and Oommen Mathews (University of Houston, Clear Lake, USA)
Copyright: © 2020 |Pages: 30
DOI: 10.4018/978-1-7998-1718-5.ch004

Abstract

The unprecedented scaling of embedded devices and its undesirable consequences leading to stochastic fault occurrences make reliability a critical design and optimization metric. In this chapter, in order to improve reliability of multi-core embedded systems, a task recomputation-based approach is presented. Given a task graph representation of the application, the proposed technique targets at the tasks whose failures cause more significant effect on overall system reliability. The results of the tasks with larger fault propagation scope are recomputed during the idle times of the available processors without incurring any performance or power overhead. The technique incorporates the fault propagation scope of each task and its degree of criticality into the scheduling algorithm and maximizes the usage of the processing elements. The experimental evaluation demonstrates the viability of the proposed approach and generates more efficient results under different latency constraints.
Chapter Preview
Top

Introduction

Embedded technology solutions have become pervasive owing to the convenience they offer across a plethora of applications. As the computing environments running embedded applications are typically characterized by performance and power constraints, numerous energy-efficient performance-driven techniques have been proposed in the literature. In addition, since the accuracy of the results is the primary concern of design process, system reliability is also considered as another important optimization metric in any computing platform. Reliability of a system can be defined as the probability that the system generates its intended output in accordance with specifications for a given time period knowing that it was functioning correctly at the start time. Critical real time systems must function correctly within timing constraints even in the presence of faults. Therefore, the design of such systems in conjunction with fault tolerance is a challenge. The fault tolerance is defined as the ability of the system to comply with its specifications despite the presence of faults in any of its components (Ayav, Fradet, & Girault, 2006). With submicron process technologies, many faults have surfaced. Such faults can be categorized as a) permanent faults (e.g., damaged micro controllers or communication links), b) transient faults (e.g., electromagnetic interference), and c) intermittent faults (e.g., cold solder joint). Among them, transient faults are the most common one due to increasing complexity, transistor sizes, wafer misalignments, operational frequency, and voltage levels. The challenge is often to introduce fault-tolerance in an efficient way while maintaining performance constraints.

When reliability is improved using different techniques such as checkpointing or replication, there is invariably a performance overhead as well. In many cases, the penalty is additional off-chip memory access which may cause performance degradation and energy overhead. The situation is more serious in the case of chip multiprocessor-based environments wherein the multiple processors may try to simultaneously use the limited available off-chip bandwidth, resulting in bus contention problems thereby hampering the performance and causing more power consumption (Koc, Kandemir, Ercanli, & Ozturk, 2007; Koc, Kandemir, & Ercanli, 2010).

In this work, we propose a technique based on task recomputation in order to improve the system reliability without incurring any performance degradation. Given that an embedded application is represented using a task graph, the recomputation technique recomputes the result of a task using its predecessors whenever it is needed instead of making an explicit off-chip memory access if it is beneficial in terms of performance or energy. The proposed technique is an enhanced version of task recomputation handling different fault scenarios. The approach iteratively searches for idle time frames in available processing cores and assigns different tasks for recomputation based on criticality with respect to the task graph. The necessary condition for a task to be recomputed is that all its preceding tasks have been scheduled for execution and the outputs of such tasks are available in memory. Steps are then taken to make sure that the task is recomputed immediately prior to its requirement by successive tasks so memory is not occupied for a prolonged period. Each recomputation is made in accordance with the latency constraint and available resources so that the overall execution deadline is met.

The proposed approach involves a new fault-tolerant scheduling algorithm with two metrics namely, Fault Propagation Scope (FPS) and Degree of Criticality (DoC). Using these two variables, our technique strives to reduce the probability of fault propagation in the course of execution in a strategic manner. The roadmap of the approach encompasses the technique of recomputation to be applied to certain tasks based on their FPS and DoC values. FPS is an effective measurement in analyzing and evaluating the gravity of faults in the embedded systems. The approach has been used to perform experiments and tested with benchmarks to produce conclusive results which will be further explained as this chapter continues.

Complete Chapter List

Search this Book:
Reset