A Fault-Tolerant Scheduling Algorithm Based on Checkpointing and Redundancy for Distributed Real-Time Systems

A Fault-Tolerant Scheduling Algorithm Based on Checkpointing and Redundancy for Distributed Real-Time Systems

Barkahoum Kada, Hamoudi Kalla
DOI: 10.4018/978-1-7998-5339-8.ch036
OnDemand:
(Individual Chapters)
Available
$33.75
List Price: $37.50
10% Discount:-$3.75
TOTAL SAVINGS: $3.75

Abstract

Real-time systems are becoming ever more widely used in life-critical applications, and the need for fault-tolerant scheduling can only grow in the years ahead. This article presents a novel fault tolerance approach for tolerating transient faults in hard real-time systems. The proposed approach combines both checkpointing with rollback and active replication to tolerate several transient faults. Based on this approach, a new static fault-tolerant scheduling algorithm SFTS is presented. It is based on a list of scheduling heuristics which satisfy the application time constraints even in the presence of faults by exploring the spare capacity of available processors in the architecture. Simulation results show the performance and effectiveness of the proposed approach compared to other fault-tolerant approaches. The results reveal that in the presence of multiple transient faults, the average timing overhead of this approach is lower than checkpointing technique. Moreover, the proposed algorithm SFTS achieves better feasibility rate in the presence of multiple transient faults.
Chapter Preview
Top

Literature Review

Extensive research has been presented to investigate the software-based fault tolerance techniques against transient faults. In the software replication technique (Girault et al., 2004; Assayad et al., 2012; Samal et al., 2013; Meroufel & Belalem, 2014) multiple replicas (active or passive) of each task are executed on different processors.

Assayad et al. (2012) proposed a new tri-criteria scheduling heuristic to minimize the schedule length, the global system failure rate and the power consumption of the generated schedule. Active replication of tasks and data dependencies is used to increase the system reliability. The primary-backup approach (passive replication) is used as a fault-tolerant scheduling technique in (Samal et al., 2013) to guarantee real time tasks constraints in the presence of permanent or transient faults. The authors proposed fault-tolerant scheduling for independent tasks using a hybrid genetic algorithm.

The replication technique is effective to tolerate spatial multiple faults (permanent or transient) and it is more preferable for safety-critical systems (Ejlali et al., 2012). However, scheduling multiple replicas of each task on different processors may not be affordable due to cost constraints (Ropars et al., 2015).

Checkpointing with rollback recovery (Han et al., 2015; Izosimov et al., 2012; Wei et al., 2012; Zhang & Chakrabarty, 2006; Kumar et al. 2015) and re-execution (Izosimov et al., 2008; Gui & Luo, 2013) are classified by Motaghi and Zarandi (2014) as time-based redundancy methods. These methods try to deal with transient faults by serial executions in the same processor of faulty task. Izosimov et al. (2008) proposed a quasi-static scheduling of fault tolerant embedded systems composed of hard and soft processes. In which re-execution is employed to recover from multiple faults. Han et al. (2015) presented a task allocation scheme for minimizing energy consumption while ensuring the fault tolerance requirement of the system. They develop an efficient method to determine the checkpointing scheme to tolerate k transient faults on a single processor. These methods do not impose any hardware cost overhead and are not effective to tolerate transient faults whose durations are very long. Moreover, serial execution may cause the non-respect of time constraints.

Complete Chapter List

Search this Book:
Reset