A Replica Based Co-Scheduler (RBS) for Fault Tolerant Computational Grid

A Replica Based Co-Scheduler (RBS) for Fault Tolerant Computational Grid

Zahid Raza, Deo P. Vidyarthi
Copyright: © 2011 |Pages: 16
DOI: 10.4018/978-1-60960-603-9.ch007
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Grid is a parallel and distributed computing network system comprising of heterogeneous computing resources spread over multiple administrative domains that offers high throughput computing. Since the Grid operates at a large scale, there is always a possibility of failure ranging from hardware to software. The penalty paid of these failures may be on a very large scale. System needs to be tolerant to various possible failures which, in spite of many precautions, are bound to happen. Replication is a strategy often used to introduce fault tolerance in the system to ensure successful execution of the job, even when some of the computational resources fail. Though replication incurs a heavy cost, a selective degree of replication can offer a good compromise between the performance and the cost. This chapter proposes a co-scheduler that can be integrated with main scheduler for the execution of the jobs submitted to computational Grid. The main scheduler may have any performance optimization criteria; the integration of co-scheduler will be an added advantage towards fault tolerance. The chapter evaluates the performance of the co-scheduler with the main scheduler designed to minimize the turnaround time of a modular job by introducing module replication to counter the effects of node failures in a Grid. Simulation study reveals that the model works well under various conditions resulting in a graceful degradation of the scheduler’s performance with improving the overall reliability offered to the job.
Chapter Preview
Top

Introduction

Computational resources being scarce requires an efficient use of these resources. Resources may vary from specialized computational machines, storage machines to heterogeneous applications. Grid is the aggregation of the resources across the world seamlessly and enabling their use as, when and wherever desired rather than individual group investing heavily for high performance computational resources. In the era of high performance and high throughput computing, grid has emerged as an efficient means of connecting distributed computers or resources scattered all over the world for the purpose of collaborative computing thus essentially unifying various heterogeneous resources on a common platform while diminishing the administrative boundaries to provide a transparent access to a user. Essentially being a part of the grid means an infinite capability to execute and compute any kind of job anywhere by simply becoming its part. Therefore, even if the appropriate computational capabilities are not available with the user, the grid helps the job to be executed on the right resources thereby being efficient as well as cost effective.

Depending on the use grids can be classified as Computational grid, Data grid, Sensor grid, Biological grid etc. A computational grid emphasizes on the computing aspect thus scheduling the job to the grid resources by exploring the computational requirements of the job and effectively load balancing it. Scheduling can be based on various objectives like maximizing the reliability of job execution, minimizing the make span or maximizing the Quality of Service (QoS) for the job execution (Grid Computing Info centre, 2008; Baker, Buyya, & Laforenza, 2002; Tarricone & Esposito, 2005; Ernemann, Hamscher, & Yahyapour, 2002; Casanova, 2002; Vidyarthi, Sarker, Tripathi & Yang, 2009; Raza & Vidyarthi, 2008, 2009).

Execution of a job on the complex and dynamic grid poses number of challenges. One of these challenges is to ensure a reliable environment to the job so that it can cope with any kind of failure. Since the grid resources are heterogeneous in behavior and administrative control, introduction of fault tolerance in the system is very difficult. In addition, the jobs demanding execution on the grid themselves may be very complex and may take a long time to execute making them vulnerable to failures. Further, the resources are under the user control so even accidental damages or even a forced shutdown may fail the execution. Similar is true for the network failure also. These failures may range from hardware to software and to the network failures. The fault tolerant techniques can thus vary from proactive to reactive approaches to counter failure at any level (Dai, Xie, & Poh, 2002; Huda, Schmidt & Peake, 2005; Mujumdar, Bheevgade, Malik & Patrikar, 2008). In spite of these measures, the chances of failures cannot be overruled. The desired objective is to accept these failures and minimize their effect by gracefully degrading the system with continued job execution at the cost of a compromised overall performance. One of the popular mechanisms to handle failures is to introduce replication. This could be in the hardware form or the software form in which same application is executed or stored at more than one resources. Therefore, with the slight increase in the execution cost, replication increases the probability of the successful execution of the job, thus being fault tolerant.

Replication incurs a heavy cost but this cost can be minimized by adopting selective replication. The selection of nodes or job modules depends on certain parameters that can be decided by the system as per the scheduling requirements. The RBS works on the basis of replicating some of the modules allocated on a node with high failure rate on to those nodes with lesser failure rate. Therefore, it increases the fault tolerance of the system without severely affecting the performance.

This paper has six sections. Next section discusses the related work reported in the literature with the similar objective followed by a section elaborating the need and integration of RBS with a main scheduler. Working of the model using a suitable example is illustrated next along with the details of the results obtained from the simulation study. The chapter finally concludes detailing the achievements and drawbacks of the work.

Complete Chapter List

Search this Book:
Reset