Reliability Based Scheduling Model (RSM) for Computational Grids

Reliability Based Scheduling Model (RSM) for Computational Grids

Zahid Raza (Jawaharlal Nehru University, India) and Deo P. Vidyarthi (Jawaharlal Nehru University, India)
Copyright: © 2011 |Pages: 18
DOI: 10.4018/jdst.2011040102
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Computational Grid attributed with distributed load sharing has evolved as a platform to large scale problem solving. Grid is a collection of heterogeneous resources, offering services of varying natures, in which jobs are submitted to any of the participating nodes. Scheduling these jobs in such a complex and dynamic environment has many challenges. Reliability analysis of the grid gains paramount importance because grid involves a large number of resources which may fail anytime, making it unreliable. These failures result in wastage of both computational power and money on the scarce grid resources. It is normally desired that the job should be scheduled in an environment that ensures maximum reliability to the job execution. This work presents a reliability based scheduling model for the jobs on the computational grid. The model considers the failure rate of both the software and hardware grid constituents like application demanding execution, nodes executing the job, and the network links supporting data exchange between the nodes. Job allocation using the proposed scheme becomes trusted as it schedules the job based on a priori reliability computation.
Article Preview

Introduction

The scientific community always thirsts for powerful computational tools and methods. This has resulted in enormous developments in the computing world with regard to processor speed, fast and large memory and efficient network devices for fast and reliable data transmission along with the advancement in software technology. The thirst for computational energy led to newer tools, which again fed back to improve the scientific research. The result of this self-feeding cycle resulted in the aggregation of heterogeneous resources known as Grid, empowering towards collaborative engineering (Foster & Kesselman, 1998; Foster, 2002; Tarricone & Esposito, 2005; Taylor & Harrison, 2009).

A grid can be considered as consisting of a number of clusters with each cluster comprising of computing resources of nearly the same nature. Though, across the clusters the nature of the nodes may differ. Participants inside cluster agree to cooperate in problem solving thus making a virtual organization (VO). At any moment of time there could be many virtual organizations inside the grid with a dynamic constitution. Jobs may enter to the grid through any of the participating nodes. To harness the advantages of the grid these jobs should be scheduled over the grid so as to utilize the parallel and concurrent nature of the jobs. Scheduling is the problem of mapping the jobs over the grid resources and is said to be efficient if this mapping is done keeping in mind the job requirements e.g. the nature of the job, its inherent parallelism, proper load balancing etc. Since scheduling is an NP-hard problem many scheduling models have been proposed in the literature optimizing one or the other parameters.

Whenever a job enters the grid for execution the chances for its failure may spread from the application failure to the resource failure (node failure etc.). Failure can be the result of many things viz. specification mistake (incorrect algorithms, architectures etc.) hardware failures (hot crash, network partition etc.), software failure (numerical exception, failed application etc.), implementation mistakes, component defects, external disturbance (radiation, electromagnetic waves, interference etc.), performance failures (application not completing within a specified time etc.) or some other failures (machine rebooted by the owner, excessive CPU load, decreased priority by the local resource for the current task etc.) (Huda, Schmidt, & Peake, 2005). A fault tolerant system is one which continues to perform even in the presence of hardware and software failure. A fault is a physical defect, imperfection, or flaw that occurs within some hardware or software component, whereas an error is the manifestation of a fault and is a deviation from accuracy or incorrectness. Specifically, faults are the cause of error and errors causes the failures. Depending on the type of grid it may be susceptible to either or all types of faults.

Reliability is the ability of a system to perform and maintain its functions in routine circumstances, as well as hostile or unexpected circumstances. More the fault tolerance of the system more reliable it is. Reliability adds quality to the system and is an often desired parameter for schedulers owing to large size of the grid and the composition consisting of scarce resources. Failures can result in a huge loss both in terms of money and utilization of computational energy. Thus, it is always desired from a grid scheduler that it ensures the reliable environment to the job execution. Whenever a grid is designed, the hardware components are specified with a failure rate by the manufacturer and are supplied as a part of the hardware specifications. Software components also has failure rate specified during software design using software engineering paradigm. These failure rates reflect the reliability of the system, which is desired to be high. For the scheduling decision, reliability should be computed beforehand keeping in mind the contribution of both the hardware and the software so that the probability of successful job execution may increase. In this work, we propose a Reliability Based Scheduling Model (RSM) which allocates the modular job on the cluster of the grid that matches the job's requirements and offers the most reliable environment to the job execution.

Complete Article List

Search this Journal:
Reset
Open Access Articles
Volume 8: 4 Issues (2017)
Volume 7: 4 Issues (2016)
Volume 6: 4 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 4 Issues (2012)
Volume 2: 4 Issues (2011)
Volume 1: 4 Issues (2010)
View Complete Journal Contents Listing