Achieving QoS in Highly Unreliable Grid Environments

Achieving QoS in Highly Unreliable Grid Environments

Antonios Litke (National Technical University of Athens, Greece)
DOI: 10.4018/978-1-60566-370-8.ch009

Abstract

Grids can form the basis for pervasive computing due to their ability of being open, scalable, and flexible to various changes (from topology changes to unpredicted failures of nodes). However, such environments are prone to failures due to their nature and need a certain level of reliability in order to provide viable and commercially exploitable solutions. This is causing nowadays a significant research activity which is focused on the topic of achieving certain levels of Quality of Service (QoS) in highly unreliable environments (such as mobile and ad hoc Grids). This study will focus on the state-of-the-art analysis of the QoS aspects in Grids and how this is achieved in terms of technological means. A small survey and related work will be also presented. A more detailed analysis on the features of unreliable environments such as mobile Grids will be described. An innovative and efficient mechanism will be described, which is especially designed for such environments, in order to enhance them with the QoS attributes of reliability (fault tolerance through replication of tasks) and service differentiation to the Grid users through a simple task prioritization scheme. The results that this recent research work is presenting are promising for the future advancement of Grid commercialization in such environments.
Chapter Preview
Top

Introduction

Grid computing has recently migrated from traditional high performance and distributed computing to pervasive and utility computing based on the advanced capabilities of the wireless networks and the lightweight, thin devices. This has as result the emergence of a new computing paradigm which is the Mobile Grid. Mobile Grid is a full inheritor of Grid with the additional feature of supporting mobile users and resources in a seamless, transparent, secure and efficient way. It has the ability to deploy underlying ad-hoc networks and provide a self-configuring Grid system of mobile resources (hosts and users) connected by wireless links and forming arbitrary and unpredictable topologies. However, it is also the basis and the enabling technology for pervasive and utility computing due to the ability of being open, highly heterogeneous and scalable.

This modern approach that combines thousands of parts into a large system provides generally less reliable platforms, which in combination with the long running codes results into application execution times that exceed the mean time to failure of the machines. For this reason fault tolerance is of vital importance in this new mobile grid paradigm since both mission-critical systems and computational intensive applications belong in the context of diverse, dependable and cross-organizational environments which is the case of the emerging mobile Grid. Mobile Grid computing is a typical example of highly unreliable computational environment and thus it is not expected to be fault free, despite the fact that individual techniques such as fault avoidance and fault removal (Lyu 1995) may additionally be applied to its resources. Therefore fault tolerance mechanisms need to be deployed to allow the Grid system to perform correctly in the presence of faults enhancing it with the appropriate reliability.

We present a fault tolerant model for task scheduling in mobile Grid systems based on the task replication concept. The basic idea is to produce and schedule in the Grid infrastructure multiple replicas of a given task. The number of replicas is calculated by the Grid middleware and is based on the failure probability of the Grid resources and the policy that is adopted for providing a specific level of fault tolerance. The adopted replication model is based on static replication (Nguyen-Tuong, 2000), meaning that when a replica fails it is not substituted by a new one. The introduction of task replicas causes an overhead in the workload that is allocated to the Grid for execution. Scheduling and resource management are important in optimizing Grid resource allocation (Ramamritham, Stankovic, & Shiah, 1990), and determining its ability to deliver the negotiated Quality of Service (QoS) to all users (“Scheduling Working Group,” 2001; Wang, Ramamritham, & Stankovic, 1995). The idea in this paper is to handle the additional tasks with a resource management scheme based on the knapsack problem formulation (Pisinger, 1995), where to each task a weight and a profit for its correct execution have been assigned. By this we allow for an efficient time scheduling of the tasks and their replicas so as to maximize the utilization of the Grid resources and the profit we can gain from the successfully executed ones. Moreover a prioritization scheme for the tasks is applied in order to allow an efficient scheduling complementing the QoS attributes of the Grid.

Complete Chapter List

Search this Book:
Reset