QoS-Oriented Grid-Enabled Data Warehouses

QoS-Oriented Grid-Enabled Data Warehouses

Rogério Luís de Carvalho Costa (University of Coimbra, Portugal) and Pedro Furtado (University of Coimbra, Portugal)
DOI: 10.4018/978-1-60566-756-0.ch009
OnDemand PDF Download:
$30.00
List Price: $37.50

Abstract

Globally accessible data warehouses are useful in many commercial and scientific organizations. For instance, research centers can be put together through a grid infrastructure in order to form a large virtual organization with a huge virtual data warehouse, which should be transparently and efficiently queried by grid participants. As it is frequent in the grid environment, in the Grid-based Data Warehouse one can both have resource constraints and establish Service Level Objectives (SLOs), providing some Quality of Service (QoS) differentiation for each group of users, participant organizations or requested operations. In this work, we discuss query scheduling and data placement in the grid-based data warehouse, proposing the use of QoS-aware strategies. There are some works on parallel and distributed data warehouses, but most do not concern the grid environment and those which do so, use best-effort oriented strategies. Our experimental results show the importance and effectiveness of proposed strategies.
Chapter Preview
Top

Introduction

In the last few years, Grid technology became a key component in many widely distributed applications from distinct domains, which include both research-oriented and business-related projects. The Grid is used as an underlying infrastructure that provides transparent access to shared and distributed resources, like supercomputers, workstation clusters, storage systems and networks (Foster, 2001). In Data Grids, the infrastructure is used to coordinate the storage of huge volumes of data or the distributed execution of jobs which consume or generate large volumes of data (Krauter et al, 2002; Venugopal et al, 2006). Most of the works on data grids considers the use or management of large files, but grid-enabled Database Management Systems (DBMS) may be highly useful in several applications from distinct domains (Nieto-Santisteban et al, 2005; Watson, 2001).

On the other hand, data warehouses are mostly read-only databases which store historical data that is commonly used for decision support and knowledge discovery (Chaudhuri & Dayal, 1997). Grid-based data warehouses are useful in many real and virtual global organizations which are generating huge volumes of distributed data. In such context, the data warehouse is a highly distributed database whose data may be loaded from distinct sites and that should be transparently queried by users from distinct domains.

But constructing effective grid-based applications is not simple. Grids are usually very heterogeneous environments composed by resources that may belong to distinct organization domains. Each domain administrator may have a certain degree of autonomy and impose local resource usage constraints for remote users (Foster, 2001).

Such site autonomy is reflected in terms of scheduling algorithms and scheduler architectures. The hierarchical architecture is one of the most commonly used scheduling architecture in Grids (Krauter et al, 2002). In such architecture, a Community Scheduler (or Resource Broker) is responsible to transform submitted jobs into tasks and to assign them to sites for execution. At each site, a Local Scheduler is used to manage local queues and implement local domain scheduling policies. Such architecture enables a certain degree of site autonomy.

Besides that, in Grids, tasks are usually specified together with Service Level Objectives (SLO) or Quality-of-Service (QoS) requirements. In fact, in many Grid systems, scheduling is QoS-oriented instead of performance-oriented (Roy & Sander, 2004). In such situations, the main objective is to increase user’s satisfaction instead of achieving high performance. Hence, the user-specified SLOs may be used by the Community Scheduler to negotiate with Local Schedulers the establishment of Service Level Agreements (SLA). But SLOs can also be used to provide some kind of differentiation among users or jobs. Execution deadline and execution cost’s limit are some example of commonly used SLOs.

We consider here the use of deadline-marked queries in grid-based Data Warehouses. In such context, execution time objectives can provide some differentiation between interactive queries and report queries. For example, one can establish that interactive queries should be executed by a 20 seconds deadline and that report queries should be executed in 5 minutes. In fact, different deadlines may be specified considering several alternatives, like the creation of privileged groups of users that should obtain responses in lower times or like providing smaller deadlines for queries submitted by users affiliated to institutions that had offered more resources to the considered grid-based data warehouse.

Data placement is a key issue in grid-based applications. Due to the grid’s heterogeneity and to the high cost of moving data across different sites, data replication is commonly used to improve performance and availability (Ranganathan & Foster, 2004). But most of the works on replica selection and creation in data grids consider generic file replication [e.g. (Lin et al, 2006; Siva Sathya et al, 2006; Haddad & Slimani, 2007)]. Therefore, the use of specialized data placement strategies for the deployment of data warehouses in grids still remains an open issue.

Complete Chapter List

Search this Book:
Reset