Adaptive Query Processing in Data Grids

Adaptive Query Processing in Data Grids

Chunjiang Zhao (National Engineering and Research Center for Information Technology for Agriculture, China), Junwei Cao (Tsinghua University, China), Huarui Wu (Tsinghua National Laboratory for Information Science and Technology, China) and Weiwei Chen (Tsinghua University, China)
DOI: 10.4018/978-1-61520-686-5.ch016


The data grid integrates wide-area autonomous data sources and provides users with a unified data query and processing infrastructure. Adaptive data query and processing is required by data grids to provide better quality of services (QoS) to users and applications in spite of dynamically changing resources and environments. Existing AQP techniques can only meet partially data grid requirements. Some existing work is either addressing domain-specific or single-node query processing problems. Data grids provide new mechanisms for monitoring and discovering data and resources in a cross-domain wide area. Data query in grids can benefit from this information and provide better adaptability to the dynamic nature of the grid environment. In this work, an adaptive controller is proposed that dynamically adjusts resource shares to multiple data query requests in order to meet a specified level of service differentiation. The controller parameters are automatically tuned at runtime based on a predefined cost function and an online learning method. Simulation results show that our controller can meet given QoS differentiation targets and adapt to dynamic system resources among multiple data query processing requests while total demand from users and applications exceeds system capability.
Chapter Preview

1 Introduction

Query processing (QP) is an essential technology for traditional database management systems (Kim et al., 1985). QP aims to transform a query in a high-level declarative language (e.g. SQL) into a correct and efficient execution strategy. Query optimization (Jarke & Koch, 1984) is one of key techniques to achieve high performance data query using cost estimation in various types of database systems, e.g. multimedia, object-oriented, deductive, parallel, distributed databases, heterogeneous multidatabase systems, fuzzy relational databases, and so on (Yu & Meng, 1997).

Traditional query processing in database management systems is usually carried out in two phases: optimization and execution. While the details of optimization have been improved over the years, the basic approach of optimization followed by execution has not been changed. In this way, optimization could only be carried out in a coarse-grained way, since during the execution environmental changes could not be identified and feedback to implement an improved optimization. If data query processing has to be carried out in a long time, QP performance may not satisfy user requirements. This is why adaptability of QP is required.

Adaptive Query Processing (AQP) (Hellerstien et al., 2000) is becoming more popular in recent years where optimization is required to be carried out during execution. The main reason is the emergence of new domains, e.g. peer-to-peer (P2P) computing and grid computing, where it is nearly impossible to use traditional query processing, because of lack of reliable performance statistics or the dynamic nature of data and environments. Two styles of adaptation in AQP are summarized in Deshpande, Hellerstein, and Raman (2006): plan-change based adaptation provides a well-defined query execution plan but allow the plan to be changed during query processing; tuple-routing based adaptation views query processing as routing of tuples through operators and effects plan changes by changing the order in which tuples are routed.

P2P computing provides a dynamic and data sharing environment, where adaptability of data access and query is implemented by optimal selection in peers of data providers. All data requester at the same time become a data provider after its request is fulfilled. Due to the absence of a central control in a P2P environment, further fine-grained adaptability cannot be implemented. In this work, we only address AQP issues in data grids where AQP is required for distributed data access and fine-grid resource management and scheduling.

Grid computing aims for integration and sharing geographically distributed resources in multiple management domains (Foster & Kesselman, 1998). While the grid is originally motivated by computational power sharing, data management turns out to be an essential service since large volumes of data processing are involved in most grid applications. Data grids (Chervenak et al., 2001) provide a transparent and seamless infrastructure for cross-domain distributed data access, leading to the following challenges for data query processing:

  • Performance of grid resources may change dramatically over time, since most these resources are shared and not dedicated to the grid.

  • QoS requirements of data query processing from grid applications may also change over time, since most grid applications last for a long time with large amount of data processing involved.

Existing AQP techniques can only meet partially data grid requirements. Some existing work is either addressing domain-specific or single-node query processing problems (Gounaris et al., 2001). Data grids provide new mechanisms for monitoring and discovering data and resources in a cross-domain wide area. Data query in grids can benefit from this information and provide better adaptability to the dynamic nature of the grid environment.

Key Terms in this Chapter

Query processing: Is to transform a query in a high level declarative language (such as SQL) into a correct and efficient execution strategy. It includes query decomposition (analysis, conjunctive and disjunctive normalization and semantic analysis), query optimization and query evaluation (execution).

Query Optimization: Is a function of many relational database management systems to find a good (not necessarily the best) query plan from multiple query plans satisfying a query. Its goal is to eliminate as many unneeded tuples, or rows as possible. Query optimization may be based on heuristic or cost.

Data Grid: Is a grid computing system that deals with data — the controlled sharing and management of large amounts of distributed data. These are often, but not always, combined with computational grid computing systems.

System Identification: Is a dynamical mathematical model in this context is a mathematical description of the dynamic behavior of a system or process in either the time or frequency domain. Usually, it is started from measurements of the behavior of the system and the external influences (inputs to the system) and try to determine a mathematical relation between them without going into the details of what is actually happening inside the system. Commonly used models include grey box model and black box model.

Adaptive Query Processing: Is an improvement of traditional query processing, in order to addresses the problems of missing statistics, unexpected correlations, unpredictable costs, and dynamic data by using feedback to tune execution. It is one of the cornerstones of so-called autonomic database management systems, although it also generalizes to many other contexts, particularly at the intersection of database query processing and the Web.

Grid Computing: Is a form of networking, which, unlike conventional networks that focus on communication among devices, harnesses unused processing cycles of all computers in a network for solving problems too intensive for any stand-alone machine. It is applying the resources of many computers in a network to a single problem at the same time - usually to a scientific or technical problem that requires a great number of computer processing cycles or access to large amounts of data.

Adaptive Control: Is a form of control system whose parameters may be changed dynamically in order to adapt to a changing environment. Generally, it is a system of automatic monitoring and adjustment, usually by computer, of an industrial process. Adaptive control allows operating parameters to be changed continuously in response to a changing environment in order to achieve optimum performance.

Complete Chapter List

Search this Book: