Article Preview
Top1. Introduction
Distributed databases are essential in the current business environment, in which data spread across multiple geographical locations is connected by a computer network (Ceri & Pelagatti, 1984). An effective distributed database management system (DDBMS) is required to manage and retrieve data from such disparate data sources. DDBMS provides a simple, unified interface to the user, which appears like a single database instead of disparate databases (Ozsu & Valduriez, 2011). Query processing plays a crucial role in the performance of DDBMS (Rho & March, 1997). The query processing problem is far more challenging in a distributed environment, since a distributed query may involve relations that are fragmented and/or replicated across multiple distinct sites leading to the incurring of site-to-site data transmission costs. In order to address this problem, an optimal Distributed Query Processing (DQP) strategy has to be devised. In DQP, the user queries are analysed and transformed into a set of data manipulation operations (Ozsu & Valduriez, 2011). In DDBMS, first the user query is passed to a query parser, where the user query is parsed and the corresponding relational algebraic expression is generated (Liu & Yu, 1993). Using the query optimizer, an optimal relational algebraic expression, comprising an effective query processing plan, is generated. The user query is broken into sub-queries and the execution plan is devised for all such sub-queries, each of which is processed in parallel at their respective sites. The results of these sub-queries are thereafter integrated to produce the final result for the user query (Ozsu & Valduriez, 2011). Thus, the goal of the DQP strategy is to generate query processing plans in a manner that reduces the data transmission cost, which would substantially reduce the total cost of processing a distributed query. This query processing cost comprises of local processing costs and site-to-site data transmission cost (Ozsu & Valduriez, 2011).
In DQP, the data transmission cost is more significant than the local processing cost (Ozsu & Valduriez, 2011). In order to make distributed query processing more efficient, communication of data between sites needs to be minimized in order to reduce the site-to-site data transmission cost. Further, an optimal ordering of operations like selection, projection and join could minimize the inter site data transfers resulting in reduced query response times. In DDBMS, multiple copies of a relation are stored at different sites to enhance data availability. Often user queries require access to multiple relations stored at multiple sites for their processing. So, it would be economical to access only those copies of relations for answering the user query that would minimize the inter-site data transmission.