Article Preview
Top1. Introduction
A distributed database is defined as a collection of data belonging to logically interrelated databases spread over different sites of a computer network (Hakimzadeh, 2005; Özsu & Valduriez, 1991; Özsu & Valduriez, 2004). A distributed database management system is defined as a software system that facilitates the management of such distributed databases with the aim of providing transparency in such distribution to the users (Özsu & Valduriez, 2004). It provides high level support for developing complex applications. Unlike centralized database systems, where the only resource that needs to be shielded from the user is the data, in distributed database management systems the communication network also needs to be managed. The user is independent of the network operational details. This kind of transparency is referred to as network transparency or distribution transparency. This distribution transparency enables the users to pose queries without having knowledge of the location of the data. Another important issue of distributed databases is the replication of data across the database nodes in the network (Özsu & Valduriez, 1997). Data is replicated in order to improve the performance, reliability and availability of the system. The data residing at a particular site is also stored at a site where it is more frequently accessed. As a result, it would increase the locality of reference. This replication should be transparent such that it appears, to a user, that there is a single copy of data, though in reality multiple copies of the same are distributed in database nodes spread across the network. Data fragmentation is also desirable, where data is divided into fragments and each such fragment is stored at different database nodes, in the network (Ceri & Pelagatti, 1985; Özsu & Valduriez, 2004; Özsu & Valduriez, 1997). Fragmentation increases the performance, availability and reliability of the system.
The aim of distributed query processing is to provide answers to user queries in an effective and efficient manner. In distributed databases, queries are usually non-procedural in nature, where the user specifies what is required without specifying how the answer to it should be retrieved. This procedure is actually devised by the query processor in a distributed database management system (Özsu & Valduriez, 2004) and thus relieves the user from tediously processing the query. Query processing is a critical performance issue and has received considerable amount of attention in the context of both centralized as well as distributed database systems (Ceri & Pelagatti, 1985; Kossmann, 2000; Özsu & Valduriez, 2004). It becomes more complex and performance critical in the case of distributed database systems, as a large number of issues like data fragmentation, data replication and distantly located data have an impact on query processing. Distributed query may involve relations, which have been fragmented and/or replicated, leading to inclusion of costs due to communication overheads. If relations at many sites are required for answering the user query, query processing may be time consuming, i.e. query response time would be high due to communication between the involved sites.