Article Preview
TopIntroduction
Data Grids (Chervenak, 2003; Foster et al., 2002) is an infrastructure that deals with huge amount of data to enable grid applications to share data files in a coordinated manner. Such an approach is seen to provide fast, reliable and transparent data access. Nevertheless, the approach is considered as a challenging problem in grid environment because the volume of data to be shared is large despite of limited storage space and network bandwidth. Furthermore, resources involved are heterogeneous as they belong to different administrative domains in a distributed environment.
However, it is unfeasible for all users to access a single instance of data (e.g. a data file) from one single organization (e.g. site). This would lead to the increase of data access latency. Furthermore, one single organization may not be able to handle such a huge volume of data by itself. Motivated by these considerations, a common strategy is used in data grids as well as in distributed systems, and is known as replication. Replication vouches the efficient access without large bandwidth consumption and access latency (Chervenak et al., 2001; Chervenak et al., 2002; Guy et al., 2002; Lamehamedi et al., 2003; Otoo et al., 2002; Ranganathan & Foster, 2001b). Replication technique is one of the major factors affecting the performance of data grids (You et al., 2006). Creating replicas can reroute a client requests to certain replica sites and offer a higher access speed (Tang et al., 2005).
Replication is also bounded by two factors: the size of storage available at different sites within the Data Grid and the bandwidth between these sites (Venugopal et al., 2006). Furthermore, the files in Data Grid are mostly large (Rahman et al., 2009); so, replication to every site is infeasible. Therefore deciding on the optimal locations to host a certain popular files is needed, in order to reduce the bandwidth consumption of the network. In this work, we propose a Replica Placement Strategy (RPS) to find the best sites to host the newly created replicas. The proposed model addresses the problems of current replication models which could be epitomized in two points:
- 1.
A large amount of network bandwidth is consumed resulting from a bad utilization of the network by the existing systems (Chang, 2006; Rasool et al., 2008; Ruay-Shiung et al, 2008; Shorfuzzaman et al., 2008; Tang et al., 2005; Tang et al., 2006; Wang et al., 2007; Yang et al., 2007) .
- 2.
As a result of bad utilization of network bandwidth will lead to increasing of the job execution time (Mansouri et al., 2008; Pangfeng & Jan-Jan, 2006; Ranganathan & Foster, 2001a; Ranganathan et al., 2002; Ruay-Shiung et al., 2008; Yi-Fang et al., 2006).
The proposed work is expected to minimize network bandwidth consumption and reduce job execution time.
TopThere are many studies in the literature that concern replica placements issues. Chin-Min Wan et al. (Wang et al., 2007) proposed a replica placement scheme that tries to overcome the bottleneck caused by increasing the downlinks, which are occurring at the same time. The proposed strategy chooses the best site to host the replica according to the evaluation result based on the number of user request and transmission cost. The purpose of the strategy is to replicate the file to a site that provides minimum average transmission cost. Transmission cost is defined to be inversely proportional to bandwidth, and the site that provides the minimum average transmission cost is selected.