In this chapter, two algorithms have been presented for supporting efficient data transfer in the Grid environment. From a node’s perspective, a multiple data transfer channel can be formed by selecting some other nodes as relays in data transfer. One algorithm requires the sender to be aware of the global connection information while another does not. Experimental results indicate that both algorithms can transfer data efficiently under various circumstances.
Distributed data intensive computing, such as gravitational-wave physics (Barish, 1999), high-energy physics (Wulz, 1998) and astronomy (Szalay, 2008), has become an important application of the Grid technology (Foster, 2001; Chervenak, 2001; Rajasekar, 2003). The future of these paradigms is to share a variety of resources within collaboration in pursuit of common goals (Deelman, 2004). All these paradigms are data driven, which means that only when the data is ready will the computing resource work. Hence, “data” are regarded as the most important resources, and they are the bridges via which the activities and people among those paradigms are connected. For the performance issue, it is important to fetch data as fast as possible. In other words, the performance of data transfer is the key factor affecting the efficiency of data intensive distributed computing.
The data transfer between a server and a client over Internet is limited by several bottlenecks (Gkantsidis, 2003). First, the achievable bandwidth by the client is limited by the server’s bandwidth to the Internet, which is referred as First-Mile problem. Secondly, the achievable bandwidth is limited by the data transfer speed of the link connecting the server and client. Thirdly, the bottleneck may exist in the client’s connection to the Internet, namely the Last-Mile problem. Thus, the data transfer speed may only be as high as the slowest link in the aforementioned setup. The optimization of data transfer is around those three aspects. The typical solutions to the First-Mile and Last-Mile problems are to improve the bandwidth of the client and server using advanced network techniques, such as Gigabyte Ethernet and Fiber channel etc. Usually the bandwidth of the direct connection from a node to Internet is high. However, the data transfer speed is determined by the slowest part of these three aspects. The main reason for low data transfer speed lies in the achievable bandwidth of the path between the server and the client, which is much lower than that of each node connecting to Internet. The bandwidth of the path selected by the routing algorithm, or in other words, the direct bandwidth from the source to the destination, is usually much less than those available to the connections from the source and the destination directly to Internet. Under this circumstance, both the source and the destination have the capabilities of sending/receiving more data given that the larger direct bandwidth those connecting to the Internet can be further utilized.