As the data requirements of scientific distributed applications increase, the access to remote data becomes the main performance bottleneck for these applications. Traditional distributed computing systems closely couple data placement and computation, and consider data placement as a side effect of computation. Data placement is either embedded in the computation and causes the computation to delay, or performed as simple scripts which do not have the privileges of a job. The insufficiency of the traditional systems and existing CPU-oriented schedulers in dealing with the complex data handling problem has yielded a new emerging era: the data-aware schedulers. This chapter discusses the challenges in this area as well as future trends, with a focus on Stork case study.
Modern scientific applications and experiments become increasingly data intensive. Large experiments, such as high-energy physics simulations, genome mapping, and climate modeling generate data volumes reaching hundreds of terabytes (Hey, 2003). Similarly, data collected from remote sensors and satellites are also producing extremely large amounts of data for scientists (Tummala and Kosar, 2007; Ceyhan & Kosar, 2007). In order to process these data, scientists are turning towards distributed resources owned by the collaborating parties to provide them the computing power and storage capacity needed to push their research forward. But the use of distributed resources imposes new challenges (Kosar, 2006). Even simply sharing and disseminating subsets of the data to the scientists’ home institutions is difficult. The systems managing these resources must provide robust scheduling and allocation of storage resources, as well as the efficient management of data movement.
One key benefit of distributed resources is that it allows institutions and organizations to gain access to resources needed for large-scale applications that they would not otherwise have. But in order to facilitate the sharing of compute, storage, and network resources between collaborating parties, middleware is needed for planning, scheduling, and management of the tasks as well as the resources. The majority of existing research has been on the management of compute tasks and resources, as they are widely considered to be the most expensive. As scientific applications become more data intensive, however, the management of storage resources and data movement between the storage and compute resources is becoming the main bottleneck. Many jobs executing in distributed environments are failed or are inhibited by overloaded storage servers. These failures prevent scientists from making progress in their research.
According to the ‘Strategic Plan for the US Climate Change Science Program (CCSP)’, one of the main objectives of the future research programs should be “Enhancing the data management infrastructure”, since “The users should be able to focus their attention on the information content of the data, rather than how to discover, access, and use it.” (CCSP, 2003). This statement by CCSP summarizes the goal of many cyberinfrastructure efforts initiated by DOE, NSF and other federal agencies, as well the research direction of several leading academic institutions.
NSF’s ‘Cyberinfrastructure Vision for 21st Century’ states that “The national data framework must provide for reliable preservation, access, analysis, interoperability, and data movement” (NSF, 2006). The same report also says: “NSF will ensure that its efforts take advantage of innovation in large data management and distribution activities sponsored by other agencies and international efforts as well.” According to the NSF report on ‘Research Challenges in Distributed Computing Systems’, “Data storage is a fundamental challenge for large-scale distributed systems, and advances in storage research promise to enable a range of new high-impact applications and capabilities” (NSF, 2005).
It would not be too bold to claim that the research and development in the computation-oriented distributed computing has reached its maturity, and now there is an obvious shift of focus towards data–oriented distributed computing. This is mainly due to the fact that existing solutions work very well for computationally-intense applications, but inadequately address applications which access, create, and move large amounts of data over wide-area networks.
Key Terms in this Chapter
Condor: It is a batch scheduling system for computational tasks. It provides a job queuing mechanism and resource monitoring capabilities. It allows the users to specify scheduling policies and enforce priorities.
Stork: It is a specialized scheduler for data placement activities in heterogeneous environments. Stork can queue, schedule, monitor, and manage data placement jobs and ensure that the jobs complete.
Condor-G: It is an extension of Condor, which allows users to submit their jobs to inter-domain resources by using the Globus Toolkit functionality. In this way, user jobs can get scheduled and run not only on Condor resources but also on PBS, LSF, LoadLeveler, and other grid resources.
Distributed Computing: It is a type of parallel computing where different parts of the same application can run on more than one geographically distributed computers.
Batch Scheduling: Scheduling and execution of a series of jobs in the background “batch” mode, without any human interaction.
Data Placement: It encompasses all data movement related activities such as transfer, staging, replication, space allocation and de-allocation, registering and unregistering metadata, locating and retrieving data.
DAGMan: It manages dependencies between tasks in a Directed Acyclic Graph (DAG), whrere tasks are represented as nodes and the dependencies between tasks are represented as directed arcs between the respective nodes.