Data-Aware Distributed Batch Scheduling

Tevfik Kosar

doi:10.4018/978-1-60566-184-1.ch005

Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us
Newsroom

Data-Aware Distributed Batch Scheduling

Tevfik Kosar

Source Title: Handbook of Research on Grid Technologies and Utility Computing: Concepts for Managing Large-Scale Applications

DOI: 10.4018/978-1-60566-184-1.ch005

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

As the data requirements of scientific distributed applications increase, the access to remote data becomes the main performance bottleneck for these applications. Traditional distributed computing systems closely couple data placement and computation, and consider data placement as a side effect of computation. Data placement is either embedded in the computation and causes the computation to delay, or performed as simple scripts which do not have the privileges of a job. The insufficiency of the traditional systems and existing CPU-oriented schedulers in dealing with the complex data handling problem has yielded a new emerging era: the data-aware schedulers. This chapter discusses the challenges in this area as well as future trends, with a focus on Stork case study.

Chapter Preview

Top

Introduction

Modern scientific applications and experiments become increasingly data intensive. Large experiments, such as high-energy physics simulations, genome mapping, and climate modeling generate data volumes reaching hundreds of terabytes (Hey, 2003). Similarly, data collected from remote sensors and satellites are also producing extremely large amounts of data for scientists (Tummala and Kosar, 2007; Ceyhan & Kosar, 2007). In order to process these data, scientists are turning towards distributed resources owned by the collaborating parties to provide them the computing power and storage capacity needed to push their research forward. But the use of distributed resources imposes new challenges (Kosar, 2006). Even simply sharing and disseminating subsets of the data to the scientists’ home institutions is difficult. The systems managing these resources must provide robust scheduling and allocation of storage resources, as well as the efficient management of data movement.

One key benefit of distributed resources is that it allows institutions and organizations to gain access to resources needed for large-scale applications that they would not otherwise have. But in order to facilitate the sharing of compute, storage, and network resources between collaborating parties, middleware is needed for planning, scheduling, and management of the tasks as well as the resources. The majority of existing research has been on the management of compute tasks and resources, as they are widely considered to be the most expensive. As scientific applications become more data intensive, however, the management of storage resources and data movement between the storage and compute resources is becoming the main bottleneck. Many jobs executing in distributed environments are failed or are inhibited by overloaded storage servers. These failures prevent scientists from making progress in their research.

According to the ‘Strategic Plan for the US Climate Change Science Program (CCSP)’, one of the main objectives of the future research programs should be “Enhancing the data management infrastructure”, since “The users should be able to focus their attention on the information content of the data, rather than how to discover, access, and use it.” (CCSP, 2003). This statement by CCSP summarizes the goal of many cyberinfrastructure efforts initiated by DOE, NSF and other federal agencies, as well the research direction of several leading academic institutions.

NSF’s ‘Cyberinfrastructure Vision for 21^st Century’ states that “The national data framework must provide for reliable preservation, access, analysis, interoperability, and data movement” (NSF, 2006). The same report also says: “NSF will ensure that its efforts take advantage of innovation in large data management and distribution activities sponsored by other agencies and international efforts as well.” According to the NSF report on ‘Research Challenges in Distributed Computing Systems’, “Data storage is a fundamental challenge for large-scale distributed systems, and advances in storage research promise to enable a range of new high-impact applications and capabilities” (NSF, 2005).

It would not be too bold to claim that the research and development in the computation-oriented distributed computing has reached its maturity, and now there is an obvious shift of focus towards data–oriented distributed computing. This is mainly due to the fact that existing solutions work very well for computationally-intense applications, but inadequately address applications which access, create, and move large amounts of data over wide-area networks.

Key Terms in this Chapter

Condor: It is a batch scheduling system for computational tasks. It provides a job queuing mechanism and resource monitoring capabilities. It allows the users to specify scheduling policies and enforce priorities.

Stork: It is a specialized scheduler for data placement activities in heterogeneous environments. Stork can queue, schedule, monitor, and manage data placement jobs and ensure that the jobs complete.

Condor-G: It is an extension of Condor, which allows users to submit their jobs to inter-domain resources by using the Globus Toolkit functionality. In this way, user jobs can get scheduled and run not only on Condor resources but also on PBS, LSF, LoadLeveler, and other grid resources.

Distributed Computing: It is a type of parallel computing where different parts of the same application can run on more than one geographically distributed computers.

Batch Scheduling: Scheduling and execution of a series of jobs in the background “batch” mode, without any human interaction.

Data Placement: It encompasses all data movement related activities such as transfer, staging, replication, space allocation and de-allocation, registering and unregistering metadata, locating and retrieving data.

DAGMan: It manages dependencies between tasks in a Directed Acyclic Graph (DAG), whrere tasks are represented as nodes and the dependencies between tasks are represented as directed arcs between the respective nodes.

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Data-Aware Distributed Batch Scheduling

Abstract

Introduction

Key Terms in this Chapter

Complete Chapter List