Karma2: Provenance Management for Data-Driven Workflows

Yogesh L. Simmhan; Beth Plale; Dennis Gannon

doi:10.4018/978-1-60566-370-8.ch020

Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Karma2: Provenance Management for Data-Driven Workflows

Yogesh L. Simmhan, Beth Plale, Dennis Gannon

Source Title: Quantitative Quality of Service for Grid Computing: Applications for Heterogeneity, Large-Scale Distribution, and Dynamic Environments

DOI: 10.4018/978-1-60566-370-8.ch020

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

The increasing ability for the sciences to sense the world around us is resulting in a growing need for datadriven e-Science applications that are under the control of workflows composed of services on the Grid. The focus of our work is on provenance collection for these workflows that are necessary to validate the work-flow and to determine quality of generated data products. The challenge we address is to record uniform and usable provenance metadata that meets the domain needs while minimizing the modification burden on the service authors and the performance overhead on the workflow engine and the services. The framework is based on generating discrete provenance activities during the lifecycle of a workflow execution that can be aggregated to form complex data and process provenance graphs that can span across workflows. The implementation uses a loosely coupled publish-subscribe architecture for propagating these activities, and the capabilities of the system satisfy the needs of detailed provenance collection. A performance evaluation of a prototype finds a minimal performance overhead (in the range of 1% for an eight-service workflow using 271 data products).

Chapter Preview

Top

Introduction

The need to access and share large-scale computational and data resources to support dynamic computational science and agile enterprises is driving the growth of Grids (Foster, Kesselman, Nick & Tuecke, 2002). In the realm of e-Science, science gateways are archetypes for accessing, managing, and sharing virtualized resources to solve large collaboratory challenges (Catlett, 2002; Gannon et al., 2005). Science gateways are built as a service-oriented architecture with Grid resources virtualized as services. These resources—including physical resources such as sensors, computational clusters, and mass storage devices, and software resources such as scientific tasks and models—are available as services that provide an abstraction to access the resources through well-defined interfaces.

A significant constituent of applications that make use of the science gateways is data-driven applications (Simmhan, Pallickara, Vijayakumar & Plale, 2006a). The proliferation of wireless networking and inexpensive sensor technology is allowing the sciences an increasing ability to sense the world around us (West, 2005). This is specifically resulting in a growing need for data-driven applications; that is, applications that can be computation-intense and are usually either dataflow applications in which data flows from one process to another, or demand-driven in which computations are triggered in response to events occurring in the world around us. Data-driven scientific experiments are designed as workflows composed of services on the Grid, and data flow from one service to another, being transformed, filtered, fused, and used in complex models. These workflows capture the invocation logic for the scientific investigation and may be composed of hundreds of services connected as complex graphs. Data-driven workflow executions also see the participation of thousands of data products that reach terabytes in size. At this scale of processing, users need the ability to automatically track the execution of their experiments and the multitude of data products created and consumed by the services in the workflow. Provenance collection and management, also called process mining, workflow tracing, or lineage collection, is a new line of research on the execution of workflows, and the derivation and usage trail of data products that are involved in the workflows (Bose & Frew, 2005; Moreau & Ludascher, 2007; Simmhan, Plale & Gannon, 2005).

Provenance collected about the tasks of a workflow describes the workflow’s service invocations during its execution (Simmhan et al., 2005). This helps track service and resource usage patterns, and forms metadata for service and workflow discovery. In data-driven applications, however, it is provenance about the data that is central to understanding and recreating earlier runs. In data-driven workflows, data products are first-class parameters to services that consume and transform the input data to generate derived data products. These derived data products are ingested by other services in the same or a different workflow, forming a data derivation and data usage trail. Data provenance provides this derivation history of data that includes information about services and input data that contributed to the creation of a data product. This kind of information is extremely valuable, not only for diagnosing problems and understanding performance of a particular workflow run, but also to determine the origin and quality of a particular piece of derived information (Goble, 2002; Simmhan, Plale & Gannon, 2006b).

Current methods of collecting provenance are from workflow engine logs (IBM, 2005) or by instrumenting the services (Bose & Frew, 2004; Zhao, Wroe, Goble, Stevens, Quan & Greenwood, 2004). In the former case, the logs from the workflow engine are at the message level and insufficient for deciphering provenance about the data products, while instrumenting services introduce a burden on the service author to modify their service to generate provenance metadata. They also tend to be specific to the workflow framework and are not interoperable with heterogeneous workflow models that are likely to be present in a Grid environment. Work is also emerging on more general information models for provenance collection (Moreau & Ludascher, 2007).

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Karma2: Provenance Management for Data-Driven Workflows

Abstract

Introduction

Complete Chapter List