Karma2: Provenance Management for Data-Driven Workflows

Yogesh L. Simmhan; Beth Plale; Dennis Gannon

doi:10.4018/978-1-60566-370-8.ch020

Hershey, Pennsylvania

New York, New YorkBeijing, China

Special Offers
- Up to 50% off Thousands of Research Books
  From July 1st through October 31st, 2025, we are offering discounts of up to 50% across thousands of titles in Business & Management; Science, Technology, & Medicine; and Education & Social Sciences. Through this campaign, we’re committed to ensuring that our mutual library customers worldwide can continue to access high-quality, peer-reviewed content during these challenging times. If this campaign is successful, we will extend through the end of the year and beyond if there’s a benefit to all parties involved. When hosted on the InfoSci^® Platform, e-books feature no DRM, no additional cost for unlimited-user licensing, full-text PDF & HTML formats, and more. Discount is automatically added at checkout.
  Browse Titles
- IGI Global Scientific Publishing Launches International Brand Ambassador Program
  IGI Global Scientific Publishing has launched a new Ambassador Program, designed to empower research professionals to help spread scholarly resources and foster global research engagement. As a local, mid-sized publisher, this initiative offers IGI Global Scientific Publishing an exciting opportunity to expand its global presence in the academic community and foster meaningful connections among scholars around the world. With currently over 130 ambassadors worldwide, these scholarly experts are dedicated to supporting the publisher’s initiative of disseminating cutting-edge research.
  Learn More
- Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 20 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no hosting or maintenance fees, no additional cost for unlimited-user licensing, full-text PDF & HTML format, and more.
  Learn More
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education & Social Sciences
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education & Social Sciences
  - Journals by Field
e-Collections
OnDemand
Open Access
- View All Open Access Opportunities
  Search across all available IGI Global Scientific Publishing open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all available IGI Global Scientific Publishing open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through the IGI Global Scientific Publishing Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global Scientific Publishing to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open access endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global Scientific Publishing to publish your work under open access? Review the IGI Global Scientific Publishing open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us

Karma2: Provenance Management for Data-Driven Workflows

Yogesh L. Simmhan (Microsoft Research, USA), Beth Plale (Indiana University, USA), and Dennis Gannon (Indiana University, USA)

Source Title: Quantitative Quality of Service for Grid Computing: Applications for Heterogeneity, Large-Scale Distribution, and Dynamic Environments

DOI: 10.4018/978-1-60566-370-8.ch020

OnDemand:

(Individual Chapters)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

The increasing ability for the sciences to sense the world around us is resulting in a growing need for datadriven e-Science applications that are under the control of workflows composed of services on the Grid. The focus of our work is on provenance collection for these workflows that are necessary to validate the work-flow and to determine quality of generated data products. The challenge we address is to record uniform and usable provenance metadata that meets the domain needs while minimizing the modification burden on the service authors and the performance overhead on the workflow engine and the services. The framework is based on generating discrete provenance activities during the lifecycle of a workflow execution that can be aggregated to form complex data and process provenance graphs that can span across workflows. The implementation uses a loosely coupled publish-subscribe architecture for propagating these activities, and the capabilities of the system satisfy the needs of detailed provenance collection. A performance evaluation of a prototype finds a minimal performance overhead (in the range of 1% for an eight-service workflow using 271 data products).

Chapter Preview

Top

Introduction

The need to access and share large-scale computational and data resources to support dynamic computational science and agile enterprises is driving the growth of Grids (Foster, Kesselman, Nick & Tuecke, 2002). In the realm of e-Science, science gateways are archetypes for accessing, managing, and sharing virtualized resources to solve large collaboratory challenges (Catlett, 2002; Gannon et al., 2005). Science gateways are built as a service-oriented architecture with Grid resources virtualized as services. These resources—including physical resources such as sensors, computational clusters, and mass storage devices, and software resources such as scientific tasks and models—are available as services that provide an abstraction to access the resources through well-defined interfaces.

A significant constituent of applications that make use of the science gateways is data-driven applications (Simmhan, Pallickara, Vijayakumar & Plale, 2006a). The proliferation of wireless networking and inexpensive sensor technology is allowing the sciences an increasing ability to sense the world around us (West, 2005). This is specifically resulting in a growing need for data-driven applications; that is, applications that can be computation-intense and are usually either dataflow applications in which data flows from one process to another, or demand-driven in which computations are triggered in response to events occurring in the world around us. Data-driven scientific experiments are designed as workflows composed of services on the Grid, and data flow from one service to another, being transformed, filtered, fused, and used in complex models. These workflows capture the invocation logic for the scientific investigation and may be composed of hundreds of services connected as complex graphs. Data-driven workflow executions also see the participation of thousands of data products that reach terabytes in size. At this scale of processing, users need the ability to automatically track the execution of their experiments and the multitude of data products created and consumed by the services in the workflow. Provenance collection and management, also called process mining, workflow tracing, or lineage collection, is a new line of research on the execution of workflows, and the derivation and usage trail of data products that are involved in the workflows (Bose & Frew, 2005; Moreau & Ludascher, 2007; Simmhan, Plale & Gannon, 2005).

Provenance collected about the tasks of a workflow describes the workflow’s service invocations during its execution (Simmhan et al., 2005). This helps track service and resource usage patterns, and forms metadata for service and workflow discovery. In data-driven applications, however, it is provenance about the data that is central to understanding and recreating earlier runs. In data-driven workflows, data products are first-class parameters to services that consume and transform the input data to generate derived data products. These derived data products are ingested by other services in the same or a different workflow, forming a data derivation and data usage trail. Data provenance provides this derivation history of data that includes information about services and input data that contributed to the creation of a data product. This kind of information is extremely valuable, not only for diagnosing problems and understanding performance of a particular workflow run, but also to determine the origin and quality of a particular piece of derived information (Goble, 2002; Simmhan, Plale & Gannon, 2006b).

Current methods of collecting provenance are from workflow engine logs (IBM, 2005) or by instrumenting the services (Bose & Frew, 2004; Zhao, Wroe, Goble, Stevens, Quan & Greenwood, 2004). In the former case, the logs from the workflow engine are at the message level and insufficient for deciphering provenance about the data products, while instrumenting services introduce a burden on the service author to modify their service to generate provenance metadata. They also tend to be specific to the workflow framework and are not interoperable with heterogeneous workflow models that are likely to be present in a Grid environment. Work is also emerging on more general information models for provenance collection (Moreau & Ludascher, 2007).

Complete Chapter List

Search this Book:

Reset

MLA

APA

Chicago

Export Reference

Karma2: Provenance Management for Data-Driven Workflows

Abstract

Introduction

Complete Chapter List