Karma2: Provenance Management for Data-Driven Workflows

Karma2: Provenance Management for Data-Driven Workflows

Yogesh L. Simmhan (Microsoft Research, USA), Beth Plale (Indiana University, USA) and Dennis Gannon (Indiana University, USA)
DOI: 10.4018/978-1-60566-370-8.ch020
OnDemand PDF Download:
$37.50

Abstract

The increasing ability for the sciences to sense the world around us is resulting in a growing need for datadriven e-Science applications that are under the control of workflows composed of services on the Grid. The focus of our work is on provenance collection for these workflows that are necessary to validate the work-flow and to determine quality of generated data products. The challenge we address is to record uniform and usable provenance metadata that meets the domain needs while minimizing the modification burden on the service authors and the performance overhead on the workflow engine and the services. The framework is based on generating discrete provenance activities during the lifecycle of a workflow execution that can be aggregated to form complex data and process provenance graphs that can span across workflows. The implementation uses a loosely coupled publish-subscribe architecture for propagating these activities, and the capabilities of the system satisfy the needs of detailed provenance collection. A performance evaluation of a prototype finds a minimal performance overhead (in the range of 1% for an eight-service workflow using 271 data products).
Chapter Preview
Top

Introduction

The need to access and share large-scale computational and data resources to support dynamic computational science and agile enterprises is driving the growth of Grids (Foster, Kesselman, Nick & Tuecke, 2002). In the realm of e-Science, science gateways are archetypes for accessing, managing, and sharing virtualized resources to solve large collaboratory challenges (Catlett, 2002; Gannon et al., 2005). Science gateways are built as a service-oriented architecture with Grid resources virtualized as services. These resources—including physical resources such as sensors, computational clusters, and mass storage devices, and software resources such as scientific tasks and models—are available as services that provide an abstraction to access the resources through well-defined interfaces.

A significant constituent of applications that make use of the science gateways is data-driven applications (Simmhan, Pallickara, Vijayakumar & Plale, 2006a). The proliferation of wireless networking and inexpensive sensor technology is allowing the sciences an increasing ability to sense the world around us (West, 2005). This is specifically resulting in a growing need for data-driven applications; that is, applications that can be computation-intense and are usually either dataflow applications in which data flows from one process to another, or demand-driven in which computations are triggered in response to events occurring in the world around us. Data-driven scientific experiments are designed as workflows composed of services on the Grid, and data flow from one service to another, being transformed, filtered, fused, and used in complex models. These workflows capture the invocation logic for the scientific investigation and may be composed of hundreds of services connected as complex graphs. Data-driven workflow executions also see the participation of thousands of data products that reach terabytes in size. At this scale of processing, users need the ability to automatically track the execution of their experiments and the multitude of data products created and consumed by the services in the workflow. Provenance collection and management, also called process mining, workflow tracing, or lineage collection, is a new line of research on the execution of workflows, and the derivation and usage trail of data products that are involved in the workflows (Bose & Frew, 2005; Moreau & Ludascher, 2007; Simmhan, Plale & Gannon, 2005).

Provenance collected about the tasks of a workflow describes the workflow’s service invocations during its execution (Simmhan et al., 2005). This helps track service and resource usage patterns, and forms metadata for service and workflow discovery. In data-driven applications, however, it is provenance about the data that is central to understanding and recreating earlier runs. In data-driven workflows, data products are first-class parameters to services that consume and transform the input data to generate derived data products. These derived data products are ingested by other services in the same or a different workflow, forming a data derivation and data usage trail. Data provenance provides this derivation history of data that includes information about services and input data that contributed to the creation of a data product. This kind of information is extremely valuable, not only for diagnosing problems and understanding performance of a particular workflow run, but also to determine the origin and quality of a particular piece of derived information (Goble, 2002; Simmhan, Plale & Gannon, 2006b).

Current methods of collecting provenance are from workflow engine logs (IBM, 2005) or by instrumenting the services (Bose & Frew, 2004; Zhao, Wroe, Goble, Stevens, Quan & Greenwood, 2004). In the former case, the logs from the workflow engine are at the message level and insufficient for deciphering provenance about the data products, while instrumenting services introduce a burden on the service author to modify their service to generate provenance metadata. They also tend to be specific to the workflow framework and are not interoperable with heterogeneous workflow models that are likely to be present in a Grid environment. Work is also emerging on more general information models for provenance collection (Moreau & Ludascher, 2007).

Complete Chapter List

Search this Book:
Reset
Editorial Advisory Board
Table of Contents
Acknowledgment
Chapter 1
Fangpeng Dong, Selim G. Akl
Over the past decade, Grid Computing has earned its reputation by facilitating resource sharing in larger communities and providing non-trivial... Sample PDF
Two Approaches of Workflow Scheduling with QoS in the Grid
$37.50
Chapter 2
Francesco Palmieri, Ugo Fiore
In the past decade there has been a remarkable change from mainframe-based centralized computing to a distributed client/server approach. In the... Sample PDF
Dynamic Network Optimization for Effective Qos Support in Large Grid Infrastructures
$37.50
Chapter 3
Junwei Cao, Fan Zhang, Ke Xu, Lianchen Liu
Grid workflows are becoming a mainstream paradigm for implementing complex grid applications. In addition to existing grid enabling techniques... Sample PDF
From Enabling to Ensuring Grid Workflows
$37.50
Chapter 4
Chuliang Weng, Jian Cao, Minglu Li
In the grid context, the scheduling can be grouped into two categories: offline scheduling and online scheduling. In the offline scheduling... Sample PDF
The Cost-Based Resource Management in Combination with Qos For Grid Computing
$37.50
Chapter 5
Yijun Lu, Hong Jiang, Ying Lu
Consistency control is important in replication-based-Grid systems because it provides QoS guarantee. However, conventional consistency control... Sample PDF
Providing Quantitative Scalability Improvement of Consistency Control for Large-Scale, Replication-Based Grid Systems
$37.50
Chapter 6
Kuo-Chan Huang, Po-Chi Shih, Yeh-Ching Chung
This chapter elaborates the quality of service (QoS) aspect of load sharing activities in a computational grid environment. Load sharing is achieved... Sample PDF
QoS-based Job Scheduling and Resource Management Strategies for Grid Computing
$37.50
Chapter 7
Dimosthenis Kyriazis, Andreas Menychtas, Theodora Varvarigou
This chapter focuses on presenting and describing an approach that allows the mapping of workflow processes to Grid provided services by not only... Sample PDF
Grid Workflows with Encompassed Business Relationship: An Approach Establishing Quality of Service Guarantees
$37.50
Chapter 8
Justin M. Wozniak, Aaron Striegel
Opportunistic techniques have been widely used to create economical computation infrastructures and have demonstrated an ability to deliver... Sample PDF
Investigating Deadline-Driven Scheduling Policy via Simulation with East
$37.50
Chapter 9
Antonios Litke
Grids can form the basis for pervasive computing due to their ability of being open, scalable, and flexible to various changes (from topology... Sample PDF
Achieving QoS in Highly Unreliable Grid Environments
$37.50
Chapter 10
Fang Huang
With the development of grid technology, the spatial information grid researches are also in progress. In China, the spatial information grid... Sample PDF
Implementation and QoS for High-performance GIServices in Special Information Grid
$37.50
Chapter 11
Xiangfeng Luo, Jie Yu
Web Knowledge Flow provides a technique and theoretical support for the effective discovery of knowledge innovation, intelligent browsing... Sample PDF
The Interactive Computing of Web Knowledge Flow - from Web to Knowledge Web
$37.50
Chapter 12
Guanfeng Liu
This chapter mainly introduces some recent researches of reputation evaluation methods in Grid economy. The GRACE (Grid Architecture for... Sample PDF
Reputation Evaluation Framework Based on QoS in Grid Economy Environments
$37.50
Chapter 13
Cheng Fu, Bang Wang
A major design challenge in wireless sensor network application development is to provide appropriate middleware service protocols to control the... Sample PDF
Distributed Scheduling Protocols for Energy Efficient Large-Scale Wireless Sensor Networks
$37.50
Chapter 14
Kaijun Ren, Jinjun Chen, Nong Xiao, Weimin Zhang
In scientific computing environments such as service grid environments, services are becoming basic collaboration components which can be used to... Sample PDF
A QSQL-Based Service Collaboration Method for Automatic Service Composition, and Optimized Execution
$37.50
Chapter 15
Xiaoyu Yang, Gen-Tao Chiang
It will become increasingly popular that scientists in research institutes will make use of Grid computing resources for running computer... Sample PDF
Hands on Experience on Building Institutional Grid Infrastructure
$37.50
Chapter 16
Dan Chen
The emergence of Grid technologies provide exciting new opportunities for large scale simulation over Internet, enabling collaboration and the use... Sample PDF
A Grid Aware Large Scale Agent-based Simulation System
$37.50
Chapter 17
Guy Gouardères, Emilie Conté
In Vocational and Educational Training (VET), new trends are toward social learning and, more precisely, toward informal learning. In such settings... Sample PDF
E-Portfolio to Promote Virtual Learning Group Communities on the Grid
$37.50
Chapter 18
Chen Zhou, Liang-Tien Chia, Bu-Sung Lee
Web services’ discovery mechanism is one of the most important research areas in Web services because of the dynamic nature of Web services. In... Sample PDF
QoS-Aware Web Services Discovery with Federated Support for UDDI
$37.50
Chapter 19
Mirghani Mohamed, Michael Stankosky, Vincent Ribière
The purpose of this article is to investigate the requirements of knowledge management (KM) services deployment in a Semantic Grid environment. A... Sample PDF
The Key Requirements for deploying Knowledge Management Services in a Semantic Grid Environment
$37.50
Chapter 20
Yogesh L. Simmhan, Beth Plale, Dennis Gannon
The increasing ability for the sciences to sense the world around us is resulting in a growing need for datadriven e-Science applications that are... Sample PDF
Karma2: Provenance Management for Data-Driven Workflows
$37.50
Chapter 21
Peter Brezany, Ivan Janciak, A Min Tjoa
This chapter introduces an ontology-based framework for automated construction of complex interactive data mining workflows as a means of improving... Sample PDF
Ontology-Based Construction of Grid Data Mining Workflows
$37.50
Chapter 22
Muzhou Xiong, Hai Jin
In this chapter, two algorithms have been presented for supporting efficient data transfer in the Grid environment. From a node’s perspective, a... Sample PDF
Optimization Algorithms for Data Transfer in the Grid Environment
$37.50
About the Contributors