Article Preview
TopIntroduction
Since the start of the new millennium, people have been sharing data in an unprecedented scale and richness. In scientific domains such as biology and chemistry, the trend of “big science” signified by large scale collaborative projects such as the iPlant Collaborative (http://www.iplantcollaborative.org) demands the sharing of data over organizational boundaries and even across disciplines. For businesses, Big Data is a key component in competition, growth and innovation, and much of Big Data originates outside of the company that is absorbing it. With the large-scale proliferation and sharing of data, questions such as “Where did this data come from?”, “Who else is using this data?”, and “Why is this piece of data here?” are becoming increasingly common (Ram & Liu, 2012). Data provenance, often referred to as “origin”, “lineage” “history”, or “pedigree” of data, contains the answers to the questions. When data travel beyond the specific setting in which they are generated, it is imperative that the provenance of the data needs to be captured to ensure the trustworthiness of the data.
In the last decade, significant research has been conducted to standardize the semantics of data provenance and develop a shared provenance ontology that allows unambiguous interpretation of provenance, supports interoperability of data provenance between systems, and improves the usability of data provenance by enabling richer queries. One of the earliest efforts in standardizing provenance semantics is the development of the W7 model (Ram & Liu, 2007). The W7 model conceptualizes provenance as consisting seven Ws including what, when, where, how, who, which and why, and it has been adopted in research such as (Lupelli et al., 2015; Narock, Yoon, & March, 2014; Prat & Madnick, 2008), etc. Another widely used provenance model is the Open Provenance Model (OPM) (Moreau et al., 2011). The OPM represents the provenance of objects by an annotated causality graph. A causality graph captures the causal dependencies between three types of nodes: artifacts, processes and agents. Other well-known provenance ontologies include Provenance Vocabulary (Hartig & Zhao, 2010) and PROV-DM model (Belhajjame et al., 2012). These generic provenance ontologies are designed to be domain and architecture independent. They support a digital representation of provenance for any “thing” so that provenance can be exchanged between systems by means of a compatibility layer based on a shared provenance model (Moreau et al., 2011).