Enriching Agronomic Experiments with Data Provenance

Enriching Agronomic Experiments with Data Provenance

Sergio Manuel Serra da Cruz (Federal Rural University of Rio de Janeiro (UFRRJ), Department of Mathematics, Rio de Janeiro, Brazil) and Jose Antonio Pires do Nascimento (Brazilian Agricultural Research Corporation (EMBRAPA), Agricultural Sector, Rio de Janeiro, Brazil)
DOI: 10.4018/IJAEIS.2017070102

Abstract

Reproducibility is a major feature of Science. Even agronomic research of exemplary quality may have irreproducible empirical findings because of random or systematic error. The ability to reproduce agronomic experiments based on statistical data and legacy scripts are not easily achieved. We propose RFlow, a tool that aid researchers to manage, share, and enact the scientific experiments that encapsulate legacy R scripts. RFlow transparently captures provenance of scripts and endows experiments reproducibility. Unlike existing computational approaches, RFlow is non-intrusive, does not require users to change their working way, it wraps agronomic experiments in a scientific workflow system. Our computational experiments show that the tool can collect different types of provenance metadata of real experiments and enrich agronomic data with provenance metadata. This study shows the potential of RFlow to serve as the primary integration platform for legacy R scripts, with implications for other data- and compute-intensive agronomic projects.
Article Preview

Introduction

Climate and environmental changes, soil fertility depletion, and land/water misuse and slowing growth in rangelands and crops yields are global concerns. Such issues are demanding the integration of various areas such as Agronomy, Biology, Engineering, and Computer Science to advance agronomic practices. The use of low-cost sensors and modern experimental devices in the field are demanding new approaches to ensure reproducibility of agronomic experiments regarding the management of the increasing amount of data generated by field trials.

Reproducibility, rigor, and independent verification are fundamental tenets of the scientific method. However, scientists frequently start thinking about making their work reproducible near the end of research process when they write up their scientific results or even far ahead when a journal requires their datasets and scripts be available for publication (Gandrud, 2015). According to Stodden et al. (2014), the vast amounts of data are contributing to making some experiments quite difficult to be reproduced or published if disconnected from its original datasets and scripts.

The problem of irreproducible research is neither new nor limited to Agronomy. A decade ago Ioanidis (2005) wrote a seminal paper which exposes that there are increasing concerns in modern research. To the author false research findings are increasing and maybe even the majority of published research claims in certain domains of knowledge.

Today, there is a growing alarm about scientific results that cannot be reproduced (Stodden et al. 2014). According to a recent Nature’s survey on 1,576 scientists who took an online questionnaire on reproducibility in research, more than 70% of them have tried and failed to reproduce another scientist's experiments. Similarly, more than half have been unable to reproduce their own experiments. According to the survey, the best-known analyses are from cancer biology and psychology, where the found rates of reproducibility were around 10% and 40% respectively (Baker, 2016).

All this, together with the complexity of modern experiments lead to the development of scientific workflows in many scientific areas. Scientific workflows are an illustration for managing the execution of several kinds of computations in a systematic and easy to follow way. Many scientists now test their hypothesis as in silico experiments implemented as workflows, running them on local servers or in geographically distributed computing infrastructures such as High-Performance Computing and clouds (Hey et al., 2009).

Currently, the use of scientific workflows focusing on the statistical processing of agronomic experiments’ datasets is witnessing a rapid increase and gaining importance (Mullis et al., 2014). Such kind of workflows is also known as statistical scientific workflows (SSWf) (Nascimento & Cruz, 2013). SSWf can be understood as scripts consisting of statistical software functions connected to one another and databases too.

The notion of conducting agronomic experiments are based on time- and labor-intensive experimental methods using field-scale experiments that may last for years is not new. These experiments are also data-intensive, requiring the use of statistical scripts to analyze the trials and its final results. The scripts act as SSWf using a textual programming language described by a regular grammar comparable to programming languages like R. The scripts often have extensive syntax and complex semantics. The scripts are not easily shared nor maintained by unskilled users. Besides, there are few mechanisms to collect scripts’ provenance (Murta et al., 2014; McPhillips et al., 2015).

The reputation of data provenance in Computer Science is well known and is becoming increasingly important for inspecting and verifying quality and reliability of data and scientific experiments (Cruz, Campos & Mattoso, 2009). However, as far as we are concerned, few agronomic experiments consider the central role of provenance to enrich their research data.

Complete Article List

Search this Journal:
Reset
Open Access Articles
Volume 10: 4 Issues (2019): 1 Released, 3 Forthcoming
Volume 9: 4 Issues (2018)
Volume 8: 4 Issues (2017)
Volume 7: 4 Issues (2016)
Volume 6: 4 Issues (2015)
Volume 5: 4 Issues (2014)
Volume 4: 4 Issues (2013)
Volume 3: 2 Issues (2012)
Volume 2: 2 Issues (2011)
Volume 1: 2 Issues (2010)
View Complete Journal Contents Listing