Article Preview
TopIntroduction
Climate and environmental changes, soil fertility depletion, and land/water misuse and slowing growth in rangelands and crops yields are global concerns. Such issues are demanding the integration of various areas such as Agronomy, Biology, Engineering, and Computer Science to advance agronomic practices. The use of low-cost sensors and modern experimental devices in the field are demanding new approaches to ensure reproducibility of agronomic experiments regarding the management of the increasing amount of data generated by field trials.
Reproducibility, rigor, and independent verification are fundamental tenets of the scientific method. However, scientists frequently start thinking about making their work reproducible near the end of research process when they write up their scientific results or even far ahead when a journal requires their datasets and scripts be available for publication (Gandrud, 2015). According to Stodden et al. (2014), the vast amounts of data are contributing to making some experiments quite difficult to be reproduced or published if disconnected from its original datasets and scripts.
The problem of irreproducible research is neither new nor limited to Agronomy. A decade ago Ioanidis (2005) wrote a seminal paper which exposes that there are increasing concerns in modern research. To the author false research findings are increasing and maybe even the majority of published research claims in certain domains of knowledge.
Today, there is a growing alarm about scientific results that cannot be reproduced (Stodden et al. 2014). According to a recent Nature’s survey on 1,576 scientists who took an online questionnaire on reproducibility in research, more than 70% of them have tried and failed to reproduce another scientist's experiments. Similarly, more than half have been unable to reproduce their own experiments. According to the survey, the best-known analyses are from cancer biology and psychology, where the found rates of reproducibility were around 10% and 40% respectively (Baker, 2016).
All this, together with the complexity of modern experiments lead to the development of scientific workflows in many scientific areas. Scientific workflows are an illustration for managing the execution of several kinds of computations in a systematic and easy to follow way. Many scientists now test their hypothesis as in silico experiments implemented as workflows, running them on local servers or in geographically distributed computing infrastructures such as High-Performance Computing and clouds (Hey et al., 2009).
Currently, the use of scientific workflows focusing on the statistical processing of agronomic experiments’ datasets is witnessing a rapid increase and gaining importance (Mullis et al., 2014). Such kind of workflows is also known as statistical scientific workflows (SSWf) (Nascimento & Cruz, 2013). SSWf can be understood as scripts consisting of statistical software functions connected to one another and databases too.
The notion of conducting agronomic experiments are based on time- and labor-intensive experimental methods using field-scale experiments that may last for years is not new. These experiments are also data-intensive, requiring the use of statistical scripts to analyze the trials and its final results. The scripts act as SSWf using a textual programming language described by a regular grammar comparable to programming languages like R. The scripts often have extensive syntax and complex semantics. The scripts are not easily shared nor maintained by unskilled users. Besides, there are few mechanisms to collect scripts’ provenance (Murta et al., 2014; McPhillips et al., 2015).
The reputation of data provenance in Computer Science is well known and is becoming increasingly important for inspecting and verifying quality and reliability of data and scientific experiments (Cruz, Campos & Mattoso, 2009). However, as far as we are concerned, few agronomic experiments consider the central role of provenance to enrich their research data.