Reproducible Computing

Reproducible Computing

Patrick Wessa, Ian E. Holliday
DOI: 10.4018/978-1-4666-5888-2.ch647
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Chapter Preview

Top

Background

The problem of irreproducible research received a great deal of attention within the statistical computing and bioinformatics communities. Arguably, the most famous quote about this problem is called Claerbout’s principle (de Leeuw, 2001): “An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and that complete set of instructions that generated the figures.”.

Other scientists have extended Claerbout’s Principle and specified additional requirements for Reproducible Research (de Leeuw, 2001): “First, there is no reason to single out figures. The same ‘Principle’ obviously applies to tables, standard errors, and so on. The fact that figures often happen to be easier to reproduce, does not preclude that we should apply the same rule to any form of computer-generated output. Second, there is no reason to limit the Claerbout’s Principle to published articles. We can make exactly the same statement about our lectures and teaching, certainly in the context of graduate teaching. We must be able to give our students our code and our graphics files, so that they can display and study them on their own computers (and not only on our workstations, or in crowded university labs). And third, and perhaps most importantly, it is not clearly defined what a ‘software environment’ is. Buckheit and Donoho apply the principle in such a way that everybody who wants to check their results is forced to buy MatLab®. Not Mathematica®, Macsyma®, or S-plus®. Those you may need to buy for other articles. This violates the Freeware Principle... ”.

Several solutions have been proposed but the most prominent one is based on the concept of Literate Programming (Knuth, 1984) and has been implemented in an R package called “Sweave” (Leisch, 2003) where the concept of the so-called Compendium plays a fundamental role. The (traditional) Compendium is defined as an integrated collection of text (written in LaTeX), statistical code (written in R), and data that allows the presented science to be reproduced. All the necessary documents that are needed to create the article are contained in an archive file (preferably in tar.gz or zip format).

While the traditional Compendium, based on Sweave-like technology, seems to solve the problem of irreproducible research there are several shortcomings and remaining problems:

Key Terms in this Chapter

Literate Programming: A mixture of software code and human readable text in which the language supports the understanding of the code. The Literate Programming document is the result of software execution which produces a readable text that contains author-specified results and descriptions.

Reproducible Computing: Allows any (non-expert) writer to create and publish (based on free technologies) electronic documents which allow any (non-expert) reader to reproduce and re-use the computations that are presented, without the need to download or install anything on the client machine. The reader only needs to click on a table or picture to open a web application which provides instant access to the underlying data and software.

Compendium (Revised Definition): A research document where each computation is referenced by a unique web page containing all the information that is necessary to re-compute and re-use the analysis on distributed web servers or on the user’s machine.

Reproducibility: An important principle of the scientific method. It relates to the fact that certain research-related actions can be replicated by others in order to verify/falsify published results or to re-use prior research in new studies.

Compendium (Traditional Definition): An integrated collection of text, code, and data that allows the presented science to be reproduced. All the necessary documents that are needed to create the article (with embedded R code) and the data are contained in an archive file.

Reproducibility Level: Reproducibility is not a binary property – rather it should be defined on a scale which represents the degree of our ability to replicate or re-use aspects of the research.

Collaborative Reproducible Writing: Allows authors to collaboratively and simultaneously write and edit documents which are hosted in a cloud computing environment. Each key stroke is associated with a co-author and can be traced back in time. Reproducible Writing is more than just a Wiki because it records changes at the key stroke level and it encompasses the statistical analyses that have been generated by any author through Reproducible Computing.

Complete Chapter List

Search this Book:
Reset