Article Preview
Top1. Introduction
Among the different distributed platforms, cluster and grid infrastructures have emerged as powerful options to face ambitious problems. However, the implementation and execution of distributed applications that can run both on cluster and Grid infrastructures is far from being trivial. Although the execution model of distributed applications on both platforms is similar, their particularities make the application requirements completely different. First, while the cluster is usually managed by someone from the same organization as the user, in the grid the resources belong to different organizations, each one with different hardware, software, usage and security policies (Foster, 2002). Also, while in local clusters the physical resources do not evolve in time, grid infrastructures are highly dynamic. Moreover, while local clusters are considered as reliable, execution of tasks on the grid can be problematic (Botón-Fernández et al., 2015), so the application must be designed to overcome failures.
One of the recent developments in distributed computing has been the use of Open Access databases. Data analysis can now be performed on different resources, possibly belonging to different institutions, and employing high performance networks to communicate. This introduces a new set of challenges, including how to refer to remote data and to ensure that the reference will remain stable for a reasonable amount of time. This is of utmost importance in the so-called data curation, this is, the active management of data over its life-cycle of interest, establishing long term repositories for current and future use.
The work presented here aims to solve both problems with a unified cross-disciplinary approach.
The first objective is to provide developers with an efficient instrument to create and port distributed applications. For this sake, DistributedToolbox (Rodríguez-Pascual and Mayo-García, 2013) encapsulates a set of tools that enable the development and execution of highly portable distributed applications on clusters and grids. It ensures a correct task completion, so the application developer is released from the low level operations of task management and control.
DistributedToolbox is articulated around RemoteAPI, a very simple API designed to define the tasks to execute on the distributed infrastructure. Then, one of the devoted tools included on the toolbox takes care of the task execution. In this sense, the proposed solution does not compete with existing ones; instead, it embraces the different alternatives.
After dealing with the creation and execution of portable applications, this work tackles with the problem of reproducible research. Persistent IDentifiers (PIDs) are long- lasting references to digital objects (single files or set of files) (Hakala, 2010). In scientific computing, PIDs can reference primary and secondary scientific data with a unique and timeless manner, very similar to how DOI numbers are used to identify articles.
The objective is to ensure reproducibility, this is, the ability of an entire experiment or study to be reproduced, either by the researcher or by someone else and to create new work based on the research. For this sake, open on-line databases are a basic tool, making both raw and processed data freely available. Of course, with the commitment of making portable, long lasting distributed applications, the input data employed by these applications should persist. PIDs represent a powerful tool for this purpose, ensuring that future changes on URIs or internal organization of Open Access databases will be transparent to the user willing to repeat a given experiment.