Integrating Data Management and Collaborative Sharing with Computational Science Research Processes

Integrating Data Management and Collaborative Sharing with Computational Science Research Processes

Kerstin Kleese van Dam (Pacific Northwest National Laboratory, USA), Mark James (University of California San Diego, USA) and Andrew M. Walker (University of Bristol, UK)
DOI: 10.4018/978-1-61350-116-0.ch021
OnDemand PDF Download:


This chapter describes the key principles and components of a good data management system, provides real world examples of how these can be successfully integrated with scientific research processes and enable successful data sharing, provides an outlook on future developments, and discusses lessons learned. We conclude with a short section on how to get started for those whose interest has been piqued.
Chapter Preview


Scientific research can be characterized by its aim to make descriptive, explanatory and predictive inferences on the basis of observed or simulated information about the real world. Ideally it uses explicit, codified and public methods and rules for its data collection and analysis. Repeatability, reproducibility and transparency are seen as the main pillars of good scientific research (King, 1994). In current computational science research these aims are often difficult to achieve because of its inherent complexities and distributed nature.

Scientists today rarely engage directly with their research object, but do so via digitally captured, reduced, calibrated, analyzed, synthesized and visualized data in combination with computer simulations of the processes of interest. Advances in experimental and computational technologies have led to an exponential growth in the volumes, variety and complexity of this data (Southan, 2009; Goble, 2009), and whilst the data deluge is not found everywhere in an absolute sense, it is seen in a relative sense within most research groups. Many lack the methods, tools and infrastructure to deal effectively with the increasing volumes, complexity and geographical distribution of the relevant data. But it is not data alone that challenges the scientific community. Scientists use a much more varied and extensive array of software products to engage with their data, combined in ever more complex workflows that are executed on very different platforms, at times unknown to the user (grids or clouds). This makes it much more difficult to follow the aims of good scientific research practices in terms of repeatability, reproducibility and transparency.

Leaving the aspirational aspects of scientific investigations aside, research practice has become much more collaborative than it was even a few years ago (Jones, 2008; Guimera, 2005), and few research projects do not rely on the sharing of processes and data amongst different group members or groups to accomplish their scientific goals. The increasing complexity of scientific challenges requires more interdisciplinary and multidisciplinary information and knowledge exchange (Committee on Facilitating Interdisciplinary Research, 2004). Whilst multidisciplinary data sharing is still rare, sharing of key data sets within particular research communities has become more mainstream in a range of scientific domains such as environmental sciences or biology (Field, 2009). This is often facilitated through dedicated data centers and expert data collections. In other fields, and specifically computational sciences, working practices around the sharing of research results have, however, not changed much over the past years. Research publications are still the main sources of information exchange in the wider community. Unfortunately publications have certain limitations in conveying comprehensive information on a particular subject; there is the limitation in length and thus detail, its main purpose is to convey the scientists’ point of view rather than a comprehensive, objective representation of all facts (Shotton, 2009; de Waard, 2006; Kuhn, 1962; Latour, 1987). Publications thus provide at best a very coarse and high level summary of the research work undertaken by the authors. The associated raw and derived data should be a rich source of supporting information, in particular, if coupled with the appropriate metadata and documented scientific workflows, forming a complete research object (DeRoure, 2009). In recognition of the desire by the research community to have access not only to the summary of a research project, but also the underpinning data, more publishers today require from their authors that they share their raw and derived data by depositing it into publicly accessible archives or by providing it on request. However, recent studies have shown (Savage, 2009; Wicherts, 2006) that few authors comply with the journals data deposition requirement and only the enforced deposition before publication seems to provide the desired result, indicating a continued reluctance to share in-depth research results with the general research community.

Key Terms in this Chapter

Semantic Technologies: Encoding of meaning separately from data files, content files, and application codes to enable machines as well as people to understand, share and reason with them at execution time

Data Management: All disciplines related to managing data as a valuable resource, here in particular we refer to Scientific Data Management, the management of storage, access, usage, lifecycle, content and meaning for scientific data

High Performance Computing: The use of parallel processing for running advanced application programs efficiently, reliably and quickly on leadership class computer systems

Collaborative: The term refers here both to working practices and supporting tools that allow and further the joint working of researchers with common interests that are often in geographically distributed locations.

Data Curation: The preservation and management of scientific data specifically for continuous reuse by identified, dedicated user groups.

Metadata: Data about Data

Data Lifecycle Management: Effective management and exploitation of data from its creation until it is obsolete.

Complete Chapter List

Search this Book:
Editorial Advisory Board
Table of Contents
Joanna Leng, Wes Sharrock
Chapter 1
Gabriele Jost, Alice E. Koniges
The upcoming years bring new challenges in high-performance computing (HPC) technology. Fundamental changes in the building blocks of HPC hardware... Sample PDF
Hardware Trends and Implications for Programming Models
Chapter 2
Ivan Girotto, Robert M. Farber
This chapter focuses on the technical/commercial dynamics of multi-threaded hardware architecture development, including a cost/benefit account of... Sample PDF
Multi-Threaded Architectures: Evolution, Costs, Opportunities
Chapter 3
Domingo Benitez
Many accelerator-based computers have demonstrated that they can be faster and more energy-efficient than traditional high-performance multi-core... Sample PDF
High-Performance Customizable Computing
Chapter 4
Rasit O. Topaloglu, Swati R. Manjari, Saroj K. Nayak
Interconnects in semiconductor integrated circuits have shrunk to nanoscale sizes. This size reduction requires accurate analysis of the quantum... Sample PDF
High-Performance Computing for Theoretical Study of Nanoscale and Molecular Interconnects
Chapter 5
Prashobh Balasundaram
This chapter presents a study of leading open source performance analysis tools for high performance computing (HPC). The first section motivates... Sample PDF
Effective Open-Source Performance Analysis Tools
Chapter 6
David Worth, Chris Greenough, Shawn Chin
The purpose of this chapter is to introduce scientific software developers to software engineering tools and techniques that will save them much... Sample PDF
Pragmatic Software Engineering for Computational Science
Chapter 7
Diane Kelly, Daniel Hook, Rebecca Sanders
The aim of this chapter is to provide guidance on the challenges and approaches to testing computational applications. Testing in our case is... Sample PDF
A Framework for Testing Code in Computational Applications
Chapter 8
Judith Segal, Chris Morris
There are significant challenges in developing scientific software for a broad community. In this chapter, we discuss how these challenges are... Sample PDF
Developing Software for a Scientific Community: Some Challenges and Solutions
Chapter 9
Fumie Costen, Akos Balasko
The computational architecture of Enabling Grids for E-sciencE is introduced as it made our code porting very challenging, and the discussion... Sample PDF
Opportunities and Challenges in Porting a Parallel Code from a Tightly-Coupled System to the Distributed EU Grid, Enabling Grids for E-sciencE
Chapter 10
Abid Yahya, Farid Ghani, R. Badlishah Ahmad, Mostafijur Rahman, Aini Syuhada, Othman Sidek, M. F. M. Salleh
This chapter presents performance of a new technique for constructing Quasi-Cyclic Low-Density Parity-Check (QC-LDPC) encrypted codes based on a row... Sample PDF
Development of an Efficient and Secure Mobile Communication System with New Future Directions
Chapter 11
Hubertus J. J. van Dam
Quantum chemistry was a compute intensive field from the beginning. It was also an early adopter of parallel computing, and hence, has more than... Sample PDF
Parallel Quantum Chemistry at the Crossroads
Chapter 12
Marc Hafner, Heinz Koeppl
With the advances in measurement technology for molecular biology, predictive mathematical models of cellular processes come in reach. A large... Sample PDF
Stochastic Simulations in Systems Biology
Chapter 13
C. T. J. Dodson
Many real processes have stochastic features which seem to be representable in some intuitive sense as `close to Poisson’, `nearly random’, `nearly... Sample PDF
Some Illustrations of Information Geometry in Biology and Physics
Chapter 14
Stefania Tomasiello
Though relatively unknown, the Differential Quadrature Method (DQM) is a promising numerical technique that produces accurate solutions with less... Sample PDF
DQ Based Methods: Theory and Application to Engineering and Physical Sciences
Chapter 15
Marco Evangelos Biancolini
Radial Basis Functions (RBF) mesh morphing, its theoretical basis, its numerical implementation, and its use for the solution of industrial... Sample PDF
Mesh Morphing and Smoothing by Means of Radial Basis Functions (RBF): A Practical Example Using Fluent and RBF Morph
Chapter 16
Joanna Leng, Theresa-Marie Rhyne, Wes Sharrock
This chapter focuses on state of the art at the intersection of visualization and CSE. From understanding current trends it looks to future... Sample PDF
Visualization: Future Technology and Practices for Computational Science and Engineering
Chapter 17
Peter Sarlin
Since the 1980s, two severe global waves of sovereign defaults have occurred in less developed countries (LDCs): the LDC defaults in the 1980s and... Sample PDF
Visualizing Indicators of Debt Crises in a Lower Dimension: A Self-Organizing Maps Approach
Chapter 18
Iain Barrass, Joanna Leng
Since infectious diseases pose a significant risk to human health many countries aim to control their spread. Public health bodies faced with a... Sample PDF
Improving Computational Models and Practices: Scenario Testing and Forecasting the Spread of Infectious Disease
Chapter 19
Eldon R. Rene, Sung Joo Kim, Dae Hee Lee, Woo Bong Je, Mirian Estefanía López, Hung Suck Park
Sequencing batch reactor (SBR) is a versatile, eco-friendly, and cost-saving process for the biological treatment of nutrient-rich wastewater, at... Sample PDF
Artificial Neural Network Modelling of Sequencing Batch Reactor Performance
Chapter 20
Joanna Leng, Wes Sharrock
Computational Science and Engineering (CSE) is an emerging, rapidly developing, and potentially very significant force in changing scientific... Sample PDF
The State of Development of CSE
Chapter 21
Kerstin Kleese van Dam, Mark James, Andrew M. Walker
This chapter describes the key principles and components of a good data management system, provides real world examples of how these can be... Sample PDF
Integrating Data Management and Collaborative Sharing with Computational Science Research Processes
Chapter 22
Jens Jensen, David L. Groep
Modern science increasingly depends on international collaborations. Large instruments are expensive and have to be funded by several countries, and... Sample PDF
Security and Trust in a Global Research Infrastructure
Chapter 23
Matt Ratto
Computational science and engineering (CSE) technologies and methods are increasingly considered important tools for the humanities and are being... Sample PDF
CSE as Epistemic Technologies: Computer Modeling and Disciplinary Difference in the Humanities
Chapter 24
Phillip L. Manning, Peter L. Falkingham
Dinosaurs successfully conjure images of lost worlds and forgotten lives. Our understanding of these iconic, extinct animals now comes from many... Sample PDF
Science Communication with Dinosaurs
About the Contributors