BioSimGrid Biomolecular Simulation Database

BioSimGrid Biomolecular Simulation Database

Kaihsu Tai (University of Oxford, UK) and Mark Sansom (University of Oxford, UK)
DOI: 10.4018/978-1-60566-374-6.ch016
OnDemand PDF Download:
List Price: $37.50


BioSimGrid is a distributed biomolecular simulation database. It is a general-purpose database for trajectories from molecular dynamics simulations. Though initially designed as a distributed data grid, BioSimGrid allows for installation as a stand-alone instance. This can later be integrated into a wider, networked system. This presentation of BioSimGrid follows a scenario in biological research to demonstrate how to install the system, and how to deposit, query, and analyze trajectories in this system, with real Python code examples for each step. What then follow are explanations of the underlying concepts in the implementation of BioSimGrid: relational database, distributed computing, and the input/output (deposit and analysis) modules. Finishing the presentation is a discussion of the emerging trends and concerns in the further development of BioSimGrid and similar biological databases. This discussion touches on quality-assurance issues and the use of BioSimGrid as a back-end for other speciality databases. The experience of developing BioSimGrid compels the conclusion: In the development and maintenance of biomolecular simulation databases, it is essential that sustainability be asserted as a key principle.
Chapter Preview

1. Introduction And Background: A Repository Of Biomolecular Simulations

Since the first application of the molecular dynamics on proteins in 1976 (Adcock and McCammon, 2006), this simulation methodology has added value to experimental structural biology by making biomolecules ‘come alive’ and by compensating in the nanosecond time-scale where experimental methods are only beginning to be able to access. Adding to this, we have the method of comparison, a precursor to the process of classification, which is fundamental to biology (Brooks and McLennan, 2006). Insights into the internal motions of proteins can come from comparing the results of molecular dynamics simulations, namely the trajectories (Pang et al., 2005; Tai et al., 2007). This process can be facilitated by having a database of trajectories. We have developed in the past few years (2003 to 2006) such a database called BioSimGrid (; Feig et al., 1999). To differentiate, BioSimGrid is a general-purpose database for trajectories from molecular dynamics simulations. It is free software licensed under the terms of GNU General Public License (Stallman 2002). It can take advantage of distributed (‘grid’) computing, to enhance reliability and ensure longevity of the trajectory content (Berman et al., 2003a).

By ‘general-purpose’, we mean the following. Firstly, BioSimGrid can admit trajectories generated by different simulation packages, such as Amber (Pearlman et al., 1995), Gromacs (Lindahl et al., 2001), Charmm (Brooks et al., 1983), NWChem (Straatsma et al., 2000), and NAMD (Kalé et al., 1999). Secondly, BioSimGrid is not restricted to a special kind of system or granularity. It can host simulations for nucleic acids, proteins, small molecules, or even non-biological polymers; simulations at the all-atom level or coarse-grain molecular dynamics (Bond and Sansom, 2006; Marrink et al., 2004; Nielsen et al., 2004). It can also store non-sequential ‘trajectories’, or rather ensembles, generated by other methods such as Monte Carlo, homology modelling (Šali and Blundell, 1993), and CONCOORD (de Groot et al., 1997).

Here we briefly describe the implementation of BioSimGrid, so the reader can have a conceptual understanding of the system: aspects such as installation, interfaces, architecture, data schema, and the deposit and analysis modules. We present an end-to-end case scenario where a scientist can deposit a trajectory into BioSimGrid, query the database, and analyze a trajectory. Finally, we discuss the prospects of such a database: the quality-assurance and sustainability issues, and customization of BioSimGrid as ‘back-ends’ to specialist databases.

Key Terms in this Chapter

Trajectories.: The results of the molecular dynamics simulation is called a trajectory. This is a time-series of molecular conformations, or in other words, a movie of three-dimensional snapshots for the molecules in the system. For a system of N particles and m frames, a trajectory can be seen conceptually as a 3N × m matrix: On one side, it contains the 3N Cartesian coordinates, three scalars for each atom. On the other, it contains a list of frames each associated with a time-point. Often the time-step for Newtonian integration for an atomistic simulation is 1 fs or 2 fs, and the snapshots are recorded at the rate of 1 ps per frame. For a 1 ns trajectory, this gives 106 time-steps and 103 frames.

Molecular Dynamics Simulation (MD): A method to simulate of molecules in which atoms (or particles in general) within molecules interact under laws of physics (traditionally the classical Newtonian laws) for some time, giving a view of the motion of the molecule. The underlying framework of physical (Newtonian) laws is called ‘molecular mechanics’ (MM). This simulation can be constructed using real ‘balls’ (spheres representing the atoms/particles) and appropriate ‘springs’ (possibly non-Hookean, to represent the interactions between the particles); but more conveniently, it is commonly done in silico, using a model of the molecules in the computer to progress through time according to the physical laws. Though usually only the classical interactions are considered, it is possible to mix in quantum mechanics considerations for certain areas where bond-breaking and bond-forming activities are of interest. This mixed method is called ‘quantum mechanics-molecular mechanics’ (QM/MM).

Metadata: ‘A set of data that describes and gives information about other data’ (Oxford English Dictionary). In the context of a database of trajectories, this may provide the name of the trajectory, the molecules included in the simulation system, related scientific publications, and the provenance (ontogeny) of the trajectory. For the last item, it may include the original molecular dynamics parameters with which the trajectory was generated.

Forcefield: In molecular dynamics, a forcefield is the collection of potentials of interaction between all possible pairs of atom/particle types within the modelled system. This includes short-range interactions such as covalent and ionic bonds, and non-bonded interactions (such as van der Waals); and long-range interactions such as electrostatics. Two examples are the eponymous forcefields in the molecular dynamics packages Amber and Gromacs.

Relational Database: A relational database is a database based on predicate logic and set theory, or in other words, on tables of relations. An example table of relations is the simple publication database shown in the table below. Further relations can be expressed by adding links and keys between tables, and performing mathematical set operations on the tables. For an excellent introduction on the theory of relational databases, see the C. J. Date’s book (Date, 2000). Software exist to model, create, and manage relational databases. Such a software package is called a relational database management system (RDBMS).

Coarse-Grain Molecular Dynamics: A coarse-grain model is one where some of the fine details of the modelled system has been smoothed over, grouped together, or averaged out. In molecular dynamics, ‘coarse-grain’, as opposed to ‘atomistic’, is a model where all atoms are not represented, but some atoms are grouped together to form a (larger) particle. To justify, the atoms grouped together within the larger particle may have behaviours that are too detailed and of no interest to the present investigation. Such grouping therefore saves computational time, so that longer-timescale ‘interesting’ events can be more readily observed. An early example of coarse-grain MD is the ‘united-atom’ forcefields, where non-ionizable hydrogen atoms are ‘united’ with the heavy atom they bond to. Recent examples include lipid and biomolecular simulations, where the nanosecond time-scale afforded by atomistic MD has been surpassed by microsecond-scale coarse-grain MD simulations.

Complete Chapter List

Search this Book: