High-Throughput Data Analysis of Proteomic Mass Spectra on the SwissBioGrid
Andreas Quandt (Swiss Institute of Bioinformatics, Switzerland), Sergio Maffioletti (University of Lugano, Switzerland), Cesare Pautasso (Swiss Institute of Bioinformatics, Switzerland), Heinz Stockinger (Swiss Institute of Bioinformatics, Switzerland) and Frederique Lisacek (ETH Zurich, Switzerland)
Copyright: © 2009
Proteomics is currently one of the most promising fields in bioinformatics as it provides important insights into the protein function of organisms. Mass spectrometry is one of the techniques to study the proteome, and several software tools exist for this purpose. The authors provide an extendable software platform called swissPIT that combines different existing tools and exploits Grid infrastructures to speed up the data analysis process for the proteomics pipeline.
In the past, biology was not closely linked to computer science but the rising of molecular biology introduced a major change, and nowadays, biological experiments produce terabytes of data so that data analysis requires large infrastructures. Grids are a promising solution to this issue. Grids enable job distribution, task parallelization as well as data and results sharing between geographically scattered groups. Grid computing has been used in scientific domains for several years. Whereas physics (and in particular high energy physics) was one of the main early drivers, other scientific domains with Bioinformatics in particular, started to leverage Grid computing technologies for certain computing and/or data intensive applications. However, using this technology is far from trivial, Grids being heterogeneous, geographically distributed resources. Grids are more complicated to maintain, the organization and monitoring of the computation steps and the secure storage and distribution of data require additional knowledge from the user. These computing related issues may become a burden to non computer-specialists and, therefore, need to be hidden as much as possible from the end-users.
Key Terms in this Chapter
Mass Spectrometry: In the field of proteomics, mass spectrometry is a technique to analyze, identify and characterize proteins. In particular, it measures the mass-to-charge ratio.
Grid Workflow: In general, a workflow can be considered as the automation of a specific process which can further be divided into smaller tasks. A Grid workflow consists of several tasks that need to be executed in a Grid environment but not necessarily on the same computing hardware.
High Performance Computing (HPC): HPC is a particular field in computer science that deals with performance optimization of single applications, usually by running parallel instances on high performance computing clusters or supercomputers.
Proteomics: The large-scale study of proteins, their functions and their structures. It is supposed to complement physical genome research. It can also be defined as the qualitative and quantitative comparison of proteomes under different conditions to further unravel biological processes (http://www.expasy.ch/proteomics_def.html).
High Throughput Computing: In contrast to HPC, high throughput computing does not aim to optimize a single application but several users and applications. In this way, many applications share a computing infrastructure at the same time – in this way the overall throughput of several applications is supposed to be maximized.
Bioinformatics: Comprises the management and the analysis of biological databases.
Grid Job Submission and Execution: Workflows are typically expressed in certain languages and then have to be executed. Often, the entire workflow is called a “job” which needs to be submitted to the Grid and executed on Grid computing resources.