High-Throughput GRID Computing for Life Sciences

High-Throughput GRID Computing for Life Sciences

Giulia De Sario, Angelica Tulipano, Giacinto Donvito, Giorgio Maggi
DOI: 10.4018/978-1-60566-374-6.ch010
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

The number of fully sequenced genomes increases daily, producing an exponential explosion of the sequence, annotation and metadata databases. Data analysis on a genome-wide level or investigation within a specific data repository has become a data- and calculation-intensive process occupying single computers and even larger computer clusters for month or even years. In most cases such applications can be subdivided into many independent smaller tasks. The smaller tasks are particularly suited to distribution over a computational GRID infrastructure, which drastically reduces the time to reach the final result. In our analysis of gene ontology data and their associations to gene products of any kind of organism in a search to find gene products with similar functionalities, we developed a system to divide the full search into a large number of jobs and to submit these jobs to the GRID infrastructure as long as all jobs are processed successfully, guaranteeing an analysis of the data without missing any information.
Chapter Preview
Top

Introduction

Data analysis in bioinformatics—due to the drastically high rate of increase in the sheer volume of data, not only in size but also in diversity—is becoming a very complex and data-intensive procedure occupying large numbers of computer units which often, for a typical user under normal condition, are not available. Sequencing projects all over the world, including high-throughput approaches such as the microarray technology and next-generation sequencing or the large scale mass spectrometry analysis, are responsible for this exponential growth of complex biological data sets. Data analysis within such projects and even more complex projects, such as comparisons and integration processes between such projects, often involves the examination of several big data sets with sizes on the order of hundreds of gigabytes. Fortunately, many of these analyses can be divided into many small tasks, producing the possibility of distributing the workload on an infrastructure such as the computational GRID. However, when the number of jobs necessary to carry on a particular analysis becomes huge, controlling the full production is not a simple enterprise. It is very important to carefully monitor each job, watching the success of each submitted job in order to be able to complete the full analysis without any missing data.

The biological task we are describing in this chapter is the comparison of gene products in a new, non-conventional way to find gene products with similar functionality. Usually gene products are compared by aligning sequences and looking for sequence similarity with the assumption that a high similarity corresponds to similar functionalities (Skolnick and Fetrow 2000). However, the relation sequence-function is not always true and often only small differences in the sequence may result in drastic changes in functionality. Those sequence differences are hardly detectable within a conventional sequence alignment.

Further, several gene products can have similar functionality but the active site or the conformation can be absolutely different, originating from an absolutely different Abstract

The number of fully sequenced genomes increases daily, producing an exponential explosion of the sequence, annotation and metadata databases. Data analysis on a genome-wide level or investigation within a specific data repository has become a data- and calculation-intensive process occupying single computers and even larger computer clusters for month or even years. In most cases such applications can be subdivided into many independent smaller tasks. The smaller tasks are particularly suited to distribution over a computational GRID infrastructure, which drastically reduces the time to reach the final result. In our analysis of gene ontology data and their associations to gene products of any kind of organism in a search to find gene products with similar functionalities, we developed a system to divide the full search into a large number of jobs and to submit these jobs to the GRID infrastructure as long as all jobs are processed successfully, guaranteeing an analysis of the data without missing any information.

primary sequence. It has also been shown that often the same functionality might have a similar tertiary structure but a similar tertiary structure doesn’t need to have a similar primary sequence.

For this reason we propose a different approach to finding gene products which have similar functionalities. In most sequence databases there are keywords describing the functionality and localisation on a cellular level of the gene product they harbour. Those keywords and other functional annotations in other biological databases were converted to a standardised vocabulary describing the gene products on the level of the molecular functions and biological processes with which they are involved and the cellular components where they are localised, i.e., the gene ontology (GO (Ashburner, Ball et al. 2000)). The GO consortium (GOC) took the lead on this effort, so that we now have a repository of more than 3,6 million gene products described by more than 25000 GO terms creating more than 19,1 million associations (01_2008). All this information is available as a relational database (MySQL) called GO database (GODB (GOC 2001)), which occupies more than 2,7 gigabytes. The vocabulary within GO is well structured and well adapted to computational processing. This standardised vocabulary is perfectly suited to the comparison of the functionality of gene products and to the search for ‘functional analogue’ gene products, i.e., gene products which have the same functionalities but possibly a clearly different sequence, so that it is unlikely that they are evolutionarily related, such as homologous gene products.

Key Terms in this Chapter

Workload Management Service (WMS): The service that takes care of submitting and distributing jobs over the gLite infrastructure.

Sequence Alignment: A way of arranging the primary sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.

Semantics: Relation between signs and the things they refer to, their denotata.

Grid Computing: The creation of a ‘virtual supercomputer’ by using a network of geographically dispersed computers.

gLite: Lightweight Middleware for GRID Computing – middleware stack developed inside the EGEE project.

Analogy: A structure performs the same or similar function by a similar mechanism but evolved separately. Similar structures may have evolved through different pathways, a process known as convergent evolution.

Job: A scheduled and/or automated task for a WN in a GRID processing environment. It is defined by the executable with its parameters, and by the input and output files.

Homology: In evolutionary biology, homology has come to mean any similarity between characters that is due to their shared ancestry.

Job Description Language (JDL): The JDL is used in gLite to specify desired job characteristics and constraints, which are used by the match-making process to select the resources that the job can use.

Task: The work assigned to a job. It can contain several elementary steps or can consist of a single operation.

Ontology: A study of conceptions of reality and the nature of being. In computer science, in fact, ontologies are representations of a set of concepts within a domain and the relationships between these concepts.

Complete Chapter List

Search this Book:
Reset