PheGeeatHome: A Grid-Based Tool for Comparative Genomics

PheGeeatHome: A Grid-Based Tool for Comparative Genomics

Bertil Schmidt (Nanyang Technological University, Singapore), Chen Chen (Nanyang Technological University, Singapore), Weiguo Liu (Nanyang Technological University, Singapore) and Wayne Mitchell (Experimental Therapeutics Centre, Singapore)
DOI: 10.4018/978-1-60566-374-6.ch009
OnDemand PDF Download:
No Current Special Offers


In this chapter we present PheGeeatHome, a grid-based comparative genomics tool that nominates candidate genes responsible for a given phenotype. A phenotype is the physical manifestation of the interplay of genetic, epigenetic and environmental factors. Our tool is designed to facilitate the discovery and prioritization of candidate genes controlling or contributing to the genetically determined portion of a specified phenotype. However, in order to make reliable nominations of candidate genes from sequence data, several genome-size sequence datasets are required. This makes the approach impractical on traditional computer architectures leading to prohibitively long runtimes. Therefore, we use a computational architecture based on a desktop grid environment and commodity graphics hardware to significantly accelerate PheGee. We validate this approach by showing the deployment and evaluation on a grid testbed for the comparison of microbial genomes.
Chapter Preview


High-throughput techniques for DNA sequencing have led to an enormous growth in the amount of publicly available genomic data. As of February 2008, 716 complete genome sequences are available and another 2,756 genome-sequencing projects are in progress ( As the sequences of more and more genomes become available, we have reached a critical mass where, instead of focusing on a subset of sequences, we can use entire genome data sets to derive global inferences and metadata. Comparative genomics refers to the study of relationships between the genomes of different species or strains. It is currently being used for ortholog detection (Itoh, Goto, Akutsu & Kanehisa, 2005) bacterial pharmacogenomics (Fraser, et al., 2000), clustering of similar protein sequences (Itoh, Akutsu & Kanehisa, 2004), etc. Unfortunately, comparative genomics applications are highly computationally intensive tasks due to the large sequence data sets involved and typically take a few months to complete. These runtime requirements are likely to become even more severe due to the rapid growth in the size of genomic databases.

The objectives of this chapter are therefore two-fold:

  • 1.

    The presentation of a new comparative genomics tool called PheGee (Phenotype Genotype Explorer). PheGee nominates candidate genes responsible for a certain phenotype π given genomic sequence datasets of phenotype positive (π+) and phenotype negative (π−) species.

  • 2.

    The proposition of a hybrid computational grid platform to accelerate PheGee.

The proposed hybrid grid architecture efficiently combines desktop grid computing with GPGPUs (General-Purpose computation on Graphics Processing Units). The driving force and motivation behind this architecture is the price/performance ratio. Using desktop grids as in the volunteer computing approach is currently one of the most efficient and simple ways to gain supercomputer power for a reasonable price. Installing in addition massively parallel processor add-on boards such as modern computer graphics cards within each desktop can further improve the cost/performance ratio significantly. We show how this architecture can be used to accelerate PheGee efficiently. Moreover, the proposed grid approach is flexible and is therefore applicable to a variety of compute-intensive genomics applications.

Key Terms in this Chapter

Comparative Genomics: Comparative genomics refers to the study of relationships between the genomes of different species or strains.

Genotype: A genotype is the genetic constitution of a cell, an organism, or an individual.

Desktop Grid Computing: Desktop grid computing is a form of distributed computing in which an organization uses its existing computers to handle long-running computational tasks

General-purpose computing on graphics processing units (GPGPU): GPGPU is the technique of using a GPU, which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the CPU.

Sequence Alignment: In bioinformatics, a sequence alignment is a way of arranging the biological strings – such as DNA or protein – to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships.

Phenotype: A phenotype is any observable characteristic of an organism, such as its morphology, development, biochemical or physiological properties, or behavior.

Smith-Waterman Algorithm: The Smith-Waterman algorithm is a well-known algorithm for performing local sequence alignment; that is, for determining similar regions between two nucleotide or protein sequences.

Complete Chapter List

Search this Book: