Integrative Data Analysis for Biological Discovery

Integrative Data Analysis for Biological Discovery

Sai Moturu
Copyright: © 2009 |Pages: 8
DOI: 10.4018/978-1-60566-010-3.ch164
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

As John Muir noted, “When we try to pick out anything by itself, we find it hitched to everything else in the Universe” (Muir, 1911). In tune with Muir’s elegantly stated notion, research in molecular biology is progressing toward a systems level approach, with a goal of modeling biological systems at the molecular level. To achieve such a lofty goal, the analysis of multiple datasets is required to form a clearer picture of entire biological systems (Figure 1). Traditional molecular biology studies focus on a specific process in a complex biological system. The availability of high-throughput technologies allows us to sample tens of thousands of features of biological samples at the molecular level. Even so, these are limited to one particular view of a biological system governed by complex relationships and feedback mechanisms on a variety of levels. Integrated analysis of varied biological datasets from the genetic, translational, and protein levels promises more accurate and comprehensive results, which help discover concepts that cannot be found through separate, independent analyses. With this article, we attempt to provide a comprehensive review of the existing body of research in this domain.
Chapter Preview
Top

Introduction

As John Muir noted, “When we try to pick out anything by itself, we find it hitched to everything else in the Universe” (Muir, 1911). In tune with Muir’s elegantly stated notion, research in molecular biology is progressing toward a systems level approach, with a goal of modeling biological systems at the molecular level. To achieve such a lofty goal, the analysis of multiple datasets is required to form a clearer picture of entire biological systems (Figure 1). Traditional molecular biology studies focus on a specific process in a complex biological system. The availability of high-throughput technologies allows us to sample tens of thousands of features of biological samples at the molecular level. Even so, these are limited to one particular view of a biological system governed by complex relationships and feedback mechanisms on a variety of levels. Integrated analysis of varied biological datasets from the genetic, translational, and protein levels promises more accurate and comprehensive results, which help discover concepts that cannot be found through separate, independent analyses. With this article, we attempt to provide a comprehensive review of the existing body of research in this domain.

Figure 1.

Complexity increases from the molecular and genetic level to the systems level view of the organism (Poste, 2005).

978-1-60566-010-3.ch164.f01
Top

Background

The rapid development of high-throughput technologies has allowed biologists to obtain increasingly comprehensive views of biological samples at the genetic level. For example, microarrays can measure gene expression for the complete human genome in a single pass. The output from such analyses is generally a list of genes (features) that are differentially expressed (upregulated or downregulated) between two groups of samples or ones that are coexpressed across a group of samples. Though every gene is measured, many are irrelevant to the phenomenon being studied. Such irrelevant features tend to mask interesting patterns, making gene selection difficult. To overcome this, external information is required to draw meaningful inferences (guided feature selection). Currently, numerous high-throughput techniques exist along with diverse annotation datasets presenting considerable challenges for data mining (Allison, Cui, Page & Sabripour, 2006).

Sources of background knowledge available include metabolic and regulatory pathways, gene ontologies, protein localization, transcription factor binding, molecular interactions, protein family and phylogenetic information, and information mined from biomedical literature. Sources of high-throughput data include gene expression microarrays, comparative genomic hybridization (CGH) arrays, single nucleotide polymorphism (SNP) arrays, genetic and physical interactions (affinity precipitation, two-hybrid techniques, synthetic lethality, synthetic rescue) and protein arrays (Troyanskaya, 2005). Each type of data can be richly annotated using clinical data from patients and background knowledge. This article focuses on studies using microarray data for the core analysis combined with other data or background knowledge. This is the most commonly available data at the moment, but the concepts can be applied to new types of data and knowledge that will emerge in the future.

Complete Chapter List

Search this Book:
Reset