Pattern Discovery in Gene Expression Data

Pattern Discovery in Gene Expression Data

Gráinne Kerr (Dublin City University, Ireland), Heather Ruskin (Dublin City University, Ireland) and Martin Crane (Dublin City University, Ireland)
DOI: 10.4018/978-1-59904-982-3.ch003
OnDemand PDF Download:


Microarray technology1 provides an opportunity to monitor mRNA levels of expression of thousands of genes simultaneously in a single experiment. The enormous amount of data produced by this high throughput approach presents a challenge for data analysis: to extract meaningful patterns, to evaluate its quality, and to interpret the results. The most commonly used method of identifying such patterns is cluster analysis. Common and sufficient approaches to many data-mining problems, for example, Hierarchical, K-means, do not address well the properties of “typical” gene expression data and fail, in significant ways, to account for its profile. This chapter clarifies some of the issues and provides a framework to evaluate clustering in gene expression analysis. Methods are categorised explicitly in the context of application to data of this type, providing a basis for reverse engineering of gene regulation networks. Finally, areas for possible future development are highlighted.
Chapter Preview


A fundamental factor of function in a living cell is the abundance of proteins present at a molecular level, that is, its proteome. The variation between proteomes of different cells is often used to explain differences in phenotype and cell function. Crucially, gene expression is the set of reactions that controls the level of messenger RNA (mRNA) in the transcriptome, which in turn maintains the proteome of a given cell. The transcriptome is never synthesized de novo; instead, it is maintained by gene expression replacing mRNAs that have been degraded, with changes in composition brought about by switching different sets of genes on and off. To understand the mechanisms of cells, involved in a given biological process, it is necessary to measure and compare gene expression levels in different biological phases, body tissues, clinical conditions, and organisms. Information on the set of genes expressed, in a particular biological process, can be used to characterise unknown gene function, identify targets for drug treatments, determine effects of treatment on cell function, and understand molecular mechanisms involved.

DNA microarray technology has advanced rapidly over the past decade, although the concept itself is not new (Friemert, Erfle, & Strauss, 1989; Gress, Hoheisel, Sehetner, & Leahrach 1992). It is now possible to measure the expression of an entire genome simultaneously, (equivalent to the collection and examination of data from thousands of single gene experiments). Components of the system technology can be divided into: (1) Sample preparation, (2) Array generation and sample analysis, and (3) Data handling and interpretation. The focus of this chapter is on the third of these.

Microarray technology utilises base-pairing hybridisation properties of nucleic acids, whereby one of the four base nucleotides (A, T, G, C) will bind with only one of the four base ribonucleotides (A, U, G, C: pairing = A – U, T – A, C – G, G - C). Thus, a unique sequence of DNA that characterises a gene will bind to a unique mRNA sequence. Synthesized DNA molecules, complementary to known mRNA, are attached to a solid surface, referred to as probes. These are used to measure the quantity of specific mRNA of interest that is present in a sample (the target). The molecules in the target are labelled, and a specialised scanner is used to measure the amount of hybridisation (intensity) of the target at each probe. Gene intensity values are recorded for a number of microarray experiments typically carried out for targets derived under various experimental conditions (Figure 1). Secondary variables (covariates) that affect the relationship between the dependent variable (experimental condition) and independent variables of primary interest (gene expression) include, for example, age, disease, and geography among others, and can also be measured.

Figure 1.

mRNA is extracted from a transcriptome of interest, (derived from cells grown under precise experimental conditions). Each mRNA sample is hybridised to a reference microarray. The gene intensity values for each experiment are then recorded.

An initial cluster analysis step is applied to gene expression data to search for meaningful informative patterns and dependencies among genes. These provide a basis for hypothesis testing--the basic assumption is that genes, showing similar patterns of expression across experimental conditions, may be involved in the same underlying cellular mechanism. For example, Alizadeh, Eisen, Davis, Ma, Lossos, Rosenwald, Boldrick, Sabet, Tran, Yu, Powell, Yang, Marti, Moore, Hudson Jr, Lu, Lewis, Tibshirani, Sherlock, Chan, Greiner, Weisenburger, Armitage, Warnke, Levy, Wilson, Grever, Byrd, Botstein, Brown, and Staudt (2000) used a hierarchical clustering technique, applied to gene expression data derived from diffuse large B-cell lymphomas (DLBCL), to identify two molecularly distinct subtypes. These had gene expression patterns, indicative of different stages of B-cell differentiation--germinal centre B-like DLBCL and activated B-like DLBCL. Findings suggested that patients, with germinal centre B-like DLBCL, had a significantly better overall survival rate than those with activated B-like DLBCL. This work indicated a significant methodology shift towards characterisation of cancers based on gene expression, rather than morphological, clinical and molecular variables.

Complete Chapter List

Search this Book:
Editorial Advisory Board
Table of Contents
Hsiao-Fan Wang
Hsiao-Fan Wang
Chapter 1
Martin Spott, Detlef Nauck
This chapter introduces a new way of using soft constraints for selecting data analysis methods that match certain user requirements. It presents a... Sample PDF
Automatic Intelligent Data Analysis
Chapter 2
Hung T. Nguyen, Vladik Kreinovich, Gang Xiang
It is well known that in decision making under uncertainty, while we are guided by a general (and abstract) theory of probability and of statistical... Sample PDF
Random Fuzzy Sets: Theory & Applications
Chapter 3
Gráinne Kerr, Heather Ruskin, Martin Crane
Microarray technology1 provides an opportunity to monitor mRNA levels of expression of thousands of genes simultaneously in a single experiment. The... Sample PDF
Pattern Discovery in Gene Expression Data
Chapter 4
Erica Craig, Falk Huettmann
The use of machine-learning algorithms capable of rapidly completing intensive computations may be an answer to processing the sheer volumes of... Sample PDF
Using "Blackbox" Algorithms Such AS TreeNET and Random Forests for Data-Ming and for Finding Meaningful Patterns, Relationships and Outliers in Complex Ecological Data: An Overview, an Example Using G
Chapter 5
Eulalia Szmidt, Marta Kukier
We present a new method of classification of imbalanced classes. The crucial point of the method lies in applying Atanassov’s intuitionistic fuzzy... Sample PDF
A New Approach to Classification of Imbalanced Classes via Atanassov's Intuitionistic Fuzzy Sets
Chapter 6
Arun Kulkarni, Sara McCaslin
This chapter introduces fuzzy neural network models as means for knowledge discovery from databases. It describes architectures and learning... Sample PDF
Fuzzy Neural Network Models for Knowledge Discovery
Chapter 7
Ivan Bruha
This chapter discusses the incorporation of genetic algorithms into machine learning. It does not present the principles of genetic algorithms... Sample PDF
Genetic Learning: Initialization and Representation Issues
Chapter 8
Evolutionary Computing  (pages 131-142)
Thomas E. Potok, Xiaohui Cui, Yu Jiao
The rate at which information overwhelms humans is significantly more than the rate at which humans have learned to process, analyze, and leverage... Sample PDF
Evolutionary Computing
Chapter 9
M. C. Bartholomew-Biggs, Z. Ulanowski, S. Zakovic
We discuss some experience of solving an inverse light scattering problem for single, spherical, homogeneous particles using least squares global... Sample PDF
Particle Identification Using Light Scattering: A Global Optimization Problem
Chapter 10
Dominic Savio Lee
This chapter describes algorithms that use Markov chains for generating exact sample values from complex distributions, and discusses their use in... Sample PDF
Exact Markov Chain Monte Carlo Algorithms and Their Applications in Probabilistic Data Analysis and Inference
Chapter 11
J. P. Ganjigatti, Dilip Kumar Pratihar
In this chapter, an attempt has been made to design suitable knowledge bases (KBs) for carrying out forward and reverse mappings of a Tungsten inert... Sample PDF
Design and Development of Knowledge Bases for Forward and Reverse Mappings of TIG Welding Process
Chapter 12
Malcolm J. Beynon
This chapter considers the role of fuzzy decision trees as a tool for intelligent data analysis in domestic travel research. It demonstrates the... Sample PDF
A Fuzzy Decision Tree Analysis of Traffic Fatalities in the US
Chapter 13
Dymitr Ruta, Christoph Adl, Detlef Nauck
In the telecom industry, high installation and marketing costs make it six to 10 times more expensive to acquire a new customer than it is to retain... Sample PDF
New Churn Prediction Strategies in the Telecom Industry
Chapter 14
Malcolm J. Beynon
This chapter demonstrates intelligent data analysis, within the environment of uncertain reasoning, using the recently introduced CaRBS technique... Sample PDF
Intelligent Classification and Ranking Analyses Using CARBS: Bank Rating Applications
Chapter 15
Fei-Chen Hsu, Hsiao-Fan Wang
In this chapter, we used Cumulative Prospect Theory to propose an individual risk management process (IRM) including a risk analysis stage and a... Sample PDF
Analysis of Individual Risk Attitude for Risk Management Based on Cumulative Prospect Theory
Chapter 16
Francesco Giordano, Michele La Rocca, Cira Perna
This chapter introduces the use of the bootstrap in a nonlinear, nonparametric regression framework with dependent errors. The aim is to construct... Sample PDF
Neural Networks and Bootstrap Methods for Regression Models with Dependent Errors
Chapter 17
Lean Yu, Shouyang Wang, Kin Keung Lai
Financial crisis is a kind of typical rare event, but it is harmful to economic sustainable development if occurs. In this chapter, a... Sample PDF
Financial Crisis Modeling and Prediction with a Hilbert-EMD-Based SVM Approachs
Chapter 18
Chun-Jung Huang, Hsiao-Fan Wang, Shouyang Wang
One of the key problems in supervised learning is due to the insufficient size of the training data set. The natural way for an intelligent learning... Sample PDF
Virtual Sampling with Data Construction Analysis
About the Contributors