AI Methods for Analyzing Microarray Data
Amira Djebbari (National Research Council Canada, Canada), Aedín C. Culhane (Harvard School of Public Health, USA), Alice J. Armstrong (The George Washington University, USA) and John Quackenbush (Harvard School of Public Health, USA)
Copyright: © 2009
Biological systems can be viewed as information management systems, with a basic instruction set stored in each cell’s DNA as “genes.” For most genes, their information is enabled when they are transcribed into RNA which is subsequently translated into the proteins that form much of a cell’s machinery. Although details of the process for individual genes are known, more complex interactions between elements are yet to be discovered. What we do know is that diseases can result if there are changes in the genes themselves, in the proteins they encode, or if RNAs or proteins are made at the wrong time or in the wrong quantities. Recent advances in biotechnology led to the development of DNA microarrays, which quantitatively measure the expression of thousands of genes simultaneously and provide a snapshot of a cell’s response to a particular condition. Finding patterns of gene expression that provide insight into biological endpoints offers great opportunities for revolutionizing diagnostic and prognostic medicine and providing mechanistic insight in data-driven research in the life sciences, an area with a great need for advances, given the urgency associated with diseases. However, microarray data analysis presents a number of challenges, from noisy data to the curse of dimensionality (large number of features, small number of instances) to problems with no clear solutions (e.g. real world mappings of genes to traits or diseases that are not yet known). Finding patterns of gene expression in microarray data poses problems of class discovery, comparison, prediction, and network analysis which are often approached with AI methods. Many of these methods have been successfully applied to microarray data analysis in a variety of applications ranging from clustering of yeast gene expression patterns (Eisen et al., 1998) to classification of different types of leukemia (Golub et al., 1999). Unsupervised learning methods (e.g. hierarchical clustering) explore clusters in data and have been used for class discovery of distinct forms of diffuse large B-cell lymphoma (Alizadeh et al., 2000). Supervised learning methods (e.g. artificial neural networks) utilize a previously determined mapping between biological samples and classes (i.e. labels) to generate models for class prediction. A k-nearest neighbor (k-NN) approach was used to train a gene expression classifier of different forms of brain tumors and its predictions were able to distinguish biopsy samples with different prognosis suggesting that microarray profiles can predict clinical outcome and direct treatment (Nutt et al., 2003). Bayesian networks constructed from microarray data hold promise for elucidating the underlying biological mechanisms of disease (Friedman et al., 2000).
Cells dynamically respond to their environment by changing the set and concentrations of active genes by altering the associated RNA expression. Thus “gene expression” is one of the main determinants of a cell’s state, or phenotype. For example, we can investigate the differences between a normal cell and a cancer cell by examining their relative gene expression profiles.
Microarrays quantify gene expression levels in various conditions (such as disease vs. normal) or across time points. For n genes and m instances (biological samples), microarray measurements are stored in an n by m matrix where each row is a gene, each column is a sample and each element in the matrix is the expression level of a gene in a biological sample, where samples are instances and genes are features describing those instances. Microarray data is available through many public online repositories (Table 1). In addition, the Kent-Ridge repository (http://sdmc.i2r.a-star.edu.sg/rp/) contains pre-formatted data ready to use with the well-known machine learning tool Weka (Witten & Frank, 2000).
Key Terms in this Chapter
Microarray: A microarray is an experimental assay which measures the abundances of mRNA (intermediary between DNA and proteins) corresponding to gene expression levels in biological samples.
Supervised Learning: A learning algorithm that is given a training set consisting of feature vectors associated with class labels and whose goal is to learn a classifier that can predict the class labels of future instances.
Over-Fitting: A situation where a model learns spurious relationships and as a result can predict training data labels but not generalize to predict future data.
Multiple testing problem: A problem that occurs when a large number of hypotheses are tested simultaneously using a user-defined a cut off p-value which may lead to rejecting a non-negligible number of null hypotheses by chance.
Curse of Dimensionality: A situation where the number of features (genes) is much larger than the number of instances (biological samples) which is known in statistics as p >> n problem.
Unsupervised Learning: A learning algorithm that tries to identify clusters based on similarity between features or between instances or both but without taking into account any prior knowledge.
Feature Selection: A problem of finding a subset (or subsets) of features so as to improve the performance of learning algorithms.