CNS Tumor Prediction Using Gene Expression Data Part II
Atiq Islam (University of Memphis, USA), Khan M. Iftekharuddin (University of Memphis, USA), E. Olusegun George (University of Memphis, USA) and David J. Russomanno (University of Memphis, USA)
Copyright: © 2009
In this chapter, we propose a novel algorithm for characterizing a variety of CNS tumors. The proposed algorithm is illustrated with an analysis of an Affymetrix gene expression data from CNS tumor samples (Pomeroy et al., 2002). As discussed in the previous chapter entitled: CNS Tumor Prediction Using Gene Expression Data Part I, we used an ANOVA model to normalize the microarray gene expression measurements. In this chapter, we introduce a systemic way of building tumor prototypes to facilitate automatic prediction of CNS tumors.
DNA microarrays, also known as genome or DNA chips, have become an important tool for predicting CNS tumor types (Pomeroy et al., 2002, Islam et al., 2005, Dettling et al., 2002). Several researchers have shown that cluster analysis of DNA microarray gene expression data is helpful in finding the functionally similar genes and also to predict different cancer types. Eisen et al. (1998) used average linkage hierarchical clustering with correlation coefficient as the similarity measure in organizing gene expression values from microarray data. They showed that functionally similar genes group into the same cluster. Herwig et al. (1999) proposed a variant of the K-means algorithm to cluster genes of cDNA clones. Tomayo et al. (1999) used self-organized feature maps (SOFMs) to organize genes into biologically relevant groups. They found that SOFMs reveal true cluster structure compared to the rigid structure of hierarchical clustering and the structureless K-means approach. Considering the many-to-many relationships between genes and their functions, Dembele et al. (2003) proposed a fuzzy C-means clustering technique. The central goal of these clustering procedures (Eisen et al., 1998, Herwig et al., 1999, Tomayo et al., 1999, Dembele et al., 2003) was to group genes based on their functionality. However, none of these works provide any systematic way of discovering or predicting tissue sample groups as we propose in our current work.
Key Terms in this Chapter
False Discovery Rate (FDR): FDR controls the expected proportion of false positives instead of controlling the chance of any false positives. A FDR threshold is determined from the observed p-value distribution from multiple single hypothesis tests.
Wilcoxon Rank Sum Test: A nonparametric alternative to the two sample t-test which is based on the order in which the observations from the two samples fall.
Self-Organizing Maps (SOMs): A method to learn to cluster input vectors according to how they are naturally grouped in the input space. In its simplest form, the map consists of a regular grid of units and the units learn to represent statistical data described by model vectors. Each map unit contains a vector used to represent the data. During the training process, the model vectors are changed gradually and then the map forms an ordered non-linear regression of the model vectors into the data space.
Parallel Coordinates: A multidimensional data visualization scheme that exploits 2D pattern recognition capabilities of humans. In this plot, the axes are equally spaced and are arranged parallel to one another rather than being arranged mutually perpendicular as in the Cartesian scenario.
Kruskal-Wallis Test: This test is a nonparametric mean test which can be applied if the number of sample group is more than two, unlike the Wilcoxon Rank Sum Test.
q-values: A means to measure the proportion of FDR when any particular test is called significant.
DNA Microarray: Also known as a DNA chip, it is a collection of microscopic DNA spots, commonly representing single genes, arrayed on a solid surface by covalent attachment to chemically suitable matrices.
Histologic Examination: The examination of tissue specimens under a microscope.