Wave-SOM: A Novel Wavelet-Based Clustering Algorithm for Analysis of Gene Expression Patterns

Wave-SOM: A Novel Wavelet-Based Clustering Algorithm for Analysis of Gene Expression Patterns

Andrew Blanchard, Christopher Wolter, David S. McNabb, Eitan Gross
Copyright: © 2012 |Pages: 23
DOI: 10.4018/978-1-4666-1785-8.ch007
(Individual Chapters)
No Current Special Offers


In this paper, the authors present a wavelet-based algorithm (Wave-SOM) to help visualize and cluster oscillatory time-series data in two-dimensional gene expression micro-arrays. Using various wavelet transformations, raw data are first de-noised by decomposing the time-series into low and high frequency wavelet coefficients. Following thresholding, the coefficients are fed as an input vector into a two-dimensional Self-Organizing-Map clustering algorithm. Transformed data are then clustered by minimizing the Euclidean (L2) distance between their corresponding fluctuation patterns. A multi-resolution analysis by Wave-SOM of expression data from the yeast Saccharomyces cerevisiae, exposed to oxidative stress and glucose-limited growth, identified 29 genes with correlated expression patterns that were mapped into 5 different nodes. The ordered clustering of yeast genes by Wave-SOM illustrates that the same set of genes (encoding ribosomal proteins) can be regulated by two different environmental stresses, oxidative stress and starvation. The algorithm provides heuristic information regarding the similarity of different genes. Using previously studied expression patterns of yeast cell-cycle and functional genes as test data sets, the authors’ algorithm outperformed five other competing programs.
Chapter Preview


Saccharomyces cerevisiae (Bakers’ yeast) is commonly used by molecular biologists as a model eukaryote for the study of basic cellular processes. In addition, studies of the response of yeast to different stresses have implications in the brewing and baking industries and for developing strains of yeast that produce lower levels of acetic acid (Mizuno et al., 2006) in response to varying ethanol concentrations, with potential for being a viable source of ethanol for the bio-fuel industry.

In the budding yeast there are over 6000 expressed genes. Until very recently, analyzing the expression patterns of these genes was confined to only a few gene products or message levels at a time, or could only be addressed by means of perturbation analysis in combination with computer simulations (Glass & Mackey, 1979). The advent of microarray techniques can now provide a means for assessing the sum total of expressed genes in a cell in a quantitative, reproducible, and internally standardized manner. A single array provides a snapshot of the transcriptional state of the cell at some point in time. When multiple snapshots are taken from a temporally coherent system such as synchronous cells, these signals yield the characteristic time signature of each of the genes in the cell. Thus, by clustering the genes according to the similarity in their expression patterns, one can learn about potential functions of the gene products that otherwise would be very difficult to determine.

Furthermore, the regulation of gene expression in the yeast is a non-linear process under control of a network of connected regulatory proteins called transcription factors (Nicholas & Prigogine, 1977) although many epigenetic factors such as RNA splicing and degradation, are known to modify the expression levels. Consequently, concentrations of a variety of cellular reagents oscillate as the cell goes through the different phases of cell cycle (Mitchison, 1971; Klevecz et al., 1984). Chemical or physical perturbations to the cell cycle can thus introduce a phase shift in the onset time of a given cell cycle event (i.e. mitosis) (Klevecz et al., 1978). If the oscillatory kinetics of expression is confined to a small number of genes, then finding these ‘cell cycle regulated’ genes becomes a fairly easy clustering task. However, if a large number of genes or indeed the entire genome oscillates and the fundamental harmonic of this oscillation is significantly distinguishable from the characteristic period of the cell cycle, then the different functional groups can be identified by their characteristic kinetics or oscillation frequency.

A successful gene-expression clustering program must be able to handle noisy time series data with possible uneven time interval data points. In addition the algorithm needs to be able to offset time shifts in the onset time for the expression of one or more genes in the dataset due to variability in experimental protocols, or variations in timing between different labs.

Complete Chapter List

Search this Book: