Important insights into gene function can be gained by gene expression analysis. For example, some genes are turned on (expressed) or turned off (repressed) when there is a change in external conditions or stimuli. The expression of one gene is often regulated by the expression of other genes. A detail analysis of gene expression information will provide an understanding about the inter-networking of different genes and their functional roles. DNA microarray technology allows massively parallel, high throughput genome-wide profiling of gene expression in a single hybridization experiment [Lockhart & Winzeler, 2000]. It has been widely used in numerous studies over a broad range of biological disciplines, such as cancer classification (Armstrong et al., 2002), identification of genes relevant to a certain diagnosis or therapy (Muro et al., 2003), investigation of the mechanism of drug action and cancer prognosis (Kim et al., 2000; Duggan et al., 1999). Due to the large number of genes involved in microarray experiment study and the complexity of biological networks, clustering is an important exploratory technique for gene expression data analysis. In this article, we present a succinct review of some of our work in cluster analysis of gene expression data.
Cluster analysis is a fundamental technique in exploratory data analysis (Jain & Dubes, 1988). It aims at finding groups in a given data set such that objects in the same group are similar to each other while objects in different groups are dissimilar. It aids in the discovery of gene function because genes with similar gene expression profiles can be an indicator that they participate in related cellular processes. Clustering of genes may suggest possible roles for genes with unknown functions based on the known functions of some other genes in the same cluster. Clustering of gene expression data has been applied to, for example, the study of temporal expression of yeast genes in sporulation (Chu et al., 1998), the identification of gene regulatory networks (Chen, Filkov, & Skiena, 1999), and the study of cancer (Tamayo et al., 1999).
Many clustering algorithms have been applied to the analysis of gene expression data (Sharan, Elkon, & Shamir, 2002). They can be broadly classified as either hierarchical or partition-based depending on how they group the data. Hierarchical clustering is further subdivided into agglomerative methods and divisive methods. The former proceed by successive merging of the N objects into larger groups, whereas the latter divide a larger group successively into finer groupings. Agglomerative techniques are more common in hierarchical clustering.
Hierarchical clustering is among the first clustering technique being applied to gene expression data (Eisen et al., 1998). In hierarchical clustering, each of the gene expression profile is considered as a cluster initially. Then, pairs of clusters with the smallest distance between them, are merged together to form a single cluster. This process is repeated until there is only one cluster left. The hierarchical clustering algorithm arranges the gene expression data into a hierarchical tree structure known as a dendrogram, which allows easy visualization and interpretation of results. However, the hierarchical tree cannot indicate the optimal number of clusters in the data. The user has to interpret the tree topologies and identify branch points that segregate clusters of biological relevance. In addition, once a data is assigned to a node in the tree, it cannot be reassigned to a different node even though it is later found to be closer to that node.
Key Terms in this Chapter
Gene Expression: Gene expression is the process by which a gene’s DNA sequence is converted into functional proteins. Some genes are turned on (expressed) or turned off (repressed) when there is a change in external conditions or stimuli.
SSMCL Clustering: A partition-based clustering algorithm that is based on the one-prototype-take-one-cluster (OPTOC) learning paradigm. OPTOC learning is achieved by constructing a dynamic neighborhood that favors patterns inside the neighborhood. Eventually, the prototype settles at the center of a natural cluster, while ignoring competitions from other clusters. Together with the over-clustering and merging process, SSMCL is able to find all natural clusters in the data.
BHC Clustering: A clustering algorithm based on the hierarchical binary subdivision framework. The algorithm proceeds by partitioning a cluster into two classes, and then check for the validity of the partition by Fisher discriminant analysis. The algorithm terminates when no further valid subdivision is possible.
Hierarchical Clustering: A clustering method that finds successive clusters using previously established clusters. Hierarchical algorithms can be agglomerative (bottom-up) or divisive (top-down). Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.
Cluster Analysis: An exploratory data analysis technique that aims at finding groups in a data such that objects in the same group are similar to each other while objects in different groups are dissimilar.
Biclustering: Also called two-way clustering or co-clustering. In biclustering, not only the objects but also the features of the objects are clustered. If the data is represented in a data matrix, the rows and columns are clustered simultaneously.
DNA Microarray technology: A technology that allows massively parallel, high throughput genome-wide profiling of gene expression in a single hybridization experiment. The method is based on the complementary hybridization of DNA sequence.
K-Means Clustering: K-means clustering is the most well-known partition-based clustering algorithm. The algorithm starts by choosing k initial centroids, usually at random. Then the algorithm alternates between updating the cluster assignment of each data point by associating with the closest centroid and updating the centroids based on the new clusters until convergence.