Fuzzy Clustering of Large Relational Bioinformatics Datasets

Fuzzy Clustering of Large Relational Bioinformatics Datasets

Mihail Popescu (University of Missouri, USA)
DOI: 10.4018/978-1-60566-858-1.ch016
OnDemand PDF Download:


In this chapter the author presents a fuzzy clustering methodology that can be employed for large relational datasets. Relational data is an N×N matrix that consists of pair-wise dissimilarities among N objects. Large relational datasets are encountered in many domains such as psychology or medical informatics, but they are abundant in bioinformatics where gene products are compared to each other based on various characteristics such as DNA or amino acid sequence. The fuzzy clustering methodology is exemplified on a set of about 30,000 human gene products.
Chapter Preview


Bioinformatics is, arguably, the application domain where large relational datasets are most abundant. There are two main reasons for this abundance. First, numerous genome projects completed in the last 10 years have generated a large amount of sequence data. For example, the RefSeq database (http://www.ncbi.nlm.nih.gov/Genbank (Pruitt, Tatusova & Maglott, 2007), is close to 85 million. The difference between the numbers of sequences in the two databases is represented, in part, by sequences with unknown function. Even for the case of the human genome, only about 21,000 genes have been annotated from an estimated total of 30,000. Second, functional annotation is a tedious process that is mainly accomplished by comparing the sequence of an unknown protein to the sequence of a protein with known functions. The sequence comparison is often performed with BLAST (Altshul et al., 1990), one of the most used tool in bioinformatics. When performed on an entire genome, this method produces large matrices (relational data) of gene sequence similarity values. There are other applications beside sequence comparison where large relational datasets are generated, such as gene comparison based on Gene Ontology annotations or microarray expression (Havens et al. 2008), and document comparison based on Medical Subject Headings (MeSH) annotations.

Clustering plays an important role in genome annotation process. In the first phase of the process, after the hypothetical gene boundaries are determined, the gene products are annotated based on their sequence similarity to gene products in related species that are well studied. Next, all the gene products in a given genome are clustered based on their sequence similarity in order to find proteins with similar functions (Enright, Van Dongen & Ouzounis, 2002). Most of the time, gene products with high sequence similarity have similar functions. However, there are gene products with similar functions with less than 30% sequence similarity. The well characterized proteins from the same cluster can then be used to determine the functions of the unknown members of the group, a strategy often called “guilt by association”. The most used annotation type is based on Gene Ontology terms. For example, if an annotated gene A clusters together with an annotated gene B due to a high sequence similarity (e.g. computed with BLAST) or due to a high similarity of their expression profiles (computed based on microarray data), we have reason to believe that B shares all/some of the annotations (functions) of A. The most popular clustering algorithms for relational datasets in bioinformatics are hierarchical clustering and Markov clustering (Enright, Van Dongen & Ouzounis, 2002). A scalable version of the hierarchical clustering algorithm, CURE, has been proposed (Guha et al., 1998), but we are not aware of its application to bioinformatics. An implementation of Markov clustering, TRIBE-MCL, has reportedly grouped about 80,000 sequences in 8,000 clusters in approximately 5 minutes on a Sun Ultra 10 workstation. However, both previous clustering approaches are crisp, that is, they assign each sequence to a unique cluster. Because many proteins have multiple sequence domains that correspond to various functions, it is more natural to allow each sequence to belong to multiple clusters (Xu et al., 2008). By employing fuzzy clustering, an unknown gene product can be assigned to more than one group, receiving in this fashion putative annotations from multiple gene families (Popescu et al., 2004). For example, if the unknown gene B has a 0.5 membership in A's cluster and, at the same time, has a 0.5 membership in another cluster where gene C is representative, then we have 50% confidence that B shares both A's and C's annotations. More applications of the fuzzy clustering in bioinformatics, such as gene product summarization and microarray processing, are presented in (Xu et al., 2008).

Complete Chapter List

Search this Book: